Screaming Frog Custom Extractions

Websites have a ton of useful information—most times, it’s too laborious or complicated to visit every page on a website to copy product data, metadata, title tags, anchor text into a spreadsheet.

Screaming Frog Custom Extraction

Here is where Screaming Frog comes to the rescue with custom data extractions to automate the process. Custom extractions is a form of web scraping, web harvesting, or web data extraction used to scrape and extract data from websites, giving the ability to store it locally on your computer. 

For beginners, some questions you might have:

What is the Screaming Frog SEO Spider?

The Screaming Frog SEO Spider software is a website crawler that improves onsite SEO by extracting and analyzing your website’s data using a graphical user interface (GUI).

What are custom extractions?

Custom extractions are a set of functions in Screaming Frogs SEO spider to extract explicit information from webpages. These extractions help optimize your site for Technical SEO; which includes search results, gather essential data on your copy, and help locate and fix errors.

How is Data Extraction done?

The process of data extraction involves pulling the required data on your website using a Screaming Frog web spider. The information is saved within Screaming Frog’s memory, giving you the option to export your scanned results to Excel or Google Sheets for further review.

Why is Data Extraction critical?

Data extraction allows you to harvest large amounts of data quickly and efficiently. This automation gives you immediate results of web architecture. This process saves you time and resources while giving you the valuable data you’ll need to plan and strategize search engine optimization strategies.  

Screaming Frog is the go to Web Scraper Tool for SEOs. The options are endless, here a ton of custom web-scraping syntaxes.

How to use Custom Extraction settings in Screaming Frog

In ScreamingFrog, go to Configuration > Custom > Extraction.

Screaming Frog Custom Extraction
Screaming Frog Custom Extraction

Next, you will need to +Add and set up your extraction rules.

Custom Extraction Settings
Select elements of internal HTML using Custom Extraction tab

Add a title, select if you need CSSPath, XPath, or Regex, then add your search function. If you aren’t sure which selector or function you need, look at the examples below or use the inspect element function in Google Chrome Dev Tools. You can open Dev Tools by using “right click” in the Google Chrome browser.

Example:

Here is an example of a how you would scrape for a Facebook Pixel ID

Facebook Pixel ID Extraction
Facebook Pixel ID Extraction

Results, as you can see, one of my pages is missing a Facebook Pixel:

Missing Facebook ID
Missing Facebook ID

Basic Syntax for using XPath Web Scraping

SYNTAXFUNCTION
//Search anywhere in the document
/Search within the root
@Select a specific attribute of an element
*Wildcard, used to select any element
[ ]Find a specific element
.Specifies the current element
..Specifies the parent element

XPath functions

XPATHOUTPUT
//h1Extract all H1 tags
//h3[1]Extract the first H3 tag
//h3[2]Extract the second H3 tag
//div/pExtract text – any <p> contained within a <div>
//div[@class=’author’]Extract any <div> with class “author”
//p[@class=’bio’]Extract any <p> with class “bio”
//*[@class=’bio’]Extract any element with class “bio”
//ul/li[last()]Extract the last <li> in a <ul>
//ol[@class=’cat’]/li[1]Extract the first <li> in a <ol> with class “cat”
count(//h2)Count the number of H2’s (set extraction filter to “Function Value”)
//a[contains(.,’click here’)]Extract any link with anchor text containing “click here”
//a[starts-with(@title,’Written by’)]Extract any link with a title starting with “Written by”

How to Extract Common HTML Elements

XPATHOUTPUT
//@hrefExtract all links
//a[starts-with(@href,’mailto’)]/@hrefExtract link that starts with “mailto” (email address)
//img/@srcExtract all image source URLs
//img[contains(@class,’aligncenter’)]/@srcExtract all image source URLs for images with the class name containing “aligncenter”
//link[@rel=’alternate’]Extract elements with the rel attribute set to “alternate”
//@hreflangExtract all hreflang values

Extract Meta Tags (use inner HTML element)

XPATHOUTPUT
//meta[@property=’article:published_time’]/@contentExtract the article publish date (commonly-found meta tag on WordPress websites)

Extract Open Graph

XPATHOUTPUT
//meta[@property=’og:type’]/@contentExtract the Open Graph type object
//meta[@property=’og:image’]/@contentExtract the Open Graph featured image URL
//meta[@property=’og:updated_time’]/@contentExtract the Open Graph updated time

Extract Twitter Cards

XPATHOUTPUT
//meta[@name=’twitter:card’]/@contentExtract the Twitter Card type
//meta[@name=’twitter:title’]/@contentExtract the Twitter Card title
//meta[@name=’twitter:site’]/@contentExtract the Twitter Card site object (Twitter handle)

Extract Schema Types

XPATHOUTPUT
//*[@itemtype]/@itemtypeExtract all of the types of schema markup on a page

Extract Breadcrumb Schema

XPATHOUTPUT
//*[contains(@itemtype,’BreadcrumbList’)]/*[@itemprop]/a/@hrefExtract all breadcrumb links
//*[contains(@itemtype,’BreadcrumbList’)]/*[@itemprop][1]/a/@hrefExtract the first breadcrumb link
//*[contains(@itemtype,’BreadcrumbList’)]/*[@itemprop]Extract breadcrumb names (set extraction filter to “Extract Text”)
count(//*[contains(@itemtype,’BreadcrumbList’)]/*[@itemprop])Count the number of breadcrumb list items (set extraction filter to “Function Value”)

Extract Product Schema

XPATHOUTPUT
//*[@itemprop=’name’]/@contentExtract product name
//*[@itemprop=’description’]/@contentExtract product description
//*[@itemprop=’price’]/@contentExtract product price
//*[@itemprop=’priceCurrency’]/@contentExtract product currency
//*[@itemprop=’availability’]/@hrefExtract product availability
//*[@itemprop=’sku’]/@contentExtract product SKU

Extract Review Schema

XPATHOUTPUT
//*[@itemprop=’reviewCount’]Extract review count
//*[@itemprop=’ratingValue’]Extract rating value
//*[@itemprop=’bestRating’]Extract best review rating
//*[@itemprop=’review’]/*[@itemprop=’name’]Extract review name
//*[@itemprop=’review’]/*[@itemprop=’author’]Extract review author
//*[@itemprop=’review’]/*[@itemprop=’datePublished’]/@contentExtract the publish date of reviews
//*[@itemprop=’review’]/*[@itemprop=’reviewBody’]Extract the body content of reviews

Extract Local Business & Organization Schema

XPATHOUTPUT
//*[contains(@itemtype,’Organization’)]/*[@itemprop=’name’]Extract the organization’s name
//*[@itemprop=’address’]/*[@itemprop=’streetAddress’]Extract the street address
//*[@itemprop=’address’]/*[@itemprop=’addressLocality’]Extract the address locality
//*[@itemprop=’address’]/*[@itemprop=’addressRegion’]Extract the address region
//*[@itemprop=’telephone’]Extract the telephone number
//*[@itemprop=’sameAs’]/@hrefExtract the “sameAs” links

Extract Article Schema

XPATHOUTPUT
//*[contains(@itemtype,’Article’)]/*[@itemprop=’headline’]Extract the article headline
//*[@itemprop=’author’]/*[@itemprop=’name’]/@contentExtract author name
//*[@itemprop=’publisher’]/*[@itemprop=’name’]/@contentExtract publisher name
//*[@itemprop=’datePublished’]/@contentExtract publish date
//*[@itemprop=’dateModified’]/@contentExtract modified date

Custom Data Extraction with Regex

Wildcards

SYNTAXFUNCTION
.Match any 1 character
*Match preceding character 0 or more times
?Match preceding character 0 or 1 time
+Match preceding character 1 or more times
|OR

Anchors

SYNTAXFUNCTION
^String begins with the succeeding character
$String ends with the preceding character

Groups

SYNTAXFUNCTION
( )Match enclosed characters in exact order
[ ]Match enclosed characters in any order
Match any characters within the specified range

Escape

SYNTAXFUNCTION
\Treat character literally, not as regex

Regex Custom Data Extraction

REGEXOUTPUT
[“‘](UA-.*?)[“‘]Extract the Google Analytics tracking ID
[“‘](AW-.*?)[“‘]Extract the Google Ads conversion ID and/or remarketing tag
[“‘](GTM-.*?)[“‘]Extract the Google Tag Manager and/or Google Optimize ID
fbq\([“‘]init[“‘], [“‘](.*?)[“‘]Extract the Facebook Pixel ID
\{ti:[“‘](.*?)[“‘]\}Extract the Bing Ads UET tag
adroll_adv_id = [“‘](.*?)[“‘]Extract the AdRoll Advertiser ID
adroll_pix_id = [“‘](.*?)[“‘]Extract the AdRoll Pixel ID

Extract All Schema Markup and Schema Types

REGEXOUTPUT
[“‘]application/ld\+json[“‘]>(.*?)</script>Extract all of the JSON-LD schema markup
[“‘]@type[“‘]: *[“‘](.*?)[“‘]Extract all of the types of JSON-LD schema markup on a page

Extract Breadcrumb Schema

REGEXOUTPUT
[“‘]item[“‘]: *\{[“‘]@id[“‘]: *[“‘](.*?)[“‘]Extract breadcrumb links
[“‘]item[“‘]: *\{[“‘]@id[“‘]: *[“‘].*?[“‘], *[“‘]name[“‘]: *[“‘](.*?)[“‘]Extract breadcrumb names

Extract Product Schema

REGEXOUTPUT
[“‘]@type[“‘]: *[“‘]Product[“‘].*?[“‘]name[“‘]: *[“‘](.*?)[“‘]Extract product name
[“‘]@type[“‘]: *[“‘]Product[“‘].*?[“‘]description[“‘]: *[“‘](.*?)[“‘]Extract product description
[“‘]@type[“‘]: *[“‘]Product[“‘].*?[“‘]price[“‘]: *[“‘](.*?)[“‘]Extract product price
[“‘]@type[“‘]: *[“‘]Product[“‘].*?[“‘]priceCurrency[“‘]: *[“‘](.*?)[“‘]Extract product currency
[“‘]@type[“‘]: *[“‘]Product[“‘].*?[“‘]availability[“‘]: *[“‘](.*?)[“‘]Extract product availability
[“‘]@type[“‘]: *[“‘]Product[“‘].*?[“‘]sku[“‘]: *[“‘](.*?)[“‘]Extract product SKU

Extract Review Schema

REGEXOUTPUT
[“‘]reviewCount[“‘]: *[“‘](.*?)[“‘]Extract review count
[“‘]ratingValue[“‘]: *[“‘](.*?)[“‘]Extract rating value
[“‘]bestRating[“‘]: *[“‘](.*?)[“‘]Extract best rating

Extract Local Business & Organization Schema

REGEXOUTPUT
[“‘]@type[“‘]: *[“‘]Organization[“‘].*?[“‘]name[“‘]: *[“‘](.*?)[“‘]Extract organization name
[“‘]streetAddress[“‘]: *[“‘](.*?)[“‘]Extract the street address
[“‘]addressLocality[“‘]: *[“‘](.*?)[“‘]Extract the address locality
[“‘]addressRegion[“‘]: *[“‘](.*?)[“‘]Extract the address region
[“‘]telephone[“‘]: *[“‘](.*?)[“‘]Extract the telephone number
[“‘]sameAs[“‘]: *\[(.*?)\]Extract the “sameAs” links

Extract Article or BlogPosting Schema

REGEXOUTPUT
[“‘]headline[“‘]: *[“‘](.*?)[“‘]Extract article headline
[“‘]author[“‘].*?[“‘]name[“‘]: *[“‘](.*?)[“‘]Extract author name
[“‘]publisher[“‘].*?[“‘]name[“‘]: *[“‘](.*?)[“‘]Extract publisher name
[“‘]datePublished[“‘]: *[“‘](.*?)[“‘]Extract publish date
[“‘]dateModified[“‘]: *[“‘](.*?)[“‘]Extract modified date

The possibilities are endless, please let me know if you want any extractions added to this list.


Published on: 2021-03-10
Updated on: 2021-03-20

Avatar for Isaac Adams-Hands

Isaac Adams-Hands

Isaac Adams-Hands is an SEO Director, Full Stack Developer, and InfoSec enthusiast. He received his Bachelor’s Degree from the University of Western Sydney before joining various marketing positions in search portals, higher education, and addiction recovery marketing agencies.