Post

Screaming Frog Cheatsheet

Configure Screaming Frog settings to crawl sites and fix 404s.

Find lorum ipsum text

Screaming frog has a built in feature for this.

  • Crawl site
  • Go to the content tab
  • Filter by lorum ipsum placeholder

Exclude urls from crawl

When crawling a big site with screaming frog sometimes the results are too overwhelming due to the sheer volume. As the devil is in the details and the proof is in the pudding it can be really helpful to exclude some of the unrequired urls. This tutorial will show you how you can exclude urls from the screaming frog crawl so you can return only the most relevant urls related to your goal.

go to configuration > exclude

Exclude word from the crawl

This regex will exclude urls containing ?_pos

1
.*_pos.*

exclude a paramater

1
2
.*\?filter=.*

exclude directory from crawl

1
https://example.com/exclude/.*

use cases

you can exclude duplicate urls. such as for Ecommerce sites where urls containing query paramaters for size and color etc

Use Screaming Frog to Search for strings

It’s such a useful ability to be able to search an entire website and return all the urls that contain an occurrence of a specific word or string or phrase.

Prerequisites

Screaming Frog installed on your Computer (there is a free version) Screaming Frog is a desktop application you can download to your computer

Use Screaming Frog to find all Occurences of this HTML or String

  1. Go to configuration > custom > search
  2. Enter search string (you can search for more then 1 string)
  3. In the search crawl results find the custom search column relating to your search

Case sensitivity

By default the search is not case sensitive so if you search for the word lawyers it will also pick up the word Lawyers & LAWYERS

Use cases

There are many use cases but the most practical one i use it for is to locate a string of text on a any size website, large or small.

For instance, use frog to find “Lorem ipsum” dummy text which needs updating.

Crawl a Password Protected Site

If the site has a browser password protection that’s no problem for screaming frog. Screaming frog will ask you to enter the username and password of the website before crawling to site.

Cannot see the text

If you cannot see the text on your website but Screaming Frog can. The text is hidden. Could be in a slideshow, accordion or tabs. You can view html source of a website in your browser and search for the string in the html code. You will see it now.

Other Options

Some command line options such as Curl curl allow you to search 1 webpage for a string of text but not the entire site. To search the entire site for string of text you will need the Screaming Frog.

Fix 404’s with Screaming Frog.

What are 404 errors and why should I care about fixing them

A 404 error is an HTTP status code that indicates the server couldn’t find the requested webpage. It’s a standard response code in web browsers, indicating that the client (usually a web browser) was able to communicate with the server, but the server could not find what was requested.

Benefits of fixing 404 errors:

  1. Improved User Experience: Users encountering a 404 error can be frustrated or confused. By fixing these errors and providing relevant content or redirection, you enhance user experience and keep visitors engaged on your website.
  2. Maintaining SEO Performance: 404 errors can negatively impact your website’s search engine rankings because search engine crawlers interpret them as broken links or missing content. By fixing these errors, you ensure that search engines can properly index your site, leading to better visibility and ranking.
  3. Preserving Link Equity: Broken links can disrupt the flow of link equity (or “link juice”) within your website. By redirecting or fixing broken links, you preserve the value of incoming links and distribute it effectively throughout your site.
  4. Reduced Bounce Rate: When users encounter a 404 error, they are more likely to leave your site. By minimizing these errors, you reduce your site’s bounce rate, which is a positive signal to search engines and can improve your overall website performance.
  5. Enhanced Trust and Credibility: A website that regularly maintains and fixes errors demonstrates professionalism and reliability. Users are more likely to trust and return to websites that provide a seamless browsing experience.

Overall, fixing 404 errors is essential for maintaining a healthy website that provides a positive user experience, preserves SEO performance, and builds trust with visitors and search engines alike.

How to find the 404 errors

This is arguably the hardest part about 404 errors and its the heart of this tutorial.

  1. Download and install screaming frog
  2. Run the crawl on your site url
  3. Filter by status code until 404 descend from the top
  4. Select any of the 404’s from the returned results.
  5. Look at the footer menu of Screaming Frog for “inlinks”
  6. Right click the url from the “from” column
  7. Hit open from in browser
  8. Search for the anchor text of the broken link by doing a “command f” in your browser window

img-description find 404’s with Screaming Frog Crawl

  1. If you cant replicate the link it could be that you are logged in and therefore cant replicate the error. Test the broken link in a non-logged in state.
  2. Try search the html source code if the search for the string on the website frontend returns 0 correct results

Crawl No-index site with Screaming Frog.

What is the no index / no follow site settings

The “noindex” and “nofollow” site settings are directives that can be implemented in the HTML code of a webpage to instruct search engines on how to handle the page and its links:

  1. Noindex: This directive tells search engines not to index the content of the page. In other words, the page won’t appear in search engine results pages (SERPs). This is commonly used for pages that are not intended to be publicly visible or for duplicate content that shouldn’t be indexed.
  2. Nofollow: This directive tells search engines not to follow the links on the page. It means that search engine crawlers won’t pass authority or PageRank to the linked pages, and those pages won’t benefit from being linked to from the “nofollow” page.

Combining both directives, “noindex, nofollow” is often used for pages that the website owner doesn’t want to appear in search results and doesn’t want to pass authority to other pages through links on that page.

These directives are typically implemented using meta tags in the HTML code of a webpage. For example:

1
<meta name="robots" content="noindex, nofollow">

Or they can be set through the robots.txt file or with HTTP response headers. They are useful for controlling how search engines interact with specific pages or sections of a website.

How to bypass the nofollow when crawling a site with Sceaming Frog

  1. Go to ‘Config > Spider’
  2. Scroll down to “Crawl behaviour”
  3. enable ‘Follow Internal Nofollow’

alt text

How to bypass the noindex

  1. Go to ‘Config > Spider > Advanced’.
  2. Uncheck Ignore Non-Indexable URLs for Issues

alt text

Now you can Crawl a site set to noIndex NoFollow

How to bypass robots.txt

Bypass the nofollow When the nofollow has been added to the Robots file.

  1. Go to ‘Config > Robots.txt > Settings’
  2. Choose ‘Ignore robots.txt’.

alt text

Disable crawling javascript enabled pages

To disable JavaScript crawling in Screaming Frog, follow these steps:

Open Screaming Frog SEO Spider: Launch the application.

Go to Configuration:

From the top menu, click on Configuration.

Select Spider Settings:

In the drop-down menu, go to Spider….

Disable JavaScript:

In the Spider Configuration window, uncheck the option ‘Render: JavaScript’ under the Rendering tab. This ensures that Screaming Frog won’t crawl the JavaScript-rendered versions of your pages, and it will focus only on the raw HTML.

Click OK: Save your settings by clicking OK.

By disabling JavaScript crawling, Screaming Frog will now crawl your site in its default HTML-only mode.

Techniques for Technical SEO

Finding 404 errors

  1. Run the crawl on the url
  2. Filter by status code until 404 descend from the top
  3. Select any of the 404’s from the returned results.
  4. Look at the footer menu of Screaming Frog for “inlinks”
  5. Right click the url from the “from” column
  6. Hit open from in browser
  7. Search for the anchor text of the broken link by doing a “command f” in your browser window

img-description find 404’s with Screaming Frog Crawl

  1. If you cant replicate the link it could be that you are logged in and therefore cant replicate the error. Test the broken link in a non-logged in state.
  2. Try search the html source code if the search for the string on the website frontend returns 0 correct results

Exclude URLs from Screaming Frog crawl

Go to configuration > exclude

Exclude word from the crawl

This regex will exclude urls containing ?_pos

1
.*_pos.*

Exclude a Parameter

.*\?filter=.*

Exclude Directory from Crawl

https://example.com/exclude/.*

Use cases

you can exclude duplicate urls. such as for Ecommerce sites where urls containing query paramaters for size and color etc

  • Run crawl
  • View all html links
  • Select all html links
  • Select inlinks from the bottom row
  • Filter by pdf

Find all images missing alt text

  • Run crawl
  • Select images tab from the top panel
  • Filter by missing alt text

Search Entire Site for some String of Text

It’s such a useful ability to be able to search an entire website and return all the urls that contain an occurrence of a specific word or string or phrase.

Use Screaming Frog to find all Occurences of this HTML or String

  1. Go to configuration > custom > search
  2. Enter search string (you can search for more then 1 string)
  3. In the search crawl results find the custom search column relating to your search

Case sensitivity

By default the search is not case sensitive so if you search for the word lawyers it will also pick up the word Lawyers & LAWYERS

Use cases

There are many use cases but the most practical one i use it for is to locate a string of text on a any size website, large or small.

For instance, use frog to find “Lorem ipsum” dummy text which needs updating.

Crawl a Password Protected Site

If the site has a browser password protection that’s no problem for screaming frog. Screaming frog will ask you to enter the username and password of the website before crawling to site.

Cannot see the text

If you cannot see the text on your website but Screaming Frog can. The text is hidden. Could be in a slideshow, accordion or tabs. You can view html source of a website in your browser and search for the string in the html code. You will see it now.

Other Options

Some command line options such as Curl curl allow you to search 1 webpage for a string of text but not the entire site. To search the entire site for string of text you will need the Screaming Frog.

How to crawl a noindex nofollow site with Screaming Frog

Most of the time WordPress puts noindex/nofollow sitewide on each page. Due to the settings in the Reading Settings.

alt text You could uncheck this box, then quickly crawl the site with Screaming Frog. But this is not best practice. Its dangerous as you may forget to reapply the discourage Search engines box. This could utimately lead to your staging site being served up in the serps to your audience.

It’s best to tweak the Screaming Frog Crawl settings as outlined below.

How to bypass the nofollow

  1. Go to ‘Config > Spider’
  2. Scroll down to “Crawl behaviour”
  3. enable ‘Follow Internal Nofollow’

alt text

How to bypass the noindex

  1. Go to ‘Config > Spider > Advanced’.
  2. Uncheck Ignore Non-Indexable URLs for Issues

alt text

Now you can Crawl a site set to noIndex NoFollow

How to bypass robots.txt

Bypass the nofollow When the nofollow has been added to the Robots file.

  1. Go to ‘Config > Robots.txt > Settings’
  2. Choose ‘Ignore robots.txt’.

alt text

This post is licensed under CC BY 4.0 by the author.