Post

Robots Cheatsheet

code examples to help you configure the way your site is crawled by search engines

Sitemap: https://example.com/sitemap_index.xml
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

examples:

User-agent: *
Disallow: /admin/
Disallow: /assets/components/
Disallow: /core/
Disallow: /connect/
Disallow: /index.php
Disallow: *?
Allow: .js
Allow: .css

Host: example.com
Sitemap: http://exammple.com/sitemap.xml

wildcards

The * wildcard character

will simply match any sequence of characters. This is useful whenever there are clear URL patterns that you want to disallow such as filters and parameters.

$ wildcards

The $ wildcard character is used to denote the end of a URL. This is useful for matching specific file types, such as .pdf.

Block search engines from accessing any URL that has a ? in it:

User-agent: *
Disallow: /*?
Block search engines from crawling any URL a search results page (query?kw=)
User-agent: *
Disallow: /query?kw=*
Block search engines from crawling any URL url with the ?color= parameter in it, except for ?color=blue
User-agent: *
Disallow: /*?color
Allow: /*?color=blue
Block search engines from crawling comment feeds in WordPress
User-agent: *
Disallow: /comments/feed/
Block search engines from crawling URLs in a common child directory
User-agent: *
Disallow: /*/child/
Block search engines from crawling URLs in a specific directory which 3 or more dashes
User-agent: *
Disallow: /directory/*-*-*-
Block search engines from crawling any URL that ends with “.pdf” – Note, if there are parameters appended to the URL, this wildcard will not prevent crawling since the URL no longer ends with “.pdf”
User-agent: *
Disallow: /*.pdf$
Block access to every URL that contains a question mark "?"
User-agent: *
Disallow: /*?
This post is licensed under CC BY 4.0 by the author.