Post

robots cheatsheet

Sitemap: https://example.com/sitemap_index.xml User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php

examples:

User-agent: * Disallow: /admin/ Disallow: /assets/components/ Disallow: /core/ Disallow: /connect/ Disallow: /index.php Disallow: *? Allow: .js Allow: .css

Host: example.com Sitemap: http://exammple.com/sitemap.xml

//wildcards

The * wildcard character will simply match any sequence of characters. This is useful whenever there are clear URL patterns that you want to disallow such as filters and parameters.

$ wildcards The $ wildcard character is used to denote the end of a URL. This is useful for matching specific file types, such as .pdf.

Block search engines from accessing any URL that has a ? in it:

User-agent: * Disallow: /*? Block search engines from crawling any URL a search results page (query?kw=)

User-agent: * Disallow: /query?kw=* Block search engines from crawling any URL url with the ?color= parameter in it, except for ?color=blue

User-agent: * Disallow: /?color Allow: /?color=blue Block search engines from crawling comment feeds in WordPress

User-agent: * Disallow: /comments/feed/ Block search engines from crawling URLs in a common child directory

User-agent: * Disallow: /*/child/ Block search engines from crawling URLs in a specific directory which 3 or more dashes

User-agent: * Disallow: /directory/--*- Block search engines from crawling any URL that ends with “.pdf” – Note, if there are parameters appended to the URL, this wildcard will not prevent crawling since the URL no longer ends with “.pdf”

User-agent: * Disallow: /*.pdf$

Block access to every URL that contains a question mark “?” User-agent: * Disallow: /*?

This post is licensed under CC BY 4.0 by the author.