SEO
  Navigation

Robots.txt file

A robots.txt file is a file at the root of your site that indicates those parts of your site you don’t want accessed by search engine crawlers. The file uses the  Robots Exclusion Standard, which is a protocol with a small set of commands that can be used to indicate access to your site by section and by specific kinds of web crawlers (such as mobile crawlers vs desktop crawlers).

What is robots.txt used for?

Non-image files

For non-image files (that is, web pages) robots.txt should only be used to control crawling traffic, typically because you don't want your server to be overwhelmed by Google's crawler or to waste crawl budget crawling unimportant or similar pages on your site. You should not use robots.txt as a means to hide your web pages from Google Search results. This is because other pages might point to your page, and your page could get indexed that way, avoiding the robots.txt file. If you want to block your page from search results, use another method such as password protection or other options found on the SEO tab of your CMS pages. 

save image

Image files

robots.txt does prevent image files from appearing in Google search results. (However it does not prevent other pages or users from linking to your image.)

Resource files

You can use robots.txt to block resource files such as unimportant image, script, or style files, if you think that pages loaded without these resources will not be significantly affected by the loss. However, if the absence of these resources make the page harder to understand for Google's crawler, you should not block them, or else Google won't do a good job of analyzing your pages that depend on those resources.

Understand the limitations of robots.txt

Before you build your robots.txt, you should know the risks of this URL blocking method. At times, you might want to consider other mechanisms to ensure your URLs are not findable on the web.

Robots.txt instructions are directives only

The instructions in robots.txt files cannot enforce crawler behavior to your site; instead, these instructions act as directives to the crawlers accessing your site. While Googlebot and other respectable web crawlers obey the instructions in a robots.txt file, other crawlers might not. Therefore, if you want to keep information secure from web crawlers, it’s better to use other blocking methods, such as password-protecting private files on your server.

Different crawlers interpret syntax differently

Although respectable web crawlers follow the directives in a robots.txt file, each crawler might interpret the directives differently. You should know the proper syntax for addressing different web crawlers as some might not understand certain instructions.

Your robots.txt directives can’t prevent references to your URLs from other sites

While Google won't crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the site can still appear in Google search results. You can stop your URL from appearing in Google Search results completely by using other URL blocking methods, such as password-protecting the files on your server or using the noindex meta tag or response header.

Make your own robots.txt file

In order to make a robots.txt file, you need access to the root of your domain, which you can find in the File Manager module. 

You can make or edit an existing robots.txt file using the robots.txt Tester tool. This allows you to test your changes as you adjust your robots.txt.

Learn robots.txt syntax

The simplest robots.txt file uses two key words, User-agent and Disallow. User-agents are search engine robots (or web crawler software); most user-agents are listed in the Web Robots Database. Disallow is a command for the user-agent that tells it not to access a particular URL. On the other hand, to give Google access to a particular URL that is a child directory in a disallowed parent directory, then you can use a third key word, Allow.

Google uses several user-agents, such as Googlebot for Google Search and Googlebot-Image for Google Image Search. Most Google user-agents follow the rules you set up for Googlebot, but you can override this option and make specific rules for only certain Google user-agents as well.

The syntax for using the keywords is as follows:

User-agent: [the name of the robot the following rule applies to]

Disallow: [the URL path you want to block]

Allow: [the URL path in of a subdirectory, within a blocked parent directory, that you want to unblock]

These two lines are together considered a single entry in the file, where the Disallow rule only applies to the user-agent(s) specified above it. You can include as many entries as you want, and multiple Disallow lines can apply to multiple user-agents, all in one entry. You can set the User-agent command to apply to all web crawlers by listing an asterisk (*) as in the example below:

User-agent: *

Note: If you leave the Disallow line blank, you're telling the search engine that all files may be indexed.

Save your robots.txt file

You must apply the following saving conventions so that Googlebot and other web crawlers can find and identify your robots.txt file:

You must save your robots.txt code as a text file,
You must place the file in the highest-level directory of your site (or the root of your domain), and
The robots.txt file must be named robots.txt.

As an example, a robots.txt file saved at the root of example.com, at the URL address http://www.example.com/robots.txt, can be discovered by web crawlers, but a robots.txt file at http://www.example.com/not_root/robots.txt cannot be found by any web crawler.

Test your robots.txt file

Use this free tool from Google to  test your robots.txt file.

Add a robots.txt file for another domain

This is generally used when you have multiple webs or multiple domain names that have the same/similar content like the main site. You may want to block the site so no search engine can access it. It's also a good idea to this for any beta sites that are not password protected.

The only difference for such a file is in its name, which must be named in the following format:  www.example.com-robots.txt


Please send us suggestions regarding this documentation page
If you would like to recommend improvements to this page, please leave a suggestion for the documentation team.

Be the first to write a comment...