Knowledge Base

How to Use Robots.txt

The robots.txt file helps control how search engines and bots access and index your website. This article explains its purpose and usage.

What is the purpose of the Robots File? ⤵
Where does robots.txt go? ⤵
How to block Robots and Search Engines from crawling ⤵
Google and Microsoft ⤵

What is the purpose of the Robots File?

When a search engine crawls (indexes) your website, most of them look for your robots.txt file. There are some exceptions to this, however. This file tells search engines what they should and should not index (save and make available as search results to the public). It also may indicate the location of your XML sitemap. The search engine then sends its "bot" or "robot" or "spider" to crawl your site as directed in the robots.txt file (or does not send it if you said they could not).

Google's bot is called Googlebot, and Microsoft's bot is called Bingbot. Many search engines, like Excite, Lycos, Alexa, and Ask Jeeves, also have their own bots. Most bots are from search engines, although sometimes other sites send out bots for various reasons. For example, some sites may ask you to put code on your website to verify you own that website, and then they send a bot to see if you put the code on your site.

Where does robots.txt go?

The robots.txt file belongs in your document root folder.

You can simply create a blank file and name it robots.txt. This will reduce site errors and allow all search engines to rank anything they want.

How to block Robots and Search Engines from crawling

If you want to stop bots from visiting your site and stop search engines from ranking you, use this code:

#Code to not allow any search engines!
User-agent: *
Disallow: /

You can also prevent robots from crawling parts of your site while allowing them to crawl other sections. The following example would request search engines and robots not to crawl the cgi-bin folder, the tmp folder, the junk folder, and everything in those folders on your website.

# Blocks robots from specific folders / directories
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

In the above example, http://www.yoursitesdomain.com/junk/index.html would be one of the URLs blocked, but http://www.yoursitesdomain.com/index.html and http://www.yoursitesdomain.com/someotherfolder/ would be crawlable.

Keep in mind that robots.txt works like a "No Trespassing" sign. It tells robots whether you want them to crawl your site or not. It does not block access. Honorable and legitimate bots will honor your directive on whether they can visit or not. Rogue bots may simply ignore robots.txt.

Google and Microsoft

Google and Microsoft DO NOT honor the robots.txt standard. You can create Google and Microsoft accounts and configure your domains to have a lower crawl delay. Read Google's official stance on the robots.txt file. You MUST utilize Google's Webmaster Tools to set most of the parameters for GoogleBot.

We DO still recommend configuring a robots.txt file. This will reduce the rate at which crawlers initiate requests with your site and reduce the resources it requires from the system, allowing for more legitimate traffic to be served.

If you would like to reduce traffic from crawlers, such as Yandex or Baidu, these typically need to be done utilizing something in the nature of an .htaccess block.

For more details regarding these topics, please reference the links listed below:

Have another question? HostGator supports are here to help; please contact us via phone or chat.

Did you find this article helpful?