Robots txt file

A Good Blogger Knows About Robots.txt

What is a robots.txt file?

Robots.txt file is a normal txt file that includes some text to give directions to search engine bots. Generally, when a search engine bot goes to visit a web-page URL such as http://example.com before visiting, it visits robots.txt file to get permission to crawl the URL. Though robots.txt file cannot force any robot to follow this file, most crawlers and spiders honor a robots.txt file, except for some spamming bots and malware bots that scan the web for security vulnerabilities. Robots.txt file plays a very significant role in On-Page SEO, so you need to understand it perfectly.

How to create a robots.txt file?

We can use any txt file generator like notepad or word pad to create this file. You just need to write some directive lines inside this file and save it as txt file with the name robots.txt. It should be mentioned here that you must use lower case characters in the file name and no upper case characters like Robots.txt.

Example #1. A simple robots.txt file:

User-agent:*
Disallow:

In the example given above, first line User-agent:* means that directions are applicable for all the search engine bots or robots. And, 2nd line Disallow: means all the search engine bots are allowed to crawl and index this site.

Example #2. Disallow to crawl an entire site:

User-agent:*
Disallow:/

For example #2, the second line Disallow:/ means all the search engine bots are not permitted crawling and indexing of this site.

Example #3. To disallow a singe robot:

User-agent:Googlebot
Disallow:/

In this example, all bots are permitted to crawl the site excluding Googlebot, meaning that Googlebot is not permitted crawling of our website.

Example #4. To allow a singe robot only:

User-agent:Googlebot
Disallow:

User-agent:*
Disallow:/

In the example given above, all other bots are disallowed to crewel this site excluding Googlebot. That is only Googlebot is permitted.

In our next example, we will show you how you can prevent bots to crawl a specific directory or file of your site. Suppose your site directory folder name is tutorial and it contains several HTML files named abc.html, xyz.html, and 123.html. Now you may not want to allow bots to crawl this folder or those files. Command will be as follows:

Example #5. Disallow to crawl a specific directory or files:

User-agent:*
Disallow:/tutorial/ 

OR

User-agent:*
Disallow:/tutorial/abc.html
Disallow:/tutorial/xyz.html
Disallow:/tutorial/123.html

Example #6. Disallow a specific directory except one file.

User-agent:*
Disallow:/tutorial/
Allow:/tutorial/xyz.html

In the example given above, it gives direction to disallow the “tutorial” folder, but allow xyz.html file only from that folder.

Sometimes your website may create a question mark (?) URL. It is very known to people who work with WordPress because WordPress creates this type of URL by default. You may block access to all URLs that include a question mark (?) using this command below:

User-agent:*
Disallow:/*?

Using robots meta tag:

You know meta tags are used in the head section of an HTML page. By default even if you do not use meta robot tag it means “index, follow” so search bots are permitted to follow and index this page. You should use meta robots tag on every page of your website to define specific rules and regulations of each page. You can allow or disallow which link will be permitted to index and followed or not by using this meta tag. Meta robots tag may be used in the following ways:

<meta name="robots" content="noindex, follow">
<--Disallowed to index, links may be followed-->
or
<meta name="robots" content="noindex, nofollow">
<--Disallowed to index and links may not be followed-->

Main differences between meta tag and robots.txt files are:

  • Robots.txt file is used to control bot activities and performances for the entire site.
  • Meta robots tags are used to define performance rules for a page only.

One thing that should be mentioned here is that if you use robots meta tag as “no-index, no-follow” and at the same time use robots.txt file disallowing subjected URL then meta tag may not work because bots will not come to visit this page to know about meta robots tag.

Some important issues to be considered when using robots.txt file:

  • It must be placed in the root directory of your site. For example http://example.com/robots.txt
  • This file is case sensitive, so: If you have two files named file.html & File.html then command on robots.txt file Disallow:/file.html will only disallow this file, but will not make any effect on File.html. So be careful.
  • It is better to have a robots.txt file on each site. Even if it is without any text inside or with default directions such as User-agent:* and Disallow:
  • Even when using robots.txt file saying not to crawl a specific file, Google may show that URL in search results page following links from an external source. To completely block that URL you may use robots no-index meta tag in the head section of your HTML page.
  • Robots.txt file is a publicly available file. Anyone can see it to know which sections of your site are allowed or disallowed by using robots.txt file.

You can learn more about robots file from Google support.