Is Your Robots.txt File Doing It’s Job?

First off, do you even have a Robots.txt file? If not, chances are good you need one.  The Robots.txt file’s job is to tell the search engines which files and directories they should and should not index. Creating it is really simple and can be done with any text editor from Notepad to Microsoft Word. This handy little file is really essential if you are using either a CMS like WordPress or creating websites with hand-coding that contain secure files or even folders with design information.

While most search engines won’t crawl or index content of pages blocked by robots.txt, they may still index the URLs found on other pages on the Web. This can result in the URL of the blocked page and other content such as anchor text in links appear in search results making it all the more critical that the pages are also protected by a login requiring a user name and password. Plus, there’s also the potential that a less than reputable spider will ignore the robots.txt file.

Robots.txt for WordPress

Here is an example of the contents of a Robots.txt file for a WordPress site.  If the WordPress install is in a subdirectory, prefix it as such.  See the WordPress Codex for more information http://codex.wordpress.org/Search_Engine_Optimization_for_WordPress.

User–agent: *
Allow: /
Disallow: /cgi–bin
Disallow: /wp–admin
Disallow: /wp–includes
Disallow: /wp–content
Disallow: /e/
Disallow: /show–error–*
Disallow: /xmlrpc.php
Disallow: /trackback/
Disallow: /comment–page–
Allow: /wp–content/uploads/
User–agent: Mediapartners–Google
Allow: /
User–agent: Adsbot–Google
Allow: /
User–agent: AdsBot–Google–Mobile–Apps
Allow: /
User–agent: Googlebot
Allow: /
User–agent: Googlebot–Image
Allow: /
User–agent: Googlebot–Mobile
Allow: /
User–agent: Googlebot–News
Allow: /
User–agent: Googlebot–Video
Allow: /
Sitemap: http://YOUR_SITEMAP_URL

Google provides a free testing tool to make sure your robots.txt file is correctly formatted. You can access it in your Google Webmaster Tools under Site configuration/Crawler access or learn more about it at Google Webmaster Tools Help.

Note: The robots.txt file belongs in the root folder of the server. If you don’t have access to that, you.can use the robots meta tag to provide this information to the spider.

Robots Meta Tag

You can use a special HTMLtag to tell robots not to index the content of a page, and/or not scan it for links to follow in a manner similar to the robots.txt file. The default for the robots meta tag is INDEX,FOLLOW so you do not need to add a tag for that.  Pages only need a tag if you want to provide the spider with other directions such as:

Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
The following two tabs change content below.

Karie Barrett

Creative Development Director at QAT Global
is the Creative Development Director at QAT Global. She has over 20 years diverse marketing, design, and business experience. Karie is responsible for driving creative strategy and execution to develop and produce quality creative web and marketing solutions that meet internal and external client's business objectives and goals. @KarieBarrett