Is Your Robots.txt File Doing It’s Job?
First off, do you even have a Robots.txt file? If not, chances are good you need one. The Robots.txt file’s job is to tell the search engines which files and directories they should and should not index. Creating it is really simple and can be done with any text editor from Notepad to Microsoft Word. This handy little file is really essential if you are using either a CMS like WordPress or creating websites with hand-coding that contain secure files or even folders with design information.
While most search engines won’t crawl or index content of pages blocked by robots.txt, they may still index the URLs found on other pages on the Web. This can result in the URL of the blocked page and other content such as anchor text in links appear in search results making it all the more critical that the pages are also protected by a login requiring a user name and password. Plus, there’s also the potential that a less than reputable spider will ignore the robots.txt file.
Robots.txt for WordPress
Here is an example of the contents of a Robots.txt file for a WordPress site. If the WordPress install is in a subdirectory, prefix it as such. See the WordPress Codex for more information http://codex.wordpress.org/Search_Engine_Optimization_for_WordPress.
User–agent: * Allow: / Disallow: /cgi–bin Disallow: /wp–admin Disallow: /wp–includes Disallow: /wp–content Disallow: /e/ Disallow: /show–error–* Disallow: /xmlrpc.php Disallow: /trackback/ Disallow: /comment–page– Allow: /wp–content/uploads/ User–agent: Mediapartners–Google Allow: / User–agent: Adsbot–Google Allow: / User–agent: AdsBot–Google–Mobile–Apps Allow: / User–agent: Googlebot Allow: / User–agent: Googlebot–Image Allow: / User–agent: Googlebot–Mobile Allow: / User–agent: Googlebot–News Allow: / User–agent: Googlebot–Video Allow: / Sitemap: http://YOUR_SITEMAP_URL
Google provides a free testing tool to make sure your robots.txt file is correctly formatted. You can access it in your Google Webmaster Tools under Site configuration/Crawler access or learn more about it at Google Webmaster Tools Help.
Note: The robots.txt file belongs in the root folder of the server. If you don’t have access to that, you.can use the robots meta tag to provide this information to the spider.
Robots Meta Tag
You can use a special HTMLtag to tell robots not to index the content of a page, and/or not scan it for links to follow in a manner similar to the robots.txt file. The default for the robots meta tag is INDEX,FOLLOW so you do not need to add a tag for that. Pages only need a tag if you want to provide the spider with other directions such as:
Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /wp-content/themes