preventing non-www hostnames from being indexed

Part of Systems

Author(s) and publish date

By:
Published:
Skip to 13 comments

W3C's main web site www.w3.org is load-balanced over a number of servers using the excellent HAProxy load balancing software. Before we discovered HAProxy we used round-robin DNS for a number of years, with www.w3.org rotated among a number of servers with names like web1.w3.org, web2.w3.org, etc.

One unfortunate side effect of this practice was that these individual hostnames would get picked up by Google and other site crawlers, with undesirable consequences such as increased server load, diluted pagerank, less effective caching, and broken links when old hostnames went out of service.

We came up with a simple way to avoid this issue: we created a file called nobots.txt that forbids all crawling/indexing by bots, and started returning it instead of our usual robots.txt file when the HTTP host is not www.w3.org, using this rule in our Apache config:

RewriteCond %{HTTP_HOST} !^www\\.w3\\.org$ [NC] RewriteRule ^/robots.txt$ /nobots.txt [L]
This prevents our site being from indexed at individual hostnames that may come and go over time.

Of course, the best way to indicate a site's preferred hostname is to issue real HTTP redirects pointing to the right place; we didn't do that in this case because we wanted to keep the ability to view our site on a specific server, for example to verify that our web mirroring software is working correctly or to work around some temporary network issue. We do issue HTTP redirects in many other cases, e.g. to redirect w3.org or w3c.org to www.w3.org.

Related RSS feed

Comments (13)

Comments for this post are closed.