preventing non-www hostnames from being indexed

W3C’s main web site www.w3.org is load-balanced over a number of servers using the excellent HAProxy load balancing software. Before we discovered HAProxy we used round-robin DNS for a number of years, with www.w3.org rotated among a number of servers with names like web1.w3.org, web2.w3.org, etc.

One unfortunate side effect of this practice was that these individual hostnames would get picked up by Google and other site crawlers, with undesirable consequences such as increased server load, diluted pagerank, less effective caching, and broken links when old hostnames went out of service.

We came up with a simple way to avoid this issue: we created a file called nobots.txt that forbids all crawling/indexing by bots, and started returning it instead of our usual robots.txt file when the HTTP host is not www.w3.org, using this rule in our Apache config:

RewriteCond %{HTTP_HOST} !^www\.w3\.org$ [NC]
RewriteRule ^/robots.txt$ /nobots.txt [L]

This prevents our site being from indexed at individual hostnames that may come and go over time.

Of course, the best way to indicate a site’s preferred hostname is to issue real HTTP redirects pointing to the right place; we didn’t do that in this case because we wanted to keep the ability to view our site on a specific server, for example to verify that our web mirroring software is working correctly or to work around some temporary network issue. We do issue HTTP redirects in many other cases, e.g. to redirect w3.org or w3c.org to www.w3.org.

7 Responses to preventing non-www hostnames from being indexed

  1. What other load balancing solutions are there and what is approved by w3? Round Robin DNS doesn’t sound like a good solution if one has to config the web servers to redirect crawlers. How can this be done transparently yet support checking each server for availability, etc? Is there a way to get the best of both worlds?

    • What other load balancing solutions are there and what is approved by w3?

      There are many possibilities; as I wrote in the article, we are now using HAProxy with w​w​w.w3.org mapped to a single IP address.

      I don’t think W3C has any guidelines on this kind of thing. (this blog is just a description of what we have chosen to use ourselves)

      Round Robin DNS doesn’t sound like a good solution if one has to config the web servers to redirect crawlers.

      It’s possible to use round robin DNS in a way that doesn’t require this extra step, that’s just how we happened to set things up.

      How can this be done transparently yet support checking each server for availability, etc?

      That’s part of the reason we chose to do things this way: it allowed us to connect to each system on their individual hostnames for convenient testing/debugging, while web robots would only index the site at its canonical hostname.

  2. There are other ways to do this to prevent indexing of a site: Google Webmaster tools, configure the domain favorites or meta robot -> noindex, nofollow. Excuse my bad english, I’m french.

    • There are other ways to do this to prevent indexing of a site: Google Webmaster tools, configure the domain favorites

      That has a number of disadvantages compared to our approach: it would only inform Google and not the dozens/hundreds of other crawlers, and it needs active maintenance over time as new hostnames are added/removed.

      or meta robot -> noindex, nofollow.

      That would prevent all our content from being indexed no matter what hostname it is served from, which would be very undesirable.

      • It sould work this way but I’m not sure about that. Google can crawl the page anyway.

        I like using the .htaccess method but it’s most suited for specific file extension:

        Header set X-Robots-Tag "noindex, nofollow"

        • I like using the .htaccess method but it’s most suited for specific file extension:

          Header set X-Robots-Tag “noindex, nofollow”

          That might be useful in other situations but not for our use case described in the blog post — we would just end up with bots crawling the entire site on the wrong hostname for no reason.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Before you comment here, note that your IP address is sent to Akismet, the plugin we use to mitigate spam comments.