preventing non-www hostnames from being indexed

Author(s) and publish date

By:

Gerald Oskoboiny

Published:

23 March 2012

W3C's main web site www.w3.org is load-balanced over a number of servers using the excellent HAProxy load balancing software. Before we discovered HAProxy we used round-robin DNS for a number of years, with www.w3.org rotated among a number of servers with names like web1.w3.org, web2.w3.org, etc.

One unfortunate side effect of this practice was that these individual hostnames would get picked up by Google and other site crawlers, with undesirable consequences such as increased server load, diluted pagerank, less effective caching, and broken links when old hostnames went out of service.

We came up with a simple way to avoid this issue: we created a file called nobots.txt that forbids all crawling/indexing by bots, and started returning it instead of our usual robots.txt file when the HTTP host is not www.w3.org, using this rule in our Apache config:

RewriteCond %{HTTP_HOST} !^www\\.w3\\.org$ [NC] RewriteRule ^/robots.txt$ /nobots.txt [L]

This prevents our site being from indexed at individual hostnames that may come and go over time.

Of course, the best way to indicate a site's preferred hostname is to issue real HTTP redirects pointing to the right place; we didn't do that in this case because we wanted to keep the ability to view our site on a specific server, for example to verify that our web mirroring software is working correctly or to work around some temporary network issue. We do issue HTTP redirects in many other cases, e.g. to redirect w3.org or w3c.org to www.w3.org.

Related RSS feed

Subscribe to our blog feed

Comments (13)

Mathias Bynens - 23 March 2012 at 11:18:50 UTC

For an example of how to issue HTTP redirects to the canonical hostname, see Removing ‘www.’ from your URI.
Charlie Hendricks - 30 January 2013 at 00:43:02 UTC

What other load balancing solutions are there and what is approved by w3? Round Robin DNS doesn't sound like a good solution if one has to config the web servers to redirect crawlers. How can this be done transparently yet support checking each server for availability, etc? Is there a way to get the best of both worlds?
- Gerald Oskoboiny - 31 January 2013 at 21:11:56 UTC
  
  What other load balancing solutions are there and what is approved by w3?
  
  There are many possibilities; as I wrote in the article, we are now using HAProxy with www.w3.org mapped to a single IP address.
  
  I don't think W3C has any guidelines on this kind of thing. (this blog is just a description of what we have chosen to use ourselves)
  Round Robin DNS doesn’t sound like a good solution if one has to config the web servers to redirect crawlers.
  
  It's possible to use round robin DNS in a way that doesn't require this extra step, that's just how we happened to set things up.
  How can this be done transparently yet support checking each server for availability, etc?
  
  That's part of the reason we chose to do things this way: it allowed us to connect to each system on their individual hostnames for convenient testing/debugging, while web robots would only index the site at its canonical hostname.
yves - 9 February 2013 at 08:26:14 UTC

There are other ways to do this to prevent indexing of a site: Google Webmaster tools, configure the domain favorites or meta robot -> noindex, nofollow. Excuse my bad english, I'm french.
- Gerald Oskoboiny - 11 February 2013 at 18:33:09 UTC
  
  There are other ways to do this to prevent indexing of a site: Google Webmaster tools, configure the domain favorites
  
  That has a number of disadvantages compared to our approach: it would only inform Google and not the dozens/hundreds of other crawlers, and it needs active maintenance over time as new hostnames are added/removed.
  or meta robot -> noindex, nofollow.
  
  That would prevent all our content from being indexed no matter what hostname it is served from, which would be very undesirable.
- LIJE Creative - 14 February 2013 at 07:11:47 UTC
  
  It sould work this way but I'm not sure about that. Google can crawl the page anyway.
  
  I like using the .htaccess method but it's most suited for specific file extension:
  
  Header set X-Robots-Tag "noindex, nofollow"
- Gerald Oskoboiny - 14 February 2013 at 18:52:38 UTC
  
  I like using the .htaccess method but it’s most suited for specific file extension:
  
  Header set X-Robots-Tag "noindex, nofollow"
  
  That might be useful in other situations but not for our use case described in the blog post — we would just end up with bots crawling the entire site on the wrong hostname for no reason.
MyGoodSite - 1 May 2020 at 15:26:45 UTC

Hello, the subject is old but today if you don't have robots.txt files Google doesn't crawl and index it. regards.
- Gerald Oskoboiny - 1 May 2020 at 16:29:18 UTC
  
  We do have a robots.txt. This config is used to return different contents for robots.txt depending on the HTTP Host header.
Cihan - 2 May 2020 at 14:01:04 UTC

i have a site. it is viewed ns1.domain.com/* and ns2.domain.com/* on google search(when i search , site:domain.com). how can i prevent ns1 and ns2 (nameserver)?
- Gerald Oskoboiny - 2 May 2020 at 17:32:48 UTC
  
  If you have access to the server's configuration you can do as we did, documented above. If you are only concerned about Google, you may be able to tell it to ignore ns1 and ns2 using Google's search console.
- CIHAN - 2 May 2020 at 18:57:13 UTC
  
  how can i ignore they with search console? i can not find it.
- CIHAN - 2 May 2020 at 19:13:52 UTC
  
  i think i found it. url: https://search.google.com/search-console/removals?

Comments for this post are closed.

Standards

Groups

Get involved

Resources

News and events

About

preventing non-www hostnames from being indexed

Author(s) and publish date

Related RSS feed

Comments (13)