preventing non-www hostnames from being indexed
W3C's main web site www.w3.org
is load-balanced over a number of servers using the excellent HAProxy load balancing software. Before we discovered HAProxy we used round-robin DNS for a number of years, with www.w3.org
rotated among a number of servers with names like web1.w3.org
, web2.w3.org
, etc.
One unfortunate side effect of this practice was that these individual hostnames would get picked up by Google and other site crawlers, with undesirable consequences such as increased server load, diluted pagerank, less effective caching, and broken links when old hostnames went out of service.
We came up with a simple way to avoid this issue: we created a file called nobots.txt
that forbids all crawling/indexing by bots, and started returning it instead of our usual robots.txt file when the HTTP host is not www.w3.org
, using this rule in our Apache config:
RewriteCond %{HTTP_HOST} !^www\\.w3\\.org$ [NC]
RewriteRule ^/robots.txt$ /nobots.txt [L]
This prevents our site being from indexed at individual hostnames that may come and go over time.
Of course, the best way to indicate a site's preferred hostname is to issue real HTTP redirects pointing to the right place; we didn't do that in this case because we wanted to keep the ability to view our site on a specific server, for example to verify that our web mirroring software is working correctly or to work around some temporary network issue. We do issue HTTP redirects in many other cases, e.g. to redirect w3.org
or w3c.org
to www.w3.org
.
For an example of how to issue HTTP redirects to the canonical hostname, see Removing ‘www.’ from your URI.
What other load balancing solutions are there and what is approved by w3? Round Robin DNS doesn't sound like a good solution if one has to config the web servers to redirect crawlers. How can this be done transparently yet support checking each server for availability, etc? Is there a way to get the best of both worlds?
There are many possibilities; as I wrote in the article, we are now using HAProxy with
www.w3.org
mapped to a single IP address.I don't think W3C has any guidelines on this kind of thing. (this blog is just a description of what we have chosen to use ourselves)
It's possible to use round robin DNS in a way that doesn't require this extra step, that's just how we happened to set things up.
That's part of the reason we chose to do things this way: it allowed us to connect to each system on their individual hostnames for convenient testing/debugging, while web robots would only index the site at its canonical hostname.
There are other ways to do this to prevent indexing of a site: Google Webmaster tools, configure the domain favorites or meta robot -> noindex, nofollow. Excuse my bad english, I'm french.
That has a number of disadvantages compared to our approach: it would only inform Google and not the dozens/hundreds of other crawlers, and it needs active maintenance over time as new hostnames are added/removed.
That would prevent all our content from being indexed no matter what hostname it is served from, which would be very undesirable.
It sould work this way but I'm not sure about that. Google can crawl the page anyway.
I like using the .htaccess method but it's most suited for specific file extension:
Header set X-Robots-Tag "noindex, nofollow"
That might be useful in other situations but not for our use case described in the blog post — we would just end up with bots crawling the entire site on the wrong hostname for no reason.
Hello, the subject is old but today if you don't have robots.txt files Google doesn't crawl and index it. regards.
We do have a robots.txt. This config is used to return different contents for robots.txt depending on the HTTP Host header.
i have a site. it is viewed ns1.domain.com/* and ns2.domain.com/* on google search(when i search , site:domain.com). how can i prevent ns1 and ns2 (nameserver)?
If you have access to the server's configuration you can do as we did, documented above. If you are only concerned about Google, you may be able to tell it to ignore ns1 and ns2 using Google's search console.
how can i ignore they with search console? i can not find it.
i think i found it. url: https://search.google.com/search-console/removals?