Making your website valid: a step by step guide.

Abstract

In this article we will imagine a situation when a webmaster wishes to make a whole website compliant with regards to web standards (valid (X)HTML, valid CSS, etc.). This article describes the usual ways to approach this problem, as well as suggesting a painless approach using a new tool developed by W3C's QA activity.

Status

This article has been produced as part of the W3C Quality Assurance Interest Group work. Please send any public feedback on it to the publicly archived mailing list public-evangelist@w3.org or for private feedback to ot@w3.org.

This document has been translated in other languages.

Improving an existing site: a difficult decision

Creating a Web site --one that complies with standards such as HTML, CSS, or the Web Accessibility Guidelines --, is the right thing to do, and is also a profitable choice.

Guidelines and tools are readily available to help you create a Web site that conforms to Web standards, ensuring a broad audience, cost-effective development, and easier maintenance.

But deciding how to convert an existing site to a standards-compliant format is a difficult decision. Your site may have legacy, unmaintained documents in multiple formats or may serve a large amount of documents, making it difficult to update. Your site may be backed by good design and flexible technologies, which will simplify the task, yet in any case updating the site will require a resource commitment.

However, the method you choose to update determines how many resources you'll need to dedicate, and the way you will dedicate them.

There are two typical ways to make an existing Web site standards compliant: start completely over (the wrong way), or manually validate each page (the hard way). For IT managers, neither is very attractive, hence making the decision to switch to a standards compliant site difficult: it simply does not seem worthy given the amount of work needed.

After looking in details at these two approaches (analyzing why they are wrong), we will see a third, better one: systematically update one section at a time.

The wrong way: Re-starting from scratch

The wrong way to improve the quality of an already existing site is to delete everything existing, and restart the site from scratch.

This approach may be tempting for the freedom it allows and the opportunity to use a clean framework for the beginning. However in addition to the cost of a full redesign, rewrite and debug of the site, trying to fix things by beginning over may create more problems, starting with broken links.

The Hard Way: The whole works

The usual way is also the hard way : the site administrator lists all resources available (provided the technologies used make this feasible), and runs those, either one by one, or in batch, through "validating" technologies, like HTML validation, CSS validation, spell checking, or through corrective filters (such as HTML Tidy).

This approach has a lot of advantages, and does not include any specific risk as did the previous method. However, especially for sites with thousands of documents, it requires an incredible amount of work and can't be achieved, if at all, without an excellent organization. Just figuring out "where to start?" is, itself, a tricky question when it comes to checking a full site.

A suggested alternative

There might be no perfect way to fix a whole site, but some are better, or easier, than others. Using tools introduced below, we will explain a relatively easy method that we believe is good. This method has, unfortunately, its limits: it is best used with static content, or dynamic/generated content if you have control over the templates.
If you do not have control over those and they produce invalid markup, then we encourage you to send a bug report to the software vendors, or to the service provider managing your content.

Step by Step approach

"The Hard Way", would certainly be the best method to fix an existing site for someone with unlimited resources dedicated to this task. In the "real world", unless the site is very small, this approach is not realistic, except if you make the process gradual, and ordered.

With careful planning and an extended time-line, you can eventually clean up the site. However, this process requires careful management, so that a given number of files are cleaned up at regular intervals and all new resources are valid.

Do the math

The number of resources you will clean up during each period depends upon the volume of content (and the ratio of invalid documents). Ask yourself the following questions when allocating resources:

How much time can you dedicate to cleaning up invalid content?
How long does it take for you (or the people assigned for this task) to fix one invalid document?

No deadline?

We have not mentioned any deadlines for this cleaning-up work. In most cases you probably have no idea what the initial ratio of invalid content you have, and you may even not know how many documents you have. Without this information, how can you estimate how long it will take?

Of course, like every project this cleaning project needs limits and deadlines. One limit you can set before starting the project is : "what is the acceptable invalid ratio for my site?" If you have a small or moderate-sized site, "zero" may be your answer, however we suggest you choose a more modest figure if you have a big site, 10% for example.

Once you have set the limit and the dedicated resources to the cleaning project, the first few rounds of the "step by step method" will give you an idea of how long it will take to reach the limit. You will then be able to reconsider the amount of time dedicated if necessary, or your targeted "quality ratio".

Traffic-based approach

Here is a simple example to explain the traffic-based approach.

Imagine you have 4 documents on a site (we'll call them 1,2,3 and 4), accounting for, respectively 40%, 30%, 20% and 10% of the traffic for this site.

Now imagine that documents 1 and 4 are invalid. that's 50% of the documents, and 50% of the traffic, and that's bad. If you have time to fix both documents, fine, but if you have time to fix only one?

The usual approach would be to just fix one so only 25% of the documents are invalid. The traffic approach tells you to choose document 1, fix it, and go up to 90% of the traffic being valid.

This is a cost-efficient approach to the problem : given a limited amount of resources, you want to focus on the improvements that will have the best results.

Estimating the Quality of a site using the Traffic approach

The traffic based approach is also a more accurate tool to estimate the quality of a website. As we will see in the following section, given a site (with an unknown number of documents served, but known logs for a given period of time), the LogValidator sorts the documents served during this time by popularity (traffic), then tries to find X invalid documents among the most popular ones.

Now, let's imagine a case where 100 documents have been served. The tool needs to go through 20 documents to find 2 (we are setting X=2 for the example) that were invalid HTML documents. These 20 documents account for 45% of the traffic. The table below give the estimations of the quality of the site (with regards to HTML validity), with a "file approach" and with a "traffic approach".

	Using the file approach		Using the traffic approach
	Lower estimate	Upper estimate	Lower estimate	Upper estimate
Before validating the 2 documents	18%	98%	40.5% (45*18/20)%	95.5% ((45*18/20)% +55%)
After validating the 2 document	20%	100%	45%	100%

The "file based" estimations are loose and inaccurate, whereas the traffic-based estimations are more accurate. Once you have fixed the 2 documents, and re-start this process, the traffic based estimates get more accurate (and higher, since more and more of the traffic is valid!).

Practical Case: Using the LogValidator and other tools to cleanup your site's markup

Here we will describe a practical example of this "cleanup strategy" using a limited set of (free) tools to validate a Web site's HTML. As stated before, HTML is just an example, you can use the techniques described (and some of the tools) for many other cases.

Get the tools

The LogValidator will be the primary (if not the only) tool you will need. You can download it freely, and install it on any system running Perl (your Web server certainly does).

You will also need a few other components that the LogValidator depends on to run smoothly. They can all be downloaded and installed free of charge.

If you are not an HTML expert, and cleaning up code is not your hobby, you can use tidy to do it for you. It's a (semi-)automatic markup cleanup tool, and is available for many platforms.

The LogValidator will check your documents through the online Markup validator at W3C. If you have a big site, or want to save bandwidth, you can install it locally, too.

Running the LogValidator

We will assume you have installed at least the LogValidator, and at least read the Manual carefully.

You first need to set up a configuration file to match your server configuration. To do so you (mainly) need access to a log file for your Web server (this will be used to compute traffic statistics). You can easily create the configuration file by copying the sample configuration file distributed with the tool, and edit it as explained in the Manual.

Once this is done, you can run the LogValidator. Don't set the number of results too high, 10 should be enough to begin with.

You should get back a list of your 10 "most popular" invalid documents. Take some time to analyze them. You can run them through the Markup Validator to check where the bad HTML is. If you are using templates, does it seem like there is something wrong with them? Can you check the template with the validator?

Next, fix the first documents on the list.Remember, those are the most popular documents on your site that are not valid, so it is an important step! This first step may be difficult, especially if "big" documents are in the list. tidy can help cleanup your code. You can also search the Web for guidelines for fixing Web pages and find people to assist you.

For example, if you don't understand the output of the validator, check out its documentation or contact the public list www-validator@w3.org.

Done? Congratulations! You can now set up the LogValidator to run every week, day, or month (see the tip to do this), and start again with other documents...

Keep up the good work. If you have a really big site made of static documents, chances are you won't reach 100% of valid pages, but that's OK. After some time, the invalid pages that are left will account for a tiny portion of your site.

Credits

Thanks a lot to Kim Nylander for a thorough review of this document and many invaluable suggestions.
Thanks to Karl Dubost and Dominique Hazael-Massieux, W3C, for their comments and suggestions.

Contact

Olivier Thereaux, W3C : <ot@w3.org>