Rene Saarsoo has published a survey of Coding practices of Web pages. It contains a lot of very useful information for those who try to understand how the Web is authored in the wild. One of the major concerns of HTML WG is to try to design HTML 5 in a way which is mostly compatible with what authors mostly do on the Web.
It is not an easy task. There are different types of authors on the Web, and then different types of requirements for different products. A while ago, I posted on the mailing-list trying to work out some of the possible categories of products.
Web author (hand coding)
From the point of view of the author, HTML is a set of tags with a clear defined meaning (ex: ‘q’) or functional semantics (ex: ‘a’). Sometimes, the definitions given by previous specifications, books, tutorials, lead to misunderstanding and then the features are not properly used. They are many categories of HTML hand coders with different capabilities and knowledge. Some of the authors will see it just as a support for CSS for example and do not care that much about the meaning. Some will be very precise and be frustrated by the lack of defined elements.
Web author (wysiwyg)
CMS developer, scripting libraries.
HTML is a language that in the best case have some rules of nesting for tags and help to put content on a web page. It is something to put bits of content coming from a database on the Web. It is very rare that the semantics is understood or even care of. It is very rare to have CMS which puts a quality process in the publishing step. Their conception is more html fragment than document.
Web authoring Wysiwyg tool
HTML is a very difficult thing to implement. The specification in the past have not been defined for Wysiwyg tools. They had to produce a document which respects the syntactic rules of the language. But there is no or little guidance on implementing the language at the UI level. We have a tendency to define, right now, a lot more how to render and not that much how to create.
Web Visual Browser
Assistive Technologies Browser
They see HTML as a powerful language to give easily access to content for people who had no access to it in the past. Giving access to a paper book to someone who is blind has a high cost, it becomes easy on the Web. Though it is also difficult to implement a useful tool because not many Web authors and CMS care for accessibility. So people themselves using these browsers fill the gap when they can by using their own skills and intelligence.
Strange world. It is not a uniform world. They are at least two big sub-classes:
Web search services (Yahoo!, MS Live, Google and Quaero)
For those, they need to parse the web content which is not only html and which is mostly a few tags and a lot of content. They are interested by links and some of the meaningful tags but not that much.
Web search engines (ht://Dig, Nutch, etc.)
More skilled and more powerful, they are used on corporate, academic, personal Web sites. They are crafted to index all kind of metadata and semantics. HTML is a fully meaningful language. It helps users on the Web to have a more precise answer within the context of a corporate site. Initiatives like explicit data (RDFa, microformats), metadata in head, etc. are very important for them. Some of these engines work on the Desktop and then are a tool for desktop users (Spotlight (Apple) for example.)
Validators, Conformance checker, Helping tools
HTML is a set of rules and definitions, that helps to define if the document is in contradiction with these rules. Some of the rules can be checked easily, can be processed by a machine, some others are a lot more difficult.
HTML is a set of rules and syntactic constraints with a defined semantics that can be used, be encapsulated in another technology.