Separation of semantic and presentational markup, to the extent possible, is architecturally sound

1. What is the problem

The problem is often stated imprecisely, or as a solution rather than a problem statement. This has in the past led to confusion when discussing it. One common statement is separate structure from presentation. This sounds good, but accidentally implies that all presentation is unstructured and does not say what separation means; it also lacks any mention of a benefit gained or a danger avoided. A similar statement is separate content from presentation which is similar but implies that presentational resources (images, sounds, scripts) are not content - a view with which server administrators would disagree. Another common statement is that semantics should be separate from presentation. Again, this provides no justification and implies that 'semantics' - meaning - is the preserve of one sort of information.

Having looked at some imprecise solution statements, what is the problem statement? The actual problem seems to be that some content has made inappropriate design choices that limit restylability. Communities that depend on a styling that is very different from that chosen are thus unable to use the content.

2. Content and Presentation

Sometimes inconsistent choices are made that muddle content and presentation. As an example, consider documentation about drug interactions and side effects, being moved from a paper-based to an electronic, online format. Suppose that severe drug interactions had always be printed in red. Here is an example of poor design:

<!ATTLIST Drug Interaction (none | mild | moderate | red) #REQUIRED>

When these documents are presented on a monochrome display, what happens? Here is the same example but with a more thoughtful design:

<!ATTLIST Drug Interaction (none | mild | moderate | severe) #REQUIRED>
Drug[Interaction="severe"] { color: red}

The information has been modeled at a consistent level of abstraction, and the traditional presentation indicated by styling which - although specific to the markup vocabulary in use - has been separated from the content. Different styling can be used on the same information when it is presented on a monochrome device (for example, bold) or using speech output (for example a shift in speech stress, or an audio icon).

As a more complex example, a highly information-dense 'Web interface' application may assume a large display screen and that all the displayed elements can be seen at once. The navigation works poorly if at all on a small screen, for example a PDA. This might be addressed by producing a compromise design that works moderately well on both large and small screens, or by offering alternate (up to date, automatically generated) views that allow sequential navigation of a series of smaller screens. Other examples of problems caused by lack of restylability are voice or multimodal access to content that assumes visual presentation. Here, the entire presentational structure would need to be re-thought, especially in terms of navigation elements, to be usable by a voice browser.

As another example, a more simple and traditional 'document-like' resource such as a technical report might be more readily adapted to different output modalities - visual presentation on different sizes of display, Braille output, speech synthesis - but users of the non-visual modalities could be hampered by contamination of the content by presentational assumptions. References in the text to 'the previous section' can be verified at a glance visually but require a feat of memory or a rewind-and-replay for the Braille and speech presentations which are more highly serialized. Worse, references to "the red text" are clearly meaningless in non-visual modalities.

These examples show that an identified problem is lack of access to restyled content, and also that some content can be trivially reworked to improve this aspect while other content requires a complete redesign.

3. Abstraction and Concreteness

There is no hard and fast division between what is 'purely semantic content' and what is 'just presentation'. The term "semantics" is often used or misused in this context, however any structured format is likely to have some semantics; and some semantics refer to precise details of presentation. Thus, 'semantics' is not used in this section.

Instead, a given format may be seen to reside somewhere on a continuum from highly abstract to highly concrete. The more concrete it is, the less presentational flexibility remains. Highly abstract formats often require extensive transformation before being presented. Less abstract formats can often be presented directly just by decorating the source tree with formatting properties, for example using CSS. Highly concrete formats may still retain some presentational flexibility, for example restyling the colors of a pie diagram to fit into a different presentation.

In this model, all resources occupy a point or a range along this continuum - they have qualities of both abstractness and concreteness.

It is sometimes asserted that XML is content, or structure, and styling is presentation. However, presentational information may itself be complex and structured, and thus a good candidate for encoding in XML. Examples of such highly concrete XML formats are XSL Formatting Objects, SVG, the presentational part of MathML, and Voice XML. The presence of structure, of itself, is no indication of where on the abstraction-concreteness axis a particular format belongs.

In general, moving from more abstract to more concrete can be done (with loss of some abstractions) and is frequently done (for presentation); moving from a more concrete to a more abstract representation (e.g., HTML to RDF) is sometimes possible in specific cases but may require extra information to be added to the more concrete form for that purpose, and is not possible in the general case.

From this discussion, two questions arise. Firstly, should presentational information be as separate as possible from the content, regardless of the level of abstraction of the content (and what does separation mean, here)? Secondly, should all information on the Web should be made available at as high a level of abstraction as possible to allow maximal opportunities for restyling? These two questions will be addressed in the following sections.

4. Separation of presentational information from content

The historical development of HTML (to take a well known example) demonstrates that the level of abstraction of the markup was not clearly understood. In addition to the more abstract elements such as h1..h6, p, address HTML inherited less abstract elements such as b, i, tt from the IBM SGML starter set. Vendor extensions such as font clearly showed that the presentational aspects of HTML were seen as the main advance over plain text files. The introduction of the table element also showed that the auto-resizing grid layout presentation was seen as more compelling than the tabular, matrix organisation. Given appropriate labelling of rows and columns, data with a tabular organization can have the rows and columns swapped without affecting the meaning, a good differentiator of presentational as opposed to abstract use of tables.

@@ style attribute, style element, external linked stylesheet, external packaged but non linked stylesheet.

@@necessity to combine these at some point to give the product of styling operation. css tree. XSL tree. FO. separation depends on level of abstraction; presentation is progressively harder to differentiate from content as one becomes more concrete (e.g. SVG geometry is 'content' and colors are 'presentation', but other choices could have been defended

Benefits of good separation - restylability, reusability, maintainability (different skill sets work on different files).

4.1 Types of separation

Separate resources (files ). Separate location (client and server).

One way to fully separate different levels of abstraction is to have the most abstract forms on the server, but not made available to the public. Instead, they are used to automatically generate more concrete forms, which are made available. This has some advantages, and some disadvantages.

4.2 Keeping presentation out of the content

Pharmacy example above. style attribute limits restylability. presentation attributes.

4.3 Keeping content out of the presentation

Generated content. List numbering issues - appropriate choices are not clear.

The problem of templates. Bulk of content can end up in the template. Difficulties of localization when content is mixed up with transformational instructions.

4.4. Inappropriate separation

Writing direction and nesting of embedding - markup required for more than one level. Block vs inline needed for speech synthesis as well as visual formatting.

Separation does not mean isolation from the workflow. For example - someone using a simple text browser on a mobile phone, or a visually disabled user with a text to speech interface, is directed to the 'text only accessible' version of a site only to find that the information is three months out of date and half of the links do not work, compared to the 'real' version of the site. This might be because the text-only version was created by hand, separating it off from the constantly updated information used to create the visually oriented parts of the site, and not subsequently updated. Using a single form of the content with different client-side StyleSheets is one way this could be fixed; having a single, more abstract source (such as a database-backed content management system) with automatic transformation to the visual and text-only versions triggered by any change to the source, and with automatic link checking and rewriting, would be another approach. Both approaches maintain the logical separation of content and presentation while ensuring that an optimized presentation is available for each modality. Both are arguably separating presentations from content (but in different ways) or rather, making appropriate design choices about the abstraction-concreteness axis as well as timeliness, usability and quality assurance issues to make a better Web site.

5. Maximally Abstract is not the solution

There are several reasons why the unrestricted serving of maximally abstract information is not an appropriate solution. Three such reasons are given below:

5.1 The Value of Information

Consider an example of a publicly traded company that is required to make some financial data available to the shareholders and to the public. At its most abstract, the financial data might include:

the customer name
the precise type goods sold
the quantity of goods sold
special discounts
the person or people from the sales force that got the sale
the date of the sale

for every sale that the company made in a ten year period. Clearly, this is highly abstract and capable of analysis in various ways it is, in some sense, the most accessible information. It is also the most financially valuable - it is possible to rank the same staff by performance and see which ones do well, which poorly, which are improving in performance and which are declining. It is possible to see which customers buy which types of goods, and to track the discounts they get, how these change over time. Combined with similar data from other companies, it could be even more valuable. All of this data is highly company confidential, and in some cases confidential within the company as well (for example, made available only to certain levels of management, human resources personnel, and so on).

Thus, it would clearly be foolish to suggest that all this valuable data should be made freely available to the general public just because it exists. At the other extreme, making the summarized financial report available only as raster images 'to stop people harvesting sales data' would be equally foolish (and readily circumvented by optical character recognition, besides being illegal in some countries on the grounds of discriminating against blind investors).

Once cannot say that information must be made available in its most abstract or most detailed form, but a suitable balance should be chosen between abstractness and concreteness.

5.2 The Volume of Information

The rawest, most abstract information is often very large. To return to the sales example, daily sales on a per-salesman, per-sale basis are highly detailed, give the most information, and (in a company with thousands of salespeople and decades of trading) the total volume of data is large. To get an overview of the performance of that company, for example to make investment decisions, summarized total sales data per quarter is both more relevant and more useful than megabytes of raw, detailed data.

5.3 The Craft of Presentation

The more abstract information is, the harder it is to make a readily understandable presentation of it. This presentational task is a craft; it requires an awareness of writing technique, communication, typography and graphical design; it is not readily automatable for new and unexpected information.

Thus, in moving from abstraction to concreteness, new information (explanation, comparison, analysis) is frequently added. This aids human understandability. For example, sales data might at its most abstract contain just numbers; the sales report would contain an explanation (and the sales graph a color key) to clarify that values are in millions of US dollars per year, inflation corrected to a given date in 2002. There would most likely be an analysis of the results and the profits in each period related to costs such as mergers, restructuring charges, and so forth to allow more sense to be made of the data. In that sense, the more concrete form is more accessible.

6. Maximally Separate is not the solution

The preceding section has shown some of the perils of inappropriately mixing information at different levels of abstraction. There are several reasons, however, why maximal separation between the more abstract form and a more concrete form is not desirable either.

6.1 The client/server divide

@@ not clear this is the bast place, introduce earlier

7. @@ @@ The Scrapheap @@ @@

server sep and inaccessibility - great separation, but not what is generally wanted.

restylability

sep of con from presentation as well as presentation from con - content in XSLT for example. localizability.

'final form' never really final, just highly concrete.

shadow trees, pseudo elements and other departures from the content tree - but still attached, not separated as with a batch transformation. connectivity and bidirectionality. event models. where do scripts run. where do animations run. how does information flow back towards abstractness (e.g., editing applications)?

graphical content falls on this gradient too. as well as textual gradient - e.g. CAD/CAM file with NURBS, can be used to make the actual part - high value, lots of content. 2D illustration, technical drawing. 2D bitonal tiff image with stippling and hatching - entirely concrete.

Example for discussion - list numbering (more unusual, internationalized example) generate hard-coded in html from XML using XSLT (looks correct everywhere, html markup does not say it is a list) vs generate html list with css 2 to control styling of generated numbering (depends on correct css implementation). Associated issue of referring to specific numbered items in the body of the text, css does not solve this.

separation of code and data - scripts, again, affects localizability of content that is intertwingled with code.

need some best practices

@@ @@