Warning:
This wiki has been archived and is now read-only.

ChangeProposals/RDFaPrefixesNoChange

From HTML WG Wiki
Jump to: navigation, search


Summary

Clarify how prefixes work in RDFa, and that they're an optional feature.

Rationale

Dropping prefixes will break existing content

The following set of circumstances leads me to believe that there is significant existing use of prefixes with RDFa in text/html:

  • By default, Drupal 7 supports RDF and publishes RDFa using prefixes as text/html. This means that Drupal 7 documents will be interpreted as HTML5 regardless of what their authors specified via DOCTYPE. While Drupal 7 has only been recently released, Drupal 5 has been estimated as powering over 1% of the Web; and the recent end-of-life announcement for Drupal 5 is likely to act as additional impetus for sites to upgrade to the latest version. If prefixes are removed from HTML+RDFa, RDFa markup that is currently valid in tens of thousands of current sites and hundreds of thousands of future Drupal 7 sites-to-be will break.
    • Three weeks after the release of Drupal 7, at least 23,522 sites are running version 7 (this is an underestimate, as this counts the number of sites which agreed to ping home drupal.org for keeping up to date with security releases, so there are likely more).
  • Further, Google, Facebook and Yahoo have all encouraged Web publishers to use RDFa with prefixes; none of them suggest that this will only be supported when content is labelled with the application/xhtml+xml media type.
  • At least two W3C member submissions (ccREL and Representing vCard Objects in RDF) document patterns for using RDFa with prefixes in HTML without mentioning any media type requirements.
  • Many current RDFa processors accept text/html input and process it the same as application/xhtml+xml.
  • Researchers at Yahoo have estimated that 3.5% of the web is using RDFa. (It's likely that close to 100% of that 3.5% is using prefixes, as the alternatives - using profile-defined terms, and using absolute URIs - have only been specified in draft form so far, and these options are not yet widely promoted.)
  • The HTML+RDFa draft is only one of a multi-faceted deployment of RDFa. The latter include, beyond (X)HTML+RDFa, the usage of RDFa in any XML dialect, SVG, Atom, ODF, and ePub - all of which support prefixes. If this mechanism were disallowed in HTML5, a division would be created between HTML5 and other, non-HTML RDFa host languages.

The HTML5 design principles include Support Existing Content and Pave The Cowpaths. The current text of the HTML+RDFa draft allows implementors to support that existing content, and formalises this use of RDFa.

Further to that, the last call XHTML+RDFa 1.1 draft supports the use of prefixes. If HTML+RDFa did not, this would lead to a difference between XHTML and HTML; differences between XHTML and HTML are known to be confusing to markup authors.

RDFa with prefixes is not especially complicated

Indeed, some articles have cited prefixes as a feature of RDFa that make the technology easier to use:

  • This interview with a Drupal developer cites CURIEs as a syntax that make RDFa easier, and that will be familiar to many web developers.
  • Feedback from the Dublin Core community comparing RDFa and Microdata has suggested that RDFa is preferable to Microdata owing to the latter's inability to abbreviate URIs, which they fear could lead to authors developing workarounds that reduce interoperability.

Whatsmore, lots of people are alreadt using RDFa, with prefixes, and getting it right. CURIEs are a success story.

A recent blog on the deployment of RDFa, based on an analysis of 12 billion web pages indexed by Yahoo! Search, reveals that 3.6% of webpages (or 430 million webpages in Yahoo!’s sample of 12 billion) use RDFa with prefixes. It is also important to note that these exclude what the blog calls “trivial” RDFa pages, i.e., when the document only contains the triples stemming from the usual HTML link elements of the header (e.g., reference to stylesheets). The evolution of the deployment is also significant: between March 2009 and October 2010 the usage of RDFa has increased by 510% (and this is before the apperance of, e.g., Drupal 7, see below, which may influce such figures even more).

This type of fast evolution is well examplified by, for example, the London Gazette (one of the official journals of record of the British government) which publishes its records annotated by RDFa. At the moment, this includes 300,000 pages with an increase by roughly 500 pages a day. Other typical users of RDFa include the Newsweek (see, e.g., a recent news item, generating RDF data using Open Graph Protocol, DCMI, SIOC, or FOAF vocabulary terms), the O'Reilly Catalog pages (see, e.g., a page on a camera, generating RDF Data using Good Relations or DCMI vocabulary terms), or BestBuy (see, e.g., a page on a specific BestBuy store, generating RDF data using Good Relations, vcard, Google’s vocabularies, etc.). The latest example is also interesting because the maintainers of the site reported a significant increase Google ranking as well as an increase of traffic on those pages after the introduction of RDFa.

Instead of listing individual sources of RDFa usage, it may be more important to draw an attention some larger scale general drivers for RDFa developement.

Drupal

There is about an estimated half a million web sites that use Drupal, and this includes some high profile ones like the Economist or the White House. Drupal is ranked as one of the top 5 CMS systems in use today. Drupal 7, published in January 2011, has, at its core, RDF in general and RDFa in particular. Ie, pages produced with Drupal 7 will produce pages with RDFa. Of course, not all sites change from Drupal 6 to Drupal 7 within a few weeks. At the time of writing these lines, some of them have not yet done it (eg, the Economist), some of them have done it but have not really made use of RDFa facilities (eg, White House), but some of them have already made use of more extensive RDFa facilities in annotating their site (just a few days after the publication of Drupal 7, for example, the Examiner.com news site added RDFa information using a mixture of the FOAF, DC, and SIOC vocabularies.)

Facebook

Facebook has adopted RDFa to implement their open graph protocol. Although the amount of RDFa information to be added is only a handful of triples, the acceptance is extremely high. High profile sites like the CNN, the Washington Post, the New York Times, or the ReadWriteWeb use it, to name just a few.

It is correct to say that the usage of the Facebook terms also reveal problems around namespaces insofar as many sites do not follow the advise of Facebook and do not add the right namespaces. This is the case for, eg, the CNN site (see, e.g., [ http://edition.cnn.com/2011/WORLD/europe/01/25/russia.airport.explosion/index.html?hpt=T1 a recent news on their site]) while other sites do it correctly (see, e.g., a recent Washington Post news item). However, if the change proposal was accepted, all pages using the Facebook terms would become invalid, regardless of whether they use the webmaster advices correctly, whereas if the "Default Profile" approach proposed by the RDFa Working Group was adopted, see Issue-78, then all pages using the Facebook terms may become correct (by adding Facebook’s prefix defintion to the default profile) and the right RDF terms would be generated.

Creative Commons, Flickr

The Creative Commons terms have become a de-facto standard for expressing copyright information related to various Web resources. The snipets provided by Creative Commons is based on RDFa, with a mixture of CC’s own vocabulary and some DCMI terms.

As an example for widespread usage, Flickr allows its contributors to set Creative Commons based copyright information to their photos, and the corresponding Flickr pages are rendered using (fully valid) RDFa (see, e.g., an arbitrary Flickr photo page, generating the CC statement in RDF).

DBPedia

DBPedia collects the data harvested from the wikipedia instances in different languages and link these to other datasets. The database itself may be accessed (and downloaded) or, alternatively, the RDF data related to one single term can be accessed directly via HTTP. These data sets are available in RDF/XML or Turtle, but the HTML pages themselves are also encoded in RDFa to provide the same information (see, e.g., the page for London, generating RDF Data using a large number of various vocabularies linked to other databases).

The use of prefixes should be clarified

Evidence presented by Ian Hickson suggests that prefixes may be confusing to some authors, so it makes sense to clarify this part of RDFa as much as possible, and offer routes for authors to avoid this optional part of RDFa if they wish.

Responses to anticipated objections

The following is a list of counter-arguments to the competing change proposal for ISSUE-120. Keep in mind that, in general, the counter-arguments do not assert that there are no risks when using a name-spacing or prefixing mechanism. There certainly are risks and trade-offs associated with such a mechanism. The RDFa Working Group has spent a very long time attempting to balance the risks and the rewards that authors benefit from when a prefixing mechanism is provided. So, the question isn't "could namespaces and prefixes be mis-used" but rather "do the benefits outweigh the drawbacks".

Copy and paste

Copy-and-paste of the source becomes very brittle when two separate parts of a document are needed to make sense of the content. Copy-and-paste is how the Web evolved, so I think it is important to keep it functional and easy.

By this train of logic, support for relative URLs should be removed from HTML5 because they are brittle when copy/pasted. Since HTML5 markup requires certain elements to be embedded in other elements - when document trees are copy-pasted between documents the pasted markup may become invalid. The same argument also applies to CSS - as classes that are not in the proper hierarchical order will not have the proper CSS applied to them - that is, if the author remembers to copy the originating CSS file at all.

This is not to say that these dangers don't exist with RDFa markup, but to express that the same logic applies to other types of copy-paste errors in HTML5. Copy-and-paste is how the Web evolved, and it evolved in spite of these potential issues. Another way to look at this is that web developers do understand the implications of copy-and-paste and do pay attention to how their copy-and-paste markup performs after it has been inserted into a new location.

Copy and paste is a danger for all of HTML5 - not just RDFa. However, Web authors seem to manage quite well in spite of this danger.

RDFa Core 1.1 attempts to go further in order to ensure that the most common prefixes are pre-defined in RDFa 1.1. So, in the event that there is a spurious copy-paste error, the correct triples will be generated because prefixes like "rdf" and "dc" and "foaf" are automatically loaded by RDFa 1.1 processors before a document is processed. This approach corrects most markup that is copy-pasted by novices.

Cognitive difficulty

The competing change proposal asserts that RDFa prefixes are an indirection mechanism and are similar to XML Namespaces. The competing change proposal goes on to gather anecdotal evidence, in the form of 20 e-mails, that assert that XML Namespaces were difficult to understand for authors. Out of the 20 examples listed, there is not a single scientifically rigorous study or set of hard data that backs up the assertions made about cognitive difficulty.

That is not to say that there are not issues with cognitive difficulty when using namespaces - just that there is no hard data proving the point. There is, however, hard data on millions of documents being generated correctly using the xmlns: mechanism provided by RDFa 1.0.

It is also fairly easy to find anecdotal evidence speaking to how prefixing mechanisms and namespaces are useful when creating distributed systems:

The list above was compiled by hitting Google for 30 minutes and gathering anecdotal evidence. However, none of the links above actually provide any sort of hard data as to whether or not prefixes are effective in RDFa and readers should not consider them, just as they should not consider the anecdotal evidence provided by the counter proposal.

There is currently no scientific study that demonstrates that prefixes are difficult to use for authors writing RDFa. There is, however, a very compelling scientific study that proves that prefixes are being used correctly in RDFa in the wild across hundreds of millions of pages.

Importance of simplicity

HTML documents are frequently maintained by different people than the original authors. If the original author is more knowledgeable than the maintainer, and uses features that the maintainer does not understand, then the quality of the document will suffer dramatically. In the context of RDFa, for instance, the original author might use prefixes, when the maintainer doesn't know RDFa — the maintainer might then move nodes around and break the relationship between the declaration of the prefix and the use of the prefix, breaking the page's RDFa annotations. What's worse, with metadata annotation formats the maintainer likely won't notice that anything broke.

While it is true that this is a danger, it applies to almost every other technology in the Web platform. CSS animation, advanced JavaScript libraries like Web Workers or Web Sockets, cross-browser HTML5 video codec requirements. If a maintainer does not understand the code that they're maintaining, they will inevitably cause harm at some point. The most likely scenario with RDFa is that the triples that the HTML+RDFa page was generating will cease to be generated. In other words, the worst case is that their page does not express semantic information. This may cause their search ranking to drop and may have other side-effects that may or may not cause them to notice the mistake (validators may fail, check.rdfa.info fails to extract the expected triples, etc). However, their website will continue to work in a manner that gracefully degrades.

RDFa strives to be as simple as it can be given the large set of use cases that the language supports across a number of different Host Languages (SVG, ODF, ePub, XML, and all the HTML family languages). The benefits of prefixes provide simplicity for authors - they don't have to remember a large number of long URLs. Most website authors will declare their prefixes in a template once, and use the same template to create each web page. This reduces the cognitive burden on them - it's easier to remember "foaf:name" than it is to remember "http://xmlns.com/foaf/0.1/name".

Simplicity in this case, is not having to remember a slew of difficult to remember URLs.

Other technologies

The competing change proposal does outline a few Web technologies that don't use "re-bindable prefix mechanisms" and goes on to conclude that:

Other than Atom, no hand-authored format uses namespaces and is anywhere near as widely deployed as HTML (and Atom doesn't really use namespace prefixes much). Indeed, no technology that uses the anti-pattern described above is as widely used as HTML. Plenty of other technologies that don't use the anti-pattern are, even just on the Web, like CSS, JS, HTTP, DOM, etc.

Voters should note the clever word smithing in the statement above - specifically, "no hand-authored format uses namespaces and is anywhere near as widely deployed as HTML). This holds for every hand-authored format today: There is no other hand-authored format that is anywhere close to the deployment that HTML enjoys on the Web, period. HTML is king - all other hand-authored publishing languages pale in comparison. What is more interesting to do is to examine why a particular feature is necessary. Just because an airplane and the space shuttle fly through the air does not mean that the space shuttle only needs to have the features of an airplane and no more than that limited set. If we followed this line of reasoning, the space shuttle would never reach higher than the airplane.

There are other issues with the assertion. HTTP, in fact, does have a "re-bindable mechanism" called the 302 ReDirect. The Persistent URL Service is a great example of a redirection mechanism in HTTP that provides a non-static outcome.

CSS supports re-bindable classes. By referring to a different stylesheet, all the class names in the document are bound to different styling values. For example, the two examples below would result in the "important" class being rendered in two completely different ways (assuming that each CSS file is different):

<link rel="stylesheet" href="http://example.org/green.css" /> ... <div class="important">This is an important notice.</div>

<link rel="stylesheet" href="http://example.org/purple.css" /> ... <div class="important">This is an important notice.</div>

JavaScript is perfectly capable of generating and using "re-bindable prefixes":

var vocab = "http://example.com/vocab#"; var term1 = vocab + "name"; vocab = "http://example.com/other-vocab#"; var term2 = vocab + "name";

In the example above, "vocab" was used as a re-bindable prefix.

DOM Level 2 supports XML Namespaces. Granted, it is a feature that is often railed against, but the DOM was successful in spite of that feature. Assume for a moment that prefixes are terrible - if they are, people will stop using them over time. Eventually, they will be removed due to market forces even if we are wrong in our assertion that a prefix binding mechanism will be helpful to authors over the next 5-10 years.

The important thing to pay attention to is /why/ these re-bindable prefix mechanisms exist in the first place. Often it is to ease authoring burden, support decentralized innovation and provide modularity. These are the reasons that the prefix mechanism is provided in RDFa.

Dynamic changes

The competing change proposal notes that RDFa is sensitive to changes made further up the DOM tree. This is true, however, RDFa is no different from HTML5 and CSS in this respect. If the elemental structure of an HTML5 document changes, the layout on the page could potentially change. If the elemental structure or values of an HTML5 document change, the CSS on the page could potentially change. This is a fundamental mode of operation for structured documents - if the structure changes, the document is affected.

This is not an argument against prefixing mechanisms - it is an argument against structured documents. Note that this is the same way that Microformats operate as well as Microdata. RDFa is no different in this case - if semantic attributes or structure changes further up the tree in Microformats or Microdata, the semantics expressed in a document will inevitably change. The counter proposal does not propose a solution for this problem that is independent of document structure.

This argument should not be considered as a valid reason to remove prefixing. Every document-based semantic language is affected by dynamic changes to the structure of the document - it is the nature of the beast.

Intentional mis-implementation

At least one implementation that is frequently cited as an argument for keeping the xmlns:*="" feature (though not the prefix="" and profile="" features, which are new) is that Google implements it.

While it is true that Google is cited as an argument for not removing xmlns:, it is never the only implementation cited. As stated previously, there are over 430 million pages that definitely use xmlns: at this moment. That number could be as high as 35 billion if you extrapolate the 3.5% figure that Yahoo discovered against the over 1 trillion URLs that Google claimed in 2008 as the size of the Internet - the number today could be even higher than that. However, let's not extrapolate numbers in that manner and rather assert that there are at least 430 million pages that utilize xmlns: in RDFa 1.0 correctly - which is a sizable figure.

However, it was accepted a long time ago that search companies will attempt to extract as much semantic data as possible out of a page, and authoring error is one area that search companies attempt to fix in order to more accurately discover the contents of a web page. RDFa 1.1 takes this into account and provides a mechanism, called Default Profiles, where common prefixes are pre-loaded by the RDFa processor before a document is processed. That is, if Google requests to have their vocabulary included in the default profile (which works to their advantage), this mis-implementation is no longer a mis-implementation. The RDFa WG is currently seeking input from the broader community as to which common prefixes should be included in the RDFa Default Profile.

That is, the RDFa Working Group has implemented a feature (Default Profiles) in RDFa 1.1 that is backwards-compatible and addresses the "intentional mis-implementation" issue in a way that is forwards-compatible.

An abbreviation mechanism is unnecessary

The competing change proposal states:

In a usability study for microdata, it was discovered that authors in fact have no difficulty dealing with straight URLs rather than shortening them with prefixes

The usability study review referred to by the competing change proposal states:

"About half the participants had trouble with properties being URLs -- they started trying to use them as values."

That usability study is also flawed in design - only six people were involved in the study - the sample size was roughly 18 web pages. One cannot drawing any conclusions from the study referred to by the counter proposal and be even close to sure that the findings reflect how prefixes/CURIEs and URLs are used in RDFa.

The RDFa Working Group has consistently seen authoring issues where full URLs are used. Even seeing that, there is a subset of people that prefer to use full URLs and so, RDFa 1.1 supports using full URIs in all places that utilize prefixes and terms. That is, if authors do not want to use prefixes, terms or CURIEs, they can use full URIs everywhere.

The RDFa WG has not had a formal complaint outside of a small handful of people that calls for the removal of prefixes or CURIEs - in fact, the prefix/CURIE feature is often cited as one of the more helpful aspects of the RDFa language.

Details

While I recognise that the HTML+RDFa specification is not an authoring guide, its editor should:

  1. Either in the HTML+RDFa specification itself, or via a reference to a W3C Recommendation published by the RDFa Working Group, provide a realistic illustration of RDFa used without prefixes. Note that prefixes may be used to abbreviate URIs used repeatedly, and provide a realistic, second example that uses prefixes to good effect.
  2. Either in the HTML+RDFa specification itself, or via a reference to a W3C Recommendation published by the RDFa Working Group, or a reference to an RDFa Working Group Note, provide best-practice advise for organisations that wish to offer snippets of RDFa for authors to copy and paste into their pages. Suggest that these snippets use full URIs rather than CURIEs unless the snippet significantly benefits from an abbreviation mechanism.

Impact

Positive Effects

  • Existing content remains supported.
  • Authors familiar with XHTML+RDFa 1.0 don't need to relearn RDFa for use with HTML5.
  • Page authors not already familiar with RDFa should realise that they do not need to use prefixes.
  • Page authors not already familiar with RDFa learn the situations when prefixes might be useful; and how to employ them in these cases.

Negative Effects

  • I do not know of any negative effects that would result from clarifying the use of prefixes in RDFa.

Conformance Classes Changes

No changes to conformance classes.

Risks

If the clarifications do not go far enough, some confusion may still surround prefixes.

References