An Analysis of XML Literals in RDF
In the beginning ...
One hypothesis is that the group working on RDF M&S decided that there were use cases where folks wanted to have literals that contained XML markup. Escaping the markup being rather tedious for the writer of RDF/XML, they introduced the rdf:parseType="Literal" mechanism. This enabled the user to write the markup without escaping it, making it easier to read and write, e.g.
<eg:prop rdf:parseType="Literal"><br /></eg:prop>
According to this hypothesis, the above was intended to be shorthand for:
and that both of these would represent a property with a literal value of "
". Aside: I wonder why CDATA couldn't be used for this? Hmmm... the current approach uses the parser to verify that the literal is well formed xml, and also means that CDATA can be used in the literal value itself.
Just a wee problemette ...
Unfortunately, XML makes this simple arrangement not quite so straightforward. Consider:
<eg:prop rdf:parseType="Literal"> <eg:prop2 eg:b='b' eg:a='a' /> </eg:prop>
The order of attributes is irrelevant to XML. Consider an RDF parser sitting on top of an XML parser - it can't tell what order the attributes in the above example were in, in the original text. Unless we do something, then one parser might produce the literal "<eg:prop2 eg:b='b' eg:a='a' />" and another produce <eg:prop2 eg:a='a' eg:b='b' /> and they would both be right.
RDFCore is insisting that there is a deterministic algorithm to determine when two literals are equal. So if rdf:parseType='Literal' is to represent literals that are what we now call plain literals, then we would have to canonicalise to ensure that all parsers produce the same plain literal.
If I recall correctly, canonicalization puts the attributes in alphabetical order, thus the example above would produce the literal "eg:prop2 eg:a='a' eg:b='b' />". This might surprise the user, e.g. if she were to search a graph for literals containing "eg:b='b' eg:a='a'", which would seem reasonable given the input XML, she might be surprised not to find it.
An RDF/XML Writer
Consider now a program that is writing RDF/XML and it is about to write the literal "<eg:prop2 eg:a='a' eg:b='b' />". It could examine the literal to determine whether it is well formed canonical XML, but it is likely that most writers won't do that, and will just write out the plain literal in escaped form:
<eg:prop><eg:prop2 eg:a='a' eg:b='b' /></eg:prop>
Not very readable.
Note that had the literal been "eg:prop2 eg:b='b' eg:a='a' />" it cannot be written using rdf:parseType="Literal". Whilst it is well formed XML, it is not in canonical form as the attributes are the wrong way round. If it were written using rdf:parseType="Literal" it would be changed to "<eg:prop2 eg:a='a' eg:b='b' />" by the parser that read it in.
Implementations can do a bit better than this. If they read an rdf:parseType="Literal" from an RDF/XML document, they can record the fact that it is well formed canonical XML, then a writer can easily know that it can use rdf:parseType="Literal" to write it out again. This need not be thought of as creating a new type of literal; it is just caching some information about the nature of a plain literal.
This turns into the plain literal "<foo>". It cannot turn into "<foo>" because then we would lose the distinction between the markup and the content. Note that this is different from:
which represents the literal "<foo>".
A Coherent Position
At this point, we have reached a coherent position. rdf:parseType="Literal" is a syntactic mechanism for writing down some plain literals in a more readable manner. There are some surprises awaiting the unwary user in that they might not always get the plain literal they are expecting. This position has the advantage that all (apart from the caching optimization) the XMLisms are handled in the XML readers and writers - RDF just deals with plain literals.
Why are XML Literals considered to be different to plain literals in the RDF Specs?
One reason suggested by Pat is to record in the RDF graph that a literal is well formed canonical XML. This does not argue for XML literals being different to plain literals; they could be a subclass of plain literals.
Another reason might be to force the use of rdf:parseType="Literal" when serializing a graph. It is apparent that some RDF applications have XML documents that can be processed by RDF tools and by XML tools. RSS and CC/PP are examples. If a writer was required to serialize XML literals using rdf:parseType="Literal", the the developer could rely on being able to use standard XML tools, such as XPATH and XSLT to process those literals in the serialization.
Mixed content is a special case of internalized media type
One of the principle purposes of rdf:parseType="Literal" is to allow XML text fragments to be exchanged "in band". Not because RDF has an XML serialization, but because marked up text is a useful datatype in and of itself. Technically, as a more of a media type or encoding than a datatype, not unlike PNG or JPEG, it could be "outside" the RDF and referenced by URI. The benefit of having a "markup" data type in RDF is convenience for the tools and exchange (not having to use MIME multipart/related containers, for example). The only thing that makes XML special in this case, is at this age of RDF and XML, XML is the only principle markup language out there.
And what about the RSS community now, where lots of people use manual escaping instead of rdf:parseType="Literal"? Would this be good for them?
Manual escaping in RSS is an error. It should never have happened. Feel free to have some "out of band" way to indicate that the datatype "is really markup", but don't let RSS's use of escaped markup otherwise affect proper handling of mixed content in RDF. -- KenMacLeod
I've been feeling for a while that all markup in RDF should have RDF semantics. It's not that hard to give HTML, etc, RDF semantics. But I don't have a good demonstration of all of this yet. One sketch is at StripeSkipping. --SandroHawke.
Re: the underlying parser problem. Could the XML be procesed in such a way that Literal attributes are recognised, and their contents treated as CDATA by the parser? An event-based approach would allow this (wouldn't it?), and aren't pretty much all parsers based on events? guest