User:Rcygania2/XML literals

From RDF Working Group Wiki
Jump to: navigation, search

Background

There are lots of things in an XML text that don't affect its meaning (or value), like choice of single vs double quotes, order of attributes, and so on. So, many different XML texts can serialize the same XML value. It's a given that in RDF we want value-based equality at some point (at least in the semantics). So somewhere, someone has to canonicalize the text into a value. This canonicalization can be done in different places:

  1. by the author of the XML text
  2. by the RDF parser
  3. by the RDF toolkit when it does value-based comparison (in the L2V mapping)
  4. by the application

The status quo is that each serialization format makes a decision between 1) and 2). RDF/XML goes for 2, everyone else for 1.

Option 1 is ok when interchanging between RDF systems (N-Triples?), but really bad for anyone who transforms XML from a non-RDF context to RDF (RDFizers, and of course anyone who authors RDF by hand). In languages like Turtle it makes XML literals unusable.

Option 2 is ok in an XML-based language, because it already has an XML parser, and C14N shouldn't be a big deal. Otherwise it sucks because now a Turtle parser would have to ship an XML parser too.

Option 3 means that SPARQL engines and reasoners have to do it, or hopefully the underlying RDF API can handle it. I think this sucks less than 2), for reasons I have not tried to articulate.

Option 4 is what happens when Option 3 is made optional. It's kind of acceptable. Comparing XML values doesn't seem to be a huge use case.

There are two main reasons for the choice of 1) and 2) in the status quo. First, otherwise the output of an RDF/XML parser has to be allowed to be likely somewhat indeterministic, because they work with the output of an underlying XML parsers and never know what kind of quotes were used. Second, RDF/XML was the only game in town, and as stated above, 2) doesn't suck too much in XML-based syntaxes. Option 3 seemed less attractive because an OWL reasoner doesn't want to ship an RDF/XML parser.

Contrasting with other datatypes

All other datatypes (the XSD types) basically let implementations choose between 3) and 4). Canonicalization only has to happen when value-based comparison is needed. And XSD support is essentially optional.

However, unlike with rdf:XMLLiteral, it's trivial to implement deterministic parsers for all XSD datatypes, in any syntax we know of.

Proposal

  • Canonicalization happens in the L2V mapping
  • rdf:XMLLiteral support is optional

Discussion

I18n objection?

Support for XML literals was introduced partly to support I18n concerns like bidi, ruby markup, and mixed-language text. Making XML literal support optional would be bad from an i18n point of view. On the other hand, the rather low level of actual use of XML literals for these purposes shows that it's not a big loss. (Data???)

Arbitrarily different lexical forms

Some funny stuff could happen.

  • Two RDF/XML parsers would likely obtain two different XML literals from the same RDF/XML file. (They would have the same value, but different lexical form. They are two triples.)
  • Serialize a graph containing XML literals as Turtle and as RDF/XML. Load them both. Likely you have two different graphs now.

Cost for SPARQL and reasoners

If they want to support value-based comparisons for rdf:XMLLiteral then they need to ship the canonicalizer. Migration cost.

Other problems with XML literals

There are plenty of other warts. Is it worth trying to fix it?

  • Language tags from RDF/XML are not inherited
  • To use XHTML in Turtle, the XHTML namespace needs to be declared on every top-level element!? Oh that sucks.
  • If they ever catch on, we get to deal with XSS exploits in Tabulator.
  • XHTML is obsolete anyway, better focus on HTML5

Use cases for XML literals

  • Storing XML content as opaque blobs
  • Transmitting content snippets, e.g., RSS
  • Rich literals, e.g., a title with some markup like sub, sup, em