XML Literals

From RDF Working Group Wiki
Revision as of 14:36, 9 May 2012 by Rcygania2 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This page discusses ongoing work on re-designing the rdf:XMLLiteral datatype. It is related to ISSUE-13.

The poll

There was a poll that broke down the issue into a number of sub-questions and gathered opinions from the WG. Full questions and responses in this thread: http://lists.w3.org/Archives/Public/public-rdf-wg/2011Nov/0163.html

Twelve responses were received. Breakdown by question follows, including salient commentary from the responses.

Q1. Should the specs define a way to compare XML literals based on value?

Responses: -1 -1 +1 -1 +1 -1 -1 +1 +0 +1 +1 +1 (Sum: +1 from 12 responses)

  • Use case: capturing HTML markup in RDFa
  • Issue: RDFa use requires support for non-closing tags, so infoset wouldn't work
  • Issue: Value space is too complex
  • Comment: Yes, this is inherent in a datatype
  • Comment: XML comparison shouldn't be in the RDF specification
  • Proposal: use XML infosets
  • (The sixth column is Andy's -1 – he seemed to respond to Q6?)

Q2. Should the specs say that RDF implementations MUST support value-based comparison?

Responses: -1 -1 0 -1 -1 -1 -1 +1 -1 0 -1 -1 (Sum: -8 from 12 responses)

Q3. Should the lexical space be in canonical form?

Responses: -1 -1 -1 -1 -1 -1 -1 +1 0 +1 -1 -1 (Sum: -7 from 12 responses)

  • Comment: Round-tripping other datatypes creates same issues, people manage
  • Comment: C14N shouldn't be in the spec, just use infosets as values
  • Comment: Better stick with the devil we know

Q4. Should invalid XML be allowed in the lexical space?

Responses: +1 -1 +1 -1 -1 -½ -½ -1 -1 -1 -1 -1 (Sum: -7 from 12 responses)

  • Use case: rdf:XMLLiteral in HTML+RDFa takes tag soup as parser input, so output should match that
  • Comment: Don't encourage it, but it will happen anyways
  • Comment: Better stick with the devil we know

Q5. Should the specs say that RDF/XML parsers MUST canonicalize when handling parseType="literal"?

Responses: -1 -1 0 +½ -1 +½ +½ +1 +1 +1 -1 +1 (Sum: +1½ from 12 responses)

  • Proposal: don't add in-scope namespaces (but that would invalidate most existing RDF/XML content)
  • Comment: Different lexical form doesn't matter as long as same value
  • Comment: Maintain current behaviour for RDF/XML
  • Comment: It pulls in stuff from context, so changes the XML anyways
  • Comment: MAY/SHOULD would be fine
  • Comment: If rdf:XMLLiteral is optional then this has to be -1, no?

Q6. Should it be required that producers of XML literals in concrete syntaxes (Turtle, N-Triples, other parseTypes in RDF/XML) canonicalize the literals themselves?

Responses: -1 -1 -1 -1 -1 -1 -1 +1 0 +1 -1 -1 (Sum: -7 from 12 responses)

  • Comment: Yes. Better stick with the devil we know

Points of consensus and disagreement

(Almost) consensus

  • rdf:XMLLiteral should be optional
  • The lexical space shouldn't be canonicalized
  • The lexical space should include only well-formed XML
  • Authors shouldn't be required to canonicalize

Disagreement

  • Should RDF/XML parsers canonicalize on input?
  • Should the value space be just non-canonicalized strings? Or something based on the infoset/C14N?

Towards a proposal

This section discusses the issues that were not settled by the poll. See also User:Rcygania2/XML_literals for more discussion.

Should the value space be just strings? Or something based on the infoset/C14N?

The value space should be canonicalized. That's what a datatype is supposed to do, and if the datatype is optional, then implementers can elect to just treat it as a string.

The value space should be based on XML infosets, rather than “something in 1:1 correspondence with C14N'd strings”. That's easy enough to specify, and again it's what a datatype is supposed to do.

Should RDF/XML parsers canonicalize on input for parseType="literal"?

Yes. That's the current behaviour and we should avoid changing RDF/XML if possible. Also, turning a DOM into a string is slightly tricky to get right (namespaces!), parsers need to implement it anyways somehow, and XC14N isn't any worse than other approaches. At least it makes tight test cases possible. It also helps with round-tripping and consistent parsing – imagine a DOM-to-string algorithm where resulting attribute order is random, you'd get a different lexical form each time you parse. SHOULD would be sufficient really, but MUST is actually easier to say in the RDF/XML spec.

Consequences of this design

  • Nothing changes for RDF/XML
  • RDF/XML parsers who violate the RDF/XML spec by not canonicalizing parseType="literal" on input are still non-conforming, but no longer produce ill-typed literals
  • Turtle "<foo/>"^^rdf:XMLLiteral is no longer ill-typed
  • Turtle "<foo/>"^^rdf:XMLLiteral and RDF/XML <x:y rdf:parseType="literal"><foo/></x:y> now result in different triples (but same meaning). This is potentially bad where content negotiation is involved, because the choice of serialization syntax for the same graph now means we potentially get different triples.

Should there be an HTML datatype?

  • Would be a good idea. Lexical space – any HTML fragment; value space – HTML DOM trees? XML infosets?
  • Should perhaps be defined by another WG though. RDFa WG?
  • If it becomes a separate document, then there might be an opportunity to move the XML datatype into that as well

rdf:XMLLiteral Mk. II proposal A (Value space based on XML Infoset)

Summary of proposal

  • Make rdf:XMLLiteral optional in the datatype map
  • Change rdf:XMLLiteral lexical space to no longer require the lexical forms to be in canonical form
  • Define a canonical lexical form for rdf:XMLLiteral that is equivalent to the old lexical space
  • Re-define the value space in terms of XML infosets (this should be in 1:1 correspondence to the old value space)

Normative changes to RDF Concepts

The current definition of the rdf:XMLLiteral lexical space is:

The lexical space is the set of all strings:

  • which are well-balanced, self-contained XML content [XML10];
  • for which encoding as UTF-8 [UTF-8] yields exclusive Canonical XML (with comments, with empty InclusiveNamespaces PrefixList) [XML-EXC-C14N];
  • for which embedding between an arbitrary XML start tag and an end tag yields a document conforming to XML Namespaces [XML-NAMES]

DELETE the second bullet point in this definition.

REPLACE the current definition of the rdf:XMLLiteral value space with this definition:

The value space is the set of all ordered lists of information items [XML Infoset] that contain only character information items, element information items, comment information items, and processing instruction information items.

REPLACE the current definition of the rdf:XMLLiteral L2V mapping with this definition:

The lexical-to-value mapping is defined as follows:

  • Wrap the lexical form between an arbitrary XML start-tag and matching end-tag, yielding an XML document [XML10]
  • Take the XML infoset corresponding to the XML document
  • Return the list of children of its document element information item

ADD the following definition for an rdf:XMLLiteral canonical mapping:

The canonical mapping is defined as Exclusive XML Canonicalization [XML-XC14N] (with comments, with empty InclusiveNamespaces PrefixList).


Informative changes to RDF Concepts

REMOVE the following sentence:

This allows the inclusion of text that contains markup, such as XHTML [XHTML11].

Instead, ADD the following sentence:

This allows the inclusion of XML payloads in RDF graphs, as well as text that contains markup, such as XHTML [XHTML11].

ADD the following informative Note:

Any XML namespace declarations (xmlns) and language annotation (xml:lang) desired in the XML content must be included explicitly in the XML literal. Note that some concrete RDF syntaxes may define mechanisms for inheriting them from the context (@parseType="literal" in RDF/XML [RDFXML]).

REMOVE the following three informative notes:

  • XML values can be thought of as the [XML-INFOSET] or the [XPATH] nodeset corresponding to the lexical form, with an appropriate equality function.
  • RDF applications may use additional equivalence relations, such as that which relates an xsd:string with an rdf:XMLLiteral corresponding to a single text node of the same string.
  • If language annotation of XML literals is required, it must be explicitly included as markup, usually by means of an xml:lang attribute.

Maybe add an example or two for rdf:XMLLiteral, using editorial discretion. Possibly:

  • Example: foo<sub xmlns="…html…">bar</sub>
  • Example: <span xmlns="…html…" xml:lang="en">…</span>
  • Example: <svg>…</svg>

Changes to other documents

  • Change RDF Semantics so that rdf:XMLLiteral is no longer interpreted in RDF-Entailment, but only in D-Entailment.
  • No changes to RDF/XML or Turtle.


rdf:XMLLiteral Mk. II proposal B (Value space based on DOM)

This is just like the proposal above, except the value space is defined in terms of the DOM, not in terms of XML infosets. This allows the definition of equality in terms of a function that is already defined in the DOM API. This gives a well-defined definition, and readily available implementations. It would also be more readily applicable to a possible HTML5 datatype, because HTML5 is defined in terms of the DOM.

In the proposal above, use the following definition of the rdf:XMLLiteral value space:

The value space contains sets of DocumentFragments [DOM3]. Any two DocumentFragments A and B are considered the same value if and only if the DOM method A.isEqualNode(B) returns true.

In the proposal above, use the following definition of the rdf:XMLLiteral L2V mapping:

The lexical-to-value mapping is defined as follows:

  • Let xmldoc be the lexical form wrapped between an arbitrary XML start-tag and matching end-tag (an XML document [XML10])
  • Let domdoc be a DOM Document object [DOM3] corresponding to xmldoc
  • Return a DocumentFragment whose childNodes list is the same as the childNodes list of domdoc's documentElement attribute