W3C | Architecture Domain | International|

Formal Objection to RDF Removal of External Language Information from XML Literals

The I18N WG herewith formally objects to the post-lastcall removal of external language information from XML Literals. This document gives the reasons for this objection, and some background on our motivation and the history of this discussion.

This document is perhaps not as polished as one would like. Some sections are rather well worked out, others are not. A lot of links could be added.

Overview

Main mail messages: RDF decision (point 12: Language tags in typed literals); notice of this decision to I18N WG

Main specs/proposals: RDF M&S, lastcall WDs: Primer, Concepts, Semantics, Syntax, Tests, Schema, post-lastcall internal WDs: Primer, Concepts, Semantics, Syntax, Tests, Schema.

Test case

The I18N WG requests that the following XML/RDF document produce two triples (as it did at lastcall), rather than one (as at post-lastcall):

<rdf:RDF>
  <rdf:Description rdf:about="http://example.org/node">
    <eg:property xml:lang="fr" rdf:parseType="Literal">chat</eg:property>
    <eg:property xml:lang="en" rdf:parseType="Literal">chat</eg:property>
  </rdf:Description>
</rdf:RDF>

[It would have been possible to express this in terms of test cases in the last call, but the tests have been changed in the meantime, and depended on syntactic details that are irrelevant.]

Requirements for Language Information in RDF

These are our main requirements for language information in RDF:

Why the Post-Lastcall Approach is not Satisfactory

The reasons for our objection are listed below grouped as follows:

Conflicting with XML 1.0 and general expectations

The post-lastcall proposal is in direct violation of the provisions for xml:lang in XML 1.0. This will lead to problems for both tools and humans, and sets a bad precedent for other specifications:

Creating New RDF Data

The post-lastcall approach relies on the use of <dummy> elements to carry language information inside XML Literals (for all forms of RDF, not only for RDF/XML). This raises the following problems:

Reasoning and Query

For internationalization purposes, text sometimes needs micro-markup. In many cases, this need is not evident to data designers and application designers. It is therefore important to provide for a transition from plain literals to XML Literals that is as smooth as possible. This in particular applies to XML literals without any markup.

Change of Interpretation of xml:lang for Existing RDF/XML Documents

The change from lastcall to post-lastcall interpretation of xml:lang in RDF/XML documents has several problems:

Availability of Other Solutions

Many alternative solutions are available. Any of them would be acceptable for us, because they avoid the problems listed above.

@@@Central Arguments

Original desing in RDF M&S

The original design in RDF M&S is best shown by the following example:

<rdf:Description
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/metadata/dublin_core#"
  xmlns="http://www.w3.org/TR/REC-mathml"
  rdf:about="http://mycorp.com/papers/NobelPaper1">

  <dc:Title rdf:parseType="Literal">
    Ramifications of
       <apply>
      <power/>
      <apply>
        <plus/>
        <ci>a</ci>
        <ci>b</ci>
      </apply>
      <cn>2</cn>
    </apply>
    to World Peace
  </dc:Title>
  <dc:Creator>David Hume</dc:Creator>
</rdf:Description>

This example shows the following salient design points in RDF M&S:

Need for 'micro-markup'

Micro-markup here refers to markup at the phrasal level. This is important for the following reasons, the first four of which are related to Internationalization:

This is clearly documented, among else in: I18N last call comments on M&S

Main Requirement/Goal for Internationalization

Consistent and easy to use way of identifying the language of text pieces

Why should language identification be consistent

It is very important to have a consistent way to identify the language of a piece of text in any technology so that generic operations needing this information can use it easily. Such operations include rendering-related operations such as (CJK) glyph disambiguation, font selection, hyphenation, text-to-speech conversion (important for accessibility), proofing operations such as spell-checking, as well as operations related to the semantics of the text.

How should language identification be consistent

Language identification should not be different for each application, but should be the same independent of the application, i.e. it should depend only on the underlying technology. The best example for this is xml:lang. XML applications are not required to use xml:lang if they do not need it, but they can use it off-the-shelf whenever needed.

Consistency also applies across base technologies. All W3C technologies, and all IETF technologies we know, use the same RFC 3066 language tags for language identification.

Why should language identification be easy to use

Language information is in many cases obvious to human readers. Also, humans often deal with information that is mostly in a single language. Therefore, it is easy for humans, from data providers to application programmers, to ignore the importance of language information. If given a choice between preserving language information and preserving other aspects of information, language information easily looses.

Example of strings without markup

This example uses RDF/XML notation because this notation is more stable; the example is about the model rather than the notation. Consider the following six statements:

<rdf:Description rdf:about='resource'>
  <prop                                      >foo</prop>   <!-- (A) -->
  <prop                         xml:lang='en'>foo</prop>   <!-- (B) -->
  <prop                         xml:lang='fr'>foo</prop>   <!-- (C) -->
  <prop rdf:parseType='Literal'              >foo</prop>   <!-- (D) -->
  <prop rdf:parseType='Literal' xml:lang='en'>foo</prop>   <!-- (E) -->
  <prop rdf:parseType='Literal' xml:lang='fr'>foo</prop>   <!-- (F) -->
</rdf:Description>>

In a widely shared understanding of M&S, there are two possible interpretations:

  1. [ignoring language codes]: All six statements mutually entail each other.
  2. [considering language codes]: (A) and (D), (B) and (E), as well as (C) and (F), entail each other in pairs.

At last call, there was the following interpretation: None of the above entails any other one.

After last call, this was changed to the following interpretation: (D), (E), and (F) mutually entail each other, but (A), (B), and (C) are mutually different and are all different from the D-F group. To get the distinction implied by the different xml:lang attribute values in D-F, RDF Core is proposing to add 'dummy' elements, as follows:

  <prop rdf:parseType='Literal'              >foo</prop>                                <!-- (D) -->
  <prop rdf:parseType='Literal' xml:lang='en'><dummy xml:lang='en'>foo</dummy></prop>   <!-- (E')-->
  <prop rdf:parseType='Literal' xml:lang='fr'><dummy xml:lang='fr'>foo</dummy></prop>   <!-- (F')-->

Table of observable artefacts and their handling by RDF:

M&S Last Call Post Last Call
plain XML xsd:string plain XML xsd:string
Text X X
Text with language info X
Text with markup X
Text with language info and markup X
XML data (X)

Inconsistencies

'dummy' elements in XML literals create problems

Independent XML blobs vs. integrated RDF/XML document

In discussion, two contrasting uses of XML Literals in RDF and RDF/XML have become apparent, and can roughly be characterized as follows:

The post-lastcall proposal makes it unduely difficult for usages according to the second view. On the other hand, the lastcall proposal does not needlessly complicate usages according to the first view. Adding xml:lang="" is much easier than adding arbitrary dummy elements.

There is also a serious concern that users will simply ignore the potential of micro-markup if it is too difficult to use.

Effects on existing data

RDF data created according to RDF M&S or to lastcall.

Message calling for "unacceptably adversely affected" cases.

Internationalization Approach

The following things are important for Internationalization:

Process

Process History

Process Problems

Last Call comments claimed against I18N agreement

@@@ add link to Jeremy's mail

Massimo Marchiori (look for *** Section 3.2.2): This is the only comment asking for explicit removal of XML Literals as a special case.

Joseph Reagle one mail, other mail: Joseph wanted to make sure there is no confusion between Canonical XML and exclusive canonicalization, but did not say anything one way or another on xml:lang.

Peter Patel-Schneider:

Tim Berners-Lee: A good interpretation of Tim's comments is provided by Patrick Stickler. The comments are not related to xml:lang.

Eric Prud'homeau:

Discussion by RDF Core before Decision

First round of proposals by Jeremy (nuking language information on XML Literals is option 4)

Notable reply by Patrick (comming to the same interpretation of Tim's last call comments and the relation to M&S and charter as we do)

Confirmation from Pat that any of the solutions would be "Not very difficult."... "I am ready for almost any decision we make,"

Solution to "API issues" with wrapper proposal (Jeremy)

Ugly parade (Jeremy)

Unsubstantiated Arguments

Jeremy's summary of arguments by RDF Core

This section lists some of the arguments that have been made for the post-lastcall solution that we think are unsubstantiated:

Against the wrapper solution: Unclear where wrapper comes from (Patrick): The differentiation is very easy, if there is a wrapper in the RDF/XML, there will be two wrappers in the wrapped literal. (@@@ add link to Martin's answer to Brian)

Exclusive Canonicalization says so: Exclusive Canonicalization is a tool with some limitations. The tool should not be used without taking into account its limitations. (@@@ add links)

Use XML Fragments: XML Fragments (CR) is not designed to include independent document pieces in another document. They are not directly applicable.

rdf:parseType="Literal" as an enveloping mechanism for XML content

@@@ Jeremy's mail to Jena-Devel

Alternative Solution


Valid HTML 4.01! | Valid CSS!

Richard Ishida, WG chair
Martin J. Dürst, W3C staff contact & IG chair
last revised $Date: 2003/09/26 14:32:11 $ by $Author: connolly $