| | |

Formal Objection to RDF Removal of External Language Information from XML Literals

The I18N WG herewith formally objects to the post-lastcall removal of external language information from XML Literals. This document gives the reasons for this objection, and some background on our motivation and the history of this discussion.

This document is perhaps not as polished as one would like. Some sections are rather well worked out, others are not. A lot of links could be added.

Overview

Main mail messages: RDF decision (point 12: Language tags in typed literals); notice of this decision to I18N WG

Main specs/proposals: RDF M&S, lastcall WDs: Primer, Concepts, Semantics, Syntax, Tests, Schema, post-lastcall internal WDs: Primer, Concepts, Semantics, Syntax, Tests, Schema.

Test case

The I18N WG requests that the following XML/RDF document produce two triples (as it did at lastcall), rather than one (as at post-lastcall):

<rdf:RDF>
  <rdf:Description rdf:about="http://example.org/node">
    <eg:property xml:lang="fr" rdf:parseType="Literal">chat</eg:property>
    <eg:property xml:lang="en" rdf:parseType="Literal">chat</eg:property>
  </rdf:Description>
</rdf:RDF>

[It would have been possible to express this in terms of test cases in the last call, but the tests have been changed in the meantime, and depended on syntactic details that are irrelevant.]

Requirements for Language Information in RDF

These are our main requirements for language information in RDF:

For natural language text, have a standard way to associate language information, so that it can be found and used by generic tools.
For consistency, provide solutions as close as possible for plain literals and XML Literals.
For RDF/XML, reuse xml:lang according to the XML 1.0 specification.
Do not require programmers/users to trade off use of language information against other goals, such as markup integrity.
Make extraction/creation of text with language information as straightforward and easy as possible.

Why the Post-Lastcall Approach is not Satisfactory

The reasons for our objection are listed below grouped as follows:

Conflicting with XML 1.0 and general expectations
Creating new RDF/XML documents
Reasoning and query
Change of interpretation of xml:lang for existing RDF/XML documents
Availability of other Solutions

Conflicting with XML 1.0 and general expectations

The post-lastcall proposal is in direct violation of the provisions for xml:lang in XML 1.0. This will lead to problems for both tools and humans, and sets a bad precedent for other specifications:

We don't know any XML-based specification that openly contradicts the inheritance rules for xml:lang as given by XML 1.0. This sets a very bad precedent.
Any kind of generic tool (e.g. screenreader for XML) may make wrong assumptions.
The very peculiar way of language information inheritance in RDF/XML (plain literals inherit from the root down, but XML Literals only inherit internally) would need a lot of additional outreach effort (and would be difficult to explain). [Looking back, there was quite some effort needed to get people used to the hierarchical scope of html:lang and later xml:lang (as opposed to linear 'new language starts here' codes).]
The burden of convincing people to use language information, and of telling them how to use it correctly, will mostly lie with theGEO taskforce of the I18N WG. It may take them years to reach RDF, and their job shouldn't be made more difficult.

Creating New RDF Data

The post-lastcall approach relies on the use of <dummy> elements to carry language information inside XML Literals (for all forms of RDF, not only for RDF/XML). This raises the following problems:

There is no straightforward way to associate language information with XML Literals, because there is no single <dummy> element.
The post-lastcall approach fails to preserve markup integrity for XML literals (while also preserving language information in a standard way) when extracting or otherwise transfering XML fragments from other documents. We think there are many cases where markup integrity is important, or is perceived to be important. The post-lastcall approach in such cases leaves the choice between preserving markup or preserving easily usable language information. We think that too often, language information will loose (i.e. be omitted in order to preserve markup integrity).
We think that the post-lastcall proposal is unnecessarily complicated for the user (not end user, but programmer/extractor/...), and that this will introduce a strong risk that users do not implement language information properly.
The current approach assumes the existence of otherwise neutral markup constructs to describe and carry language language information in the native markup associated with a fragment. Such constructs may not exist, in which case it seems impossible to ascribe such information at a meta level.

Reasoning and Query

For internationalization purposes, text sometimes needs micro-markup. In many cases, this need is not evident to data designers and application designers. It is therefore important to provide for a transition from plain literals to XML Literals that is as smooth as possible. This in particular applies to XML literals without any markup.

Once <dummy> elements are inserted to carry language information, it is impossible for a general application or a general technology such as a future RDF Query mechanism to know whether an element was inserted as a dummy element or carries actual meaning.

Change of Interpretation of xml:lang for Existing RDF/XML Documents

The change from lastcall to post-lastcall interpretation of xml:lang in RDF/XML documents has several problems:

We do not know how much RDF/XML data is out there that relies on M&S/lastcall conventions, i.e. would loose relevant language information for XML literals. A call for "existing data or applications that will be unacceptably adversely affected" was made, but we do not think that the mostly negative results are conclusive. We think that it would be wrong to put the burden of change on people who have followed existing specifications and did what we think was the right thing, even if the number of such people is not terribly high.
Because there is no general, totally neutral <dummy> element, converting old data to the new syntax is a non-trivial process. This is similar to creating new documents below, except that the connection with the original data (if there was such data) is already lost.
Adding <dummy> elements to carry language information may cause problems for applications that previously worked well with the same data without the <dummy> elements.
Because the change between lastcall and post-lastcall is just a change in interpretation, not a change in syntax, it is difficult if not impossible for an RDF/XML parser to issue sensible warnings for documents using the old convention. [It is possible to issue a warning when seeing an XML Literal in the context of a non-empty xml:lang attribute, but such a warning would in most cases quickly be ignored, and could only be supressed by adding xml:lang="" for each XML Literal, thus effectively making the post-lastcall change irrelevant.]

Availability of Other Solutions

Many alternative solutions are available. Any of them would be acceptable for us, because they avoid the problems listed above.

I18N has not been convinced that any of the alternative proposals for including language information are problematic, and feels they are more intuitive and workable than the current proposal because they do not entail the problems cited above.@@@
The alternative solutions do not seem to pose any serious issues for implementers.
The alternative solutions do not seem to pose any serious issues on a theoretical level.
The need for integration of independent, block-level XML pieces into RDF/XML is covered by xml:lang="".
Proposals involving more far-reaching changes in how RDF handles language information have also been made, including using reification and using languages as properties. It seems to be possible to define a consistent and easy to use mechanism to handle language information in RDF using such a technique.

@@@Central Arguments

Micro-markup, and the possibility to easily add or introduce micro-markup to cases where only text is used, is important for internationalization (multilingual texts, bidi, ruby, glyph variants,...).

Original desing in RDF M&S

The original design in RDF M&S is best shown by the following example:

<rdf:Description
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/metadata/dublin_core#"
  xmlns="http://www.w3.org/TR/REC-mathml"
  rdf:about="http://mycorp.com/papers/NobelPaper1">

  <dc:Title rdf:parseType="Literal">
    Ramifications of
       <apply>
      <power/>
      <apply>
        <plus/>
        <ci>a</ci>
        <ci>b</ci>
      </apply>
      <cn>2</cn>
    </apply>
    to World Peace
  </dc:Title>
  <dc:Creator>David Hume</dc:Creator>
</rdf:Description>

This example shows the following salient design points in RDF M&S:

Literals can be simple text, or text with (XML) markup.
Markup is used to give more information@@@

Need for 'micro-markup'

Micro-markup here refers to markup at the phrasal level. This is important for the following reasons, the first four of which are related to Internationalization:

Multilingual values: He said <span xml:lang='fr'>Oui</span> because he spoke French fluently.
Ruby Annotation (e.g. Japanese)
Glyph variant selection
Bidirectionality (e.g. Arabic, Hebrew)
Mathematical formulae
Chemical formulae
Accessibility-related annotations, e.g. expansions on acronym or abbr
Other kinds of micro-markup

This is clearly documented, among else in: I18N last call comments on M&S

Main Requirement/Goal for Internationalization

Consistent and easy to use way of identifying the language of text pieces

Why should language identification be consistent

It is very important to have a consistent way to identify the language of a piece of text in any technology so that generic operations needing this information can use it easily. Such operations include rendering-related operations such as (CJK) glyph disambiguation, font selection, hyphenation, text-to-speech conversion (important for accessibility), proofing operations such as spell-checking, as well as operations related to the semantics of the text.

How should language identification be consistent

Language identification should not be different for each application, but should be the same independent of the application, i.e. it should depend only on the underlying technology. The best example for this is xml:lang. XML applications are not required to use xml:lang if they do not need it, but they can use it off-the-shelf whenever needed.

Consistency also applies across base technologies. All W3C technologies, and all IETF technologies we know, use the same RFC 3066 language tags for language identification.

Why should language identification be easy to use

Language information is in many cases obvious to human readers. Also, humans often deal with information that is mostly in a single language. Therefore, it is easy for humans, from data providers to application programmers, to ignore the importance of language information. If given a choice between preserving language information and preserving other aspects of information, language information easily looses.

Example of strings without markup

This example uses RDF/XML notation because this notation is more stable; the example is about the model rather than the notation. Consider the following six statements:

<rdf:Description rdf:about='resource'>
  <prop                                      >foo</prop>   <!-- (A) -->
  <prop                         xml:lang='en'>foo</prop>   <!-- (B) -->
  <prop                         xml:lang='fr'>foo</prop>   <!-- (C) -->
  <prop rdf:parseType='Literal'              >foo</prop>   <!-- (D) -->
  <prop rdf:parseType='Literal' xml:lang='en'>foo</prop>   <!-- (E) -->
  <prop rdf:parseType='Literal' xml:lang='fr'>foo</prop>   <!-- (F) -->
</rdf:Description>>

In a widely shared understanding of M&S, there are two possible interpretations:

[ignoring language codes]: All six statements mutually entail each other.
[considering language codes]: (A) and (D), (B) and (E), as well as (C) and (F), entail each other in pairs.

At last call, there was the following interpretation: None of the above entails any other one.

After last call, this was changed to the following interpretation: (D), (E), and (F) mutually entail each other, but (A), (B), and (C) are mutually different and are all different from the D-F group. To get the distinction implied by the different xml:lang attribute values in D-F, RDF Core is proposing to add 'dummy' elements, as follows:

  <prop rdf:parseType='Literal'              >foo</prop>                                <!-- (D) -->
  <prop rdf:parseType='Literal' xml:lang='en'><dummy xml:lang='en'>foo</dummy></prop>   <!-- (E')-->
  <prop rdf:parseType='Literal' xml:lang='fr'><dummy xml:lang='fr'>foo</dummy></prop>   <!-- (F')-->

Table of observable artefacts and their handling by RDF:

	M&S	Last Call			Post Last Call
		plain	XML	xsd:string	plain	XML	xsd:string
Text	X						X
Text with language info	X
Text with markup	X
Text with language info and markup	X
XML data	(X)

Inconsistencies

Plain literals have language information, but XML literals do not.
XML literals 'inherit' (through exclusive canonicalization) 'visible' namespace prefix declarations, but do not inherit xml:lang.

'dummy' elements in XML literals create problems

Problems with conversions: When to include such an element, what element to include
Problem for query: difficult to write queries that search for both plain literals and XML literals, because additional knowledge is needed about what dummy element to expect, and in what situations iit is really just a dummy, and in what others it is relevant.
In general, creates alternate code paths.
Artificial change of markup; it is not possible to keep markup as is without loosing language information.
Impression that html:span is the neutral element to use, when it may not be appropriate in some contexts.
Incompatibility with anything out there based on RDF M&S
Additional effort for users who edits by hand

Independent XML blobs vs. integrated RDF/XML document

In discussion, two contrasting uses of XML Literals in RDF and RDF/XML have become apparent, and can roughly be characterized as follows:

XML Literals as independent blobs of XML
- XML Literals are preferably single elements with content, or maybe element content (several elements), but not
XML Literals as textual XML fragments

The post-lastcall proposal makes it unduely difficult for usages according to the second view. On the other hand, the lastcall proposal does not needlessly complicate usages according to the first view. Adding xml:lang="" is much easier than adding arbitrary dummy elements.

There is also a serious concern that users will simply ignore the potential of micro-markup if it is too difficult to use.

Effects on existing data

RDF data created according to RDF M&S or to lastcall.

Message calling for "unacceptably adversely affected" cases.

Internationalization Approach

The following things are important for Internationalization:

Make things easy even for cases with special requirements (e.g. bidi, ruby)
Make it easy to extent things to such cases
Consistency from an user point
Reduced need of explanation
Reuse of features and consistency amongst specifications
Not to punish people doing the right thing by arbitrary spec changes
Careful lookahead into upcomming needs to avoid divergence early on

Process

Process History

I18N was quite involved in the design of RDF M&S, in particular of the parseType="Literal" syntax and the handling of xml:lang. See also I18N last call comments. On xml:lang, RDF M&S says:

The xml:lang attribute may be used as defined by [XML] to associate a language with the property value. There is no specific data model representation for xml:lang (i.e., it adds no triples to the data model); the language of a literal is considered by RDF to be a part of the literal. An application may ignore language tagging of a string. All RDF applications must specify whether or not language tagging in literals is significant; that is, whether or not language is considered when performing string matching or other processing.
After the RDF Core WG was chartered to clarify the RDF M&S syntax, delegations from the RDF Core WG and the I18N WG met at the Technical Plenary in Cannes. The ambiguity of whether language tagging is significant or not was resolved by making it significant in the RDF model, which would still allow applications to ignore it in application-specific comparisons.
As a result of the Technical Plenary and input from other groups, the I18N WG worked on a solution of how to deal with subtrees with unknown language, checked this proposal with experts in language identification (Library of Congress), proposed the solution to the XML Core WG (and kept the RDF Core WG informed about this), discussed it with XML Core, decision by XML Core WG.

Process Problems

There was quite a bit of i18n involvement for RDF M&S; in particular Misha Wolf was a full member of the WG. This resulted in the M&S design. The current design
RDF Core was aware of relevance of XML Literals and micromarkup to I18N due to a last call comment from XML Schema (expressing our concerns probably better than we could have done ourselves) @@@
RDF Core was apparently aware of our interest in language information on XML Literals, because they asked about our opinion

Last Call comments claimed against I18N agreement

@@@ add link to Jeremy's mail

Massimo Marchiori (look for *** Section 3.2.2): This is the only comment asking for explicit removal of XML Literals as a special case.

Joseph Reagle one mail, other mail: Joseph wanted to make sure there is no confusion between Canonical XML and exclusive canonicalization, but did not say anything one way or another on xml:lang.

Peter Patel-Schneider:

Tim Berners-Lee: A good interpretation of Tim's comments is provided by Patrick Stickler. The comments are not related to xml:lang.

Eric Prud'homeau:

Discussion by RDF Core before Decision

First round of proposals by Jeremy (nuking language information on XML Literals is option 4)

Notable reply by Patrick (comming to the same interpretation of Tim's last call comments and the relation to M&S and charter as we do)

Confirmation from Pat that any of the solutions would be "Not very difficult."... "I am ready for almost any decision we make,"

Solution to "API issues" with wrapper proposal (Jeremy)

Ugly parade (Jeremy)

Unsubstantiated Arguments

Jeremy's summary of arguments by RDF Core

This section lists some of the arguments that have been made for the post-lastcall solution that we think are unsubstantiated:

Against the wrapper solution: Unclear where wrapper comes from (Patrick): The differentiation is very easy, if there is a wrapper in the RDF/XML, there will be two wrappers in the wrapped literal. (@@@ add link to Martin's answer to Brian)

Exclusive Canonicalization says so: Exclusive Canonicalization is a tool with some limitations. The tool should not be used without taking into account its limitations. (@@@ add links)

Use XML Fragments: XML Fragments (CR) is not designed to include independent document pieces in another document. They are not directly applicable.

rdf:parseType="Literal" as an enveloping mechanism for XML content

@@@ Jeremy's mail to Jena-Devel

Alternative Solution

Removing language information from XML Literals was not a forced or clearcut solution, but one of 5 'ugly alternatives' @@@
A lot of last call comments are cited for removing language information from XML Literals @@@. However, only one of them @@@ explicitly asks for removal.

Richard Ishida, WG chair
Martin J. Dürst, W3C staff contact & IG chair
last revised $Date: 2003/09/26 14:32:11 $ by $Author: connolly $