I18N last call comments on XQuery/XPath Data Model

Dear XML Query WG and XSL WG,

Below please find the I18N WGs comments on your last call document
"XQuery 1.0 and XPath 2.0 Data Model"
(http://www.w3.org/TR/2003/WD-xpath-datamodel-20030502/).

Please note the following:
- Please address all replies to there comments to the I18N IG mailing
   list (w3c-i18n-ig@w3.org), not just to me.
- All i18n-relevant comments are marked with ***. There are also general
   comments on the spec which we hope you will find useful.
- We have not yet reviewed the other documents, such as XQuery 1.0
   or XSLT 2.0, and so we might be unaware of i18n issues that appear
   in these specs but may have to be traced back to the data model.
   There are also cases where we have identified an i18n issue here,
   but we are not sure exactly what the best solution will be.
- Our comments are numbered in square brackets [nn].

We look forward to further discussion with you.


General:

[1] In general, this is a very extensive and rather boring document.
Where possible, it should be shortened and compacted, to make it
easier to get the relevant points.

[2] There are mappings between the following different things:
- properties of nodes of the data model
- the corresponding accessors
- the mapping from the PSVI (post-schema validation infoset)
   to the properties
- the mapping from accessor output to XML infoset properties
(in the other draft we are reviewing, there are also functions
corresponding to the accessors).

At least one of these could easily be removed (e.g.
the properties or the accessors).


[3] 1. Intro: 'stylesheet or query' should be replaced by 'transform or query'

[4] 2. expanded-QName: Does this allow to handle special cases such as
    XSLT that transforms XSLT, or XQuery that queries XQuery,...?

[5] 3.2 Document order: "The relative order of nodes in distinct documents 
is implementation-dependent but stable. In other words, given two distinct 
documents A and B, if a node in document A is before a node in document B, 
then every node in document A is before every node in document B.
    The second sentence sounds like a corollary from the first, but is
    a non sequitur. It could as well be that an implementation decides
    to order first all the first nodes from all the documents, then all
    the second nodes, and so on. If indeed all nodes of one document
    have to be before all nodes of another document, that should be
    said explicitly, and not only as 'in other words'.

[6] 3.3, markup of [Definition]. Using square brackets for indicating
    definitions doesn't look good at all. Also, there should not
    be a period before and after the closing ].


[7] *** 3.3 data model support of values that are not supported
    by the XML Infoset: What about pcdata with an associated
    language information? What about document fragments with
    associated inherited attributes in general? RDF is dealing
    with such things, and it would be very good if they could be handled.

[8] *** The handling of inherited attributes in general is an important
    issue for I18N (because of xml:lang) that wasn't dealt with at
    all in XSLT 1.0. Apart from what may be needed in the data model,
    support is also important on a higher level.

[9] 3.3 "The data model supports incompletely validated documents, but 
inconsistent data models are forbidden."
     What is an inconsistent data model? What actually happens when there
     is such a model? Does an error get thrown?

[10] 3.4 "In either case, the type names must also appear in the In-scope 
Schema Definitions (as defined in [XPath 2.0]) available to the processor."
     'type name' or rather 'type definitions' or 'types'? There are
     anonymous type names, but it seems strange to say that these appear 
somewhere.

[11] *** anyType, anySimpleType, anyAtomicType, untypedAtomic, string, text 
nodes:
     This is a very general concern, but very important for
     internationalization. There seems to be a proliferation of type
     variants dealing with the simplest of things in XML, namely simple
     text. This seems to ruin quite a bit of the benefits of using
     Unicode; now that we have solved the character encoding problems,
     we don't want to create arbitrary differences for simple pieces of
     text. But various specs (e.g. also RDF) seem to come up with
     additional ways of creating arbitrary differences.
     anyAtomicType and untypedAtomic seem to be badly explained
     and justified. We have to make sure that whenever possible, there
     is no arbitary boundaries in functionality. Rather than treating
     string, text nodes, and untyped as three completely different
     things, they should work as much as possible in an overloaded
     way similar to the number operators.

[12] *** 3.6.1 date and time mappings: for things with timezones, 
canonicalizing
     the time zone and then representing the original time zone separately
     seems to make sense. But for values without a timezone, representing
     them as if they were in UTC is inherently wrong and will lead to
     a lot of misunderstandings. (having things with timezones and things
     without timezones as separate types would have been the better
     solution originally, and maybe it's still not too late for that)

[13] 3.6.1, editorial: "Lexical representations that do not have a timezone 
are assumed to be in UTC for the purposes of normalization." ->
    "Lexical representations that do not have a timezone are assumed to be 
in UTC for the purposes of normalization ONLY."

[14] 4.1.6 typed-value: It would be good to have some explanation of what
     the idea/purpose of this accessor is. It seems to be strange that
     some cases produce errors. Why does mixed content produce a string,
     but complex content, a subset of mixed content, produce an error?

[15] *** 4.1.8 children Accessor: "The sequence of children will never 
contain adjacent text nodes." (see also 4.2.1)
     It is good that text nodes are always merged. But this should
     be stated as a property of the data model, not just mentioned
     in an accessor description.

4.2.1 "The children must consist exclusively of element, processing 
instruction, comment, and text nodes if it is not empty. Attribute, 
namespace, and document nodes can never appear as children"
     [16] - 'if it is not empty' seems irrelevant, obviously an empty document
       won't contain any nodes of other types either.
     [17] - There should be a period at the end

[18] 4.2.1, "Implementations that support DTD processing and access to the 
unparsed entity accessors, use the unparsed-entities property to associate 
information about an unordered collection of unparsed entities with a 
document node."
     spurious comma

[19] 4.2.2 typed-value: why does document return the string value, but
    any of its elements could return an error?

[20] *** 4.2.2 and many other places: As far as we understand from previous
    discussions, xs:string is often used instead of xs:anyURI for
    convenience (to avoid additional casts). It is important in these
    cases to clearly state that the values actually have to be anyURIs,
    AND are treated according to anyURI syntax.

[21] *** 4.2.4: [character encoding scheme]: "The values of these 
properties are implementation-defined but must be consistent with the rest 
of the Infoset constructed."
     What does 'consistent' mean here? There is a dependency between
     non-ASCII element/attribute/... names and the encoding chosen.
     But for a data model that produces an infoset that is (not yet)
     intended for serialization, it almost seems that any specific
     value would be inappropriate. On the other hand, when actually
     being written out, at least for XSLT, the property is not
     implementation-dependent, but determined by the <output>
     element. So we suggest the following text:
     "irrelevant during processing, determined by XQuery or XSLT
     for output"

[22] *** 4.3.1: processing instructions and comments: Is there a way
     to ignore these (if not in the data model, then in XQuery and XSLT?)
     Because they are not part of the actual text, ignoring them
     is often desirable. In that case, the text nodes should merge
     automatically.

[23] 4.3.2 "If the element node's type is xs:anyType, the 
dm:typed-value  accessor returns the node's string value as 
xs:anySimpleType. If the type is a complex type with complex content, 
invoking dm:typed-value raises an error."
     Doesn't anyType include complex types?

[24] 4.3.2: One additional accessor: Why is this accessor not listed in the 
table?

[25] 4.3.3: Ale xml:base attributes treated as special attributes or like
     namespace declarations?

[26] 4.4.1: "Attribute nodes encapsulate XML attributes": 'represent' may be
     better than 'encapsulate'.

[27] 4.4.2: The details about typed-value are useless duplications. It would
     be better to specify this very clearly in one single place, and
     just point to it from other places.

[28] 4.4.3: "The xs:QName IS computed..."

[29] 4.4.4: [owner element] -> [parent]

[30] ***4.5.1: uri -> anyURI (or an equivalent explanation)

[31] ***4.8.3: "The string-value is not W3C normalized as described in the 
Character Model for the World Wide Web version 1.0 draft."
     This may be misunderstood that the string value has to be
     non-normalized. It should at least be clarified as follows:
     "The string-value is not necessarily W3C normalized as described
      in the Character Model for the World Wide Web version 1.0 draft.
      It is the responsibility of data providers to provide appropriately
      normalized text, and the responsibility of programmers to make
      sure that operations do not de-normalize text."
     Even better clarification, in particular of the first sentence,
     is highly desirable, to clearly say that this refers to a state,
     and not an action.

[32] 5. "The values of nodes whose type is derived by union from an XML 
Schema primitive type are represented by a sequence of atomic values each 
of whose type is one of the individual types from the union. The union type 
information is lost and only the specific types of each individual item is 
retained."
     this seems to apply to lists of unions, or maybe unions of lists,
     but not to simple unions. This should be clarified.

[33] 5. "Using the canonical lexical representation for atomic values may 
not always be compatible with XPath 1.0.": Please say when this is not the 
case.

D. Example:
     [34] *** xml:lang should be used in the instance, not only appear
       in the schema (and in the schema be allowed higher-up so
       that it can be inherited)
     [35] *** Defining a default currency in the schema is bad design
       practice. Without the schema, the data is basically useless.
       Please choose something different for an example of default attribute
       handling.
     [36] *** The monetaryAmount type works well for some currencies
       (USD, EUR,...), but does not work for others (Yen,...).
       Please generalize. The number of fractional digits needed
       currently is 0, 2, or 3.
       for details, please see:
http://www.bsi-global.com/Technical+Information/Publications/_Publications/t 
ig90x.doc
     [37] *** The pop-culture example may make it difficult for non-native
       readers to understand the example, or to create a reasonable
       translation.
     [38] - "Literal strings are shown without the xs:string() constructor"
       this should say that strings are shown in quotes
     [39] - Why are N1-N5 before P1 and E1?
     [40] - A4: why is typed-value xs:token?
     [41] - typed-value of E5: inconsistent.
     [42] - other inconsistencies include: children(E5)->T2,
       string-value(A7), (A8), (A9), (A10), (A11) (string and typed
       values seem out of sync)
     [43] - Graphic representation of the data model. [large view]: This
       should be provided in SVG


Regards,    Martin.

Received on Monday, 7 July 2003 11:42:55 UTC