Fwd: I18N Last call comments on Schema Part 2 from Martin J. Duerst on 2000-05-31 (www-xml-schema-comments@w3.org from April to June 2000)

From: Martin J. Duerst <duerst@w3.org>
Date: Wed, 31 May 2000 14:22:49 +0900
To: www-xml-schema-comments@w3.org
Message-Id: <4.2.0.58.J.20000531142142.03430280@sh.w3.mag.keio.ac.jp>
Forwarded, on request of C. M. Sperberg-McQueen.

>Date: Mon, 29 May 2000 18:57:09 +0900
>From: "Martin J. Duerst" <duerst@w3.org>
>Subject: I18N Last call comments on Schema Part 2
>
>Dear Schema WG,
>
>[This mail is crossposted to the I18N IG to allow for further discussion.
>Please feel free to forward these comments to another list, including
>a public list, but please make sure that you don't reveal the mail
>addresses of the various groups.]
>
>This are the last call comments on XML Schema Part 2: Datatypes
>from the I18N WG/IG.
>
>The comments are numbered by [n], but their order does not
>reflect their importance.
>
>
>[1] The definition of 'match' has been copied from XML 1.0. There
>     are propsals for clarifying XML 1.0. The Schema WG should
>     work together with the XML Core WG and the I18N IG to
>     make sure everything is in sync.
>
>[2] The spec says that lenght for strings is measured in terms of [Unicode]
>     codepoints. This is technically correct, but it should say
>     it's measured in terms of characters as used in the XML Recommendation
>     (production [2]).
>
>[3] In 2.4.2.12, it says 'For example, "20" is the hex encoding for the
>     US-ASCII space character'. It should say something like '"20" encodes
>     a byte value represented e.g. in C as 0x20, which may stand for the
>     space character if US-ASCII (or UTF-8) is used to encode it.'
>     But actually this is a bad example, because encoding text with
>     base64 is a bad idea and is against the spirit of XML.
>
>[4] related to [3]: XML is based on Unicode and therefore allows to
>     represent a huge range of characters. However, XML explicitly
>     excludes most control characters in the C0 range. There are fields
>     in databases and programming languages that allow and potentially
>     contain these characters. A user of XML and XML Schema has various
>     alternatives, all not very satisfactory:
>     1) Drop the forbidden characters
>     2) Using XML Schema 'binary' with an encoding: This does not
>        encode characters, but bytes, and therefore looses all i18n
>        features of XML. There is a serious danger that this is used
>        even when the data item in question or even the whole database
>        does not contain a single such character.
>     3) Invent a private convention
>     This is a serious problem, and should be duly addressed by
>     XML Schema.
>     [There is a related problem with respect to names (GIs in
>      SGML terminology), but this is more an XML 1.0 problem than
>      an XML Schema problem, and there is no danger to loose all
>      i18n information just because of a single character.]
>
>[5] related to [4]: 3.2.1 seems to allow all Unicode/ISO 10646 characters,
>     this is not true (see [4]).
>
>[6] 3.2.1: Expand 'UCS' to Universal Character Set.
>
>[7] Make sure that functionality for locale-independent representation
>     and locale-dependent information is clearly distinguished.
>     This is the only way to assure both appropriate localization
>     (which we consider to be very important) and worldwide data exchange.
>     The specification is rather close to half of this goal, namely to
>     provide locale-independent datatypes. Some serious improvements
>     however are still possible and necessary (see below).
>     It is clearly desirable that W3C also address locale-dependent
>     data representations. We think that these go beyond simple datatyping/
>     exchange issues and include:
>     - Conversion from locale-independent to locale-dependent representation
>     - Conversion from locale-dependent to locale-independent representation
>     - Association between locale-dependent and locale-independent information
>     - ...
>     These issues therefore have to be examined on a wider level
>     involving various groups such as the XML Schema WG, the XSL WG,
>     the CSS WG, the XForms WG, and so on, and this should be done as
>     soon as possible.
>
>     We would like to repeat that any mixup between locale-independent
>     and locale-dependent data representation will lead to confusion and
>     will hurt, and not benefit, internationalization and localization.
>     (This point is further addressed in detail in some of the points below:
>     [8], [9], [10]-[16], [20]).
>
>[8] Say explicitly in the specification and in the primer that the lexical
>     representations you provide for various datatypes (in particular
>     things such as date, numbers,...) are designed for locale-independent
>     data exchange, and that they are inappropriate for locale-dependent
>     data representation. In the primer, an example such as
>     <date value='2000-05-16'>Tuesday, 16th of March, 2000</date>
>     (or even just something like <date value='2000-05-16'>next 
> Tuesday</date>)
>     with value defined as a date and the <date> content as string, would 
> help.
>     Also, explicitly warn that where there is some similarity between
>     localized representations and the locale-independent representation,
>     this must not be exploited when presenting the data to a user,
>     and that similarities are due to
>     - Having to choose *some* kind of representation
>     - Making this representation somewhat manageable in raw text
>       for when raw text is needed (debugging, plain text editing,...)
>     and that the fact that some representations are more similar to
>     some locales than others is done reluctantly, and not explicitly to
>     disadvantage certain users. [Indeed, where possible, we would prefer
>     representations that avoid any similarity to any existing locale.]
>
>[9] As said above and explained below, addressing localized representations
>     as a whole is a huge problem. The one contribution that seems most
>     appropriate and relevant from XML Schema is to associate locale-
>     independent and locale-dependent representations. Taking the example
>     above, <date value='2000-05-16'>Tuesday, 16th of March, 2000</date>,
>     the association between the locale-independent 'value' and the
>     locale-dependent element content is implicit; XML Schema should
>     provide a way to make this association explicit. Including in the
>     association some way to indicate the local format used / the conversion
>     functions necessary seems also desirable, although we are not yet
>     aware of an interoperable way to do so.
>
>[10] Several datatypes have more than one lexical representation for
>     a single value. This gives the impression that these lexical
>     representations actually allow some kind of localization or
>     variation of representation. However, as explained above,
>     such an impression is a dangerous misunderstanding, and has
>     to be avoided at all costs.
>     We therefore strongly request that all duplicate lexical
>     representations be removed. The following points ([11]-[16],[20], 
> [22], [27])
>     give details for each affected datatype. For each datatype,
>     we indicate where duplicate representations exist, and
>     how it may be removed. Unless otherwise indicated, we do
>     not have any particular preferences of how to remove the
>     duplicates; we just explain one way to do so to allow you
>     to reuse the analysis we (mostly Mark Davis) have already done.
>     We would like to point out that reducing the lexical representations
>     to a single one for each value also makes using digital signatures
>     on such data a lot easier, and to a large extent and at very
>     little cost, avoids the creation of another WG and spec like
>     in the case of XML Canonicalization.
>
>[11] 3.2.2 'boolean': There are currently four lexical reps. for
>     two values. This has to be reduced to two lexical reps. The
>     I18N WG/IG here has a clear preference:
>     most desirable:   0/1
>     less desirable:   true/false
>     clearly absolutely undesirable: 0/1/true/false
>
>[12] 3.2.3.1 'float' allows multiple representations. This must be fixed,
>     e.g. as follows:
>
>     Float values have a single standard lexical representation consisting 
> of a
>     mantissa, followed by the character "E" (upper case only), followed by an
>     exponent. The exponent must be an integer. The mantissa must be a decimal
>     number. The representations for exponent and mantissa must follow  the
>     lexical rules for integer and decimal numbers discussed 
> above[below?]. The
>     absolute value of the mantissa must be either zero, or greater than or
>     equal to 1 and less than 10. If the mantissa is zero, then the 
> exponent must
>     be zero. For example:
>     Valid: "-1.23E5", "9.9999E14", "1.0000001E-14", "0E0", "1E0"
>     Invalid: "+1.23E5", 100000.0E3", "1.0E3", "1.0E0", "012.E3", "0E1"
>     [This leaves one issue open, namely the issue of too high precision.
>      one way to solve this is to define that the lexical rep. chosen is
>      the one with the shortest lexical rep of the mantissa that corresponds
>      to the desired value according to [Clinger/Gay], or if two lexical
>      reps with the same shortest mantissa correspond, then the closer
>      one should be chosen, and if both are equally close, then the
>      one with an even end digit is chosen. [This should cover all cases,
>      but there may be more accurate or more easy to calculate alternatives,
>      and this should be checked by experts.]]
>     [Some people may claim that e.g. the free choice of exponent or the
>      use of leading digits is necessary to be able to mark up existing
>      data; we would like to point out that if such claims should be made,
>      we would have to request that not only such variations, but also
>      other variations, e.g. due to the use of a different series of
>      digits (Arabic-Indic, Devanagari,... Thai,..., Tibetan,...,
>      ideographic,...) and so on be dealt with at the same level.]
>
>[13] 3.2.4.1 'double' allows multiple representations. This must be fixed.
>     The solution lined out in [12] can be applied.
>
>[14] 3.2.5.1 'decimal' allows multiple representations. This must be fixed,
>     e.g. as follows:
>
>     Decimal values have a single, unique, lexical representation. This 
> consists
>     of a string of digits (x30 to x39) with a period (x2E) as a decimal
>     indicator (in accordance with the scale and precision facets), and a 
> leading
>     minus sign (x2D) to indicate a negative number. The decimal indicator
>     must be omitted if there are no fraction digits. Leading and trailing 
> zeros
>     are illegal, except for zero itself (which is written as "0"). For 
> example:
>     Valid: "-1.23", 100000", "12678967.543233", "0"
>     Invalid: "+1.23", 100000.0", "12,678,967.543233", "12,678,967.543233",
>     "0.0", "012."
>
>[15] Lexical representation of derived datatypes: The lexical
>      representation of all datatypes derived (directly or indirectly)
>      from 'decimal' (13 types from 'integer' to 'positiveInteger')
>      must be changed to be unique. The easiest and most consistent way
>      to do this is to just specify for each datatype that the lexical
>      representation for all the values of the type is the same as for 
> 'decimal'.
>      If you want to be specific, you can find some details at:
>      http://lists.w3.org/Archives/Member/w3c-i18n-wg/1999Nov/0007.html
>      (members only). In any case, disallowing a '+' (done on some types,
>      but not consistently) and disallowing leading zeroes should do
>      the job.
>
>[16] For elementary types, there may be a desire to allow whitespace
>      around the actual data. To be clear, the spec should explicitly
>      say that this is disallowed. (except for cases where it has to
>      be allowed for XML/SGML conformance, i.e. ENTITY, ID,...).
>      Another way of expressing this comment is to say that the
>      spec should make clear for which datatypes CDATA attribute-value
>      normalization should be chosen, and for which datatypes not.
>
>[17] The time-related datatypes (timeDuration and recurringDuration
>     and derived datatypes) need to be redesigned to avoid a number
>     of serious problems. For details, please see points [18]-[25].
>
>[18] The specification assumes that usual arithmetic can be done
>     with TimePeriod, but due to the representation chosen, this
>     is not the case. For example, it is absolutely unclear which of
>     P3.01M or P90.5D is greater, or whether they are equal. There are
>     two ways to solve this, either to choose a different representation
>     or to remove orderedness and min/maxIn/Exclusive. The former
>     is clearly desirable because of additional reasons, please see [19].
>
>[19] The use of culture-specific time length units is highly problematic.
>     This in particular applies to years and months in timeDuration.
>     Various calendars use different month and year lengths; the main
>     distinction being the one between lunar calendars and solar calendars.
>     The Islamic, Hebrew, and Chinese months and years, for example, are
>     all different from the corresponding western units.
>     A system either has to be able to represent these units in all
>     calendars (extremely difficult) or should be limited to representations
>     that are to an extremely high degree culturally neutral. In order
>     to deal with [18], too, we propose to do the later.
>
>[20] Unique representation of timeDuration: There must be only one
>     lexical representation for each timeDuration. This can be achieved
>     as follows:
>     Based on the representation of ISO 8601, only PnDTnHnMnS is used
>     (i.e. no years or months). If any unit is zero, the number and the
>     letter after it are removed, except for the zero duration, which
>     is represented as P0D. If only Days are present, 'T' is omitted.
>     Overflows in lower units have to be converted to higher units
>     (i.e. PT24H -> P1D, PT60M -> PT1H, PT60S -> PT1M; except for
>     leap second cases). Decimal fractions are only allowed for
>     seconds, and do not allow trailing zeroes.
>     [A serious alternative to this would be to remove timeDuration
>      altogether.]
>
>[21] The problems with timeDuration ([19]-[21]) heavily affect
>     recurringDuration and all datatypes derived from it. In addition
>     to the arguments above, recurringDuration is clearly of verylimited
>     use even for the areas of the world that use the Gregorian calendar
>     for all their activities. Being able to specify e.g. the 5th
>     of May every year is only of limited value; most events are
>     decided according to a much more complex pattern. The 3rd
>     Wednsday of each month, a certain date if it is not a Sunday,
>     otherways the Monday after it, and so on, are easier examples,
>     and things can get more complex. With the current solution
>     only a small part of the actual requirements can be addressed.
>     Therefore, the datatype 'recurringDuration' must be removed.
>     Several derived datatypes will be removed as a consequence
>     (e.g. timePeriod, recurringDate, recurringDay, time,...).
>     [The only viable alternative to this is to work on a more
>      powerful representation can can address both various cultures
>      and more complicated rules.]
>
>[22] Having a datatype for timeInstant is clearly desirable. The current
>     derived type should be promoted to a base type. Ideally, the 
> representation
>     should be based only on days (and seconds within the day) from an
>     arbitrary but clearly specified base time instant (this would greatly
>     simplify conversions to internal representation of all kinds of
>     OSs and libraries). If this is judged to be not enough readable
>     in plain text, the current scheme based on ISO 8601 may be
>     kept (but should be verified to be absolutely clean of double
>     lexical representations). Please note that while the representation
>     in this case would not be culturally neutral, each timeInstant
>     can with appropriate calculations be represented in a different
>     calendar without problems.
>
>[23] It may be reasonable to consider a datatype 'date', which
>     is related to timeInstant but most probably best defined as
>     a separate base type. 'month', 'year', and 'century' have
>     to be removed for the reasons given above. It may be worth
>     defining a 'composite' datatype 'actualTimePeriod', which
>     consists of a start timeInstant and an end timeInstant.
>     This would cover a lot more (and a lot more useful) cases
>     in a much more uniform manner than what is currently possible,
>     and could even replace 'date'.
>
>[24] ISO 8601 is based on the Gregorian calendar, but there
>     seems to be no indication as to whether this is applicable
>     before 1582, nor how exactly it would be applied. Also,
>     it is unclear how far into the future the Gregorian calendar
>     will be used without corrections. A representation purely
>     based on days and seconds would avoid these problems; if
>     this is not possible, then the spec needs some additonal
>     explanations or references.
>
>[25] Several details in appendix D have to be fixed. It has to
>     be clear that leading zeroes for months and days are needed.
>     Hours obviously go from 0 to 23, minutes from 0 to 59.
>     Seconds indeed can go to 60 in the case of leap seconds,
>     but only in that case.
>
>[26] For international data interchange, a uniform way to transmit
>     measurements not only for time lengths and time instants,
>     but all kinds of other units, seems highly desirable. If
>     this cannot be provided in the first version of XML Schema,
>     it clearly should be taken up soon for the next version.
>
>[27] The lexical representation of 'hex' encoding (2.4.2.12)
>     must be changed to allow only one case (e.g. only upper
>     case) for hex digits.
>
>[28] String length: There should be a note saying that string
>     length as defined here does not always coincide with string
>     length as perceived by the user or with an actual amount
>     of storage units in some digital representation, and that
>     therefore care should be taken both when specifying some
>     bounds as well as when using these bounds to try to derive
>     some storage requirements.
>     [Although this is not an i18n issue, our group also found
>      the simultaneous availability of 'length', 'minLength',
>      and 'maxLength' highly confusing.]
>
>[29] String ordering: This feature seems to be present for no
>     real use, and should be removed. User-oriented string
>     ordering is highly complex and locale-dependent, and is
>     dealt with in other standards (ISO/IEC 14651 and Unicode TR #10).
>     Locale-independent ordering only makes sense if it is usable
>     for something. This may be actually the case if it were
>     possible to specify that all subelements of a given element
>     have to appear in a given order (just to avoid variation).
>     If this is possible with XML Schema, the orderedness of
>     string may be kept. If not, orderedness as a facet should
>     be removed altogether.
>     In any case, the related facets min/maxIn/Exclusive must be
>     removed, because they never lead to any useful subset of
>     strings. (E.g. assume minInclusive='a' and maxExclusive='b'.
>     This makes sure the first letter is a lower case 'a', but
>     allows any letter whatsoever (from the whole Unicode repertoire)
>     after the 'a'. This is most probably not what a naive user
>     is expecting (but as good as we can get), and for an advanced
>     user, this (and many other useful things) are much easier specified
>     by patterns).
>
>[30] URI Reference: This definition must be changed to allow for
>     characters not allowed in URI References, in order to be in
>     accordance with the relevant section of the W3C Character Model
>     (http://www.w3.org/TR/charmod/#URIs) and all the W3C Recommendations
>     and upcomming Recommendations in accordance with it (HTML 4.0,
>     XML 1.0, RDF, XPointer, XLink,...).
>     [While at it, please also remove the definitions of 'absolute
>      uriReference' and 'relative uriReference' if you don't use it,
>      and make sure you mention that RFC 2396 has been updated by
>      RFC 2732: Format for Literal IPv6 Addresses in URL's
>      R. Hinden, B. Carpenter, L. Masinter, December 1999.
>      e.g. at http://www.ietf.org/rfc/rfc2732.txt]
>
>[31] 3.3.1 language: The 'LanguageID' production in XML 1.0 is too
>     narrow. It fits the currently allowed languageIDs of RFC 1766
>     tightly, but RFC 1766 is being upgraded (see
>http://search.ietf.org/internet-drafts/draft-alvestrand-lang-tags-v2-01.txt).
>     The I18N WG/IG are working together with the XML Core WG to
>     make sure XML can be adjusted appropriately, and that no
>     premature overly restrictive decisions are taken. The XML
>     Schema WG should work together with the above WG to coordinate
>     this issue.
>
>[32] The 'length/minLenght/maxLength' facets on 'language' are
>     highly doubtful; they do not correspond to any useful
>     concepts in the value domain of this datatype.
>
>[33] It is unclear why certain datatypes are derived from
>     'string' (e.g. language, nmtoken, name, ncname), but not
>     others (e.g. ID, idref, entity, notation, qname).
>
>[34] Pattern combinations: Section 5.2.4 says that multiple patterns
>     in a derivation of a single type are combined as if they were
>     separate branches of a regular expression. Branches result
>     in an 'OR' combination, i.e. the actual string can conform
>     to either branch. It seems much better to change this to
>     an 'AND' condition, i.e. the actual string has to conform
>     to BOTH regular expressions. There are several reasons for
>     this:
>     - Restrictions on all kinds of facets, on the same derivation
>       or on subsequent derivations, can very generally be modeled
>       as AND conditions (i.e. for a derived simple type, all
>       conditions on that type and any base types apply simultaneously).
>       This allows to deal uniformly with all such restrictions,
>       and to avoid special cases. E.g. instead of saying that
>       having both a minInclusive and a minExclusive on the same
>       derivation is illegal, one of them just becomes redundant.
>     - The regular expression syntax does not allow AND conditions.
>       However, such conditions are frequently used in programming.
>       In programming, they don't have to be part of the regexp
>       syntax, because they can be modelled as two subsequent checks.
>       In XML Schema, there is no device for subsequent checks.
>     - AND conditions on regular expressions are in particular
>       important for i18n (see point [35]).
>
>[35] It has to be possible to specify various restrictions on
>     a string simultaneously. In particular, we expect that
>     combining a restriction regarding the character repertoire
>     (e.g. to deal with encoding restrictions in legacy systems)
>     and a restriction on the structure of a string will be
>     quite frequent. See also point [36].
>
>[36] Some of the regular expressions needed will be quite long.
>     As an example, the regular expression to limit the repertoire
>     to those characters expressible in the traditional Japanese
>     encodings results in a character class with about 6000
>     characters. To make this reasonably possible, we suggest:
>     - To allow XML spaces in regular expressions (including
>       character classes) in the same way they are allowed in the
>       newer Perl versions. This will lead to greater readability
>       for many other applications, too.
>     - To allow to define character classes or regular expressions
>       in general as objects of their own that can be referenced
>       either in a 'pattern' element or directly in a regular
>       expression or character class.
>     - If the point just above is not possible, in any case
>       to make sure that patterns are combined by 'AND' in
>       the derivation hierarchy.
>
>[37] In appendix E, remove the 'CS   Surrogate' character property.
>     Surrogates do not appear on the level that XML Schema is working.
>
>[38] For character sequence '\w', please make sure that the character
>     class does not end at &#xFFFF but at &#x10FFFF, and that this
>     is consistent in the primer.
>
>[39] Upgrade the reference to ISO 10646 to the year 2000 version,
>     removing the reference to the amendments.
>
>[40] Upgrade the reference to Unicode to version 3.0.
>
>[41] Make sure whether/that block escapes are normative (i.e.
>     change the various 'may' in their definition to something more
>     appropriate).
>
>[42] Try to give less US-centric examples.
>
>[43] Make sure that the character property categories and block
>     escape classes for Unicode characters are not bound to a single
>     version of Unicode. This would create an update problem as soon
>     as Unicode is updated, which is sure to happen rather soon.
>     XML Schema should be independent of such upgrades, otherwise
>     this part of it will soon be less and less useful. The pointer
>     to version 3.0.0 of the Unicode Database should be changed to
>     a generic pointer to the latest version.
>
>[44] The current regular expression syntax does not take into account
>     combinations of base characters and combining marks easily.
>     This can be inconvenient for certain scripts, and will become
>     more and more inappropriate because the encoding of precomposed
>     characters has been stopped. There should be a note pointing
>     out this problem, and the XML Schema WG should have a plan of
>     how and when to address this (i.e. the upgrade to the next level
>     of regular expressions according to Unicode TR #18).
>
>[45] Several examples could be less US-centric. In particular, the
>     example in 5.2.11 should be changed from Farenheit to Celsius.
>
>[46] In appendix A, all prose should fall under xml:lang='en'.
>
>[47] There are a number of inconsistencies and typos, but given
>     the large number of needs for changes as discussed above, it
>     seems more appropriate to check and report such problems on
>     a second reading after an update.
>
>Regards,   Martin.
Received on Wednesday, 31 May 2000 01:53:50 UTC