This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 13712 - Line ends within xs:token
Summary: Line ends within xs:token
Status: RESOLVED WORKSFORME
Alias: None
Product: XML Schema
Classification: Unclassified
Component: Datatypes: XSD Part 2 (show other bugs)
Version: 1.1 only
Hardware: PC Linux
: P2 normal
Target Milestone: ---
Assignee: David Ezell
QA Contact: XML Schema comments list
URL: http://www.w3.org/TR/xmlschema11-2/
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-08-09 13:04 UTC by saasha
Modified: 2011-08-22 08:18 UTC (History)
2 users (show)

See Also:


Attachments

Description saasha 2011-08-09 13:04:48 UTC
Hello!

Reading
http://www.w3.org/TR/xmlschema11-2/#token with great interest, I noticed that one can read:

> The value space [as well as the] lexical space of token is the set of strings that do not contain the carriage return (#xD), line feed (#xA) nor tab (#x9) characters, that have no leading or trailing spaces (#x20) and that have no internal sequences of two or more spaces.

In other words, the characters #x85 (NEL) and #x2028 (line separator) seem to be allowed within the xs:token value space. As of
http://www.w3.org/TR/xml11/#sec-line-ends those two characters are considered line ends.

Furthermore, at
http://www.w3.org/TR/xmlschema11-2/#status one can read that

> Support for XML 1.1 has been added. It is now implementation defined whether datatypes dependent on definitions in [XML] and [Namespaces in XML] use the definitions as found in version 1.1 or version 1.0 of those specifications.

In other words, I consider that a clarification could be added to explain whether the status of those two characters is "implementation defined" when they occur within an xs:token and if they are allowed to be there at all.

Clarifying the status of #x2029 (paragraph separator) would also be welcome. These clarifications would also be welcome when it comes to xs:normalizedString at
http://www.w3.org/TR/xmlschema11-2/#normalizedString

Regards!

Saaha,
Comment 1 Michael Kay 2011-08-09 13:48:05 UTC
XML 1.1 classifies these as line ending characters (which means they are normalized to newlines by the XML parser). This means that they will not normally be seen by a schema validator, unless they were escaped as numeric character references. If they are escaped then they can appear in an NMTOKEN (they are not classed as whitespace characters under either XML 1.0 or XML 1.1).

It seems to me that this is all fairly clear from a reading of the two specs and I can't see that any further clarification is required, unless you are looking for an affirmation that the specification actually means what it says.

(Personal response)
Comment 2 David Ezell 2011-08-12 16:17:08 UTC
The WG discussed this bug, and decided to close it as WORKSFORME because:

1) XML Core has left these characters out for reasons that don't clearly align with the needs of XSD processing.
2) we align with the XML 1.0 spec here and 
3) introducing more magic could have unintended consequences.

We hope you understand our decision, and we really appreciate your input on this matter.
Comment 3 saasha 2011-08-21 21:57:52 UTC
(In reply to comment #2)

Hello!

In light of Unicode's recommendations (2011) explaining that:

"U+2029 paragraph separator (PS) and U+2028 line separator (LS). [...] should be used wherever the desired function is unambiguous."

http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf (page 150)

one may wonder what happens if a system (of any kind, operative system, DBMS, etc.) using XML begins to apply Unicode's recommendations.

Being aware that

"Conforming implementations of this specification may provide either the 1.1-based datatypes or the 1.0-based datatypes, or both. If both are supported, the choice of which datatypes to use in a particular assessment episode should be under user control."

http://www.w3.org/TR/xmlschema11-1/#intro-relatedWork

and that according to

http://www.w3.org/TR/xml11/#sec-line-ends and
http://www.w3.org/TR/xml/#sec-line-ends the character U+2029 paragraph separator (PS) may be present within XML data, including within an xs:token (both in XML 1.0-based and XML 1.1-based contexts), I will try to formulate three possibilities for an addition to the specification of XML Schema 1.1 (part 2).

(1) One (in my opinion acceptable) possibility would be to add two new datatypes (xs:paragraph and xs:line) to XML Schema 1.1 for portability. Keeping xs:token unchanged would ensure backward compatibility. These two additions would be:

(1a) The datatype xs:paragraph could be defined as an xs:token containing no U+2029 (paragraph separator) and (in XML 1.0-based contexts) no U+0085 (NEL) either - In XML 1.1-based context, no U+0085 (NEL) would be present anyway.

(1b) Within an XML 1.0-based context: The datatype xs:line could be defined as an xs:paragraph containing no U+2028 (line separator).

(2) One other (in my opinion problematic) possibility would be instead to redefine xs:token to take into account U+2029 and U+2028. This would compromise backward compatibility, though.

(3) A short and honest, but in my opinion not really satisfying, possibility would be to add a note clarifying that: "Neither xs:token nor any other XML Schema 1.1 datatype support unambiguous use of U+2029 paragraph separator and U+2028 line separator as recommented by unicode."

Regards!

Saaha,
Comment 4 Michael Kay 2011-08-22 08:18:28 UTC
In XML based systems I think it is likely that people will delimit lines and paragraphs using XML markup rather than Unicode delimiter characters. However, if people want to use Unicode delimiter characters for the purpose, they are welcome to define appropriate data types that reflect this usage. The derived types in XSD such as xs:token should be thought of as a "starter set", there is no intention that they should meet all possible requirements, and the fact that for some particular requirement it is necessary to define a user-defined type is not in any way a defect in the specification.

xs:token actually has a much bigger problem which you don't mention - it is misnamed. Its value space is not a single token, but a space-separated sequence of tokens. There's no built-in data type that conveniently represents a single token; but users can easily define their own, and commonly do so, so this isn't a big problem in practice.

(Personal response)

(If you feel the WG needs to look at this again, please reopen the bug. Please bear in mind that XSD 1.1 is now very close to becoming a Recommendation, which means that the WG will only make changes if the spec is broken: the time for adding things that are "good ideas" is long past, however good the ideas.)