28024 – Does "[Unicode] characters" EQUAL "Char production in [XML]?

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 28024 - Does "[Unicode] characters" EQUAL "Char production in [XML]?

Summary: Does "[Unicode] characters" EQUAL "Char production in [XML]?

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Data Model 3.1 (show other bugs)
Version:	Candidate Recommendation
Hardware:	PC Linux

Importance:	P2 normal
Target Milestone:	---
Assignee:	Norman Walsh
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-02-13 22:45 UTC by Patrick Durusau
Modified:	2015-03-16 23:55 UTC (History)
CC List:	4 users (show)

See Also:

Attachments

Description Patrick Durusau 2015-02-13 22:45:57 UTC

One of the problems caused by not referencing definitions rather than repeating them is the creation of conflicting definitions. For example:

XDM 3.1 reads at 2.7.3 XML and XSD Versions:

[Definition: A string is a value in the value space of the xs:string data type; equivalently, it is a sequence of characters.]

[Definition: A character is an instance of the Char production in [XML]. It is recommended that the implementation use the latest definition, currently XML 1.1 Second Edition.]

And yet, XPath 3.1 reads at 2 Basics:

The basic building block of XPath 3.1 is the expression, which is a string of [Unicode] characters; the version of Unicode to be used is implementation-defined.

"[Unicode] characters" does NOT EQUAL "Char production in [XML]. So, as a reader of the specification, which is it? Any Unicode character can appear in an expression or is XPath 3.1 limited to the Char production?

*****

The better course is to have numbered definitions, I would suggest starting with the data model but that may be a personal preference, that are then cited by other parts of the same group of standards, in their numbered definitions. Having all those definitions in separate sections makes comparison of the definitions trivial.

Thus, the data model could say:

N Character: A character is an instance of the Char production in [XML]. It is recommended that the implementation use the latest definition, currently XML 1.1 Second Edition.

N String: A string is a value in the value space of the xs:string data type; equivalently, it is a sequence of characters.]

(I would insert a full location for the references. N = subsection number and header.)

And then XPath 3.1 says:

N Character: As defined by XDM 3.1 N.

N String: As defined by XDM 3.1 N.

Not only does it save space it binds the documents together.

Not to mention making it possible to substitute the definition for every occurrence of the term in the text, which should help with proofing.

I marked this for the data model but obviously it applies to all the drafts under review.

Comment 1 Michael Dyck 2015-02-14 06:11:00 UTC

(In reply to Patrick Durusau from comment #0)
> One of the problems caused by not referencing definitions rather than
> repeating them is the creation of conflicting definitions. For example:
> 
> XDM 3.1 reads at 2.7.3 XML and XSD Versions:
> 
> [Definition: A string is a value in the value space of the xs:string data
> type; equivalently, it is a sequence of characters.]
> 
> [Definition: A character is an instance of the Char production in [XML]. It
> is recommended that the implementation use the latest definition, currently
> XML 1.1 Second Edition.]
> 
> And yet, XPath 3.1 reads at 2 Basics:
> 
> The basic building block of XPath 3.1 is the expression, which is a string
> of [Unicode] characters; the version of Unicode to be used is
> implementation-defined.

The latter is not a definition of "string", since there is no "[Definition: ]" markup, and the word "string" isn't even bolded, so this certainly isn't a case of "conflicting definitions".

Moreover, it isn't a case of a non-definition disagreeing with a definition,
because they're not talking about the same thing. The definition in XDM is talking about values in an abstract value space, whereas the statement in Basics is talking about (concrete) program texts.

Comment 2 Patrick Durusau 2015-02-14 16:26:27 UTC

Thanks! You make my case better than I could. If "string" is being defined and then used differently in the case I cited, it is even more of a problem. 

Or are you saying that sometimes XPath 3.1 relies on XDM definitions and at other times it doesn't? (a different problem than the one I posed)

"string" cannot (should not) be defined to mean one thing and then used to mean a variety of other things. In general that is why standards have definitions, to fix the meaning of terms. 

Perhaps a definition that conflicts with usage would have been more accurate.

Comment 3 Michael Dyck 2015-02-14 23:52:45 UTC

Okay, the XDM spec gives a technical definition for the word "string", and an introductory section of the XPath spec uses the word "string" in one of its common English senses while talking in a very general way about program texts.
If this concerns you, you could simply suggest that a different word be used in the latter context. Personally, I don't think this example warrants anything more involved than that.

(Of course, I'm not speaking for the Working Group, so it might decide that there is indeed a larger problem needing a larger solution.)

Comment 4 Michael Kay 2015-02-15 09:11:25 UTC

Actually, an XPath expression is a string of unicode characters that conforms to the XPath grammar, and it so happens that the set of characters allowed by the grammar is the same as the set of characters allowed by the xs:string data type, which is the same as the set of characters that can appear in XML documents. But the relationship is extensional rather than intensional. I don't think it would be helpful to anyone to describe an XPath expression as a value in the value space of xs:string. The word "string" here is used in its everyday sense, and this can be inferred from the absence of a link to any definition of "string" as a term of art.

Comment 5 Patrick Durusau 2015-02-20 15:23:58 UTC

I am puzzled by:

"The word "string" here is used in its everyday sense, and this can be inferred from the absence of a link to any definition of "string" as a term of art."

I say that because under 3 Data Model Construction I find:

"The data model supports some kinds of values that are not supported by [Infoset]."

Infoset with a hyperlink and thus a term of art but I also find under 6.2.3 Construction from an Infoset:

"schema-type

    All Element Nodes constructed from an infoset have the type xs:untyped."

Where infoset appears without a hyperlink. Is that a different definition of infoset than [infoset] with a hyperlink?

I ask because the distinction of hyperlinked terms and non-hyperlinked versions of the same term isn't stated in the draft. Moreover, I don't think non-English speakers are likely to divine that two different meanings are intended by hyperlinking, bolding, etc. 

As I noted with RFC2119, the difference between UPPERCASE and lowercase is an explicit rule for use of that standard. Personally I would call it out when invoking RFC2119 as a reminder to readers.

Comment 6 Michael Kay 2015-02-20 15:39:38 UTC

You write:

<quote>
All Element Nodes constructed from an infoset have the type xs:untyped."

Where infoset appears without a hyperlink. Is that a different definition of
infoset than [infoset] with a hyperlink?
</quote>

This seems a bit disingenuous. Where the reference is hyperlinked, we guide readers to a definition of the term. Where it is not hyperlinked, we expect them to be able to read English, and to infer the meaning of a term from their knowledge of the English language, the context in which the term appears, and the terminology of the technical area with which the spec is concerned. If a term has multiple meanings and we don't highlight one of them by underlining, we expect the reader to work it out. That's what English is like. If this leaves sentences that leave the semantics of the specification ambiguous, please point them out and we will try to fix them. But we aren't going to switch at this stage from writing in English to writing in formal mathematical notation.

Comment 7 Patrick Durusau 2015-02-20 18:08:51 UTC

I had no intention of being "bit disingenuous" with my comment about hyperlinking.

You raised the distinction between hyperlinked terms and not hyperlinked text as though multiple meanings of a term are so signaled in the document.

First, the document makes no mention of that practice.

Secondly, it isn't necessary to resort to formal notation to define "string" (or any other term) and then to use it consistently throughout a document. 

It is using a term relying on the "understanding" of a reader that causes the difficulty. Either the document states what it means, consistently, or it doesn't. Opening up the "understanding" of the reader puts non-native English speakers at a clear disadvantage. 

I don't understand the difficulty with consistent usage of terminology. I assume the every time infoset is used, it is the XML infoset that is meant. But that isn't the case for string.

Comment 8 Michael Kay 2015-02-20 18:30:15 UTC

>You raised the distinction

No, Michael Dyck did.

I think the point about the particular sentence you're complaining about is that the precise meaning doesn't actually matter. It's introductory in nature. I think it's pretty clear to any intelligent reader that an expression is something that matches a relevant production in the XPath grammar, and the XPath grammar is unambiguous about what characters are allowed and where. The spec could be improved, of course, but it's not wrong and it's not misleading.

Comment 9 Patrick Durusau 2015-02-20 19:44:53 UTC

Apologies for the confusion on names.

Relying on an "intelligent reader" for ediing decisions leaves me with no response.

Comment 10 C. M. Sperberg-McQueen 2015-03-03 19:28:53 UTC

[Speaking for myself and not on behalf of anyone else]

Having read this bug report and the ensuing discussion more than once, I remain puzzled. The bug report starts by talking about "conflicting definitions", but I don't see any conflicts identified.

The set of characters identified by the Char production in the XML spec is a subset of the set of Unicode characters (or it was the last time I looked). So it is necessarily true that any sequence of instances of the Char production in [XML] is (by definition) a sequence of Unicode characters. The sentence quoted from XPath 3.1, section 2 Basics, is identifying a property of XPath 3.1 expressions which some people find important: they are strings of characters.

The basic building block of XPath 3.1 is the expression, which is a
string of [Unicode] characters; the version of Unicode to be used is
implementation-defined.

It is not a definition of "string" or "expression" or "character", and it does not say that the set of expressions is the same as the set of strings of Unicode characters, only that every expression is a string of Unicode characters. There are plenty of strings of Unicode expressions which are not XPath expressions -- strings including Unicode characters that don't match the XML Char production among them. Strings beginning with right-parenthesis or ending with left-parenthesis are also in that set. The sentence quoted says nothing that contradicts any of these straightforward points.

It is quite true that if a reader assumes that any bolded string in the spec marks a definition, and infers from the bolding of the word 'expression' that this is intended as a definition, then the reader is apt to find the text more than a bit confusing. The problem in this case, however, is that the assumption does not hold.

If I knew a good way to make readers stop assuming things that aren't true, I'd be a happier man.

The specs might indeed be easier to read, in some ways, and for some people, if they had numbered definitions, or an alphabetical list of definitions sequestered in a terminology section. But there is ample anecdotal evidence that many people find that format off-putting and perhaps a bit confusing; the preference of many W3C working groups for embedding definitions in the exposition instead of sequestering them seems to suggest that some of those people inhabit W3C WGs. (And I would urge caution before replacing occurrences of terms with their definitions -- that will only work when the definiens has the same part of speech as the definiendum, and that is another kind of formality that apparently strikes many people involved in spec development as artificial and awkward, and not to be enforced.)

Comment 11 Andrew Coleman 2015-03-04 14:30:12 UTC

At the teleconference on 2015-03-03, the Working Groups agreed to fix this bug by adding a definition of 'codepoint' to XDM, removing the now-redundant definitions in F and O, XQuery, and any others that have definitions of 'string', 'character' and 'codepoint', and ensuring that all specs point to XDM as having the normative definitions of 'character', 'string', and 'codepoint'.

Many thanks for taking the time to review these specifications.

Comment 12 Norman Walsh 2015-03-05 15:52:21 UTC

Fixed:

[Definition: A string is a sequence of zero or more characters, or equivalently, a value in the value space of the xs:string data type. ]

[Definition: A character is an instance of the Char production in [XML] . ]

[Definition: A codepoint is a non-negative integer assigned to a character by the Unicode consortium, or reserved for future assignment to a character.]