3838 – fn:codepoints-to-string should allow any infoset character

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 3838 - fn:codepoints-to-string should allow any infoset character

Summary: fn:codepoints-to-string should allow any infoset character

Status:	CLOSED LATER

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Functions and Operators 1.0 (show other bugs)
Version:	Candidate Recommendation
Hardware:	PC Linux

Importance:	P2 enhancement
Target Milestone:	---
Assignee:	Ashok Malhotra
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2006-10-15 20:08 UTC by Per Bothner
Modified:	2006-10-19 00:29 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Per Bothner 2006-10-15 20:08:31 UTC

The specification of 7.2.1 fn:codepoints-to-string says:

  If any of the code points in $arg is not a legal XML character, an error is raised.

Why?  What is the rationale for this?  I thought XQuery was supposed be useful not *only* for XML files, but more generally useful for datasets compatible with XML inforsets, which are not always XML files.

Furthermore, there is a "Text" output method,  One might want to emit text files which are not always XML files.

The infoset specification says about characters:
  [character code] The ISO 10646 character code (in the range 0 to #x10FFFF,   though not every value in this range is a legal XML character code) of the   character.

codepoints-to-string should allow all Unicode characters, possibly excepted surrogates.  (Though allowing a pair of surrogate characters might be useful too.)

See also 3776, relating to tests in the testsuite.

Comment 1 Michael Kay 2006-10-15 22:14:12 UTC

There are of course use cases for such an enhancement, but this is not the time to be considering enhancements.

There are also considerable technical complications in allowing the string data type to use a wider character set than XML permits. For example, it would require a careful look at the rules for regular expressions. There's also a potentially significant performance penalty if characters in a string have to be checked for XML-validity at the time text or attribute nodes are constructed, rather than at the time the string is constructed.

Apart from that, the XML working group chose consciously to disallow certain characters, and we should respect that decision unless there are very compelling reasons. Sometimes it's more important to be consistent than to be right. The XPath data model should have XML as its foundation.

Comment 2 Per Bothner 2006-10-16 01:02:48 UTC

(In reply to comment #1)
> There are of course use cases for such an enhancement, but this is not the time
> to be considering enhancements.

I can certainly understand that.

I might suggest making it implementation-defined what happens if a codepoint is not an XML character, rather than requiring an error to be raised, but I understand there would be reluctance even for that.

> There are also considerable technical complications in allowing the string data
> type to use a wider character set than XML permits. For example, it would
> require a careful look at the rules for regular expressions.

I can see one would want to read through the rules to avoid any inconsistencies, but I can't imagine any fundamental problems.  Java (and Perl) regular expressions obviously aren't restricted to XML characters.

> There's also a
> potentially significant performance penalty if characters in a string have to
> be checked for XML-validity at the time text or attribute nodes are
> constructed, rather than at the time the string is constructed.

Alternatively, you do the checking for XML-validity at serialization time.  You have to do that anyway, to check which characters need to be encoded or escaped.  

This makes sense if one extends the datamodel to match the infoset model, in allowing arbitrary characters in text modes (and also attribute values, comments, and processing-instructions).  But I agree the issues to be considered are too big for this point..

> Apart from that, the XML working group chose consciously to disallow certain
> characters, and we should respect that decision unless there are very
> compelling reasons. Sometimes it's more important to be consistent than to be
> right.

The XML working group seems to have disallow many fewer characters in XML 1.1, now only disallowing 0, surrogages, FFFE, and FFFF.

> The XPath data model should have XML as its foundation.

I have mixed feelings about this philosophy.  A more general data model that just happens to subsume XML infoset model can also be very useful.

One issue is being able to use general-purpose libraries that do not have to be specially-written for XQuery/XPath.  For example one would like to use common libraries for regular expressions or more generally string-hamdling across multiple languages rather than writing XPath-specific string functions.  That is often difficult because of minor specification differences.

If there is a list of issues to be considered for XQuery 1.1, perhaps this could be added to it?  Apart from that, I won't object to closing this issue.

Comment 3 Jim Melton 2006-10-19 00:28:40 UTC

Per, thanks for your reasoned response.  Based on that response, I will mark this item as an ENHANCEMENT and CLOSE it on that basis (changing the resolution to LATER).  In addition, I will add it to the list of things being considered for any possible future version of XQuery. 

Jim (acting unilaterally as co-chair)