This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 14858 - [FO30] format-integer picture string
Summary: [FO30] format-integer picture string
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Functions and Operators 3.0 (show other bugs)
Version: Member-only Editors Drafts
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: Michael Kay
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL: http://www.w3.org/XML/Group/qtspecs/s...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-11-17 16:10 UTC by Tim Mills
Modified: 2012-05-18 15:26 UTC (History)
0 users

See Also:


Attachments

Description Tim Mills 2011-11-17 16:10:18 UTC
The specification states that:

"A picture consists of a primary format token, followed by an optional format modifier. If the picture is two or more characters in length and the final character is one of those permitted as a format modifier, then the primary format token consists of the entire picture except for its final character; otherwise the primary format token is the entire picture."

and that:

"The format modifier, if present, is one or more of the following, in any order:

    either c or o, optionally followed by a sequence of characters enclosed between parentheses, to indicate cardinal or ordinal numbering respectively, the default being cardinal numbering

    either a or t to indicate alphabetic or traditional numbering respectively, the default being 
Comment 1 Michael Kay 2011-11-28 23:34:10 UTC
I think it should say that the format modifier if present consists of a single character optionally followed by a sequence of characters starting with '(' and ending with ')' and not containing either '(' or ')'; and then  for 'final character' read, the last character in the string or the last character before the '('. Editorial work needed to find the right way to say this...
Comment 2 Michael Kay 2011-12-06 21:01:34 UTC
The WG encouraged the editor to specify the format of the picture by means of a formal grammar or regular expression. The following regular expression has been submitted for review by the WG, and is included in the status quo text on the presumption that it is OK:

^((\p{Nd}|#|[^\p{N}\p{L}])+?)(([co](\([^()]\))?)?[at]?)$

(this needs to be supplemented by additional prose rules, of course)
Comment 3 Michael Kay 2011-12-13 16:38:21 UTC
The last call draft of 13 December 2011 includes the a regular expression to define the picture string as given in comment #2, but unfortunately (as pointed out in subsequent email) this is incorrect: it works for the "decimal" form of picture string, but not for other forms such as A, I, or Ww.
Comment 4 Michael Kay 2012-01-19 11:13:04 UTC
I think the regex for the format modifier should be as follows: if the string ends with

(([co](\([^()]+\))?)?[at]?)

then this is taken as the format modifier. This differs from the published regex by the addition of the "+" quantifier. There may be better ways of expressing this (or paraphrasing it).

The regex for the first part of the picture is wrong. The part preceding the format modifier must either be a decimal digit pattern, which matches

((\p{Nd}|#|[^\p{N}\p{L}])+?)

or some other format token defined in the specification, which matches

(A|a|I|i|W|w|Ww)

or an implementation-defined format token, which I propose should be unrestricted except that it must not be empty.

I think we should express this as follows.

(1) The value of $picture must match the regular expression ^ (.+) ( ([co]( \([^()]\))? )? [at]? )$ according to the rules of the matches() function with flags "xs". (This is not a severe restriction: only the zero-length string fails to match this pattern.)

(2) Following this match, the content of captured group 1 is referred to as the primary format token, and the content of captured group 2 is referred to as the format modifier. The semantics of the format modifier are covered by existing rules.

(3) The primary format token is classified as follows:

(3a) if it contains a decimal digit or "#" then it is taken as a decimal-digit-pattern and must follow the rules for decimal digit patterns (otherwise, error)

(3b) if it is one of (A|a|I|i|W|w|Ww) then it is handled as defined in the specification for that format token

(3c) otherwise, its meaning (if any) is implementation-defined; if the implementation does not attach any other meaning to the format token then it is handled in the same way as the primary format token "1" (Currently the spec says it must use a format token of "1", which discards the modifier: this is a change.)

These rules change the result of some tests that are currently deemed errors, to being implementation-defined, with a fallback that uses format picture "1". Examples of such tests include format-integer-038, whose picture is "()Wwo", and format-integer-057, which uses a picture of "boo".
Comment 5 Michael Kay 2012-01-24 16:38:35 UTC
The WG was minded to improve the syntax of the picture string to avoid these ambiguuties by introducing a delimiter between the primary and secondary format tokens. A new proposal will be forthcoming.
Comment 6 Michael Kay 2012-02-27 17:45:03 UTC
Proposal:

(a) replace this text:

<old>
The value of $picture must match the regular expression:

^((\p{Nd}|#|[^\p{N}\p{L}])+?)(([co](\([^()]\))?)?[at]?)$

The substring that matches the first capturing group in this regular expression are referred to as the primary format token. The substring that matches the second capturing group (which may be empty) is referred to as the format modifier. A picture thus consists of a primary format token, followed by an optional format modifier.
</old>

with this:

<new>

The value of $picture consists of a primary format token, optionally followed by a format modifier. If the string contains a semicolon then everything that precedes the last semicolon is taken as the primary format token and everything that follows is taken as the format modifier; if the string contains no semicolon then the entire picture is taken as the primary format token, and the format modifier is taken to be absent (which is equivalent to supplying a zero-length string).

</new>

(b) add the semicolon to the examples that require it
Comment 7 Michael Kay 2012-02-28 16:58:41 UTC
Proposal accepted with two amendments: change "contains a semicolon" to "contains one or more semicolons", and keep the regex for the case of numeric primary format tokens.
Comment 8 Tim Mills 2012-03-02 10:51:11 UTC
I think I'm correct in saying that anything before the semicolon delimiter is a valid primary format token.

However, is the format modifier required to match some regex?

If it does not match, is that an error or a silent failure?
Comment 9 Michael Kay 2012-05-18 11:14:32 UTC
The outstanding point in comment 8 is covered by bug 17099. Closing this one.