Bug 19004 - [FO30] format-integer tests expecting err:FODF1310
Summary: [FO30] format-integer tests expecting err:FODF1310
Status: RESOLVED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Functions and Operators 3.0 (show other bugs)
Version: Working drafts
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: Michael Kay
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL: https://www.w3.org/XML/Group/qtspecs/...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-09-25 08:42 UTC by Tim Mills
Modified: 2013-03-11 19:03 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Mills 2012-09-25 08:42:01 UTC
The F&O 3.0 specification states that

"If the primary format token contains at least one Unicode digit then it is taken as a decimal digit pattern, and in this case it must match the regular expression ^((\p{Nd}|#|[^\p{N}\p{L}])+?)$. If it contains a digit but does not match this pattern, an error is raised [err:FODF1310]."

Thus FODF1310 is ONLY raised when a primary format token contains at least one Unicode digit AND fails to match the given regular expression.

This is later stated as

"An error is raised [err:FODF1310] if the primary format token contains a digit but does not match the required regular expression."

The following tests expect an error for the given $pattern.

 format-integer-020: '' (empty string).  This does NOT contain a Unicode digit.
 format-integer-023: '0,000,': This matches the regex.
 format-integer-024: '11#0,000'.  This matches the regex.
 format-integer-025: '#'.  This does NOT contain a Unicode digit.
 format-integer-026: '#a'.  This does NOT contain a Unicode digit.
 format-integer-027: ',123'.  This matches the regex.
 format-integer-028: '0,00,,000'.  This matches the regex.
 format-integer-034: '###𐄀0,00'.  This matches the regex.
 format-integer-037: '1;o(-er)z'.  Primary format token '1' matches the regex.
 format-integer-038: 'Ww;o()('.  Primary format token 'Ww' does NOT contain digit.
 format-integer-039: '
'.  This does NOT contain a Unicode digit.
 format-integer-040: '123١'.  This matches the regex.
 format-integer-054: '0#'.  This matches the regex.

One of the following is possible.
a) I'm wrong.
b) The tests are wrong.
c) The specification is wrong.
Comment 1 O'Neil Delpratt 2012-09-26 15:47:09 UTC
(In reply to comment #0)
> The F&O 3.0 specification states that
> 
> "If the primary format token contains at least one Unicode digit then it is
> taken as a decimal digit pattern, and in this case it must match the regular
> expression ^((\p{Nd}|#|[^\p{N}\p{L}])+?)$. If it contains a digit but does not
> match this pattern, an error is raised [err:FODF1310]."

Further down in the format-integer section of the specification there are some 'must' statements, indicating rules which the picture must abide by. The tests are edge cases of these.

Therefore, the specification should be clearer in regards to defining a error code for the 'must' statement and the regex.

I am raising this as a F&O spec issue. The change required is trivial enough, but it does require flagging up to the WG.
Comment 2 Michael Kay 2012-09-27 08:56:22 UTC
I propose to fix the spec by extending the meaning of FODF1310 to cover the other errors described, which currently specify no error code. Specifically, change:

If the primary format token contains at least one Unicode digit then it is taken as a decimal digit pattern, and in this case it must match the regular expression ^((\p{Nd}|#|[^\p{N}\p{L}])+?)$. If it contains a digit but does not match this pattern, an error is raised [err:FODF1310].

to

If the primary format token contains at least one Unicode digit then it is taken as a decimal digit pattern, and in this case it must match the regular expression ^((\p{Nd}|#|[^\p{N}\p{L}])+?)$, and must satisfy certain other rules described below. If it contains a digit but does not meet these requirements, an error is raised [err:FODF1310].

and change the "Error conditions" section from

An error is raised [err:FODF1310] if the primary format token contains a digit but does not match the required regular expression.

to

An error is raised [err:FODF1310] if the primary format token contains a digit but is not a valid decimal-digit-pattern
Comment 3 Tim Mills 2012-09-27 16:27:29 UTC
I think the problem I have here is the precedence of the rule:

"Any other format token, which indicates a numbering sequence in which that token represents the number 1..."

and the raising of FODF1310.

Exactly what conditions have to be met to determine that a primary format token is an invalid "decimal-digit-pattern" and not an "any other format token"?
Comment 4 Michael Kay 2012-09-27 16:30:06 UTC
>Exactly what conditions have to be met to determine that a primary format token
is an invalid "decimal-digit-pattern" and not an "any other format token"?

I thought we now stated very clearly that it falls into the first category if it contains at least one digit.
Comment 5 Tim Mills 2012-09-27 17:19:15 UTC
Several of the patterns in comment #0 do not have one digit, e.g. Empty string, #.
Comment 6 Michael Kay 2012-09-27 22:16:20 UTC
(In reply to comment #5)
> Several of the patterns in comment #0 do not have one digit, e.g. Empty string,
> #.

Fair enough. I would be inclined to say:

(a) a zero-length primary format token is an error (spec change)

(b) in the other cases where the regex is not matched (025, 026, 039), the implementation should fall back to a format token of "1".

(c) for the cases where the format modifier is wrong: we currently say "The format modifier, if present, is one or more of the following, in order". We don't say what happens if it isn't. I would recommend (spec change) changing the "is" to "must be" and raising an error (the same error code). Affects 037 and 038; there should probably be more tests of this condition.

Michael Kay
Comment 7 Tim Mills 2012-09-28 11:30:29 UTC
> Fair enough. I would be inclined to say:
> 
> (a) a zero-length primary format token is an error (spec change)

This would require a change to the expected result of format-integer-061:

format-integer(1, ';')
Comment 8 Tim Mills 2012-09-28 11:33:13 UTC
ro-length primary format token is an error (spec change)
> 
> (b) in the other cases where the regex is not matched (025, 026, 039), the
> implementation should fall back to a format token of "1".

Surely where the regex isn't matched, you'd have to fall through to "Any other format token...".
Comment 9 Michael Kay 2012-09-28 12:28:21 UTC
(In reply to comment #8)
> ro-length primary format token is an error (spec change)
> > 
> > (b) in the other cases where the regex is not matched (025, 026, 039), the
> > implementation should fall back to a format token of "1".
> 
> Surely where the regex isn't matched, you'd have to fall through to "Any other
> format token...".

Yes of course. But for the tests in question, although this technically leaves other options open, most processors might be expected to then take the path that says if the format token isn't recognized, treat it as "1". Technically the test should have a dependency, for example in the case of -039 that the implementation does not recognize xA as a format token.
Comment 10 Tim Mills 2012-09-28 13:25:17 UTC
I suspect that

"All mandatory-digit-signs within the format token must be from the same digit family, where a digit family is a sequence of ten consecutive characters in Unicode category Nd, having digit values 0 through 9. "

should also be a trigger for FODF1310.  Note that the "must" here isn't in bold.
Comment 11 Tim Mills 2012-09-28 15:00:33 UTC
That still leaves format-integer-037 and format-integer-038 which are presumably expecting an error due to the format modifier.  The specification doesn't mention raising errors for this.
Comment 12 Tim Mills 2012-09-28 15:17:31 UTC
(In reply to comment #11)
> That still leaves format-integer-037 and format-integer-038 which are
> presumably expecting an error due to the format modifier.  The specification
> doesn't mention raising errors for this.

Sorry, make that

format-integer-034: format-integer(1, '1;o(-er)z')
format-integer-037: format-integer(1234, 'Ww;o()(')

Test format-integer-038 seems to suggest that if the primary format token isn't recognised, then the format modifier should be ignored as well.  The specification doesn't seem to suggest that.  I'd have expected the result 1234th rather than 1234.

Other than a couple of examples, there doesn't seem to be much describing the how to use the argument appearing within brackets.  Is it an error if it doesn't include a hyphen?
Comment 13 Michael Kay 2012-12-10 12:32:55 UTC
PROPOSAL.

1. Add to the paragraph starting: "The value of $picture consists of ...", the sentence "The primary format token is always present and MUST NOT be zero-length."

1a. In the sentence "All mandatory-digit-signs within the format token must be from the same digit family" make "must" an RFC2119 keyword.

2. Change the sentence "If the [$lang] argument is specified, the value should be a string that is castable to the type xs:language." and the following paragraph to use the same formulation as the corresponding argument of format-date():

The value of the argument should(*) be either the empty sequence or a value that would be valid for the xml:lang attribute (see [XML]). Note that this permits the identification of sublanguages based on country codes (from [ISO 3166-1]) as well as identification of dialects and of regions within a country.

If the $language argument is omitted or is set to an empty sequence, or if it is set to an invalid value or a value that the implementation does not recognize, then the processor uses the default language defined in the dynamic context.

(*) format-date() uses "must" here, but since supplying an invalid value is not an error, "should" would be more appropriate.

3. Change the sentence "The format modifier, if present, is one or more of the following, in order:" to:

The format modifier *must* be a string that matches the regular expression ^([co]\(.+`))?[at]?$. That is, if it is present then it must consist of one or more of the following, in order:"

3a. Add before the "In some languages" paragraph:

"The string of characters between the parentheses, if present, is used to select between other possible variations of cardinal or ordinal numbering sequences. The interpretation of this string is implementation-defined. No error occurs if the implementation does not define any interpretation for the supplied string.

3b within the "In some languages" paragraph, change "preferred" to "recommended" in RFC2119 styling.

4. In the errors section, clarify that FODF1310 is raised for any violation of mandatory rules concerning the format picture.
Comment 14 Michael Kay 2013-01-09 22:14:20 UTC
Note that the proposal in comment #13 has been implemented in the Candidate Recommendation of 8 Jan 2013, despite the absence of any evidence that the WG accepted it as a resolution.
Comment 15 Tim Mills 2013-01-18 16:26:49 UTC
Following the changes, I think the following test cases need fixing in the test suite.

format-integer-025: format-integer(1500000, '#')
format-integer-025: format-integer(1500000, '#a')

These should act as if the default primary format token was used rather than expecting an error.

format-integer-061: format-integer(1, ';')

This should expect an error, and not behave as if the default primary format token was supplied because:

"If the string contains one or more semicolons then everything that precedes the last semicolon is taken as the primary format token"

giving a zero-length primary format token and 

"The primary format token is always present and must not be zero-length."

hence the error.

I also believe the regex governing the format modifier to be incorrect.  It is stated as:

^([co]\(.+\))?[at]?$. 

but I think it should be

^([co](\(.+\))?)?[at]?$. 

because the former requires an argument to 'c' or 'o' whereas the specification states:

"either c or o, optionally followed by a sequence of characters enclosed between parentheses, to indicate cardinal or ordinal numbering respectively, the default being cardinal numbering"
Comment 16 Michael Kay 2013-01-22 18:44:07 UTC
The Working Group retrospectively accepted the change described in comment #13, and also accepted the correction to the regular expression given in comment #15.

I have applied this correction to the spec and have also make the corrections to test cases identified in comment #15.
Comment 17 Christian Gruen 2013-03-10 20:35:20 UTC
Sorry for my jumping in late. I am wondering if the new results of the following two test cases...

format-integer-025:
  format-integer(1500000, '#')
  1500000

format-integer-026
  format-integer(1500000, '#a')
  1500000

...harmonize with the specification, which says that "there MUST be at least one mandatory-digit-sign" in decimal-digit-patterns.

Next, I am not sure where to spot the "default primary format" mentioned in comment #15.
Comment 18 Michael Kay 2013-03-11 08:45:46 UTC
(In reply to comment #17)
> Sorry for my jumping in late. I am wondering if the new results of the
> following two test cases...
> 
> format-integer-025:
>   format-integer(1500000, '#')
>   1500000
> 
> format-integer-026
>   format-integer(1500000, '#a')
>   1500000
> 
> ...harmonize with the specification, which says that "there MUST be at least
> one mandatory-digit-sign" in decimal-digit-patterns.

Yes, they do, because the rule in question only applies when you have a "decimal digit pattern", and you only have a decimal digit pattern when the primary format token contains at least one digit. In fact, the rule cited is now always true by definition.
> 
> Next, I am not sure where to spot the "default primary format" mentioned in
> comment #15.

I think comment #15 is using the phrase "default primary format token" to refer to the token used when this rule is invoked: "If an implementation does not support a numbering sequence represented by the given token, it must use a format token of 1." Which in this case means that "#" and "#a" are not errors, they are format tokens with an implementation-defined meaning, which is likely to be "1".
Comment 19 Christian Gruen 2013-03-11 09:30:27 UTC
(In reply to comment #18)
> Which in this case means that "#" and "#a"
> are not errors, they are format tokens with an implementation-defined
> meaning, which is likely to be "1".

Thanks! I see. Maybe things get more obvious if we...

- always classify a pattern as "decimal-digit-pattern" if the primary format token starts with (or contains?) "#"
- limit "any other format token" to single characters, and raise FODF1310 otherwise

Without those changes, I fear that users may easily get confused, and will hardly understand when an error is raised and when not.

The changed rules would then yield errors for the following queries:

- format-integer(1, '#')
- format-integer(1, '#,')
- format-integer(1, '#a')
- format-integer(1, 'x#')
- format-integer(1, '!!!')

I would even suggest to raise an error whenever "an implementation does not support a numbering sequence represented by the given token" in order to reject queries like "format-integer(1, '(')".


> Yes, they do, because the rule in question only applies when you have a
> "decimal digit pattern", and you only have a decimal digit pattern when the
> primary format token contains at least one digit. In fact, the rule cited is
> now always true by definition.

I would propose to rephrase this sentence into "There is at least one mandatory-digit-sign.". If we should decide to change the parsing rules as proposed above, this sentence could stay as is.
Comment 20 Michael Kay 2013-03-11 10:07:05 UTC
At this stage of the game, clarifying and fixing bugs in the spec is still allowed, but making improvements isn't. I think what you are suggesting falls into the category of suggesting improvements.

Some of the details of format-integer() are derived directly from xsl:number, which may explain some of the oddities.

Requiring the primary format token to be a single character would disallow "Ww",  and also (I believe) some other options that various xsl:number implementations have traditionally provided.

The design principle for error handling in functions such as format-integer(), format-date() etc is that if the picture syntax is wrong, it's an error, but if the picture has implementation-defined semantics, then implementations that don't recognize the picture must use a fall-back representation. This is designed to maximize interoperability.
Comment 21 Christian Gruen 2013-03-11 10:16:24 UTC
..enough reasons to change the status back to RESOLVED. Thanks for the helpful insights.
Comment 22 Christian Gruen 2013-03-11 15:55:25 UTC
Another one.. The specification says that 

"It is ·implementation-defined· what combinations of values of the format token, the language, and the cardinal/ordinal modifier are supported. If ordinal numbering is not supported for the combination of the format token, the language, and the string appearing in parentheses, the request is ignored and cardinal numbers are generated instead."

As already observed in comment#12, the test "format-integer-038" requires "1234" to be returned, whereas I would also have expected "1234th" as result or (at least) alternative. Did I miss something?
Comment 23 Michael Kay 2013-03-11 17:51:55 UTC
I agree, the result of format-integer-038 should be "1234th". Will fix.
Comment 24 Christian Gruen 2013-03-11 19:03:38 UTC
In the current specification, error FODF1310 is described as following:

"This error is raised if the picture string supplied to fn:format-number has invalid syntax."

I believe that fn:format-integer should be added:

"This error is raised if the picture string supplied to fn:format-number or fn:format-integer has invalid syntax."