Re: I18N-ISSUE-246: Clarify character encoding behavior when calculating storage size [ITS-20]

On Feb 22, 2013, at 7:28 , Stephan Walter wrote:

> Hi,
>  
> since I can see Norbert’s point here but on the other hand I’m not really convinced I’d like to involve the group in this discussion once more.
>  
> I find the case somewhat strange because the values to be expected for the encoding attribute depend completely on external factors (encodings in users’ databases or CMSs) and are quite unrelated to the data being marked up. This seems to make it a bit hard to delimit a reasonable set of required encodings apart from UTF-8. I also think that it might turn out to be quite a large requirement introduced by just this one data category.

Requiring just UTF-8 and leaving support for other encodings optional would be fine.

> I think I do agree now that it would be good if an application could rely on a specific behavior of a processor in the error cases mentioned by Norbert. So I had a look to see how we deal with parallel cases. It seems to me that ‘Allowed Characters’ is actually the only other category where we provide constraints on the tagged data that one would expect to be evaluated automatically on the data itself in a  standard way.
> In this case we have regular expressions and (as I understand it) we import the requirements on processing behavior by pointing to the section on Character Classes of XML Schema. So there is no real analogy here, either.
>  
> If we require any specific behaviour in the error cases mentioned in the issue, how detailed would this have to be?

It could be as simple as "If an ITS processor doesn't support the specified character encoding, it must report this as an error and terminate processing. If the selected nodes contain characters that the specified character encoding cannot represent, the processor must report this as an error and terminate processing." Or you could try and be nice in the second case and specify a fallback strategy, e.g., by saying that the first replacement character among U+FFFD, U+003F, U+FF1F that can be represented in the specified character encoding must be used instead of any character that can't. Or you could have an attribute selecting different behaviors. You probably have a better understanding than I what would work for your target audience.

Norbert

Received on Wednesday, 27 February 2013 07:50:02 UTC