Re: [P&C] Low-level internationalization, XML deserialization, IRI or URI, IRI normalization

2009/7/25 Marcin Hanclik <Marcin.Hanclik@access-company.com>:
> Hi Marcos, All,
>
> Regarding the usage of IRI in the widget configuration document, I do not know which speicification is responsible for mandating the IRI normalization.
> It is possible that I simply have not yet found the proper existing explanation to the issue, so if you know it, I would be grateful to get this information.
>
> These are more details.
>
> The P&C spec mixes the targets of the grammars (or low-level format specifications) it operates on.
> E.g.
> the sections about Zip archive operate on bytes
> http://www.w3.org/TR/widgets/#zip-archive
> http://www.w3.org/TR/widgets/#version-of-zip-needed-to-extract-a-file-
>
> and
> zip-rel-path grammar
> http://www.w3.org/TR/widgets/#zip-rel-path
> operates on characters, not bytes (it may not be fully clear from the P&C text).

We currently say this in the spec: "For interoperability,
manipulations of Zip relative paths must be performed on the string
obtained by decoding the file name field using the appropriate
encoding, and not on the bytes initially stored in the archive. For
the sake of comparison and matching, it is recommended that a user
agent treat all Zip-relative paths as [UTF-8]."

> XML Fifth Edition refers only to URI specification, it does not know about IRI.

right.

> WUA must support XML and UTF-8:
> http://www.w3.org/TR/widgets/#dependencies-on-other-specifications-and

However, in Step 7, we have not put what happens if the user agent
does not understand the encoding. This would obviously result in the
widget being treated as an invalid widget:

To Step 7, I've added "If doc is encoded in a format that is
unsupported by the user agent, then the user agent must terminate this
algorithm and treat this widget package as an invalid Zip archive. "

> The configuration document is only required to be XML:
> http://www.w3.org/TR/widgets/#configuration-document
> and its encoding may be virtually any that is registered with IANA (my assumption).

Yes. We don't impose restrictions on the encoding. I don't think we should.

However, to the authoring guideline in the Configuration document
section, I've added "To ensure interoperability, encode the
configuration document as [UTF-8]."

> So we can have the following situation:
> The WUA, that I develop widgets for, has a very interesting feature, whose IRI is really international (Polish in this case):
>
> http://example.com/ŁódzkiŚpiewnikŹdźbłowy

Yep

> I.e. the IRI contains characters outside of the US-ASCII character set.
Yep

> Then, I may not have an UTF-X capable editor at hand, so I convert the IRI to URI as in
> http://tools.ietf.org/html/rfc3987#section-3.1, Step 2. and
> I write the following config.xml with US-ASCII legacy encoding:
>
> <?xml version=”1.0” encoding=”us-ascii”>
> <widget …>
> …
> <feature name=”http://example.com/%C5%81%C3%B3dzki%C5%9Apiewnik%C5%B9d%C5%BAb%C5%82owy” />
> …
> </widget>
>
> http://tools.ietf.org/html/rfc3987#section-3.2 provides a method to convert URI to IRI.
> However, I am not sure whether this conversion is mandated in P&C, since P&C just says that e.g. the name attribute is an IRI:
> I am not sure whether it should have IRI syntax in config.xml (not possible in my case, since I use US-ASCII only) or later.
>
> Percent encoding is allowed in IRIs:
> http://tools.ietf.org/html/rfc3987#section-2.2
> and
> "Terminals in the ABNF are characters, not bytes."
>
> Therefore it seems possible that the above config.xml, when parsed by XML- and UTF-8-supporting WUA, will refer to a feature whose IRI would be
>
> http://example.com/%C5%81%C3%B3dzki%C5%9Apiewnik%C5%B9d%C5%BAb%C5%82owy
>
> on the character level. Then, this valid IRI has to be checked for equivalence with
>
> http://example.com/ŁódzkiŚpiewnikŹdźbłowy

ok....

> based on the algorithm specified in http://tools.ietf.org/html/rfc3987#section-5.1
> and
> http://tools.ietf.org/html/rfc3987#section-5.3.1
>
> http://tools.ietf.org/html/rfc3987#section-5.3.2, specifically section 5.3.2.3 mentions percent-encoding normalization.

> I am not sure whether DOM3Core Load&Save mechanisms perform such normalization (as also below).

I doubt it, as DOM3Core has no way of knowing if the value of an
attribute is a URI or not. This is application dependent.

> P&C does not specify it.
>
> P&C says:
> "An attribute defined as containing a valid IRI. A valid IRI is one that matches the IRI  token of the [RFC3987] specification."

Well, it defines the expected value.

> Again, is it the syntax in config.xml (i.e. impossible on byte level) or later?
>
> DOM3Core http://www.w3.org/TR/DOM-Level-3-Core/core.html says
> "A solution for loading a Document and saving it persistently is proposed in [DOM Level 3 Load and Save]."
>
> DOM3LS http://www.w3.org/TR/DOM-Level-3-LS/load-save.html will normalize the entities AFAIK, but probably will not normalize percent-encoded characters in URI/IRI.
>

I don't think it would.

> Proposal
>
> http://www.w3.org/TR/REC-xml-names/#iri-use says:
> "Because of the risk of confusion between URIs that would be equivalent if dereferenced, the use of %-escaped characters in namespace names is strongly discouraged."
>
> So maybe P&C shall state something similar, e.g.
>
> "Because of the risk of confusion between IRIs that would be equivalent if dereferenced, the use of %-escaped characters in feature names is strongly discouraged."
>

Like I said in my last email. The above is ok as an authoring
guideline. What is needed in Step 7 is to state that normalization
needs to happen on URI attributes.

> This could result in percent-encoded IRIs not be present in the configuration document, and the need for the configuration document developer to use UTF-8 capable editor (it may be too hard requirement, it is just a proposal).
>

I don't know how much of an issue this is... even notepad will spit
out UTF-8 (if politely asked to, via the encoding box). I don't think
the use of  UTF-8 capable editor is a real concern. However, I'm
approaching this from a English-centric perspective. I don't know what
the reality is, for instance, in China, India, etc...

I think we should ask i18n for guidance.

> Alternatively, we could specify in P&C that the attributes – that are currently specified as being IRI – shall actually be "IRI or URI" depending on the encoding of the config.xml.
>

This is what we had previously. The i18n WG made us change that to IRI
because URIs are a subset of IRIs (and, in theory, a URI is fully
compatible with an IRI).

> Third option would be to say something about IRI/URI normalization.

Yes, like I said, it might be good to make it explicit in Step 7. We
might even reference the Resolving Web addresses section of
http://www.w3.org/html/wg/href/draft-ietf.htmll

> More comments:
>
> The part of
> http://www.w3.org/html/wg/href/draft.html#parsing-urls
> namely:
> „How does this compare to just parsing using the IRI grammar of RFC 3987?”
> makes me think that the problem (I assume my problem and the Web addresses are similar) is not yet fully solved in any spec.
> I am sorry for any ignorance if such is identified.
>
> The latest draft for IRI is this one:
> http://tools.ietf.org/html/draft-duerst-iri-bis-06
> and it is being discussed also in W3C, see e.g. very recent comments from Anne at
> http://lists.w3.org/Archives/Public/public-iri/2009Jul/
>
> These are the documents that could help more:
> http://www.w3.org/International/articles/idn-and-iri/
> http://www.w3.org/html/wg/href/draft-ietf.html (seems to be just newer version of the above draft )
>

Thanks for the pointers.


-- 
Marcos Caceres
http://datadriven.com.au

Received on Friday, 7 August 2009 11:43:56 UTC