Comments on section 7 from Björn Höhrmann on 2004-04-11 (www-i18n-comments@w3.org from April 2004)

From: Björn Höhrmann <bjoern@hoehrmann.de>
Date: Mon, 12 Apr 2004 05:31:33 +0900
To: www-i18n-comments@w3.org
Cc: bjoern@hoehrmann.de (Björn Höhrmann)
Message-Id: <815106943.20040411203133@toro.w3.mag.keio.ac.jp>
This is a last call comment from Björn Höhrmann (bjoern@hoehrmann.de) on
the Character Model for the World Wide Web 1.0
(http://www.w3.org/TR/2002/WD-charmod-20020430/).

Semi-structured version of the comment:

Submitted by: Björn Höhrmann (bjoern@hoehrmann.de)
Submitted on behalf of (maybe empty): 
Comment type: substantive
Chapter/section the comment applies to: 7 Character Encoding in URI References
The comment will be visible to: public
Comment title: Comments on section 7
Comment:
Section 7, Character Encoding in URI References

[...]
  According to the definition in RFC 2396 [RFC 2396], URI references are
  restricted to a subset of US-ASCII, with an escaping mechanism to
  encode arbitrary byte values, using the %HH convention. However, the
  %HH convention by itself is of limited use because there is no
  definitive mapping from characters to bytes.
[...]

This statement is rather confusing. There is no definitive mapping from characters to bytes for the entire URI Reference, there is no difference between using 'a' or %61 in a URI reference. If I want to refer to a fragment "a" in some document, the URI spec does not say whether I would use #a or #b or #5 and it neither says that if I have a URI Reference #a that this matches a fragment 'a', #a could well refer to a fragment "/" if the URI producer choses EBCDIC as encoding. The text quoted above suggests that there is a difference between characters and %HH escapes which does not reflect the theory on this matter. Please make the text less confusing.

[...]
  C058  [S]  Specifications that define protocol or format elements
  (e.g. HTTP headers, XML attributes, etc.) which are to be interpreted
  as URI references (or specific subsets of URI references, such as
  absolute URI references, URIs, etc.) SHOULD use Internationalized
  Resource Identifiers (IRIs) [I-D IRI] (or an appropriate subset
  thereof).
[...]

One cannot use IRIs if the protocol element is to be interpreted as a URI Reference, since they are not compatible. This would also be rather confusing if the protocol element is e.g.

  background-image: url(foo.svg#bar)

using url() to refer to an *IRI* *Reference* rather than a URL as the protocol element suggests is quite difficult to teach to non-experts. This can of course be solved

  http://www.w3.org/mid/406dbf02.1731301630@smtp.bjoern.hoehrmann.de

A better text might be

  ... elements for resource names or resource locators ...

[...]
  C059  [S]  Specifications MUST define when the conversion from IRI
  references to URI references (or subsets thereof) takes place, in
  accordance with Internationalized Resource Identifiers (IRIs) [I-D
  IRI]. 
[...]

This is ambiguous. Are specifications required to define that such a conversion takes place? Even if they do not use IRIs? If they use IRIs, is only the "when" a concern or are specifications also required to define *how* it must take place? And if so, are they required to define that in accordance with IRIs, too?

For me as an editor it is most difficult to define this in accordance with IRIs, the "when" is difficult to gather from the draft and the how is overly complex, there are MAYs and SHOULDs and several variants and there appear to be some inconsistencies as to whether NFC or NFKC should be used. Maybe this text suggests that specifications state, 

  Implementations MUST convert in accordance with IRIs

In this case I would have the same problems as an implementer which is probably even worse.

[...]
  NOTE: Many current specifications already contain provisions in
  accordance with Internationalized Resource Identifiers (IRIs) [I-D
  IRI].
[...]

This is not my reading of the IRI draft, IRIs appear to require NFC normalization for non-Unicode encodings while most of the the cited specifications do not.

I would also like to point out that this does not work reliably in existing implementations if IDNs are involved, see e.g. <http://tidy.sf.net/bug/924809>, hence it seems more reasonable to require Punycode encoding for the host/regname part, which the IRI
draft does not really do, the cited specifications at least don't
which results in the breakage mentioned in the cited bug report.

[...]
  C060  [S]  Specifications that define new syntax for URIs, such as a
  new URI scheme or a new kind of fragment identifier, MUST specify that
  characters outside the US-ASCII repertoire are encoded using UTF-8 and
  %HH-escaping.
[...]

As I wrote above, this is inappropriate. All components need to be encoded using UTF-8. If you encode some part of a component using an ASCII-incompatible encoding and another part using UTF-8 you likely
run into problems.

Specifications must also define how to unescape/decode this syntax back. It is useless to say that 'Björn' must be escaped as 'Bj%C3%B6rn' if it does not say that 'Bj%C3%B6rn' must be turned back into 'Björn'. My understanding of XPointer in this regard is that implementations are required to perform this unescaping, yet I've been unable to find an implementation that actually does this in a reliable manner, it seems for example typically impossible to refer to an id = 'Björn' using '#Bj%C3%B6rn'.

[...]
  C061  [S]  Such specifications SHOULD also define the normalization
  requirements for the syntax they introduce. 
[...]

I do not understand what you mean by "normalization requirements" in this regard, please clarify and illustrate this in the specification.


Structured version of  the comment:

<lc-comment
  visibility="public" status="pending"
  decision="pending" impact="substantive" id="LC-">
  <originator email="bjoern@hoehrmann.de"
      >Björn Höhrmann</originator>
  <represents email=""
      >-</represents>
  <charmod-section href='http://www.w3.org/TR/2004/WD-charmod-20040225/#sec-URIs'
    >7</charmod-section>
  <title>Comments on section 7</title>
  <description>
    <comment>
      <dated-link date="2004-04-11"
         href="http://www.w3.org/mid/815106943.20040411203133@toro.w3.mag.keio.ac.jp"
        >Comments on section 7</dated-link>
      <para>Section 7, Character Encoding in URI References

[...]
  According to the definition in RFC 2396 [RFC 2396], URI references are
  restricted to a subset of US-ASCII, with an escaping mechanism to
  encode arbitrary byte values, using the %HH convention. However, the
  %HH convention by itself is of limited use because there is no
  definitive mapping from characters to bytes.
[...]

This statement is rather confusing. There is no definitive mapping from characters to bytes for the entire URI Reference, there is no difference between using &#x27;a&#x27; or %61 in a URI reference. If I want to refer to a fragment &#x22;a&#x22; in some document, the URI spec does not say whether I would use #a or #b or #5 and it neither says that if I have a URI Reference #a that this matches a fragment &#x27;a&#x27;, #a could well refer to a fragment &#x22;/&#x22; if the URI producer choses EBCDIC as encoding. The text quoted above suggests that there is a difference between characters and %HH escapes which does not reflect the theory on this matter. Please make the text less confusing.

[...]
  C058  [S]  Specifications that define protocol or format elements
  (e.g. HTTP headers, XML attributes, etc.) which are to be interpreted
  as URI references (or specific subsets of URI references, such as
  absolute URI references, URIs, etc.) SHOULD use Internationalized
  Resource Identifiers (IRIs) [I-D IRI] (or an appropriate subset
  thereof).
[...]

One cannot use IRIs if the protocol element is to be interpreted as a URI Reference, since they are not compatible. This would also be rather confusing if the protocol element is e.g.

  background-image: url(foo.svg#bar)

using url() to refer to an *IRI* *Reference* rather than a URL as the protocol element suggests is quite difficult to teach to non-experts. This can of course be solved

  http://www.w3.org/mid/406dbf02.1731301630@smtp.bjoern.hoehrmann.de

A better text might be

  ... elements for resource names or resource locators ...

[...]
  C059  [S]  Specifications MUST define when the conversion from IRI
  references to URI references (or subsets thereof) takes place, in
  accordance with Internationalized Resource Identifiers (IRIs) [I-D
  IRI]. 
[...]

This is ambiguous. Are specifications required to define that such a conversion takes place? Even if they do not use IRIs? If they use IRIs, is only the &#x22;when&#x22; a concern or are specifications also required to define *how* it must take place? And if so, are they required to define that in accordance with IRIs, too?

For me as an editor it is most difficult to define this in accordance with IRIs, the &#x22;when&#x22; is difficult to gather from the draft and the how is overly complex, there are MAYs and SHOULDs and several variants and there appear to be some inconsistencies as to whether NFC or NFKC should be used. Maybe this text suggests that specifications state, 

  Implementations MUST convert in accordance with IRIs

In this case I would have the same problems as an implementer which is probably even worse.

[...]
  NOTE: Many current specifications already contain provisions in
  accordance with Internationalized Resource Identifiers (IRIs) [I-D
  IRI].
[...]

This is not my reading of the IRI draft, IRIs appear to require NFC normalization for non-Unicode encodings while most of the the cited specifications do not.

I would also like to point out that this does not work reliably in existing implementations if IDNs are involved, see e.g. &#x3C;http://tidy.sf.net/bug/924809&#x3E;, hence it seems more reasonable to require Punycode encoding for the host/regname part, which the IRI
draft does not really do, the cited specifications at least don&#x27;t
which results in the breakage mentioned in the cited bug report.

[...]
  C060  [S]  Specifications that define new syntax for URIs, such as a
  new URI scheme or a new kind of fragment identifier, MUST specify that
  characters outside the US-ASCII repertoire are encoded using UTF-8 and
  %HH-escaping.
[...]

As I wrote above, this is inappropriate. All components need to be encoded using UTF-8. If you encode some part of a component using an ASCII-incompatible encoding and another part using UTF-8 you likely
run into problems.

Specifications must also define how to unescape/decode this syntax back. It is useless to say that &#x27;Björn&#x27; must be escaped as &#x27;Bj%C3%B6rn&#x27; if it does not say that &#x27;Bj%C3%B6rn&#x27; must be turned back into &#x27;Björn&#x27;. My understanding of XPointer in this regard is that implementations are required to perform this unescaping, yet I&#x27;ve been unable to find an implementation that actually does this in a reliable manner, it seems for example typically impossible to refer to an id = &#x27;Björn&#x27; using &#x27;#Bj%C3%B6rn&#x27;.

[...]
  C061  [S]  Such specifications SHOULD also define the normalization
  requirements for the syntax they introduce. 
[...]

I do not understand what you mean by &#x22;normalization requirements&#x22; in this regard, please clarify and illustrate this in the specification.</para>
    </comment>
  </description>
</lc-comment>
Received on Sunday, 11 April 2004 16:31:36 UTC