<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>26614</bug_id>
          
          <creation_ts>2014-08-20 15:36:32 +0000</creation_ts>
          <short_desc>change &quot;violation of Unicode&quot; notes to something less scary (per Unicode)</short_desc>
          <delta_ts>2014-09-01 17:15:22 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WHATWG</product>
          <component>Encoding</component>
          <version>unspecified</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Windows NT</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>Unsorted</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Addison Phillips">addison</reporter>
          <assigned_to name="Anne">annevk</assigned_to>
          <cc>ishida</cc>
    
    <cc>mike</cc>
    
    <cc>www-international</cc>
          
          <qa_contact>sideshowbarker+encodingspec</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>110356</commentid>
    <comment_count>0</comment_count>
    <who name="Addison Phillips">addison</who>
    <bug_when>2014-08-20 15:36:32 +0000</bug_when>
    <thetext>Ken Whistler advised us:
==
Re the note in Section 4.2, I do not understand at all why you word this
as “In violation of section 1.4 of UTS #22…” How is this a “violation” of
anything? The wording in UTS #22 is:

“… For best results, names should be compared after applying the
following transformations: …”

That is simply a recommendation for how to minimize non-recognition
of variations in spelling of charset names in labels. It doesn’t really have
anything to do with the actual conformance clause of UTS #22. So I
don’t see how anybody could actually be in “violation” of it.

The W3C “Encodings” document just makes a much more detailed
and prescriptive mapping of charset labels to the specified encodings
it enumerates. Why don’t you just say *that*, instead:

=============================================================

Note: This specification provides a more detailed and prescriptive
mapping of charset labels to encodings than the loose matching
for charset aliases recommended by UTS #22 … etc., etc.

=============================================================

See? No violation anywhere.

I have a similar reaction to your notes in 14.2 and 14.4. I also do not see
those as “violations” of the Unicode Standard (which, by the way, I would
spell with a capitalized “Standard”).

Start with 14.4 utf-16le. The Unicode Standard does not specify “labels” for
charsets, so I don’t see how you’d be in violation of the standard by
defining how you interpret charset labels. Essentially, you are saying:

=========================================================

Note: For [insert reason here] the label “utf-16” is treated as synonymous
with the label “utf-16le”, and also identifies the utf-16le encoding.

===========================================================

And for your note in 14.2, I think the statement is just wrong. This is
not a violation of the Unicode Standard. It is very much in the spirit
of the definition of the UTF-16 encoding scheme to treat the BOM
as signature and use it to identity the actual byte order of a stream.
And if that is used to override an explicit (but erroneous) charset
labeling, so be it. See Asmus’ comment, which just crossed mine.

In any case, I would advise rewording all three of these notes in
your document. Rather than having a rhetorical stance that
says, “We violate the Unicode Standard, but that’s o.k., because
this item is uncontroversial, and …”, why would you need to state
any violations here at all? Just put in clarifying notes to forestall
people from *claiming* that these practices violate the Unicode
Standard (or UTS #22).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>110357</commentid>
    <comment_count>1</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-08-20 15:41:12 +0000</bug_when>
    <thetext>Interesting. Are these comments archived somewhere? As well as Asmus&apos; comment referred to above?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>110358</commentid>
    <comment_count>2</comment_count>
    <who name="i18n IG">www-international</who>
    <bug_when>2014-08-20 15:43:43 +0000</bug_when>
    <thetext>http://lists.w3.org/Archives/Public/public-i18n-core/2014JulSep/0060.html</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>110395</commentid>
    <comment_count>3</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-08-21 10:41:27 +0000</bug_when>
    <thetext>So for this change I should acknowledge Ken Whistler?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>110396</commentid>
    <comment_count>4</comment_count>
    <who name="i18n IG">www-international</who>
    <bug_when>2014-08-21 11:15:06 +0000</bug_when>
    <thetext>yes</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>110481</commentid>
    <comment_count>5</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-08-22 16:03:01 +0000</bug_when>
    <thetext>https://github.com/whatwg/encoding/commit/a0429a6b2b043d9b7e130554529d66636c73133f</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>110543</commentid>
    <comment_count>6</comment_count>
    <who name="Addison Phillips">addison</who>
    <bug_when>2014-08-25 16:15:41 +0000</bug_when>
    <thetext>For reference, here is Asmus&apos;s note, which he gave permission to include here:

--
The first note refers to a SHOULD specification in UTS#22. It would be overstating to call it a &quot;violation&quot; to deviate from it.

If I understand the issue correctly is that you need the BOM to be able to override conflicting external designations.
Hence, an encoding is only &quot;known&quot; to be correctly labeled when it doesn&apos;t contradict an internal BOM. Otherwise, you implicitly treat the declared encoding as erroneous.

Seems a fine approach by me, given the realities.

Those are my 2 cents.

A./
--</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>110771</commentid>
    <comment_count>7</comment_count>
    <who name="Richard Ishida">ishida</who>
    <bug_when>2014-08-29 15:23:14 +0000</bug_when>
    <thetext>Martin Duerst provided some additional feedback on wording:

--
&gt; [Section 14.2]
&gt; -- 
&gt; Checking for and using a byte order mark happens before an encoding to decode a byte stream is chosen, as seen in the decode algorithm, as is deemed more accurate than any label.
&gt; -- 

I really had problems parsing this sentence. One problem is that two clauses start with &quot;as&quot;. I suggest changing the connective for the second clause to &quot;because&quot;, and maybe moving that clause to the start of the sentence. Other improvements might work too.

Also, unless this is clear from e.g. a link that&apos;s missing in the text version in this email, it would also be useful to be specific about whether the &quot;decode algorithm&quot; is something in the Unicode spec or in the encoding spec.

In addition, &quot;before an encoding ... is chosen&quot; looks problematic to me because 1) &quot;checking and using a BOM&quot; also actually may choose an encoding, and 2) because the &quot;to encode a byte stream&quot; makes the structure difficult to parse (my first (and second and third) parse was &quot;checking ... happens before an encoding&quot;).

I suggest something along the lines of:
&quot;A byte order mark has priority over an encoding label...&quot;
--

And Asmus Freytag also said: 

--
I had difficulties as well with that sentence, but couldn&apos;t put my finger on it; thanks for pointing out the reason. There&apos;s nothing objectionable to the content that it intends to express, but it would be improved if reworded along the lines you suggest.
--</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>110799</commentid>
    <comment_count>8</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-08-31 12:15:20 +0000</bug_when>
    <thetext>Reopening per comment 7.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>110823</commentid>
    <comment_count>9</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-09-01 17:15:22 +0000</bug_when>
    <thetext>https://github.com/whatwg/encoding/commit/31b5075d63115f2f07ee970224653611a7dbcea5</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>