<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>13676</bug_id>
          
          <creation_ts>2011-08-05 01:44:05 +0000</creation_ts>
          <short_desc>Clarify what the code-point length of a string with isolated surrogate is.</short_desc>
          <delta_ts>2011-10-06 23:50:17 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>HTML WG</product>
          <component>HTML5 spec</component>
          <version>unspecified</version>
          <rep_platform>Other</rep_platform>
          <op_sys>other</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc>http://www.whatwg.org/specs/web-apps/current-work/#common-parser-idioms</bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P3</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter>contributor</reporter>
          <assigned_to name="Ian &apos;Hixie&apos; Hickson">ian</assigned_to>
          <cc>annevk</cc>
    
    <cc>ayg</cc>
    
    <cc>cam</cc>
    
    <cc>hsivonen</cc>
    
    <cc>ian</cc>
    
    <cc>kennyluck</cc>
    
    <cc>mike</cc>
    
    <cc>public-html-admin</cc>
    
    <cc>public-html-wg-issue-tracking</cc>
          
          <qa_contact name="HTML WG Bugzilla archive list">public-html-bugzilla</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>54208</commentid>
    <comment_count>0</comment_count>
    <who name="">contributor</who>
    <bug_when>2011-08-05 01:44:05 +0000</bug_when>
    <thetext>Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/common-microsyntaxes.html
Multipage: http://www.whatwg.org/C#common-parser-idioms
Complete: http://www.whatwg.org/c#common-parser-idioms

Comment:
Clarify what the code-point length of a string with isolated surrogate is.

Posted from: 114.43.115.245 by kennyluck@csail.mit.edu
User agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; ja-jp) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>54209</commentid>
    <comment_count>1</comment_count>
    <who name="KangHao Lu">kennyluck</who>
    <bug_when>2011-08-05 02:09:01 +0000</bug_when>
    <thetext>Assuming the intension of the current text is to count

&quot;\ud840\udc87&quot; // as code-point length = 1
&quot;\ud840+\udc87&quot; // as code-point length = 3

(which isn&apos;t very clear as far as I can tell), I would suggest the spec to include a sentence like &quot;Unpaired surrogates count as one code point each.&quot; (wording from [1])

Alternatively, it might be clearer to replace the sentence

# The code-point length of a string is the number of Unicode code points in that string.

by

| The code-point length of a string is the number of Unicode characters after the string is converted to a sequence of Unicode characters[2].

This will then work for both a string of Unicode characters(theory) and a DOMString(reality), before the internal representation of the value of an input element[3] is made clear.

Having said that, I am not convinced that defining @maxlength this way is the best, I tried to analyze other possibilities[4] but wasn&apos;t confident enough to file a bug (my preference is to count 16 bits)

[1] http://download.oracle.com/javase/1,5.0/docs/api/java/lang/String.html (definition of String.codePointCount)
[2] http://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode
[3] http://www.whatwg.org/specs/web-apps/current-work/multipage/association-of-controls-and-forms.html#concept-fe-value
[4] http://lists.w3.org/Archives/Public/www-international/2011AprJun/0105</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>54298</commentid>
    <comment_count>2</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-08-06 06:12:57 +0000</bug_when>
    <thetext>Yeah this is fallout from the decision to stop having the spec be in terms of Unicode but instead have the spec in terms of UTF-16 code points. Sigh. I&apos;ll need to fix this definition somehow.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>55218</commentid>
    <comment_count>3</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2011-08-16 09:17:58 +0000</bug_when>
    <thetext>Web IDL will probably introduce the term &quot;code units&quot; for 16-bit code units. I think all our specifications should use that and we should just accept they are not exactly ideal I think.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>55238</commentid>
    <comment_count>4</comment_count>
    <who name="Aryeh Gregor">ayg</who>
    <bug_when>2011-08-16 15:06:31 +0000</bug_when>
    <thetext>I use the term &quot;element&quot;, following ES:

http://es5.github.com/#x8.4

But that&apos;s admittedly quite confusing, even though I xref it, so someplace defining &quot;code unit&quot; would be nice.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>55462</commentid>
    <comment_count>5</comment_count>
    <who name="Cameron McCormack">cam</who>
    <bug_when>2011-08-19 03:33:53 +0000</bug_when>
    <thetext>I just added http://dev.w3.org/2006/webapi/WebIDL/#dfn-code-unit, if that is helpful.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>55507</commentid>
    <comment_count>6</comment_count>
    <who name="Aryeh Gregor">ayg</who>
    <bug_when>2011-08-19 20:00:39 +0000</bug_when>
    <thetext>Thanks, I switched over to using that:

http://aryeh.name/gitweb.cgi?p=editing;a=commitdiff;h=73d8d864</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>57851</commentid>
    <comment_count>7</comment_count>
    <who name="">contributor</who>
    <bug_when>2011-10-05 17:31:47 +0000</bug_when>
    <thetext>Checked in as WHATWG revision r6633.
Check-in comment: Redefine code-point length in terms of UTF-16 16bit code units.
http://html5.org/tools/web-apps-tracker?from=6632&amp;to=6633</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>57852</commentid>
    <comment_count>8</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-10-05 17:35:21 +0000</bug_when>
    <thetext>This affects all manner of algorithms. I&apos;m not even sure what I want for some of them. I&apos;m pretty sure I don&apos;t want everything to work with code units rather than characters...

I still have to add an xref for code unit; I also have to figure out what to do for a lot of these cases. I guess I need to define that isolated surrogates in a Unicode string are considered as being characters too, or something (with their code point being the value of the surrogate).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>57921</commentid>
    <comment_count>9</comment_count>
    <who name="">contributor</who>
    <bug_when>2011-10-06 23:27:58 +0000</bug_when>
    <thetext>Checked in as WHATWG revision r6648.
Check-in comment: Try to tidy up some more of the Unicode/code unit mess with a probably over-reaching definition (there&apos;s over 2000 uses of the word &apos;character&apos; in the text, so I didn&apos;t check that all of them use this new definition... hopefully it works out; otherwise, we&apos;ll just have to try something else again).
http://html5.org/tools/web-apps-tracker?from=6647&amp;to=6648</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>57922</commentid>
    <comment_count>10</comment_count>
    <who name="">contributor</who>
    <bug_when>2011-10-06 23:33:32 +0000</bug_when>
    <thetext>Checked in as WHATWG revision r6649.
Check-in comment: Define &apos;code unit&apos;.
http://html5.org/tools/web-apps-tracker?from=6648&amp;to=6649</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>57923</commentid>
    <comment_count>11</comment_count>
    <who name="">contributor</who>
    <bug_when>2011-10-06 23:38:08 +0000</bug_when>
    <thetext>Checked in as WHATWG revision r6650.
Check-in comment: Define &apos;Unicode code point&apos;.
http://html5.org/tools/web-apps-tracker?from=6649&amp;to=6650</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>57924</commentid>
    <comment_count>12</comment_count>
    <who name="">contributor</who>
    <bug_when>2011-10-06 23:44:41 +0000</bug_when>
    <thetext>Checked in as WHATWG revision r6651.
Check-in comment: Reorder the definitions and fix them so that they aren&apos;t cyclic.
http://html5.org/tools/web-apps-tracker?from=6650&amp;to=6651</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>57926</commentid>
    <comment_count>13</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-10-06 23:50:17 +0000</bug_when>
    <thetext>EDITOR&apos;S RESPONSE: This is an Editor&apos;s Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Accepted
Change Description: see diff given below
Rationale: Ok I think I&apos;m done here.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>