<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>9663</bug_id>
          
          <creation_ts>2010-05-06 06:13:36 +0000</creation_ts>
          <short_desc>Should sequences of bytes be replaced by a single U+FFFD, or one U+FFFD per input byte?</short_desc>
          <delta_ts>2010-10-04 14:29:09 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>HTML WG</product>
          <component>pre-LC1 HTML5 spec (editor: Ian Hickson)</component>
          <version>unspecified</version>
          <rep_platform>Other</rep_platform>
          <op_sys>other</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc>http://www.whatwg.org/specs/web-apps/current-work/#parsing-0</bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords>NE</keywords>
          <priority>P3</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>LC</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter>contributor</reporter>
          <assigned_to name="Ian &apos;Hixie&apos; Hickson">ian</assigned_to>
          <cc>annevk</cc>
    
    <cc>ian</cc>
    
    <cc>mike</cc>
    
    <cc>public-html-admin</cc>
    
    <cc>public-html-wg-issue-tracking</cc>
    
    <cc>w3c</cc>
          
          <qa_contact name="HTML WG Bugzilla archive list">public-html-bugzilla</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>35453</commentid>
    <comment_count>0</comment_count>
    <who name="">contributor</who>
    <bug_when>2010-05-06 06:13:36 +0000</bug_when>
    <thetext>Section: http://www.whatwg.org/specs/web-apps/current-work/#parsing-0

Comment:
Should sequences of bytes be replaced by a single U+FFFD, or one U+FFFD per
input byte?

Posted from: 213.236.208.46</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>37241</commentid>
    <comment_count>1</comment_count>
    <who name="Adam Barth">w3c</who>
    <bug_when>2010-08-04 23:26:43 +0000</bug_when>
    <thetext>The WebKit HTML5 parser replaces each byte individually.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>37284</commentid>
    <comment_count>2</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2010-08-08 07:09:51 +0000</bug_when>
    <thetext>This depends on the octet-to-character conversion layer. And more specifically, the error handling can presumably differ depending on the encoding. It should be fixed at the encoding layer I think.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>37286</commentid>
    <comment_count>3</comment_count>
    <who name="Adam Barth">w3c</who>
    <bug_when>2010-08-08 07:12:19 +0000</bug_when>
    <thetext>I misunderstood which part of the spec you were referring to.  You seem to be liking to WebSRT, but I thought you were linking to the &quot;preprocessing the input stream&quot; for HTML parsing.  :)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>38859</commentid>
    <comment_count>4</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2010-09-10 22:44:39 +0000</bug_when>
    <thetext>EDITOR&apos;S RESPONSE: This is an Editor&apos;s Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Did Not Understand Request
Change Description: no spec change
Rationale: This phrase appears all over the place. What should it be replaced with?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>38981</commentid>
    <comment_count>5</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2010-09-13 18:42:23 +0000</bug_when>
    <thetext>First try, probably isn&apos;t quite right:

Numbers are bytes in hex. &quot;Anything but ...&quot; includes EOF.

Stray 80-BF:
FE-FF:
replace with one U+FFFD.

C0-C1 followed by 80-BF:
replace the 2-byte sequence with one U+FFFD.

C0-FD followed by anything but 80-BF:
replace the first byte with one U+FFFD and reprocess the second byte.

E0-FD followed by 80-BF followed by anything but 80-BF:
replace the first two bytes with one U+FFFD and reprocess the third byte.

F0-FD followed by two 80-BF followed by anything but 80-BF:
replace the first three bytes with one U+FFFD and reprocess the forth byte.

F0-F4 followed by three 80-BF that represent a code point above U+10FFFF:
replace all four bytes with one U+FFFD.

F5-FD followed by three 80-BF followed by anything but 80-BF:
replace the first four bytes with one U+FFFD and reprocess the fifth byte.

FC-FD followed by four 80-BF followed by anything but 80-BF:
replace the first five bytes with one U+FFFD and reprocess the sixth byte.

Overlong forms (e.g. F0 80 80 A0):
replace the whole byte sequence with one U+FFFD.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>39782</commentid>
    <comment_count>6</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2010-09-28 19:16:05 +0000</bug_when>
    <thetext>EDITOR&apos;S RESPONSE: This is an Editor&apos;s Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Partially Accepted
Change Description: see diff given below
Rationale: Ok, I tried. Let me know what I screwed up! :-)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>39783</commentid>
    <comment_count>7</comment_count>
    <who name="">contributor</who>
    <bug_when>2010-09-28 19:16:37 +0000</bug_when>
    <thetext>Checked in as WHATWG revision r5530.
Check-in comment: Tighten up UTF-8 error handling definitions
http://html5.org/tools/web-apps-tracker?from=5529&amp;to=5530</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>