<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>921</bug_id>
          
          <creation_ts>2004-10-17 22:58:15 +0000</creation_ts>
          <short_desc>Validator inserts text in the middle of a UTF-8 character</short_desc>
          <delta_ts>2005-08-18 03:16:35 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>Validator</product>
          <component>check</component>
          <version>0.6.7</version>
          <rep_platform>All</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc>http://validator.w3.org/check?uri=http://forum.druzya.org</bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Viktor Pilpenok">bdew</reporter>
          <assigned_to name="Terje Bless">link</assigned_to>
          <cc>warcraft2002</cc>
          
          <qa_contact name="qa-dev tracking">www-validator-cvs</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>2536</commentid>
    <comment_count>0</comment_count>
    <who name="Viktor Pilpenok">bdew</who>
    <bug_when>2004-10-17 22:58:16 +0000</bug_when>
    <thetext>I&apos;ve discovered a bug in the checker, i tried to check my site 
(http://forum.druzya.org) and the first error looked broken, something like: 
(that&apos;s what appeared on my screen, as you can see it&apos;s broken and some HTML 
code produced by the validator leaks to the screen)

...663b&quot; title=&quot;&amp;#1057;&amp;#1087;&amp;#1080;&amp;#1089;&amp;#1086;&amp;#1082; &amp;#1092;&amp;#1086;&amp;#1088;&amp;#1091;&amp;#1084;&amp;#1084;strong title=&quot;Position where error was detected.&quot;&gt;?
&amp;#1074; &amp;#1044;&amp;#1088;&amp;#1091;&amp;#1079;&amp;#1077;&amp;#1081;&quot; /&gt;

here&apos;s a hex dump of the html that pruduced this:

      0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F
000: 3C 6C 69 3E 3C 70 3E 3C 65 6D 3E 4C 69 6E 65 20  &lt;li&gt;&lt;p&gt;&lt;em&gt;Line
010: 37 2C 20 63 6F 6C 75 6D 6E 20 31 30 33 3C 2F 65  7, column 103&lt;/e
020: 6D 3E 3A 20 3C 73 70 61 6E 20 63 6C 61 73 73 3D  m&gt;: &lt;span class=
030: 22 6D 73 67 22 3E 63 68 61 72 61 63 74 65 72 20  &quot;msg&quot;&gt;character
040: 64 61 74 61 20 69 73 20 6E 6F 74 20 61 6C 6C 6F  data is not allo
050: 77 65 64 20 68 65 72 65 3C 2F 73 70 61 6E 3E 3C  wed here&lt;/span&gt;&lt;
060: 2F 70 3E 3C 70 3E 3C 63 6F 64 65 20 63 6C 61 73  /p&gt;&lt;p&gt;&lt;code clas
070: 73 3D 22 69 6E 70 75 74 22 3E 2E 2E 2E 63 36 64  s=&quot;input&quot;&gt;...c6d
080: 36 26 23 33 34 3B 20 74 69 74 6C 65 3D 26 23 33  6&amp;#34; title=&amp;#3
090: 34 3B D0 A1 D0 BF D0 B8 D1 81 D0 BE D0 BA 20 D1  4;&amp;#1056;&amp;#1038;&amp;#1056;&amp;#1111;&amp;#1056;&amp;#1105;&amp;#1057;_&amp;#1056;_&amp;#1056;&amp;#1108; &amp;#1057;
0A0: 84 D0 BE D1 80 D1 83 D0 BC D0 3C 73 74 72 6F 6E  &quot;&amp;#1056;_&amp;#1057;_&amp;#1057;_&amp;#1056;_&amp;#1056;&lt;stron

As you can see, it&apos;s utf8 and at 0x0A9 there is a beginning of a utf-8 
character that&apos;s got broken into two by the message. The first char of the 
message html (&quot;&lt;&quot;) got processed as the second byte of that character.
                                  
0B0: 67 20 74 69 74 6C 65 3D 22 50 6F 73 69 74 69 6F  g title=&quot;Positio
0C0: 6E 20 77 68 65 72 65 20 65 72 72 6F 72 20 77 61  n where error wa
0D0: 73 20 64 65 74 65 63 74 65 64 2E 22 3E BE 3C 2F  s detected.&quot;&gt;_&lt;/

At position 0x0DD seems to be the character that the checker complains about, 
and i don&apos;t see anything bad in it so probably it&apos;s a bug too.

0E0: 73 74 72 6F 6E 67 3E D0 B2 20 D0 94 D1 80 D1 83  strong&gt;&amp;#1056;_ &amp;#1056;&quot;&amp;#1057;_&amp;#1057;_
0F0: D0 B7 D0 B5 D0 B9 26 23 33 34 3B 20 2F 26 23 36  &amp;#1056;·&amp;#1056;&amp;#1095;&amp;#1056;&amp;#8470;&amp;#34; /&amp;#6
100: 32 3B 3C 2F 63 6F 64 65 3E 3C 2F 70 3E           2;&lt;/code&gt;&lt;/p&gt;
[END OF HEXDUMP]

This how this looked on the original file: (It was encoded with CP-1251, the 
recoding to UTF8 was done by the checker)

      0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F
000: 3C 6C 69 6E 6B 20 72 65 6C 3D 22 74 6F 70 22 20   &lt;link rel=&quot;top&quot;
010: 68 72 65 66 3D 22 2E 2F 69 6E 64 65 78 2E 70 68   href=&quot;./index.ph
020: 70 3F 73 69 64 3D 30 34 34 38 62 32 66 62 61 63   p?sid=0448b2fbac
030: 38 38 66 31 65 39 62 31 66 35 65 65 39 37 36 63   88f1e9b1f5ee976c
040: 64 65 30 36 38 32 22 20 74 69 74 6C 65 3D 22 D1   de0682&quot; title=&quot;&amp;#9572;
050: EF E8 F1 EE EA 20 F4 EE F0 F3 EC EE E2 20 C4 F0   &amp;#1103;&amp;#1096;&amp;#1105;&amp;#1102;&amp;#1098; &amp;#1031;&amp;#1102;&amp;#1025;&amp;#1108;&amp;#1100;&amp;#1102;&amp;#1090; &amp;#9472;&amp;#1025;

It seems that it barfed at 0x05B, as i said i see nothing bad about this 
character whatsoever.

060: F3 E7 E5 E9 22 20 2F 3E                           &amp;#1108;&amp;#1095;&amp;#1093;&amp;#1097;&quot; /&gt;

That&apos;s all, i hope that my bugreport helps (and that it won&apos;t be corrupted 
because of all those chars :)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>2537</commentid>
    <comment_count>1</comment_count>
    <who name="Viktor Pilpenok">bdew</who>
    <bug_when>2004-10-17 23:00:10 +0000</bug_when>
    <thetext>Darn, it did get corupted. Anyway you can still see what&apos;s going on from the 
hexcodes ...</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>2538</commentid>
    <comment_count>2</comment_count>
    <who name="Olivier Thereaux">ot</who>
    <bug_when>2004-10-18 01:09:07 +0000</bug_when>
    <thetext>Seems very similar to the truncate bug Martin was working on earlier.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>2996</commentid>
    <comment_count>3</comment_count>
    <who name="Olivier Thereaux">ot</who>
    <bug_when>2005-06-02 06:14:01 +0000</bug_when>
    <thetext>*** Bug 1488 has been marked as a duplicate of this bug. ***</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>3014</commentid>
    <comment_count>4</comment_count>
    <who name="Olivier Thereaux">ot</who>
    <bug_when>2005-06-02 22:48:12 +0000</bug_when>
    <thetext>*** Bug 1488 has been marked as a duplicate of this bug. ***</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>5431</commentid>
    <comment_count>5</comment_count>
    <who name="Bj">bjoern</who>
    <bug_when>2005-08-18 03:16:35 +0000</bug_when>
    <thetext>Hi Viktor, thanks for your report. I&apos;ve fixed this issue in HEAD,
http://qa-dev.w3.org/wmvs/HEAD/check?uri=http://forum.druzya.org
should be valid.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>