This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
The BOM handling algorithm is specified as removing lead bytes from the stream before invoking the encoder's decode algorithm. (Currently step 4) This can be specified in an alternate fashion: after running the encoding's decoder algorithm: if BOM seen flag is unset, and the output stream is non-empty, set the BOM seen flag; if encoding is one of utf-8, utf-16le, utf-16be and the first code point of the output stream is U+FEFF then remove it from the stream.
For symmetry with section 6 it may be better to leave it as-is, but I thought I'd point it out.
That'd require we clearly define the output stream.
Fixed as part of bug 16688. https://github.com/whatwg/encoding/commit/dc8e4c10c9b4a91f188f3145c2e31ddec4d52a78 This is a massive change, review appreciated!
In 6 Decode and encode, step 3: if buffer doesn't match any lines in the byte-order table, buffer should be prepended to stream Prior to that revision, this was handled by updating an offset only if the BOM matched.
We no longer have offsets or code point pointers. You can only read from a stream and put stuff back to it. There's no peek or skip... At least in the specification abstractions, you can implement them however you want of course.
(In reply to Anne from comment #5) > We no longer have offsets or code point pointers. You can only read from a > stream and put stuff back to it. There's no peek or skip... At least in the > specification abstractions, you can implement them however you want of > course. Right.... So step 2 reads 3 bytes from the stream. Step 3 checks if they match the BOM. If they don't match the BOM, as written those 3 bytes are dropped on the floor and unavailable in step 4. So step 3 needs to also say the equivalent of "otherwise, prepend buffer to stream"
My bad, fixed.
(In reply to Anne from comment #7) > My bad, fixed. That fix lgtm. Another edge case: if stream is 0xFF 0xFE (that is, exactly two bytes), then step 3 reads two bytes then end-of-stream so buffer ends up being [ 0xFF, 0xFE ]. Step 4 matches utf-16le and BOM seen is set. Step 6 will then prepend the last byte of buffer to stream - so stream ends up as 0xFE. Oops. I suppose we say that buffer ends up being [ 0xFF, 0xFE, end-of-stream ], so end-of-stream token gets prepended to stream?!?
https://github.com/whatwg/encoding/commit/b891a81b34d17ae4da01eb4cad074d09bf843e09