23155 – Algorithm for BOM handling in TextDecoder.decode can be simplified

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 23155 - Algorithm for BOM handling in TextDecoder.decode can be simplified

Summary: Algorithm for BOM handling in TextDecoder.decode can be simplified

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:	16688
Blocks:
	Show dependency tree / graph

Reported:	2013-09-04 16:32 UTC by Joshua Bell
Modified:	2014-03-28 11:21 UTC (History)
CC List:	1 user (show)

See Also:

Attachments

Description Joshua Bell 2013-09-04 16:32:35 UTC

The BOM handling algorithm is specified as removing lead bytes from the stream before invoking the encoder's decode algorithm. (Currently step 4)

This can be specified in an alternate fashion: after running the encoding's decoder algorithm: if BOM seen flag is unset, and the output stream is non-empty, set the BOM seen flag; if encoding is one of utf-8, utf-16le, utf-16be and the first code point of the output stream is U+FEFF then remove it from the stream.

Comment 1 Joshua Bell 2013-09-04 16:35:11 UTC

For symmetry with section 6 it may be better to leave it as-is, but I thought I'd point it out.

Comment 2 Anne 2013-09-04 20:08:02 UTC

That'd require we clearly define the output stream.

Comment 3 Anne 2014-03-26 18:36:37 UTC

Fixed as part of bug 16688.

https://github.com/whatwg/encoding/commit/dc8e4c10c9b4a91f188f3145c2e31ddec4d52a78

This is a massive change, review appreciated!

Comment 4 Joshua Bell 2014-03-26 18:55:01 UTC

In 6 Decode and encode, step 3: if buffer doesn't match any lines in the byte-order table, buffer should be prepended to stream

Prior to that revision, this was handled by updating an offset only if the BOM matched.

Comment 5 Anne 2014-03-26 18:59:44 UTC

We no longer have offsets or code point pointers. You can only read from a stream and put stuff back to it. There's no peek or skip... At least in the specification abstractions, you can implement them however you want of course.

Comment 6 Joshua Bell 2014-03-26 19:26:51 UTC

(In reply to Anne from comment #5)
> We no longer have offsets or code point pointers. You can only read from a
> stream and put stuff back to it. There's no peek or skip... At least in the
> specification abstractions, you can implement them however you want of
> course.

Right....

So step 2 reads 3 bytes from the stream. Step 3 checks if they match the BOM. If they don't match the BOM, as written those 3 bytes are dropped on the floor and unavailable in step 4.

So step 3 needs to also say the equivalent of "otherwise, prepend buffer to stream"

Comment 7 Anne 2014-03-27 12:24:44 UTC

My bad, fixed.

Comment 8 Joshua Bell 2014-03-27 16:47:13 UTC

(In reply to Anne from comment #7)
> My bad, fixed.

That fix lgtm.

Another edge case: if stream is 0xFF 0xFE (that is, exactly two bytes), then step 3 reads two bytes then end-of-stream so buffer ends up being [ 0xFF, 0xFE ]. Step 4 matches utf-16le and BOM seen is set. Step 6 will then prepend the last byte of buffer to stream - so stream ends up as 0xFE. Oops.

I suppose we say that buffer ends up being [ 0xFF, 0xFE, end-of-stream ], so end-of-stream token gets prepended to stream?!?

Comment 9 Anne 2014-03-28 11:21:41 UTC

https://github.com/whatwg/encoding/commit/b891a81b34d17ae4da01eb4cad074d09bf843e09