21091 – Clarify "legacy" single-byte encodings

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 21091 - Clarify "legacy" single-byte encodings

Summary: Clarify "legacy" single-byte encodings

Status:	RESOLVED WONTFIX

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC Windows NT

Importance:	P2 trivial
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-02-22 16:34 UTC by Paril
Modified:	2013-02-22 22:09 UTC (History)
CC List:	1 user (show)

See Also:

Attachments

Description Paril 2013-02-22 16:34:47 UTC

I'm not sure exactly how to best explain this, but I'll do my best.

I'm using the Encoding API to transfer binary data to clients - the data is transferred as text/plain; us-ascii, which is essentially what I want as it is one byte per byte and won't mangle certain special characters - primarily, my concern is that bytes outside of the 7f range will actually be two bytes rather than one byte, which doesn't match the standard that 1 byte is.. well, 1 byte.

This can be proven with the following:
new TextEncoder("utf-8").encode("§")

It returns a Uint8Array with two indices - c2 a7. Obviously, this is expected for utf-8; so, okay, easy fix, I should be using us-ascii to match the encoding that I'm passing, but the problem now is that the document does not explain how to actually -use- these encodings.

Section 9 makes the remark that "Single-byte encodings share the decoder and encoder.", but this makes no sense as a Decoder can only give out a string, and new TextEncoder("us-ascii") throws an error that it must be one of the three utf types.

The document fails to actually provide details on how users of this standard are supposed to use the legacy single-byte encodings to decode a string that is loaded from a single-byte document into an array that will match the length of the string - that is to say, a string with 256 characters each containing their respective index as its value should decode to a Uint8Array with a length of 256, and each index should maintain its value.

I might be missing something obvious, but, it would be really handy to know exactly how to decode a single-byte string. I managed to jerry-rig a solution based on a TextEncoder implementation on Google Code by removing the utf checks - surprisingly this actually works, and produces the proper output, but it's not following standard and I don't want to cheat the standard.

Comment 1 Anne 2013-02-22 16:39:46 UTC

It's not supported, basically. You should be using utf-8 always.

Comment 2 Paril 2013-02-22 16:57:43 UTC

That's what I figured from the way it was written.. that seems really lame, though. Binary data is implied to contain bytes anywhere between 0-255; what other options do I have for quickly converting a string to a Uint8Array? Preferably an option that is natively supported in browsers (or will be, at some point) so that it's not pure JS.

Comment 3 Anne 2013-02-22 17:00:47 UTC

Could you maybe more completely explain what you want to do? Single-byte encodings only support up to 255 code points. Strings support way more.

Comment 4 Paril 2013-02-22 17:07:42 UTC

Okay, well, I come from a software development background, meaning I am used to dealing with binary data. That's just how I work with my file formats - they are binary, and I really don't want to convert everything to a textual format, as for this specific project speed is really important and there's not much that can beat binary data formats for the amount of data streaming I need to do.

All of the file formats I am transferring to the end-user are in binary. I need a way, in JavaScript, to read these bytes back in the same order they are written to the file in. At the moment I'm using that legacy implementation of TextEncoder, which works perfectly at the moment, but I feel like something is "wrong" about having to edit it to force it to decode us-ascii properly and that there must be something better in the works for doing this.

A simple example: I write a single byte to a file (A7); I send this 1 byte file to client; they read it back, and the first byte they get out of it is A7.

I started out by doing direct string parsing (charCodeAt()), but that caused problems as the encoding would return odd values for certain characters well beyond 255, which had me hacking an array to convert them to their "intended" byte value. I realized that the Encoding API is likely a better alternative - and in my case, it was, in that the time to parse the entire binary file is approximately 14x faster than it used to be for ~400 kb of binary data - but again I feel like I'm doing something horribly wrong by editing a standard implementation to -force- it to decode a string into us-ascii.

I don't know if I'm making it more complicated than it needs to be or something, but I can't figure out any easy way to "push" a remote file into a Uint8Array. That is the end goal - to GET a file from a server and have it go directly into a Uint8Array without having to convert it from a string in the first place.

Comment 5 Anne 2013-02-22 17:11:53 UTC

It sounds like you should not be reading these files in as a string. That is your problem.

Comment 6 Paril 2013-02-22 17:14:18 UTC

Of course, but there is no standard way to read a response as binary, no? Is there a sort of XMLHttpRequest for binary?

Comment 7 Paril 2013-02-22 17:16:48 UTC

Whoa, okay, I just found this: http://www.html5rocks.com/en/tutorials/file/xhr2/
I don't know how the heck I missed this one. I swear I looked it up before posting here, officer, I'm just holding the webpage for a friend.

Thanks for putting up with me - I knew I was missing something that would make this much easier on me.

Comment 8 Anne 2013-02-22 17:18:07 UTC

Thanks for explaining your use case, glad we addressed it some time ago :-)

Comment 9 Paril 2013-02-22 17:22:20 UTC

It's a tough move moving from software development to web development. I have to say I picked the best year to start working with it, though - the standards have pretty much brought web development up to par with desktop programs, and I am loving every minute of it. 'Nuff for me though - I still think that the documentation should be updated to reflect that the encoder/decoder cannot be used to decode legacy types. Reading it through entirely, it suggests that it is supported but that you can only use the Encoder to go both ways.. which makes no sense :)

Comment 10 Anne 2013-02-22 17:32:45 UTC

Well the encoders and decoders are not just exposed through the API. The TextEncoder limitations are quite clear from http://encoding.spec.whatwg.org/#interface-textencoder Everything but utf-8, utf-16le, and utf-16be throws.

The encoders are still used for HTML <form>, URLs, etc. but we don't want to spread that legacy any further.

Comment 11 Paril 2013-02-22 20:04:13 UTC

Yeah, I understand that now, but it's this line that throws me off:
"Single-byte encodings share the decoder and encoder."

It seems to imply that you are able to still encode a string with a single-byte encoding, but by using a different method - nowhere does it really explain why legacy types are not allowed and why only utf are allowed in the encoder but not decoder. In theory, if single-byte encodings share a decoder & encoder, should it not be possible to still encode?

Comment 12 Anne 2013-02-22 22:09:26 UTC

Sure, and as I said, certain parts of the platform do just that. But we don't want to expose that more than needed.