Bugzilla – Bug 16157
WebSocket shouldn't throw SyntaxError on unpaired surrogates
Last modified: 2012-05-02 20:06:47 UTC
> If the method's second argument has any unpaired surrogates, then throw a SyntaxError exception and abort these steps.
> If the data argument has any unpaired surrogates, then throw a SyntaxError exception.
Don't throw exceptions on unpaired surrogates. Instead, use the WebIDL "convert a DOMString to a sequence of Unicode characters"  algorithm, which converts unpaired surrogates to U+FFFD, as well as defining the conversion itself.
Silently scrambling data seems like a bad idea. Why would we do this?
Please see the thread at http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html, so we don't start the discussion from scratch.
I read that thread before and didn't see any reason to do this.
The only argument I've seen so far is about what happens if a user types in a message with astral characters and the script truncates it naïvely half-way through a surrogate and then sends it through the socket. That does seem like a potentially rare case (wouldn't be caught in the design). Not clear that replacing the half-surrogate with U+FFFD is especially nice either but it seems better than crashing.
How is this different from the "draconian" error handling the XML parsers are required to do and which many people, you included, has argued strongly against.
The problem with throwing for unpaired surrogates is that easy-to-make data-dependent mistakes produces very fatal results. I.e. if for example you want to send string data in smaller chunks a very easy "mistake" to make would be to simply chop up the JS-string into 10k sized chunks and send each separately. This will generally work great, however in languages which produces a lot of surrogates this will fail 50%-67% of the time.
If we could make it throw consistently then I agree it would have been a more reasonable strategy. But I can't think of a way to not make this very data dependent which means that it's likely to not fail on developers machines, but fail in the real world.
And yes, putting in a replacement character also results in destroyed data. However in the example stated above, having one destroyed character every 10k of data should be a low enough error rate that the message is still understandable to a human. Just like the layout errors produced by a missing end tag likely produces a page understandable to humans.
Checked in as WHATWG revision r7084.
Check-in comment: Make WebSocket silently convert isolated surrogated to U+FFFD rather than throwing an exception. This will result in data corruption when a user types in astral-plane characters that get truncated by naiive script half-way through, rather than crashing the application.