This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 23822 - Should the EventSource, Worker, and SharedWorker constructors resolve the URL using utf-8?
Summary: Should the EventSource, Worker, and SharedWorker constructors resolve the URL...
Status: RESOLVED WONTFIX
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: Other other
: P4 normal
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL: http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:
: 23823 (view as bug list)
Depends on:
Blocks: 25090
  Show dependency treegraph
 
Reported: 2013-11-14 04:46 UTC by contributor
Modified: 2014-10-10 20:26 UTC (History)
5 users (show)

See Also:


Attachments

Description contributor 2013-11-14 04:46:30 UTC
Specification: http://www.whatwg.org/specs/web-apps/current-work/
Multipage: http://www.whatwg.org/C#the-eventsource-interface
Complete: http://www.whatwg.org/c#the-eventsource-interface
Referrer: 

Comment:
Should the EventSource constructor resolve the URL using utf-8 (like
WebSocket)? (If yes, also change .url to use utf-8)

Posted from: 59.37.57.226
User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.26 Safari/537.36 OPR/18.0.1284.11 (Edition Next)
Comment 1 Ian 'Hixie' Hickson 2013-11-21 23:31:29 UTC
Nah. WebSocket does it because it's a new protocol that uses UTF-8 throughout. EventSource is just a regular HTTP GET.

   http://damowmow.com/playground/tests/urls/001.html
Comment 2 Anne 2013-12-16 23:15:06 UTC
Encoding override is a legacy thing. We shouldn't put it in new APIs. Heck, we don't even use it for XMLHttpRequest.
Comment 3 Ian 'Hixie' Hickson 2014-01-15 22:56:03 UTC
I disagree. Consistency across the platform is good, as it makes the platform less surprising (and surprising APIs are really bad for authors). Having different APIs act subtly differently on the same URL string in the same Document — in the same script, even — is wacked.
Comment 4 Anne 2014-01-16 10:51:56 UTC
You want consistency across EventSource objects. We lost that for <a>. <a> inside a document and its nested document do not necessarily work the same. We should avoid having that for EventSource and EventSource in a worker.
Comment 5 Anne 2014-01-16 10:57:35 UTC
Apart from consistency across similar-purpose objects (such as EventSource, Worker, etc.) there's another reason. Which is that if these simply use utf-8 there's no reason for these objects to keep or have a tie to a document in their implementation (or for there to be a global encoding variable, which would be worse). The implementation of these objects can rely on less external state and be more portable therefore.
Comment 6 Ian 'Hixie' Hickson 2014-01-16 17:14:52 UTC
I think that consistency between how an EventSource object and an <a href=""> parse a given string is more important than how two EventSource objects in two documents parse the same string. You're far more likely to be manipulating a string within a Document and passing it (or parts of it) to different APIs than you are to pass one string (or parts of it) between Documents, especially between Documents with different character encodings.

Implementations have to rely on the Document (actually the script settings object) anyway for base URL resolution. Adding encodings isn't a big deal. Indeed, it's good because it makes all the places that resolve URLs able to use the same pattern, rather than having some parts use one pattern and others use another.
Comment 7 Ian 'Hixie' Hickson 2014-01-16 17:15:47 UTC
*** Bug 23823 has been marked as a duplicate of this bug. ***
Comment 8 Anne 2014-01-17 11:08:44 UTC
I disagree. I think it's far more likely to use a given URL in the same context (i.e. the same API) than it is use a given URL in different contexts. And if any URL manipulation is to be done that would be done through the URL API, which always use utf-8.
Comment 9 Anne 2014-01-17 12:57:44 UTC
The other problem I have with your approach is that I think it is unlikely we will be able to follow it through consistently. If e.g. CSS would always use utf-8 your argument in comment 6 would not hold when you have direct access to the CSSOM from a document. It also does not hold if you pass the URL to a worker or some such.

Therefore I think that consistency within APIs and anything that does not need to be non-utf-8, be utf-8, is a more viable long term strategy.
Comment 10 Ian 'Hixie' Hickson 2014-01-18 03:07:22 UTC
It's clear that we'll have inconsistent behaviour if you move from one context (CSS, Worker, page JS) to another. But IMHO that's less bad than inconsistent behaviour within one context, which, if I understand correctly, is what you're arguing for (<a> to EventSource, e.g.).

It's more likely that someone will take a URL within one context and use it in a different way in that context (e.g. taking the query component of one URL and putting it into another) than crossing contexts, IMHO, if only because most pages have just one context.

(Have we actually tested what browsers do in CSS, by the way? Is it really UTF-8?)

Obviously this all becomes moot if someone just uses UTF-8 throughout.
Comment 11 Simon Pieters 2014-01-20 14:09:24 UTC
(In reply to Ian 'Hixie' Hickson from comment #10)
> (Have we actually tested what browsers do in CSS, by the way?

Yes.

https://github.com/w3c/web-platform-tests/tree/master/html/infrastructure/urls/resolving-urls/query-encoding

> Is it really
> UTF-8?)

It's mixed.

Presto uses utf-8 except for @import which uses the document's encoding.
Blink uses the stylesheet's encoding except for @import which uses utf-8.
Gecko uses the document's encoding except for <style> which uses utf-8.
Comment 12 Ian 'Hixie' Hickson 2014-01-21 21:35:35 UTC
And the style sheet encoding in Blink's case is what; the document's?
Comment 13 Simon Pieters 2014-01-23 06:29:49 UTC
No it can use its own encoding set with HTTP content-type or @charset (for external stylesheets). For <style> it's the document's.
Comment 14 Anne 2014-01-24 18:50:41 UTC
(In reply to Ian 'Hixie' Hickson from comment #10)
> It's more likely that someone will take a URL within one context and use it
> in a different way in that context (e.g. taking the query component of one
> URL and putting it into another) than crossing contexts, IMHO, if only
> because most pages have just one context.

I disagree with this. While URLs can be put anywhere, a URL you put in EventSource does not make sense within <a>. It makes much more sense you have that URL already and at some point move it to a worker or some such to use it with the same API. I.e. my argument is that URLs are API-bound and that APIs should be as consistent as possible across contexts, which argues for utf-8 as certain contexts are already restricted to that.
Comment 15 Ian 'Hixie' Hickson 2014-01-24 23:11:12 UTC
(In reply to Simon Pieters from comment #13)
> No it can use its own encoding set with HTTP content-type or @charset (for
> external stylesheets). For <style> it's the document's.

I meant in the case where there was no given encoding, sorry. (It'd be very weird if it used the document encoding even though it had its own encoding... is that really what Gecko does? And if so, does it really do that _except for sheets inside the document_? that's wacked, if so.)


(In reply to Anne from comment #14)
> 
> I disagree with this. While URLs can be put anywhere, a URL you put in
> EventSource does not make sense within <a>.

Not the whole URL, but components of it. For example, you might end up with the same text in a link's query component, an EventSource URL's query component, a WebSocket's query component, etc. It's not the URLs themselves that you're copying around, but the string.

If you already _have_ a URL, there's not really a problem. It's almost certainly already gotten converted so that there's no re-encoding to risk.
Comment 16 Anne 2014-01-24 23:59:56 UTC
That depends on when you parse the URL, really. And ideally all URL manipulation down the road will go through the URL API. And from your example the WebSocket API would already be utf-8.

It also seems much simpler to explain that a couple of APIs have a quirk rather than that all things depend on a global flag and context.
Comment 17 Simon Pieters 2014-01-25 19:23:16 UTC
(In reply to Ian 'Hixie' Hickson from comment #15)
> I meant in the case where there was no given encoding, sorry.

OK, then yeah, the stylesheet inherits the document's encoding (assuming <link> as opposed to @import).

> (It'd be very
> weird if it used the document encoding even though it had its own
> encoding... is that really what Gecko does?

Yes.

> And if so, does it really do
> that _except for sheets inside the document_? that's wacked, if so.)
 
Yes.
Comment 18 Ian 'Hixie' Hickson 2014-01-27 19:52:09 UTC
Down the road, all the encodings will be UTF-8, too, so this will be moot.

This whole thing is only relevant for the near term, where people use these APIs but don't use UTF-8 and don't use the new URL parsing APIs.

In practice, I think URL concatenation is something that's going to exist for a long time, by the way. Having a string that you just concatenate to another is always going to be way easier than using an API, however good the API is.
Comment 19 Anne 2014-01-30 00:28:29 UTC
Encodings will be utf-8? Do you mean other encodings will disappear?

I would agree with that.

Given all that, I think this comes down to what is easiest to implement. And I would argue that what I proposed is easier. I'm happy to let implementers decide.
Comment 20 Ian 'Hixie' Hickson 2014-01-30 02:29:46 UTC
I don't think it's down to what's easiest to implement. I think it's down to what's easiest for authors in the meantime.
Comment 21 Anne 2014-01-30 18:38:15 UTC
What's easiest for authors in the maintime is to use utf-8 as implementations all do different things as zcorpan pointed out already.
Comment 22 Ian 'Hixie' Hickson 2014-01-30 19:47:17 UTC
Using UTF-8 isn't trivial for many authors. I think it'll take years and years for legacy sites to convert. Much longer than it will take for browsers to align.
Comment 23 Anne 2014-01-30 20:04:10 UTC
Is there data on that? Most of the web is utf-8 today per data Google released on their blog.
Comment 24 Ian 'Hixie' Hickson 2014-01-30 22:59:07 UTC
I assume you're referring to this:
   http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html

That seems pretty consistent with most new sites being UTF-8, but most legacy sites not changing.
Comment 25 Simon Pieters 2014-01-31 10:54:40 UTC
FWIW I agree with Anne at this point, we should do what is simpler to implement. My colleagues' experience with this issue in Presto is that it was non-trivial to use the document's encoding because many parts of the code need to keep track of it. The more features that use utf-8, the better.
Comment 26 Ian 'Hixie' Hickson 2014-01-31 18:16:00 UTC
Why would authors be trumped by implementors on this point? I don't understand.
Comment 27 Simon Pieters 2014-02-03 09:51:13 UTC
I think the difference for authors is not so clear, and it's temporary, while the difference for implementors is clearer and is not temporary.
Comment 28 Ian 'Hixie' Hickson 2014-02-03 19:59:41 UTC
It's "temporary", but still on the order of years if not decades...

I still think this is exactly the kind of thing that makes the Web platform suck. Yes, we have quirks. But it's worse to have inconsistent quirks than to just own our quirkiness and be consistent about it everywhere. Having

   protocol + host + path + query

...result in different strings at the end based on what's going on at the start, or based on what API this expression is placed into, is worse than the fact that path and query get encoded differently.
Comment 29 Anne 2014-02-05 14:19:31 UTC
Say you're maintaining a legacy application and need to add some EventSource code. Would it not be nice if that worked the same way on legacy as on new pages?
Comment 30 Ian 'Hixie' Hickson 2014-02-05 19:34:33 UTC
If you're maintaining a legacy application, new pages will still be in the same encoding as the old pages. Having an application use two different encodings is a giant pain.
Comment 31 Ian 'Hixie' Hickson 2014-02-21 21:06:22 UTC
As far as the EventSource and Worker constructors go, I don't see value is being inconsistent with the Audio() constructor, nor with features in attributes like <a href=""> and so on. Also, it's interoperably implemented today anyway:

   http://damowmow.com/playground/tests/urls/001.html

So I'm WONTFIXing this.

It sounds like the CSS situation is a bit of a mess, but I'll leave that to the CSS working group to figure out.
Comment 32 Simon Pieters 2014-02-26 09:36:26 UTC
It's not interoperably implemented. Firefox uses utf-8 for EventSource.

Fail	EventSource constructor	assert_equals: expected "%E5" but got "%C3%A5"
http://web-platform.test:8000/html/infrastructure/urls/resolving-urls/query-encoding/windows-1252.html

Your test is not so helpful because you use a character that is not representable in the encoding which causes Firefox to use utf-8 even if it would otherwise use the document's encoding.

In Firefox, window.open() also uses utf-8, importScripts() uses the document's encoding, cache manifest uses the document's encoding.

In Blink, <base href> uses utf-8, location.search and <a>/<area>.search use utf-8,  history.pushState/replaceState use utf-8, SVG <image> uses utf-8, XHR uses document's encoding, WebSocket uses document's encoding.

So it's not just the CSS situation that is in a mess.
Comment 33 Ian 'Hixie' Hickson 2014-02-27 23:39:17 UTC
> importScripts() uses the document's encoding

Which document (in shared workers in particular)? importScripts() is only available from a worker, and workers are always UTF-8, per spec.


> So it's not just the CSS situation that is in a mess.

I meant at the spec level. All the cases you describe are well-defined, even if not perfectly well implemented by everyone.


Updated test case with a bit more (though not all those you mentioned):
   http://damowmow.com/playground/tests/urls/002.html

I grant that interop is not as good as 001 made it appear. (Firefox doesn't seem to use UTF-8 for EventSource, though, unless I'm missing something.) But it still seems pretty good to me, at least relative to what I'm used to.
Comment 34 Simon Pieters 2014-02-28 14:41:38 UTC
(In reply to Ian 'Hixie' Hickson from comment #33)
> > importScripts() uses the document's encoding
> 
> Which document (in shared workers in particular)?

I don't know, haven't tested that.

> importScripts() is only
> available from a worker, and workers are always UTF-8, per spec.

Yep.
 
> > So it's not just the CSS situation that is in a mess.
> 
> I meant at the spec level. All the cases you describe are well-defined, even
> if not perfectly well implemented by everyone.

OK, sure.

> Updated test case with a bit more (though not all those you mentioned):
>    http://damowmow.com/playground/tests/urls/002.html
> 
> I grant that interop is not as good as 001 made it appear. (Firefox doesn't
> seem to use UTF-8 for EventSource, though, unless I'm missing something.)

Firefox uses utf-8 for EventSource when actually fetching. (Seems like I forgot to test EventSource.url.)

> But it still seems pretty good to me, at least relative to what I'm used to.

That's fine, but you WONTFIXed on the basis that there was interop, which was false. We can go either way based on the implemented landscape here I think.
Comment 35 Ian 'Hixie' Hickson 2014-02-28 22:04:03 UTC
Firefox being inconsistent about what it fetches and what it reflects in IDL is pretty scary.
Comment 36 Ian 'Hixie' Hickson 2014-03-13 20:35:25 UTC
(In reply to Simon Pieters from comment #34)
> 
> That's fine, but you WONTFIXed on the basis that there was interop, which
> was false. We can go either way based on the implemented landscape here I
> think.

I think the situation is far more interoperable than that suggests. It's not like it's 50:50, or mostly random, or whatever. Most things use the document encoding. There are exceptions in every browser I've tested, but they're the minority.


For authors, long term it won't matter, since they'll just use UTF-8. Short term, using the document encoding consistently everywhere is the least confusing.

For implementors, doing UTF-8 everywhere is better on the long term, but we probably can't do that everywhere due to compat. Once you have to do it somewhere, the benefit of doing it consistently everywhere seems on par to the benefit of having some parts of the codebase be simpler. So it seems like a push.

All in all, I just don't see changing this as very compelling, and I think there's good arguments to be made for the consistency position.
Comment 37 Anne 2014-03-18 13:47:13 UTC
I'm having a hard time buying that argument. If authors use anything but utf-8 they're in a world of hurt, especially right now.

So it seems like we should consult with implementors as to what they prefer. I would expect them to prefer being less dependent on global variables.
Comment 38 Ian 'Hixie' Hickson 2014-09-29 18:31:09 UTC
Here's where I stand on this bug:

 - overall, IMHO this is WONTFIX. I don't want to churn on such things, and I
   doubt we'll ever be able to make it entirely consistent so there's no long-
   term simplification win to be had, we'll always have some special code and
   so on.

 - if there are some APIs where the majority of implementations disagree with
   the spec, I'm happy to fix those. File separate bugs.

 - if there are specific APIs where the majority of implementors want to move
   to a different model, then those implementations should ship the change to
   see if it's compatible, and then file a bug under the second bullet above.