25914 – No definition of parsing blob's scheme data

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 25914 - No definition of parsing blob's scheme data

Summary: No definition of parsing blob's scheme data

Status:	RESOLVED FIXED

Alias:	None

Product:	WebAppsWG
Classification:	Unclassified
Component:	File API (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Arun
QA Contact:	public-webapps-bugzilla

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	25987
	Show dependency tree / graph

Reported:	2014-05-29 09:52 UTC by Anne
Modified:	2014-06-25 19:15 UTC (History)
CC List:	2 users (show)

See Also:

Attachments

Description Anne 2014-05-29 09:52:25 UTC

We'd need such a definition and it needs to return the origin and blob ID.

Comment 1 Glenn Maynard 2014-05-29 14:13:10 UTC

I don't think there's any reason to ever parse out the UUID.  The "blob ID" can just be the whole URL.

Comment 2 Arun 2014-05-29 14:25:18 UTC

Do you mean, how to extract the identifier for use with the Blob URL store? That's the entire scheme data IMHO -- the extra origin annotation isn't problematic and goes along for the ride (previous editions of this used to have the scheme data just be a UUID).

The spec covers how to serialize a Blob URL. Tokenizing the string to extract parts is straightforward. Do you want additional definition of how to parse the Blob URL string and extract origin? If so, that's stripping aside the "blob:" and anything after the "/" solidus, which leaves the origin. Should this be formalized in the spec.?

Comment 3 Anne 2014-05-29 15:16:48 UTC

If we use scheme data as identifier, fine.

Yes, I'd like a way to extract an origin out of the scheme data.

Comment 4 Arun 2014-05-30 21:32:48 UTC

(In reply to Anne from comment #3)
> If we use scheme data as identifier, fine.
> 
> Yes, I'd like a way to extract an origin out of the scheme data.

OK, I think 

http://dev.w3.org/2006/webapi/FileAPI/#extractionOfOriginFromBlobURL 

will work.

Comment 5 Anne 2014-05-31 07:14:09 UTC

This algorithm should operate on a parsed URL, not a fresh URL. You want to hand this algorithm a scheme data component.

Comment 6 Arun 2014-06-02 20:05:47 UTC

(In reply to Anne from comment #5)
> This algorithm should operate on a parsed URL, not a fresh URL. You want to
> hand this algorithm a scheme data component.


Done! And revisited the original intent of this bug.

So:

1. Specified extracting the identifier from a fresh URL: http://dev.w3.org/2006/webapi/FileAPI/#extractionOfIdentifierFromBlobURL 

2. Specified extracting origin from a scheme data component / identifier: http://dev.w3.org/2006/webapi/FileAPI/#extractionOfOriginFromIdentifier

Clarified what emitting methods do.

Comment 7 Anne 2014-06-05 08:21:01 UTC

This is wrong. The URL parser is http://url.spec.whatwg.org/ That returns you components you can operate on. 'scheme data' seems to be the identifier you want. Or maybe 'scheme data' + "?" + 'query'.

Comment 8 Arun 2014-06-05 13:22:09 UTC

(In reply to Anne from comment #7)
> This is wrong. The URL parser is http://url.spec.whatwg.org/ That returns
> you components you can operate on. 'scheme data' seems to be the identifier
> you want. Or maybe 'scheme data' + "?" + 'query'.


I thought it was the other way around: that URL Parser wanted File API to define a way to extract identifiers from Blob URLs.

If not, seems like what's needed is:

1. Use URL parser to get scheme data. Other URL components have no meaning, and are not emitted by any method (e.g. URL.createFor or URL.createObjectURL do not emit a fragment, a query, etc., although web developers can append a fragment).

2. Get the origin out of the parsed URL from 1.

Comment 9 Anne 2014-06-05 13:33:03 UTC

Yeah, you should define 2 as part of the blob URL definition. Would make sense.

Comment 10 Glenn Maynard 2014-06-05 13:34:22 UTC

It wasn't obvious to me what "scheme data" means.  It looks like, in a URL "http://foo.com/bar?baz#baf", the scheme data is "foo.com/bar", so what you want is the algorithm to parse out "foo.com" from "foo.com/bar"?

This looks like it should be identical to HTTP.  The format is the same, and people will inevitably parse it out using code they have for parsing HTTP URLs.

Comment 11 Anne 2014-06-05 13:46:07 UTC

No, for http URLs there is no scheme data. It follows different rules as it is a relative URL.

Comment 12 Glenn Maynard 2014-06-05 13:46:55 UTC

Oh, forgot about the nested-URL aspect of blob URLs now.  The scheme data (if I read correctly) of "blob:http://foo.com/1111-2222?data" is "http://foo.com/1111-2222".  Should the scheme data just be recursively parsed as a URL, so it can retrieve the origin of any nested scheme?

Speaking of which, the "blob:origin/UUID" encoding assumes that "origin/path" makes sense for every possible scheme that can be an origin for blob URLs.  Is that safe?  I can't think of any where it won't, but if a URL doesn't have a "host/path" formatting at all it could be weird.  (data: is a URL format that doesn't, but I haven't been following the origin-of-data: discussion to know if that matters.)

Comment 13 Anne 2014-06-05 13:53:30 UTC

Recursively parsing it might make sense. If we want to define origin as a tuple.

Comment 14 Arun 2014-06-05 14:29:35 UTC

(In reply to Glenn Maynard from comment #12)
> Oh, forgot about the nested-URL aspect of blob URLs now.  The scheme data
> (if I read correctly) of "blob:http://foo.com/1111-2222?data" is
> "http://foo.com/1111-2222".  Should the scheme data just be recursively
> parsed as a URL, so it can retrieve the origin of any nested scheme?


So yes in blob:http://example.org:8080/33107ee0-ecbd-11e3-ac10-0800200c9a66 the scheme data is http://example.org:8080/33107ee0-ecbd-11e3-ac10-0800200c9a66 and is also the identifier entered in the Blob URL Store.

It *could* be recursively parsed, but Blob URLs are just strings, and recursively parsing the URL doesn't do anything useful (e.g., it isn't handed off to Fetch).

So, origin extraction is currently defined as string manipulation. What follows after the "/" is a UUID.


> 
> Speaking of which, the "blob:origin/UUID" encoding assumes that
> "origin/path" makes sense for every possible scheme that can be an origin
> for blob URLs.  Is that safe?  


Well, the file scheme, along with user-agent proprietary schemes like chrome, are not defined. In fact, the spec. restricts us to what the Web Origin specification (http://tools.ietf.org/html/rfc6454) defines as an origin.


(In reply to Anne from comment #9)
> Yeah, you should define 2 as part of the blob URL definition. Would make
> sense.


It *is* defined as part of the Blob URL definition, which asks for a Unicode-serialization of the origin of the incumbent settings object + "/" + UUID, which is the scheme data, which identifies a Blob resource.

Comment 15 Glenn Maynard 2014-06-05 14:45:15 UTC

(In reply to Arun from comment #14)
> So, origin extraction is currently defined as string manipulation. What
> follows after the "/" is a UUID.

> Well, the file scheme, along with user-agent proprietary schemes like
> chrome, are not defined. In fact, the spec. restricts us to what the Web
> Origin specification (http://tools.ietf.org/html/rfc6454) defines as an
> origin.

Pulling it out as a string is fine, I think, but don't define it in a way that fails if the origin contains a slash.  Find the *last* "/" in the URL and strip off everything after it.  For example, blob:file://home/glenn/test.html/10d88a40-b6b1-45ec-8aba-a626037613a0 would parse out "file://home/glenn/test.html" and not "file://home".

This doesn't actually matter for file URLs, since browsers don't expose the origin for file URLs in the first place (for privacy reasons, I assume), but it helps ensure that this will work for all future and unknown schemes that blob URLs might be used in.

(Actually, file URLs are a bit broken with blob URLs in Chrome: it results in a URL like "blob:null/uuid".  Firefox's current blob URLs don't have this problem, since they don't include the origin.)

Comment 16 Arun 2014-06-05 16:55:32 UTC

(In reply to Glenn Maynard from comment #15)
> (In reply to Arun from comment #14)
> > So, origin extraction is currently defined as string manipulation. What
> > follows after the "/" is a UUID.
> 
> > Well, the file scheme, along with user-agent proprietary schemes like
> > chrome, are not defined. In fact, the spec. restricts us to what the Web
> > Origin specification (http://tools.ietf.org/html/rfc6454) defines as an
> > origin.
> 
> Pulling it out as a string is fine, I think, but don't define it in a way
> that fails if the origin contains a slash.  Find the *last* "/" in the URL
> and strip off everything after it.  For example,
> blob:file://home/glenn/test.html/10d88a40-b6b1-45ec-8aba-a626037613a0 would
> parse out "file://home/glenn/test.html" and not "file://home".


Slash-containing origins are really only in the file scheme arena, and right now the spec. disallows them, which I think is right.

> 
> This doesn't actually matter for file URLs, since browsers don't expose the
> origin for file URLs in the first place (for privacy reasons, I assume), but
> it helps ensure that this will work for all future and unknown schemes that
> blob URLs might be used in.
> 
> (Actually, file URLs are a bit broken with blob URLs in Chrome: it results
> in a URL like "blob:null/uuid".  Firefox's current blob URLs don't have this
> problem, since they don't include the origin.)


This might be right, actually. If the origin of the incumbent settings object returns a null when parsing for tuples in the Unicode Serialization of an Origin algorithm, the Unicode Serialization of a Blob algorithm allows implementations to write out an implementation defined value. It could be null, or something else. 

But file scheme use is disallowed.

Comment 17 Anne 2014-06-09 07:09:22 UTC

(In reply to Arun from comment #14)
> It *could* be recursively parsed, but Blob URLs are just strings, and
> recursively parsing the URL doesn't do anything useful

Well, origins are objects and not strings.


> It *is* defined as part of the Blob URL definition, which asks for a
> Unicode-serialization of the origin of the incumbent settings object + "/" +
> UUID, which is the scheme data, which identifies a Blob resource.

That sounds like the opposite of what this bug is asking for.

Comment 18 Glenn Maynard 2014-06-09 14:08:49 UTC

To recap something from IRC: a problem with the idea of embedding the origin inside the blob: URL and then extracting it later is that it won't work with anything that has a unique origin, which includes sandboxed iframes and data:.  This already works today:

http://zewt.org/~glenn/blob-inside-sandbox.html

We might not care about file:, but we should care about data: and we definitely care about sandboxed iframes.

Unique origins serialize to "null".  I think there's no way for us to embed the origin inside the blob URL and then pull it out later--we need the real origin to do origin checks, not "null" (and not the effective script origin, which is what sandboxed iframes see in document.location.origin).

This seems to point towards Firefox's "implicit origin" approach: don't put the origin in the URL at all, so the blob URL is simply "blob:uuid", and store the origin in the blob URL store.

It seems like we shouldn't expose the origin from URL.origin at all if we do this.  Otherwise, the result of new URL(blobUrl).origin will be different before and after blobUrl is revoked, which is unfortunate.  It would also mean synchronously looking up the URL in the blob URL store to get its origin, which might be IPC.  I don't know if that's critical, but as far as I know blob URLs are completely async entities today, which seems like a good thing.

Comment 19 Anne 2014-06-09 14:10:17 UTC

As I said on IRC, I don't see how that's a problem given how the blob URL store works.

Comment 20 Glenn Maynard 2014-06-09 14:32:31 UTC

I'm not sure what you're saying.  (You said you don't care about file:, which is why I've brought up other cases this approach is wrong for.)

The approach we reached on the list and that we were working on here is the "explicit origin" approach, where the origin is stored in the URL and then extracted as a string.  That doesn't work with unique origins, and "blob URLs just won't work inside data: and sandboxed iframes" is obviously not an option.  So, the approach needs to be reevaluated.

Comment 21 Anne 2014-06-09 14:38:03 UTC

Again, the whole origin check is largely unneeded as I explained due to the blob URL store being scoped to browsing contexts that can reach each other which is document.domain reach. For browsing contexts with unique origins that's obviously only same-origin browsing contexts and therefore the whole null thing is kind of irrelevant.

Comment 22 Glenn Maynard 2014-06-09 15:08:45 UTC

(In reply to Anne from comment #21)
> Again, the whole origin check is largely unneeded as I explained due to the
> blob URL store being scoped to browsing contexts that can reach each other
> which is document.domain reach. For browsing contexts with unique origins
> that's obviously only same-origin browsing contexts and therefore the whole
> null thing is kind of irrelevant.

I don't remember any discussion about "the whole origin check is largely unneeded", but that's the same as what I said: doing the origin check through the blob URL store (implicit) and not by comparing something serialized into the URL (explicit).

Comment 23 Arun 2014-06-09 17:37:56 UTC

(In reply to Glenn Maynard from comment #22)
> (In reply to Anne from comment #21)
> > Again, the whole origin check is largely unneeded as I explained due to the
> > blob URL store being scoped to browsing contexts that can reach each other
> > which is document.domain reach. For browsing contexts with unique origins
> > that's obviously only same-origin browsing contexts and therefore the whole
> > null thing is kind of irrelevant.
> 
> I don't remember any discussion about "the whole origin check is largely
> unneeded", but that's the same as what I said: doing the origin check
> through the blob URL store (implicit) and not by comparing something
> serialized into the URL (explicit).


Well, let's discuss the origin check and properly defining the Blob URL Store in the bug Anne filed for it, namely Bug 25987 .

Anne: in order to get *this* bug right, tell me if this is the right approach:

1. Use Basic URL Parser to parse a Blob URL into relevant components. File API is chiefly interested in scheme data, though presently we allow fragment but make no explicit mention of any other URL components.

2. Since the scheme data portion returned by 1. above is itself a URL, run the Basic URL Parser again on the scheme data component returned by 1. Since scheme data is itself a Unicode serialized origin string, the scheme, host, port tuple should suffice for origin extraction without string manipulation heroics that the spec now has. Strictly speaking, if we get Bug 25987 right, this may not be necessary, but good to have around.

OK?

Comment 24 Anne 2014-06-09 17:44:20 UTC

You don't need to define 1. That is already handled by the generic URL parser.

Comment 25 Glenn Maynard 2014-06-10 00:34:00 UTC

Assuming bug 25987, the only reason to do this at all is to be able to expose the serialized origin through new URL(url).origin without the result changing after the blob has been revoked, right?  (I think that was just a nice side-effect of embedding the origin in the URL, which was put there for the no longer needed same-origin check, so we should make sure it's still worth doing on its own and not just momentum from a decision that no longer applies...)

Comment 26 Anne 2014-06-10 05:46:10 UTC

No, there's also document.domain. (Assuming this store is shared across globals and you don't have to use the same entry settings object's global object to get to it.)

Comment 27 Arun 2014-06-18 21:22:54 UTC

(In reply to Anne from comment #24)
> You don't need to define 1. That is already handled by the generic URL
> parser.

Done per http://dev.w3.org/2006/webapi/FileAPI/#extractionOfOriginFromIdentifier

Comment 28 Anne 2014-06-19 09:11:46 UTC

Don't check for "parse error", see if the algorithm returns failure.

Also, it may not return a host or port. You should probably check if it's a relative scheme.

I think you probably just want to parse it and then say something like "return /parsedURL/'s origin".

But maybe I should just bite the bullet and define origin for all URL schemes we care about.

Comment 29 Arun 2014-06-23 18:38:39 UTC

(In reply to Anne from comment #28)
> Don't check for "parse error", see if the algorithm returns failure.


Done!


> Also, it may not return a host or port. You should probably check if it's a
> relative scheme.


It's true that it could be a relative scheme, so I check for a relative scheme per your advice. But, it will never be a relative URL; that is, the recursive invocation of the basic URL parser will never also be fed a base URL. So, host will always be extractable, since the emitter methods (URL.createFor and URL.createObjectURL) will always emit a full origin string, namely the effective script origin specified by the settings object.

> 
> I think you probably just want to parse it and then say something like
> "return /parsedURL/'s origin".


Done! 

> But maybe I should just bite the bullet and define origin for all URL
> schemes we care about.


If you do this, some portions of this in File API might be made redundant. We've already made the section on request-response informative; other sections may follow suit.

Please give it a check: 
http://dev.w3.org/2006/webapi/FileAPI/#extractionOfOriginFromIdentifier

Comment 30 Anne 2014-06-24 08:25:23 UTC

(In reply to Arun from comment #29)
> It's true that it could be a relative scheme, so I check for a relative
> scheme per your advice. But, it will never be a relative URL; that is, the
> recursive invocation of the basic URL parser will never also be fed a base
> URL. So, host will always be extractable, since the emitter methods
> (URL.createFor and URL.createObjectURL) will always emit a full origin
> string, namely the effective script origin specified by the settings object.

This is not true, e.g.

blob:data:test

would not have a host when recursively parsed.


Anyway, of the algorithm you have now step 3 could be removed and then it seems fine. I'll look into defining the origin of URLs in the URL Standard.

Comment 31 Arun 2014-06-25 19:15:10 UTC

(In reply to Anne from comment #30)
> (In reply to Arun from comment #29)
> > It's true that it could be a relative scheme, so I check for a relative
> > scheme per your advice. But, it will never be a relative URL; that is, the
> > recursive invocation of the basic URL parser will never also be fed a base
> > URL. So, host will always be extractable, since the emitter methods
> > (URL.createFor and URL.createObjectURL) will always emit a full origin
> > string, namely the effective script origin specified by the settings object.
> 
> This is not true, e.g.
> 
> blob:data:test
> 
> would not have a host when recursively parsed.


True; outside of http and https URLs, which are always going to have a host based on how the emitting methods work, this won't be an issue. In Chrome I can generate Blob URLs of the sort: blob:chrome://newtab/d1cec0be-a757-4a91-923c-9be1f7fb8ff4 which will be an issue. These aren't defined for use here.

 
> Anyway, of the algorithm you have now step 3 could be removed and then it
> seems fine. I'll look into defining the origin of URLs in the URL Standard.


Done!