Bug 19554 - Needed: An Algorithmic "Hook" For Culling URL.objectURLs with autoRevoke=true
Summary: Needed: An Algorithmic "Hook" For Culling URL.objectURLs with autoRevoke=true
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: Other Windows 3.1
: P1 blocker
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL: http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:
Depends on:
Blocks: 19558
  Show dependency treegraph
 
Reported: 2012-10-16 15:13 UTC by Arun
Modified: 2012-12-07 22:58 UTC (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Arun 2012-10-16 15:13:59 UTC
Currently, URL.createObjectURL is specified to work with autoRevoke set to true, which allows Blob URLs to be valid past a provided stable state.

But stable state is not a sufficient concept, nor is the misdirected Bug 17988 which seeks to have microtask defined as a concept.

What is needed here is: something that's always invoked when exiting the outermost script (including from parser-invoked script) .  If this is provided in HTML5, we can use it to describe autoRevoke semantics.  This is useful for the createObjectURL default pattern, which avoids persisting string-based references to in-memory objects, minimizing leaks.
Comment 1 Masatoshi Kimura 2012-10-16 18:11:10 UTC
Creating a new hook everytime new API is added? How can one understand the difference stable state, microtask, hooks for autoRevoke stuff, hooks for xxx, etc?
Comment 2 Arun 2012-10-16 19:25:06 UTC
(In reply to comment #1)
> Creating a new hook everytime new API is added? How can one understand the
> difference stable state, microtask, hooks for autoRevoke stuff, hooks for
> xxx, etc?

We're not (hopefully) going to do anything more here than have something that's always invoked when existing the outermost unit of script of same-origin script browsing contexts.  To answer your specific question:

1. You can either "await a stable state" or "provide a stable state" which are defined here: http://www.whatwg.org/specs/web-apps/current-work/multipage/webappapis.html#provide-a-stable-state

These aren't good for autoRevoke = true; see for example https://bugzilla.mozilla.org/show_bug.cgi?id=773132#c7

2. Microtask doesn't exist, and we should stop referring to it as a real thing.  Instead, user agents can "perform a microtask checkpoint."  This is defined here:  http://www.whatwg.org/specs/web-apps/current-work/multipage/webappapis.html#perform-a-microtask-checkpoint

But we can't simply "perform a microtask checkpoint" to get to a point to cull Blob URLs created with autoRevoke=true; some discussion of the pros and cons of this have been discussed on list: http://lists.w3.org/Archives/Public/public-webapps/2012JanMar/1316.html

So I think we'll need something more concrete.
Comment 3 Arun 2012-10-16 19:30:51 UTC
(In reply to comment #1)
> Creating a new hook everytime new API is added? How can one understand the
> difference stable state, microtask, hooks for autoRevoke stuff, hooks for
> xxx, etc?

We're not (hopefully) going to do anything more here than have something that's always invoked when exiting the outermost unit of script of same-origin script browsing contexts.  To answer your specific question:

1. You can either "await a stable state" or "provide a stable state" which are defined here: http://www.whatwg.org/specs/web-apps/current-work/multipage/webappapis.html#provide-a-stable-state

These aren't good for autoRevoke = true; see for example https://bugzilla.mozilla.org/show_bug.cgi?id=773132#c7

2. Microtask doesn't exist, and we should stop referring to it as a real thing.  Instead, user agents can "perform a microtask checkpoint."  This is defined here:  http://www.whatwg.org/specs/web-apps/current-work/multipage/webappapis.html#perform-a-microtask-checkpoint

But we can't simply "perform a microtask checkpoint" to get to a point to cull Blob URLs created with autoRevoke=true; some discussion of the pros and cons of this have been discussed on list: http://lists.w3.org/Archives/Public/public-webapps/2012JanMar/1316.html

So I think we'll need something more concrete.
Comment 4 Ian 'Hixie' Hickson 2012-10-17 19:34:23 UTC
Ok, how's this. You just need to define that your spec's "post-script clean-up steps", and then it gets invoked whenever script execution finishes.

Is that sufficient?
Comment 5 contributor 2012-10-17 19:34:33 UTC
Checked in as WHATWG revision r7467.
Check-in comment: Attempt to define a hook for the File API.
http://html5.org/tools/web-apps-tracker?from=7466&to=7467
Comment 6 Arun 2012-10-17 21:24:09 UTC
The sequence of steps for an autoRevoke=true invocation (true by default) of createObjectURL might be:

1. Return a Blob URI, and run the rest of these steps asynchronously.
2. As part of the post-script cleanup steps, revoke the Blob URI returned in 1. above, which is equivalent to calling revokeObjectURL on it, and terminate this algorithm.

Does 2. need to be more specific about the script block itself?
Comment 7 Glenn Maynard 2012-10-17 22:15:40 UTC
(In reply to comment #4)
> Ok, how's this. You just need to define that your spec's "post-script
> clean-up steps", and then it gets invoked whenever script execution finishes.
> 
> Is that sufficient?

Only the outermost call should trigger revocation.  Calling a black-box library function that happens to synchronously dispatch an event shouldn't cause URLs to be mysteriously revoked.  They should only be revoked after the task returns to the event loop (the outermost task, if "spin the event loop" can cause that), to prevent the URL from being revoked too soon.

(They should also be revoked at </script>, but only if it's not coming from document.write(), for the same reason.)
Comment 8 Ian 'Hixie' Hickson 2012-10-17 23:44:15 UTC
Arun: Sounds reasonable.

Glenn: Surely you want blob:s declared in the event handler itself to be revoked before the event handler returns, though, right?

Maybe things are more complicated than it was suggested above, and we need some sort of scoping mechanism. Arun?
Comment 9 Glenn Maynard 2012-10-18 02:55:15 UTC
I think it's fine for URLs created within a callback to still exist to the caller, until the "real" top-level entry point returns.  It doesn't seem harmful for URLs to last a little longer than expected, as long as it's well-defined.  It's still happening synchronously.

(There might even be cases where it's useful, eg. to return a URL from an event handler through the event object to the dispatcher, though I don't have a particular use case off-hand and this is a bit esoteric.)
Comment 10 Arun 2012-10-18 13:28:29 UTC
(In reply to comment #8)
> Arun: Sounds reasonable.
> 
> Glenn: Surely you want blob:s declared in the event handler itself to be
> revoked before the event handler returns, though, right?
> 
> Maybe things are more complicated than it was suggested above, and we need
> some sort of scoping mechanism. Arun?

(In reply to comment #8)
> Arun: Sounds reasonable.
> 
> Glenn: Surely you want blob:s declared in the event handler itself to be
> revoked before the event handler returns, though, right?
> 
> Maybe things are more complicated than it was suggested above, and we need
> some sort of scoping mechanism. Arun?

The "real top level entry-point" (from Comment 9) is what we want, ideally.  We've called that the "outermost script" before (which was once erroneously called a "microtask" in Bugzilla).  I like the idea of post-script cleanup steps. Alongside "jump to a code entry-point" can we have something like "... when the outermost entry-point of the invoking script..." and define some invariants for what "outermost entry-point.." means?  Then, along with post-script cleanup, we have enough fodder for autoRevoke behavior.

Such invariants can include:

1. </script> but not document.write closures of scripts.
2. When callbacks are called, etc.
Comment 11 Ian 'Hixie' Hickson 2012-10-18 18:43:12 UTC
I think it's bad that this:

   var blob = ...;
   a.addEventListener('click', function () {
     window.globalURL = URL.createObjectURL(blob)
   });
   a.addEventListener('click', function () {
     var img = new Image();
     img.src = window.globalURL;
     document.body.appendChild(img);
   });

...should work fine if you use dispatchEvent() but not if the user clicks. It makes no sense.

IMHO we should be scoping blob URLs to the current script execution, and in the case of reentrant script execution, we should thus clean up the inner execution's blobs before returning. Anything else is going to lead to the API appearing completely crazy.
Comment 12 Glenn Maynard 2012-10-18 21:46:03 UTC
It makes sense to me, if you're dispatching both events in the same task.  As a developer I don't think of nested JavaScript invocations as being much different than just calling the function directly and swallowing exceptions; there just happens to be native code lying in the middle of the call stack.

I'm not sure it would be harmful (at least I can't think of any specific use case), but I don't see much benefit for the extra complexity, either.
Comment 13 Arun 2012-10-19 16:52:20 UTC
The scenario in Comment 11 will oblige developers to switch to autoRevoke=false since it's just about "one over the finish line" from where autoRevoke=true would get us.  Hixie is right; it does make no sense, and I'd find it maddening as a developer.

I'd assert that what we want to follow is the 80-20 rule.  The "sweet spot" (80% of use cases) might resemble this:

var blob = ...;
img1.src = URL.createObjectURL(blob); /*autoRevoke=true by default*/

For such cases, I think it isn't worth it to grant a lease on life to the Blob URL till past the outermost script block.  Of course, it makes sense to encourage developers to use autoRevoke="false" as little as possible, but if we stuck to the proposals in Comment 9 and Comment 10, we'd *also* have to specify (or explain) the scenarios that might be unexpected.  That doesn't seem worth it.

So perhaps the original patch is good enough after all.  I've Cc'd enough people that may wish to chime in with a contrary opinion, but I'm happy to rest (and thanks for the rapid turnaround).
Comment 14 Glenn Maynard 2012-10-19 17:56:09 UTC
It's not.  It's brittle, causing URLs to be revoked at seemingly random times--times which will change depending on the internal behavior of black-box library calls.

function f()
{
    var img = document.getElementById("my-image");
    img.src = URL.createObjectURL(blob);
    img.alt = getText();
    var imgThumb = document.getElementById("my-image-thumb");
    imgThumb.src = img.src;
    imgThumb.width = imgThumb.height = 64;
}

This works fine--until getText is changed to use a module framework that implements lazy loading with documentElement.appendChild(script), or to search a document for the desired text using TreeWalker (which uses callbacks), and who knows what else.

(In reply to comment #13)
> The scenario in Comment 11 will oblige developers to switch to
> autoRevoke=false since it's just about "one over the finish line" from where
> autoRevoke=true would get us.  Hixie is right; it does make no sense, and
> I'd find it maddening as a developer.

That doesn't make sense.  There's nothing in #11 that's fixed with autoRevoke: false.  It makes perfect sense for URLs to be released when the current task completes, because nested callbacks are still running in the current task.

(In reply to comment #13)
> For such cases, I think it isn't worth it to grant a lease on life to the
> Blob URL till past the outermost script block.  Of course, it makes sense to
> encourage developers to use autoRevoke="false" as little as possible, but if
> we stuck to the proposals in Comment 9 and Comment 10, we'd *also* have to
> specify (or explain) the scenarios that might be unexpected.  That doesn't
> seem worth it.

In every case you have something unexpected that needs to be explained.  In the current spec, these unexpected cases will cause mysterious code breakage.  By revoking the URL at the outermost script, breakage will be vanishingly rare, and can never be caused by innocuous changes inside third-party library calls.  (Scoping URLs to the invocation would also need to be explained, since that's not obvious, either.)

In the current spec, even if I don't use blob URLs *at all*, I still have extra work: in any library I release, I need to document whether API calls can cause blob URLs to be revoked, and I now have to worry any time I use one of these APIs that I might be breaking third-party code using my library.

There's another big problem: it causes empty event listeners to have side-effects.  The following code should have no side-effects:

element.addEventListener("click", function(e) { }, false);

It looks like in the current spec, blob URLs are only expired if one or more event handlers exist to be invoked.

If you want to deal with the extra work and complexity of scoping blob URLs to the current nesting level, then go ahead.  It means the feature will be that much more complicated (and there's already more complexity coming, for "pinning" blobs when their URLs are used), for little gain, and I worry that layering so much complexity is going to keep this from getting implemented.  It would also make other things more complicated, since polyfills of *other* APIs (like TreeWalker) would need to do the same thing.

But releasing blob URLs due to calls *lower* on the call stack is just broken.
Comment 15 Arun 2012-10-19 18:28:45 UTC
(In reply to comment #14)
> It's not.  It's brittle, causing URLs to be revoked at seemingly random
> times--times which will change depending on the internal behavior of
> black-box library calls.
> 
> function f()
> {
>     var img = document.getElementById("my-image");
>     img.src = URL.createObjectURL(blob);
>     img.alt = getText();
>     var imgThumb = document.getElementById("my-image-thumb");
>     imgThumb.src = img.src;
>     imgThumb.width = imgThumb.height = 64;
> }
> 
> This works fine--until getText is changed to use a module framework that
> implements lazy loading with documentElement.appendChild(script), or to
> search a document for the desired text using TreeWalker (which uses
> callbacks), and who knows what else.

Can you explain why you say the scenario in Comment 11 is NOT fixed by autoRevoke:false?  I agree that switching to a more elaborate module framework will break autoRevoke:true in the example above, but this holds true even in the "click" scenario in Comment 11, no?  Namely, that it is possible to create code that breaks autoRevoke=true.  The goal is to address *most* of the use cases without requiring autoRevoke:false.  Do you think this goal is NOT met with the existing proposal?


> In every case you have something unexpected that needs to be explained.  In
> the current spec, these unexpected cases will cause mysterious code
> breakage.  By revoking the URL at the outermost script, breakage will be
> vanishingly rare, and can never be caused by innocuous changes inside
> third-party library calls.  (Scoping URLs to the invocation would also need
> to be explained, since that's not obvious, either.)

Not necessarily to the invocation, but to the post-script cleanup following that block, which may not happen at the invocation.  But this is still a lower-on-the-stack approach than what you've said.


> If you want to deal with the extra work and complexity of scoping blob URLs
> to the current nesting level, then go ahead.  It means the feature will be
> that much more complicated (and there's already more complexity coming, for
> "pinning" blobs when their URLs are used), for little gain, and I worry that
> layering so much complexity is going to keep this from getting implemented. 
> It would also make other things more complicated, since polyfills of *other*
> APIs (like TreeWalker) would need to do the same thing.
> 
> But releasing blob URLs due to calls *lower* on the call stack is just
> broken.


So what *exactly* is your ask?  Does it resemble Comment 10, namely: 

1. Determining an "outermost entry point" in WHATWG/HTML5 and
2. Harnessing Blob URL revocation to post-script cleanup steps at that point?

Or something else?  Can you strawperson a proposal?
Comment 16 Ian 'Hixie' Hickson 2012-10-19 18:52:44 UTC
Assume we have this helped function:

  function getImage(url, width) {
     var result = new Image();
     result.src = url;
     result.width = width;
  }

The original patch (what the spec now says) is definitely no good, because it means this fails:

  var blob = ...;
  var img = getImage(URL.createObjectURL(blob), getPreferredWidth());

...works, whereas this:

  var blob = ...;
  var img = getImage(URL.createObjectURL(blob), getPreferredWidth());

...fails.

What's the difference? getPreferredWidth() in the first one just returns a constant, whereas getPreferredWidth() in the second one synchronously dispatches an event to get something to update its metrics.

That's a non-starter, as Glenn says.

I think the solution is that instead the blob: URLs have to have a lifetime scoped to their instance of the "jump to a code entry point" algorithm.

I've specced this now.

Instead of a "post-script clean-up steps" algorithm, the File API just "adds the following job to the global script clean-up jobs list" when the autoRevoke behaviour is enabled, that "job" being whatever steps it takes to clean up a blob: URL. It works much like a task. The HTML spec takes care of the scoping.

How's that?
Comment 17 contributor 2012-10-19 18:54:22 UTC
Checked in as WHATWG revision r7469.
Check-in comment: Attempt to define a better hook for the File API
http://html5.org/tools/web-apps-tracker?from=7468&to=7469
Comment 18 Glenn Maynard 2012-10-19 21:44:05 UTC
(In reply to comment #15)
> Can you explain why you say the scenario in Comment 11 is NOT fixed by
> autoRevoke:false?

Sorry, I'll restate.  autoRevoke is not causing a problem that would be fixed by having "execution-scoped" URLs.  That wouldn't make the code work; it'd just make it fail in both cases.  In other words, the "scoped URLs" approach would *not* prevent people from disabling autoRevoke in this case.  (And nothing will, of course, by the nature of the feature.)

> I agree that switching to a more elaborate module
> framework will break autoRevoke:true in the example above, but this holds
> true even in the "click" scenario in Comment 11, no?  Namely, that it is
> possible to create code that breaks autoRevoke=true.  The goal is to address
> *most* of the use cases without requiring autoRevoke:false.  Do you think
> this goal is NOT met with the existing proposal?

All of these approaches meet that goal.  It's the universal "not having brittle or confusing behavior" goal that wasn't being met.

> Not necessarily to the invocation, but to the post-script cleanup following
> that block, which may not happen at the invocation.  But this is still a
> lower-on-the-stack approach than what you've said.

Sorry, I'm not sure what you're saying.  In any case, what Ian just committed effectively scopes URLs to the invocation, by triggering clean-up jobs immediately after the script finishes executing.

> So what *exactly* is your ask?  Does it resemble Comment 10, namely: 
> 
> 1. Determining an "outermost entry point" in WHATWG/HTML5 and
> 2. Harnessing Blob URL revocation to post-script cleanup steps at that point?
> 
> Or something else?  Can you strawperson a proposal?

Most importantly, that blob URLs are not mysteriously revoked as a side-effect of other things.  The commit in comment #17 accomplishes this.  I think it's a bit of overdesign that doesn't improve the feature, but it doesn't seem to cause any new problems aside from making things a little more complicated, so I can live with it.
Comment 19 Jonas Sicking 2012-10-19 21:57:38 UTC
(In reply to comment #11)
> IMHO we should be scoping blob URLs to the current script execution, and in
> the case of reentrant script execution, we should thus clean up the inner
> execution's blobs before returning. Anything else is going to lead to the
> API appearing completely crazy.

"the current script execution" is very ill defined though. Does Array.map create a new script execution for each call? What about in Chrome where Array.map is implemented in JS?
Comment 20 Ian 'Hixie' Hickson 2012-10-19 22:52:31 UTC
It's whenever you go through the "jump to a code entry point" algorithm. Nothing in the JS spec ever does that. It's possible some things in the HTML spec should explicitly say that but don't; those aren't a new problem, they are already broken in that they don't set the "entry script" properly and need to be fixed anyway. (I don't know if there actually are such bugs; if there are please file them.)

Marking FIXED; don't hesitate to reopen if the fix isn't good.
Comment 21 Glenn Maynard 2012-10-20 00:00:31 UTC
If the end result is some APIs which call callbacks synchronously having a "URL scope" and others not--if TreeWalker has this behavior but Array.map don't--then I think that's bad.  To web developers, they're all the same.  If so, it seems like a good reason to define this purely in terms of the top-level task.

But I'll defer to sicking since I don't know if that's what will actually happen, so I haven't reopened.

I like the "top-level task" idiom because it seems to make it easy to make this all very consistent (combined with considering the example in comment #11 to make perfect sense, but maybe I'm alone in that).
Comment 22 Jonas Sicking 2012-10-20 00:31:58 UTC
Yeah, I agree. I think this creates more uncertainty and complexity than using the same checkpoints as for mutation observers.

The result is that DOM APIs and ES APIs would behave very differently. Even though to authors they look exactly the same. And it would likely be a performance overhead if we move towards implementing more of the DOM in JS. (Additionally it'd be easy to introduce bugs when that happens since implementations are likely to forget to run this extra checkpoint in each API that is migrated to JS).

I agree that the scenario in comment 11 can happen and can cause problems. But I think those problems will be smaller than the confusion with regards to DOM APIs and ES APIs behaving differently.

And do note that things are well defined either way. It's not something that will behave differently in different browsers.
Comment 23 Glenn Maynard 2012-10-20 14:35:37 UTC
Hmm.  Actually, Array.map in particular might be a case where you don't want execution-scoped revocation, so you can say things like

blobs = [blob1, blob2, blob3];
urls = blobs.map(function(blob) { return URL.createObjectURL(blob); });

without getting a bunch of revoked URLs back.
Comment 24 Ian 'Hixie' Hickson 2012-10-22 18:19:45 UTC
I agree that callback-based synchronous APIs shouldn't trigger this. I don't think that they do trigger this, but if it's ambiguous then we should fix that.

I think dispatchEvent() is a very different case that _should_ trigger this.
Comment 25 Glenn Maynard 2012-10-22 18:53:57 UTC
How is it different?  To me as a web developer, a function being called synchronously is a function being called synchronously, whether it's done by Array.map or e.dispatchEvent.
Comment 26 Olli Pettay 2012-10-22 18:56:28 UTC
I don't see either why dispatchEvent would be any different than other
sync callback handling.
Comment 27 Ian 'Hixie' Hickson 2012-10-22 21:34:17 UTC
Because events aren't only dispatched by script. See comment 11.
Comment 28 Glenn Maynard 2012-10-22 22:49:20 UTC
They're still just callbacks being invoked.  They're top-level script invocations in some instances and not in others, and that causes different behavior, but the difference is straightforward to me once the criteria is described: "autorevoke URLs are revoked when the topmost script returns to the browser".  Having some callback APIs cause new "URL scopes" and others not seems less clear, unless there's a similar general rule that can describe it (eg. not "this list of APIs will revoke URLs, and this other list of APIs won't").
Comment 29 Jonas Sicking 2012-10-23 00:28:34 UTC
(In reply to comment #27)
> Because events aren't only dispatched by script. See comment 11.

This depends entirely on the event. If a page dispatches custom events then those events are only dispatched from script and from the page's point of view simply looks like any other callback mechanism.

I don't think there are any good solutions here. But I think using the top-level script invocation is the least bad solution because for most practical purposes things behave as expected, and for the others at least it will behave consistently among browsers.
Comment 30 Ian 'Hixie' Hickson 2012-10-24 00:08:10 UTC
Do other callback-style APIs occur both in nested script invokations (like dispatchEvent) _and_ as their own tasks (like user interaction events)?
Comment 31 Glenn Maynard 2012-10-27 03:15:21 UTC
I can't think of any, for what it's worth.  The only path I can think of that might have a similar effect is "spin the event loop".  If it can, having it create a new URL scope might make sense, as if its script invocations are top-level even though they're possibly running underneath another script.
Comment 32 Ian 'Hixie' Hickson 2012-12-07 22:56:47 UTC
Ok, went with comment 29.
Comment 33 contributor 2012-12-07 22:58:00 UTC
Checked in as WHATWG revision r7573.
Check-in comment: Simplify the Blob release mechanism, at the cost of exposing the Web's innards a bit more to script authors.
http://html5.org/tools/web-apps-tracker?from=7572&to=7573