20083 – appcache: JS server worker idea

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 20083 - appcache: JS server worker idea

Summary: appcache: JS server worker idea

Status:	RESOLVED WONTFIX

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	HTML (show other bugs)
Version:	unspecified
Hardware:	Other other

Importance:	P3 normal
Target Milestone:	2016 Q1
Assignee:	Ian 'Hixie' Hickson
QA Contact:	contributor

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-11-26 02:44 UTC by Ian 'Hixie' Hickson
Modified:	2014-09-25 22:44 UTC (History)
CC List:	18 users (show)

See Also:

Attachments

Description Ian 'Hixie' Hickson 2012-11-26 02:44:19 UTC

People regularly suggest that we make it possible for an appcache manifest to contain a same-origin reference to a JS file known as its interceptor. If we do this, I propose we do it as follows: when a Document's application cache is complete and has a declared interceptor, the networking model changes to a third model that acts as follows:

- open a connection to a worker, or create one if none yet exists, that is an
ApplicationCacheInterceptWorkerGlobalScope for the given application cache,
using the JS file for the interceptor as mentioned in the manifest.

- each time there is a network request to the same origin as the manifest,
send a MessageEvent event to this worker using the event name "request",
whose payload is an object of the following form:
{
method: 'GET', // or POST or whatever, '' for non-HTTP(S) origins
url: 'http://www.example.com/file.png', // the url being fetched
headers: {
'header': ['value', 'value'] // each HTTP request header
},
body: '', // the request body (e.g. for POST requests)
port: a_MessagePort_object,
}

- The passed port expects data in the following manner:
- the first message to be sent has to be one of these:
- a Blob or File, which is treated as the resource payload.
- null, which is treated like {} as described below.
- an object with an attribute named "action", whose value is
interpreted as follows:
- "passthrough": fall back to the normal appcache net model.
- "cache": serve the file from the cache, or act if it is a
network error if the file isn't there.
- "network": do it via the network, ignoring the cache.
- anything else: act as if the "action" attribute is
absent, as described next.
- an object without the "action" attribute, which is then
treated as meaning the resource had a network error.
- anything else, which is stringified and then treated as the
response including headers, but possibly incomplete.
- the second and subsequent messages, which are only acted upon if
the first was not an object, are either of these:
- null, which is treated like {} as described below.
- an object, in which case the resource is assumed to be
finished, as if the network connection had closed.
- anything else, which is stringified and treated as a more
response data.
- if there's a Content-Size header, and data is transmitted past
the specified size (as interpreted per HTTP rules), then the
extraneous data is discarded.

- swapCache() disconnects from the worker if there is one (so that
the new cache's worker can kick in if necessary).

One possible problem with this approach is that it doesn't kick in until a Document exists, which is after the first attempt at fetching a file: the main "master" file thus always comes from the cache (or maybe network, in the case of prefer-online stuff). Is that a problem?

A bigger problem is that it's not clear what the use cases are. Without knowing what they are, I'm probably not going to add anything.

Comment 1 David Barrett-Kahn 2013-01-15 17:14:05 UTC

I think it's worth asking whether this is particularly an appcache feature. Having such a 'controller' be declared as part of an appcache manifest has disadvantages:
* As Ian points out, the system behaves differently before the cache is established to how it does afterwards. I think this is a real problem, as it makes the behavior of pages on an origin harder to understand and reason about. Having to understand so many possible client states is one thing that makes our current appcache implementation so brittle.
* If the controller is part of one cache, it can't affect which cache a frame navigation binds to. The limiting, hard-coded controller rules would still represent an obstacle in cases where an origin had more than one cache.

I'm drawn to an approach where there is an 'origin controller' script. The browser would recognize one per origin. It would be declared by the server in a response header, which the site would be expected to send in every response on the origin. The first time the browser saw such a response header, it would fetch and run the script in parallel to the main request, ensuring that it 'took effect' starting from the very first request to the origin. The controller would be called whenever a request was issued on its origin, and would be able to influence how that request was handled.

Things an origin controller might do:
* Influence which cache bound to the request, and which items in caches were served in response to it
* Cause existing windows to be focused instead of new ones being created, where the request is best served by doing so.
* Perform redirects, for example in cases where a web application has changed its URL format and the URL presented by the user is in the 'old' format.
* Recognize poorly formed/illegal/abusive URLs before the request is even sent.
* In order to allow the controller to consult local storage as part of its decision making, the interface between the browser and the controller would have to be asynchronous.

Later, the responsibilities of an origin controller could expand to cover other cases where cooperation between windows in an origin is desirable, such as responding to window open / close operations, or deciding how an application-level operation should best be performed given the current set of tabs open on that origin. It would have much in common with the 'event page' feature of chrome apps, although probably wouldn't actually have a DOM. If available, I think the origin controller could supercede shared workers in many cases, in a more resource-efficient way. The execution environment of the controller could be stateless, state left lying around in local variables would not be kept between event loops.

Here are some other proposals I consider related. It may not be obvious at first why, as they have little to do with appcache. In my mind however they all adhere to the principal that an application is an origin, not a browsing context.

"We should support an asynchronous locking mechanism"
https://code.google.com/p/chromium/issues/detail?id=161072

"We should support a "broadcast" version of postMessage"
https://code.google.com/p/chromium/issues/detail?id=161072

"Should user gestures follow postMessage?"
https://code.google.com/p/chromium/issues/detail?id=161068

Comment 2 Tobie Langel 2013-01-15 17:15:35 UTC

Yes, yes, yes and yes!

Comment 3 David Barrett-Kahn 2013-01-15 20:30:54 UTC

Ian asked for some more material on use cases, and frustrations with the existing functionality. Broadly, I want client-side interception of requests whenever the default handling is not going to give me what I want.

1. The request should be served from a local cache, but the rules for the use and selection of that cache are more complex than can be comfortably expressed in an appcache manifest.

Here I draw mainly from the experience of designing docs offline. This often had the feeling of a constraint problem, with appcache's controller logic as the constraint. Rather than structuring the application in the most sensible and maintainable way, I structured it in the only way I could come up with that would work with appcache's fixed rules. The result was hard to reason about, very complicated, brittle, and almost impossible to extend. Some of the problems we faced:

* I didn't want to use appcache while online, so I used a fallback entry. But many URLs (some of them used in iframes) must always reach the network, so I had to put whitelist entries in many caches and never make a mistake.
* Which cache should be used depends not only on which docs product is being used, but on which document is being opened. The local database must be consulted to determine this. Right now we have something very like the origin controller behind a fallback entry, although it's slow and limited.
* The URLs of the cached versions of the application were different to those of the online one, so I had to rewrite the URLs from within the cached versions of the applications.
* Our URLs often have legal authentication prefixes (eg /a/google.com) but appcache can only do prefix matches. Expensive and painful revision of our authentication code was required.
* If a request is discovered which isn't being handled properly and you can't find a way to somehow jam it into the existing rules, you can end up very stuck.
* If your existing prefix/url structure doesn't fit appcache very well, your only option is to rearrange it all. Retrofitting appcache onto an existing application is very painful due to its inflexibility.

Now obviously I found ways to get around enough issues to launch the product, but boy was it painful and time consuming. If I'd had true control over request fulfillment, the effort and risk involved in this project would have been much, much lower. The use of regexes for fallback and whitelist entries would have helped tremendously, and might be enough to ease some of this pressure, but I see many other uses for the origin controller too.

2. The user hasn't used the most up to date URL for the resource, but we don't want to impose a latency penalty on them for that.

There are numerous situations where clients send URLs which are 'wrong but acceptable'. An example from our world is that we recently changed our URL format from docs.google.com?id=xx to docs.google.com/d/xx. The stock way of dealing with them is to send back a 302 or 304, but a lower latency solution is desirable.

3. The best way to respond to a request isn't to open a new window and fetch the resource.

Users navigate to URLs with some kind of intent in mind, they want to view some particular web content or perform some operation. They don't necessarily want to view that web content in a new browsing context, it would often be more helpful to focus or re-purpose an existing one. A controller could perform this operation - discern the intent expressed by a navigation request and fulfill it some other way. In Docs for example, rather than opening a second copy of a document, it could focus the existing one. In gmail, an attempt to send mail or view a particular mail could be fulfilled by focusing the gmail window and sending a postMessage to it instructing it what to do. Same thing for composing a social media post. The google application navigation bar could behave much more sensibly, actually switching between applications rather than opening a whole new instance of whatever you ask for.

Comment 4 Ian 'Hixie' Hickson 2013-01-15 20:45:12 UTC

Sounds like what you really need here is basically a set of APIs and hooks to let you implement your own caching mechanism, rather than using appcache itself. Basically a way for an origin to intercept every request for resources in that origin (and any others that opt in), and for each request, to decide how to handle it (fetch from local DB, fetch from network and store in local DB, fetch from network and leave as is, generate on the fly, etc).

Would IndexDB be a sufficient database for storing the files and data, or would we need to provide a dedicated storage system for network resources?

Comment 5 Ian 'Hixie' Hickson 2013-01-15 20:50:16 UTC

(David adds in private communication: "One thing I do like from appcache and which I'd like preserved in some form is the atomic cache update, where it fetches all or nothing. That is valuable.")

Comment 6 Jake Archibald 2013-01-15 23:12:56 UTC

I think a separate request cache would be useful. The use case is other-domain responses. A way for the developer to cache an other-domain resource and use it within the interceptor/router without ever being able to query its content (eg, cdn resources).

Comment 7 Jake Archibald 2013-01-15 23:18:51 UTC

Agree with David Barrett-Kahn, don't think the interceptor/router should be a child feature of appcache, I see it as a lower-level version of the manifest (with finer control, more features) when combined with a request store.

Comment 8 Chris Wilson 2013-01-15 23:27:52 UTC

(In reply to comment #7)
> Agree with David Barrett-Kahn, don't think the interceptor/router should be
> a child feature of appcache, I see it as a lower-level version of the
> manifest (with finer control, more features) when combined with a request
> store.

Yes.  It's really more that the current html5 appcache can be envisioned as a particular code logic instance of a local request handler/controller/updater, that has access to a local cache system.  That same local cache system could be used by other local controllers, with different code logic.

Comment 9 David Barrett-Kahn 2013-01-16 00:03:51 UTC

The controller is also useful for 'horizontal concerns', just as servlet filters are on the server side.
* Security - the controller might know more about the context in which a request was made than the server did, allowing it to make better decisions about XSRF and clickjacking prevention.  If standard libraries became common, even insecure web sites might be strengthened in a 'bolt on' way rather than requiring new functionality.
* Statistics keeping and performance measurement - the controller would know how long resources took to load.
* Authentication - forms of authentication not based on cookies, with all their limitations, could be contemplated.  The challenges Google faced implementing our multilogin scheme show how limited cookies really are.
* Abuse and attack prevention.  A central qps limiter for the whole origin which was under the server's control.

A big problem with this proposal is just how powerful it could be.  Analyzing it from a security and abuse perspective will be difficult.  Clearly capabilities would have to be introduced into it slowly and carefully once the basic structure was established.

Comment 10 David Barrett-Kahn 2013-01-16 19:17:14 UTC

More discussion occurred internally about what kind of storage mechanism this controller might use to cache its files.

Mozilla and Google have pursued different courses, Chrome has its filesystem API and Mozilla favor an approach where resources are stored in IDB. I related our experiences with the Chrome filesystem API:

* Getting files into it is much more painful than it should be. Doing 'download this URL and put it here' takes far too much code.
* No built in ability to atomically fetch a set of files, which is what I'd want if it was to replace appcache.
* The fact that it sniffs the mimetype based on the file extension rather than allowing me to set it into file metadata at write time makes me nervous that it'll get it wrong or a file format won't be covered.
* Having to put a mime type extension on the name of a file whose name is an id makes that file hard to retrieve, as I have to know the mime type in order to fetch the file. If I don't know it, I have to do a directory list operation to find out its name.
* API was just kind of 'busy', lots of calls, lots of mucking around to do what seemed like simple tasks.
* The fact that it issues these special URLs had an unexpected downside: when you right click on an image and say 'copy image URL', that URL isn't usable in any context outside the origin.
* No easy way to stream large resources off the network into a filesystem file without staging them in memory. It may be possible, but would require significant effort.

Verdict: did what we wanted, but our experience was 'just OK'.

Several people voiced support for the atomic update aspect of appcache as being something worth saving.

Concern was raised about the controller's execution context becoming a bottleneck for parallel requests. It seems likely that this context would need to be completely stateless, and capable of simultaneous execution. Blocking/serialization might occur when accessing the database, but should not occur in the controller execution itself.

Comment 11 David Barrett-Kahn 2013-01-16 19:19:42 UTC

A design sketch I presented internally for backing storage, for comments:

* Consensus on the various filesystem proposals seems elusive
* Neither the mozilla or the google proposal has the atomic update behavior
* Appcache is already widely implemented, and has atomic update

I'm drawn to the idea of keeping appcache and its manifests as the means by which resources are identified and fetched, and just optionally stripping the system of its controller/router responsibilities. For each request, the controller would say 'use this resource from this cache to satisfy that request'. The controller would also be responsible for the cache's freshness and lifecycle. It would have APIs to:

* enumerate caches
* manually create, refresh, and delete them
* check when they were last refreshed
* examine the contents of caches
* search the whole pool of caches for a particular resource
* discover the amount of storage being consumed by a cache

I suspect with a system like this the need to manually add and remove resources from an individual cache would be diminished. Instead of manually managing files, you'd manually manage caches. A cache might represent a particular newspaper article or other granular piece of content. Different caches could be marked with different retention policies for the browser's benefit when it needed to clear space. If the controller found it wanted to satisfy a request from a cache but that cache was absent, the system would not need to wait for the whole cache to be established to serve the page, just as it doesn't now.

Cache operations would need to be triggered by a request, but that's probably good. We don't want yet another thing hanging around and waking up on a timer. If they must do that they can make a background page and it can wake up the controller periodically.

If cache manifests were allowed to contain arbitrary name/value annotations (per-manifest and per-entry) applications could come up with their own sets of cache policies. These policies could be invoked in the cache manifests and implemented in the controller. Maybe popular web frameworks and content publishing sites would provide pre-written controllers. The only part of the manifest meaningful to the browser would be the actual resource names, which it would fetch as written. These requests, naturally, would bypass the controller.

The cache querying APIs used by the controller would probably need to let you query by annotation.

Comment 12 michaeln 2013-01-16 21:11:54 UTC

Repurposing the existing appcache to be a storage repository with the some ability to refresh the the set of resources listed in a manifest (really, just a collection of related resources) sounds like a decent idea. So limit the storage component's responsibility to storage and some amount of unattended updating and expiration of those collections.

We have some prior proposals on APIs to enumerate/create/delete/update the set of manifests in the system. That could be relevant to this effort. Additional meta data about update timestamps and total sizes might make sense in those proposals. As well as extensions to enumerate individual resources contained within.
https://bugs.webkit.org/show_bug.cgi?id=67135

Comment 13 Jake Archibald 2013-01-17 03:09:18 UTC

I'd like to investigate solutions that avoid the manifest altogether but preserve the atomic grouping feature. The JS that adds/removes/updates items to the request store could live in the same worker handling the routing.

This would allow a bit more logic into the process. Eg, I could query video support & cache the supported format.

The API could also allow dependencies in the atomic groups, as in level2 depends on level3, but a failed download in level3 wouldn't stop levels 1 & 2 committing to the cache.

To be clear, I don't think we should deprecate the manifest, but recommend it more for single-page 'apps' that have few requirements beyond "take all these files offline".

Comment 14 Andrew Betts 2013-01-17 15:38:30 UTC

I like the idea of detaching atomic resource caching from offline routing. A few practical problems we've encountered (Economist and Financial Times web apps) include:

1) The Economist publishes a new edition once a week. We'd like to create an atomic resource cache for an edition, but be able to add and remove editions without affecting those already cached. Currently we use WebSQL or IndexedDB for this, because appcache doesn't offer that flexibility.

2) When we update an article or section page, we want the user to see the update straight away if online, but get offline copy if offline. Automatic caching of master entries means this currently can't make use of the appcache, so instead we cache all content in IndexedDB or WebSQL, and just put the home page and a single fallback into appcache (by listing the manifest attribute on only one document, which is then included in every other document in an IFRAME). The home page must also contain no content as a result of this.

3) We offer the option to listen to an audio recording of the article you're currently reading. For this you currently have to be online because there is no storage feature that has sufficient quota to store 5 hours of audio. To achieve this we have to wrap the site in a native app container and use native code to access larger amounts of persistent storage outside of the browser.

4) When connected to a poorly performing network, visiting a page that is covered by an appcache FALLBACK rule will only deliver the fallback after a non-configurable timeout. The timeout is typically longer than user patience, rendering the app effectively unusable from the user point of view.

(1) is a flaw in appcache design which is not solved by having an interceptor. Ditto (2): interceptor would allow us to instruct the browser to skip the appcache for anything we didn't explicitly tell it to cache, but it would still cache any master document so we'd still need the IFRAME hack to avoid that. (3) remains a problem. (4) is fully solved by the interceptor, if I understand it correctly.

The idea of a single canonical interceptor per origin is simple but when Alex Russell and I discussed this back in November he pointed out to me that it wouldn't scale (imagine the size of it for an origin like google.com or amazon.com).

Comment 15 Jake Archibald 2013-01-17 15:50:37 UTC

I don't buy the not-scaling thing. Even massive sites have a routing script of sorts, they scale pretty well. Unless you're thinking of directing everything to one HTML page (as in INTERCEPT), then yeah, I agree.

I don't think you'd need one rule per url, I'd like the routing to look similar to expressjs, or at least similar in functionality.

Comment 16 Tobie Langel 2013-01-17 15:54:04 UTC

(In reply to comment #14)
> The idea of a single canonical interceptor per origin is simple but when
> Alex Russell and I discussed this back in November he pointed out to me that
> it wouldn't scale (imagine the size of it for an origin like google.com or
> amazon.com).

It should be possible to relax this restriction a similar approach as the one used to match fallback namespaces in the current spec (http://www.whatwg.org/specs/web-apps/current-work/multipage/offline.html#concept-appcache-matches-fallback).

Comment 17 Tobie Langel 2013-01-17 16:04:52 UTC

Deleting my previous comment as it is just silly given the proposal being discussed and will just derail the conversation.

Comment 18 Tobie Langel 2013-01-17 16:07:17 UTC

And of course the top right button collapses the comment instead of deleting it. Oh well. :-/

Comment 19 David Barrett-Kahn 2013-01-17 16:41:39 UTC

Andrew, thanks so much for your contribution. I was really hoping someone from publishing would provide their perspective. Your site sounds more ambitious than most, hopefully we can come up with something that provides value lower down the complexity scale too. Long time subscriber to the Economist btw!

On the snowballing script complexity thing, I share Andrew's concern. Controllers which contained actual lists of resources would get unwieldy very fast. This is what I was aiming at when I suggested retaining appcache manifests and allowing annotations to be included in them. I imagined a 'private language' a site might have, policies which were implemented in the controller script and invoked in the various cache manifests. What Jake's comment made me see is that there's no need to reuse the appcache manifest for this. Controllers can be data driven in ways unique to them. The Economist could have a system of manifest files of its own devising - maybe one global one, one per edition, and perhaps even finer grained than that. After fetching and reading these manifests, the controller could decide whether it wanted to create, delete, or update any of its... let's call them 'resource sets'. The contents of a resource set would be specified programmatically. When binding individual resource requests to resource set members, the controller could use a combination of the URLs actually in the sets, policies it had stored in IDB or similar, and perhaps some kind of resource set metadata? There would be no requirement that all the content on a page come from a single set. Cache maintenance operations would be performed by the controller after it had already told the browser how to proceed with whatever request triggered it, so as not to increase the latency of the page load.

Sites without the appetite to create a controller and manifest language entirely for their own use could use an off the shelf implementation, perhaps provided by one of the major web application frameworks, web servers, or J2EE containers. Very simple sites would use controllers with resource names hard coded into them.

It would be possible to provide retention hints to the browser for each resource set. As an illustration, imagine that each had a priority number. When it needed to evict sets, the browser would remove them from an origin in reverse priority order. This makes the caching of media files possible, even on storage constrained platforms. To inform decisions about whether to establish such caches in the first place, the browser might provide a 'storage pressure' signal. Browser vendors could come up with their own storage allocation policies and UIs - different policies will be suitable for different hardware and UI situations.

As Michael N has pointed out previously, controller context setup latency is an issue here. The controller's execution context would need to set up quickly, and to quickly gain access to resources we think it would routinely use, such as IDB and the caching system. The browser engineers on this thread would be able to say more.

Andrew, your comments on how an interceptor wouldn't fix various problems you've encountered with appcache, you were talking about a cache-specific interceptor rather than a whole-origin one?

Comment 20 Andrew Betts 2013-01-17 17:40:47 UTC

(In reply to comment #19)
> Long time subscriber to the Economist btw!

Splendid.  Carry on.

> Andrew, your comments on how an interceptor wouldn't fix various problems
> you've encountered with appcache, you were talking about a cache-specific
> interceptor rather than a whole-origin one?

Either I guess.  The point I was making is that an interceptor is not going to magically solve all our appcache woes, and that actually this thread seems to also be dealing quite extensively with issues of atomic grouping of cached resources which to me seems quite different to the routing issue.

I actually don't share others' enthusiasm for atomicity.  Provided that some API exists to inspect what is in the cache, isn't it easy enough to ignore partially populated caches if atomicity is important to your application?  If app cache had no atomicity, then (disregarding the separate capacity issue) we could have used it for the multiple-editions use case perfectly well - albeit with some additional code to validate that the stored edition was complete.

In terms of offering a solution for simple cases, I'd be interested to know how many 'simple' offline web apps there are.  Certainly ours would be a lot simpler without all the workarounds that we currently employ, and I feel like if the underlying technology is implemented well enough that complex apps can use it elegantly, framework/shim developers will provide shortcuts for simple cases.

In short: Yes to an interceptor plus a better, larger, more controllable, less atomic (or at least more controllably atomic), offline HTTP request cache storage type thing.

Comment 21 Tab Atkins Jr. 2014-05-07 17:39:37 UTC

Actually, Hixie, you should be able to just close this, as this is precisely ServiceWorker.

Comment 22 Ian 'Hixie' Hickson 2014-07-29 00:13:15 UTC

Keeping this open just to check on how ServiceWorkers turn out. Assuming they end up adopted, then this bug is indeed moot, as Tab says in comment 21.

Comment 23 Ian 'Hixie' Hickson 2014-09-25 22:44:21 UTC

Punting to ServiceWorkers.