Server Proxy Cache Policy

08 Nov 2017


JohnJansen, Chaals, QingqianTao, YoavWeiss, mattkot, XiaoqianWu, AnqiLi(Angel)
Qingqian (Baidu)


<inserted> scribe: chaals

QT: We think there are some issues with caching policies for server proxies.
... want to talk about problems when the website comes out of a server cache, and options


Issues: cache strategy not clear to the developer.
... there is no clear agreement on how to control where the page might be cached.
... And there is no clear specification for the cache time

CMN: Why robots.txt?

QT: Because the scenario is that a proxy also crawled.

(slide showing that... if robots.txt says don't cache, that information gets into the chain and browser is told to get it fresh)

QT: Another propblem - if the proxy service caches the page, then you lose statistical information on view.
... there is no way to get the notification of when the user requested the site.
... (Firefox propose x-forward-for as a basis of a solution)
... use that header to trigger a pingback from the proxy with user data when the page is served out of proxy cache
... or use a meta, and have the browser do the same thing.

@@1: So the browser would send additional information?

CMN: If the page loaded uses the user data to generate something, how does that get into the page?

QT: The page can use a script

CMN: OK - but then you don't need a pingback because the script can directly ping e.g. onload.

QT: So can we standardise this for cache control?

CMN: What data is not available to a script?

QT: If we can do it in a browser then maybe it will work better than page script.
... some website will not use it correctly.

CMN: Looking to see if there is something more we can do than script. If the website writes the code wrong, we cannot fix that - they may well handle the pingback wrong too.
... And if it is blocked for user privacy, the browser will probably still block the information going out.

QT: Some websites in AMP/MIP do not show the right URL - it is listed as coming from the amp server not the original domain.
... we need to show the correct URL for the browser.

CMN: If you use link rel="canonical" in the page, the browser could use that and show the real URL. But is that a problem elsewhere?

@@1: HTTP2 allows for connection coalescing, you can have multiple origins mixed.

QT: shows example of using a query parameter to hold the real URL

XW: Do you have an implementation in the Baidu browser to show the URL issue?

CMN: Same origin issue. But having thingcome from one cache means you can more easily build XSS sine everything is now on the same origin...

<angel> s/everythiong /everything

YW: URL fudging is risky

JohnJansen: so the ask is a way to bypass Same Origin when you are getting stuff from a server cache.

YW: Most of the re-hosted websites have a CDN provider that have the origin certificates and can serve it from the origin with whatever constraints
... the CDN serves the content with the same constraints as the cache without the origin doing anything, and effectively serve it from the real origin

QT: Another problem - we load the page in iframe, and change the URL so it is same origin as the cache.
... If we don't have the same origin we cannot provide our transition effects.
... We need to try to find some solutions about the cache origin policy problem.

YW: CSP on the page?

JJ: Implementing something in the browser to allow magical cross-origin insertion from the CDN seems pretty risky


JJ: So what would the browser do, ideally?

QT: Let the browser pass the iframe URL to the real URL. Or let the browser provide the navigation transition script.
... problem is the cache has the wrong URL. We need to do something to translate that to same origin.

YW: Think this needs to be solved on the server side. I have an elegant proof of concept but it doesn't fit in the margin
... and I don't think it is a standards problem.

CMN: There is also Yandex (and I think others) using a subdomain on the AMP server domain, to be able to do this.
... it feels like a small-scale standards problem

[Big question seems to be finding a forum...]

YW: Like AMP is not a standard, I can see a solution being developed along similar lines

<angel> https://groups.google.com/forum/#!forum/amphtml-discuss

YW: And solutions that require protocol or browser changes will break the Web
... you need to figure out how to serve the content from the origin while applying the cache constraints.

rrasgent, draft minutes

Summary of Action Items

Summary of Resolutions

[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.152 (CVS log)
$Date: 2017/11/08 22:57:49 $

Scribe.perl diagnostic output

[Delete this section before finalizing the minutes.]
This is scribe.perl Revision: 1.152  of Date: 2017/02/06 11:04:15  
Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/

Guessing input format: Irssi_ISO8601_Log_Text_Format (score 1.00)

Succeeded: i/QT:/scribe: chaals
Succeeded: s/M<N/MN/
Succeeded: s/everythiong /thing/
FAILED: s/everythiong /everything/
Succeeded: s/@@2/JohnJansen/
Succeeded: s/Qianqing/Qingqian/
Present: JohnJansen Chaals QingqianTao YoavWeiss mattkot XiaoqianWu AnqiLi(Angel)
Found Scribe: chaals
Inferring ScribeNick: chaals

WARNING: No "Topic:" lines found.

WARNING: No date found!  Assuming today.  (Hint: Specify
the W3C IRC log URL, and the date will be determined from that.)
Or specify the date like this:
<dbooth> Date: 12 Sep 2002

People with action items: 

WARNING: No "Topic: ..." lines found!  
Resulting HTML may have an empty (invalid) <ol>...</ol>.

Explanation: "Topic: ..." lines are used to indicate the start of 
new discussion topics or agenda items, such as:
<dbooth> Topic: Review of Amy's report

WARNING: IRC log location not specified!  (You can ignore this 
warning if you do not want the generated minutes to contain 
a link to the original IRC log.)

[End of scribe.perl diagnostic output]