SV_MEETING_TITLE -- 31 Oct 2012

URLs

<SimonSapin> I’m interested to know if I’m gonna need to rewrite the urlparse python module.

<annevk> RRSAgent: make minutes public

<cabanier> URL: http://url.spec.whatwg.org/

<paulc> If anyone knows where Larry Masinter is please invite him to this session.

[missed some discussion at the beginning. Starting logs now.

annevk: Some characters, such as square brackets, are not percent escaped when sent over the wire.
... It's better to be more conservative in what's escaped for compatibility
... we're limited in what we escape. If we want to match STD 66 model, we have to escape more. That's not happening.

timbl: you want to have a standard that you can round trip.

annevk: with HTML, you try to enforce postels law. Strict in what you produce, liberal in what you accept.

… the syntax is strict, the parser is liberal

… The parser only stops at fatal errors.

timbl: I don't want URLs written according to the existing standard being interpreted differently.

annevk: I think that is largely not the case. I'm not sure how the current spec deals with relative URLs.

… the way it's implemented in browsers is that there's a certain class of relative URLs. It depends on the scheme.

paulc: When you say the way it's implemented in browsers, is the situation you're dealing with that the browsers are uniformly doing the same thing? Or large proporiton of them?

annevk: all browsers require to not conform to that data model, but there is wiggle room in what they can do.

… at opera we found that being conservative is beter for compat.

timbl: maybe you should do an overview

krit: You say that URL will be based on URI.

annevk: if we can compare it to the RFC, the URL syntax matches IRI.

… the syntax is broad. It escapes characters over U+7F

annevk: fragments are escaped in the parsing step.

… e.g. if you have an ID with a euro character, then the character in the URL will be percent encoded as a byte sequence.

annevk: you have to make it dependent on the document type.

larry: If the fragment contains non-normalised unicode, the matching might need to be a loose matching.

… it might be you have fragments that are written out not in percent encoding, but the matching algorithm does the reverse mapping

… Normally matching from a URI to IRI is a heuristing process.

annevk: it's part of the application layer.

… e.g. SVG takes the fragment, turns it into unicode.

larry: Does it turn the fragment into unicode or do a loose matching algorighm

annevk: HTML turns it into unicode
... do you have some things in general to say about it?

larry: There are 4 specs. There should be 1.

… The IETF documents were the documents of record. If they're no longer, they need to be closed off.

… what do you do with all of the other specs that refer to it?

… The IETF spec says more about comparison.

… The HTML document has something tha was in progress, but needs to be replaced or something

paulc: When the URL work from moved from HTML to WebApps, people thought it would be removed from the HTML spec. It wasn't.

larry: that will be easier to fix later.

annevk: the web apps document is obsolete.

MikeSmith: the editors draft refers to annevk's spec.
... Julian's tests are great for what they are, but will need updating to state the expected result.

larry: The spec in the HTML spec has an advantage in its style in that it normatively references the IETF spec.

… If you look at the latest version of the IETF document, it was changed ...

http://tools.ietf.org/html/draft-ietf-iri-3987bis-13

larry: see section 7

… "Processing of URIs/IRIs/URLs by Web Browsers"

scribe: it points to annevk's document

larry: Any valid URI is a valid string that can be used in an HTML document if it's in UTF-8

… then the special processing of the query is the same.

annevk: I think that is true, but I have questions.

… If I write http:test

… that is valid URI?

timbl: It's not a valid HTTP uri, but syntactically correct.

annevk: it's considered a relative reference by by browsers.

timbl: not in the standard.

larry: does it parse into HTTP scheme with test at the host or path?

annevk: the parser algorithm does both resolving against the base URL. If there's no base URL, you get HTTP scheme and test as the host.

… If there is, then it's resolved as a path.

timbl: I've got old code. It says if there's a colon, it's a scheme.

… If you're going to document this bizarre practice, then you're going to have something incompatible with a bunch of code, or a special section just for browsers.

hsivonen: I think this section on browsers is a distraction. There is other software that is not a browser, that I want to process content in a way compatible with browsers.

… e.g. the HTML validator needs to fetch URLs. It's not a browser, but it doesn't make sense to fetch something different.

… e.g. link checker or spider. I would want http:test to parse the same way as browsers.

… it's apps that want web compat vs. apps that don't

… In the HTML validator, I'm using a jenga (???) IRI library.

… It has different behaviours for URIs, IRIs, etc.

<masinter> the question is scope

<masinter> scope: HTML, HTML on some OS, http: only, file:, other schemes, mailto:, etc.

… there's already 6 different configurations to make it behave in different ways. Not one is that I want it to be compatible with the web, and yet it's the best Java library available.

<masinter> "how browsers do it" may not be the right scope

… for any Java based app that wants to spider the web, the most successful implementation would be compatible with browsers.

masinter: many of the libraries are conditional about whether they're case sensitive depends on the system. e.g. for file URLs

annevk: file URLs are a bit hard. I'm working on them.

<SimonSapin> s/working on them/ working on them/ ?

<masinter> "What browsers do" might only be "what some browsers do for http: URIs on some file systems"

<annevk> SimonSapin: no, I'm trying to "fix" them

<annevk> SimonSapin: well, define

<SimonSapin> ok

<masinter> I thought I heard that what some browsers do in some circumstances may depend....

[missed a bit]

hsivonen: The URL bar is UI. It nees to accept what's pasted into it, but can have different processing based on what users type.

<masinter> is there any real-world content that depends on http:test being treated as a relative IRI in contexts where there is a base?

dbaron: I wouldn't expect any differences in those different places (HTTP, XML, etc.)

… Putting it in an HTML attribute may require some escaping. But I don't know of any software that's going to interpret URLs differently based on if it's in HTML or XML.

hsivonen: timbl's diagram represents what that java library does. To fix the bug, we should use the HTML stuff for all of it.

annevk: You're confusing syntax and the model.

… HTML syntax escaping disappears once it's represented as a model.

annevk: the spec model is that you get a set of code points - a string - that goes into the parser.

… HTML handles entity references. The string passed to the URL parser excludes that HTML specific syntax.

<masinter> test_resolve('bare host', 'http://example.org/', 'http:test', 'http://example.org/test');

… the DOM level passes it off to the URL level.

<scribe> ScribeNick: Lachy

<masinter> so http:test the test is treated as a path, not as a host name

<masinter> in chrome on windows

hsivonen: what happens is that the URL parser knows the original character encoding.

… the DOM has UTF-16 strings.

… The leakage is that it remembers the original encoding that was parsed, and it is given to the URL parser.

dbaron: that extra parameter is used for less than it used to be, I think?

<MikeSmith> s/the editors draft refers to member:annevk's spec/we have a WebApps WG FPWD of URL but not a current editor's draft/

… I think there has been change over time more recently than I thought.

… There was a reduction in where IE changed incompatibly around IE7 or 8 and it reduced that leakage.

annevk: basic design: You ahve a string and a base URL and encoding override. Those go into the parser. The parser outputs an object represeenting the URL.

hsivonen: In addition to passing null for the encoding, you can also pass null for the base URL.

timbl: then there's a serialiser

annevk: the serialiser doesn't get the encoding override.

<masinter> We have four specs for URL/IRI/URI either around or planned:

<masinter> * IETF 3297bis

<masinter> - draft http://tools.ietf.org/html/draft-ietf-iri-3987bis

<masinter> - wg http://tools.ietf.org/wg/iri/charters, mailto:Public-iri@w3.org

<masinter> - chairs Peter, Chris

<masinter> - editors me, Martin

<masinter> Reason: Group established for 3 years, IETF is "official" spec for URIs, etc, referenced by W3C HTML

… when you serialise it, you just output the string, you don't need to know the orignal encoding.

<masinter> * W3C HTML WG

<masinter> - Spec http://www.w3.org/TR/html5/urls.html#urls

<masinter> - Wg http://www.w3.org/html/wg/ mailto:public-html@w3.org

<masinter> - chairs Paul, Maciej, Sam

<masinter> - editors team led by Robin Berjon

<masinter> Reason: stable document, no plans for changing unless bug submitted

<masinter> * W3C WebApps URL spec

<masinter> - draft not done, currently pointing to Anne's spec, but plan is to create new spec by copying Anne's draft

<masinter> - wg http://www.w3.org/2008/webapps/ group email public-webapps@w3.org

<masinter> - Chairs Art, Charles

annevk: say we have windows-1251 as the encoding override.

<masinter> Reason: now chartered by W3C to develop spec

<masinter> * WHATWG spec

… this is our query: ?€

<masinter> - draft http://url.spec.whatwg.org

<masinter> - wg http://www.w3.org/community/whatwg/ email whatwg@whatwg.org

<masinter> - editor Anne van Kesteren

<masinter> Reason: WHATWG members defer to Anne

<masinter> No group has plans to drop spec.

<masinter> Other groups that have discussed this

<masinter> - IETF/W3C liaisons (Mark, Thomas, Philippe) and IETF/W3C public-ietf-w3c@w3.org

<masinter> - W3C Advisory Committee w3c-ac-forum@w3.org yesterday

… that goes into the parser and it comes up with: ?%80 in the parsed URL. The serialiser will output this.

<masinter> - W3C TAG www-tag@w3.org various action items

<masinter> - IETF discussion list ietf@ietf.org discussion spilled over

<masinter> - URI mailing list uri@ietf.org also involved

<masinter> Testing should help converge the specifications, so results of testing and plans for them should help (Chris, Simon, Kris)

annevk: eventually everyone should move to UTF-8 to avoid problems. The model we have is ugly.

masinter: we need to deprecate the idea that you can convert a URI with percent encoding into an IRI.

… it's a heuristic process.

… we have no idea whether it ever came from an IRI, so it shouldn't affect any definition of the syntax.

… There's just some funny things going on in the back ends of servers.

… it's a special case having non-ASCII characters in the query string.

annevk: it's very common in non-UTF-8 documents.

masinter: processing is in 2 steps. First parse it into components.

… then you translate as necessary.

annevk: no, you translate while parsing.

<dbaron> I think https://bugzilla.mozilla.org/show_bug.cgi?id=261929 is the bug documenting where Mozilla reduced the amount that the origin encoding was used for.

annevk: every browser implements one URL parser.

… except for webkit

annevk: other applications try to converge with web browsers.

masinter: the host time, IDN, the parser translates them to puny code?

annevk: yes

masinter: even in IE, Windows?

paulc: there's a standard Windows library parser in the OS.

timbl: 2 points of order

… 1 it's 12:06 and we've run out of time. 2. Lunch.

RRSAgent: make minutes

timbl: the other thing, when we started the W3C, we did HTML and HTTP URIs. We tried HTML in the IETF, didn't have the publishing people.

… but HTTP, we did in the IETF in order to get the people designing protocols to review it.

… If you're going to do this work, it's important to unify all this, a classic way to do that is to get all the people involved, who have written other URI parsers, who are going to the IETF.

… going to the trouble to go to their place is sensible.

RRSAgent: make minutes

- DRAFT -

SV_MEETING_TITLE

31 Oct 2012

Attendees

Contents

URLs

Summary of Action Items

Scribe.perl diagnostic output