W3C Technical Architecture Group F2F -- 14 Sep 2011

URI Definition Discovery; Metadata Architecture

http://www.w3.org/2001/tag/2011/09/13-agenda#metadata

Noah: aiming for a major piece of work on this in July time-frame.

JAR: 3 documents - important one is the http://www.w3.org/2001/tag/2011/09/referential-use.html - meant as an introduction to the other two.
... I've come at this HTTP-RANGE14 issue from 2 directions. First - can I use a URI for a reference to a journal article. Is that allowed by the range14 rule?
... Second way - writing statements about licenses. Is there a standard way to refer to the work that's licensed in a way that people will understand? Answer is no, but http-range14 is close...

… also, large numbers of people in linked data / SW community are discussing...

… for a long time ...

… so HTTP-Range14 needs to be fixed.

<masinter> parts of this were issues in the early '90s discussions of URL standards in the URI working group in IETF

JAR: so one option is that [CC] and others could do this privately. But there is enough interest that we should have a shared solution.
... The question is: what is the relationship between what the URI refers to and the representations retrieved by the URI?
... My definition of metadata is that it is information about information, not just information about anything.
... [goes to the whiteboard]

[unminutable discussion on con-neg]

JAR: idea that URIs are used in particular contexts but used in practice referentially. Maybe that's only used in RDF. Premise is that URIs are being used referentially.

Tim: RDF should not be inconsistent with Web Architecture.

JAR: Other people want it to be inconsistent.

… notion that a URI refers to something. In the case of a license statement, a URI would refer to the work, a URI that refers to the license terms and a URI that refers to the relationship of "being licensed."

… from an engineering tool that could be used in a remix tool. You do a copy-paste and the tool could just check the license.

Larry: in the context of RDF there is some ambiguity in whether you're referring to the document retrieved or the referent, and resolving this ambiguity [depends on the context.]

… the license relationship could be more explicit.

JAR: That's not the way RDF work - RDF has referential transparency.

Tim: you could have one property that says "I like this page" and then "ogp:like" - these say things about the topic of the page.

[discussion on sockets and plugs]

JAR: …being precise about what the subject of the license is… is the question. What is the relationship by convention - the agreement - between the way people are using URI referentially and the way people are using them for retrieval? Is there anything we can say or agree on ahead of time about what that relationship is?
... I'm making a statement and I want the URI to refer to [e.g.] a rabbit.

Noah: And the retrievable is meant to be data about the rabbit?

JAR: That's what the [discussion] is over.

… even if it's zero, there's no reason a-priori to assume any relationship between the two.

HT: Is it or is it not yet legitimate for me to understand retrievable as "200 retrievable"?
... I want to make sure that e.g. 303 are not covered...

JAR: Right.
... I'm talking about RFC-3986. That talks about retrieval.

Noah: 206 is being debated.

JAR: I'm using it in the sense of - would it be correct for a http server to deliver a 200 response?

… RDF makes no connection between the thing the URI is referring to and what gets retrieved. It leaves it up to the context.

Ashok: Earlier you and I reviewed the link header draft - that presents a solution to this, doesn't it?

JAR: There's a description which could be bound to the URI through a variety of methods - a link header, another header, a SPARQL query, etc...

Ashok: the link header would typically...

JAR: Give you a URI.

Ashok: What else do you require?

JAR: What is required is a way to go from a URI to this description that can be done on a hosted platform where they cannot change http headers or exotic response codes - also one round trip instead of two. Unless we relax these two things then semantic web will be out-competed by other standards [according to Harry Halpin].

<JeniT> +1s JAR's last statement

Tim: There's things you can do - change the response code, add a header, add a fragment, a fragment id syntax...

JAR: [clarifying] some people require … I can't argue with them.

<JeniT> "if you get a 2XX response when you request a URI, that URI refers to a document"

Tim: [on OGP] people looked at OGP and said "if Facebook made this mistake then others will make this mistake and so we should make the mistake legal."

<JeniT> *lots* of people make the mistake

JAR: There are other places where range14a has not been observed. FlickR is one of them.
... There is no enforcement point for httprange-14. So people are not going to be aware of it.
... The statement that such a page is licensed with a license is false if the URI refers to the landing page.

<JeniT> they'll say no, because they all require server configuration

JAR: you could sit down with everyone who is doing it in this way you could try to convince people to use one of the approved range-14 solutions. [but that's not scalable]

Tim: you could make a validator. People will want to know what they said in RDFa. There will be problems e.g. licensing the wrong thing by accident.

JAR: We could got back to FlickR and ask them to change but I don't want to do that if haven't resolved this issue here.

[discussion on how flickr is using CC licenses]

Noah: let's say the UI says "by posting a photograph here you grant a license…" Now the UI could come back and inform the user allowing them to choose which meaning… Let's say I added comments on the landing page. Did I mean to license the comments as well as the photo?

HT: A bunch of different proposals exist for how to get from a URI for a thing to a URI for a description about the thing...

Ashok: if you use the link header you can get multiple descriptions...

JAR: We can amend range-14a.

… the whole point of the ISSUE-57 document is to sweeten the proposition, [thereby] allowing people to live with range-14a.

… the outcome could be that people say "yes [for example] hash URIs are OK".

… another outcome could be that people are not satisfied by any of the solutions.

… we could ignore them and push something through...

HT: Do you have any interest in all at exploring the opposite outcome? That we should say "yes, you're right, you should go ahead and use 200"?

JAR: Yes.

HT: That would amount to a retraction of 14a.

JAR: I've come to a better appreciation for the alternatives. One alternative is to withdraw 14a and let the community decide. In this case, [e.g.] Creative Commons would develop our own approach.

<JeniT> I made some proposals which got some good feedback at http://www.jenitennison.com/blog/node/159

JAR: You'd have to write in RDF something to disambiguate.

HT: example: Pat is the owner of a domain. He puts up a page about himself. Then he says "use this URI to refer to me" - and now we can't make assertions about the document.

Tim: When my RDF tool takes one graph and merges it with another graph that make different assumptions then we break RDF.

JeniT: talks about different ways of disambiguating statements. The analogy that I draw in that article is between how we think about persistence of URIs and how we deal with the fact that they don't persist over time. We should be aiming for people to use different URIs for people and documents but we need to deal with the fact that they don't all the time.

Larry: you've left out link relations.
... you could have two license relationships.

JAR: yes - that's the Facebook solution...

[discussion on the meaning of meaning]

HT: "larry has five letters" vs. "larry has three children" - you have no issue understanding that but RDF does.

Tim: in the databases out there on the net right now there is no ambiguity - the semantics are well defined.
... [some] say they must use another solution because the solutions [given to them by] web architecture are not acceptable.

<timbl> There is masses of data out there where the people running the system don't have any problem when a table name happens to be the same as a value in a cell, don't have a problem when the same string happens to be used as an ID in one column and as a value in a different column. This data is all waiting t be put on the web. It it clean and unambiguous and to suggest that when it goes onto the web we should necessarily introduce ambiguities because people always

<timbl> to make a major step backwards.

<masinter> so if databases don't have a problem, then the problem is with RDF, no? So fix RDF

JAR: what I want to know - how can I set expectations as I go into conversations with the community about this?

… I would like to say "I want to work through this tree with you" if I get a yes, then great but if I get a no then what expectation can I set with people about this outcome?

<JeniT> iand's said why he won't use hashes http://lists.w3.org/Archives/Public/www-tag/2011Aug/0127.html

<timbl> I have responded to all those points

HT: Should we review and prioritize the candidate amendments? Should we have some preferences in the TAG?

<JeniT> timbl, so your argument is that we go back and say to all the people who have used non-hash URIs for the last 5+ years and say that they were wrong?

<JeniT> timbl, that the httpRange-14 decision should never have encouraged people to do that

<timbl> 303 works and is fine -- it is inefficient

Tim: If there are architectural issues with the current solutions then we should design architecture to address those issues…

<masinter> personally, i really dislike 303

Tim: When I look at Ian's arguments they don't look sound to me. It may be that's he's got one.

<timbl> http://lists.w3.org/Archives/Public/www-tag/2011Sep/0004.html

JAR: I think they're as sound as anything else.

<timbl> http://lists.w3.org/Archives/Public/www-tag/2011Sep/0006.html

HT: It's fair to say that the way in which the hash solution works is not consistent with the RFCs.

Tim: I'm really fond of the hash as a piece of punctuation between a global and local identifier. I want to use this in many contexts.

<masinter> personally, I really dislike using # for this disambiguation too

<timbl> The phrase "fragment identifier" is a historical and unfortunate. "Local Identifier" is much better

JAR: if I have a flickr page with a photo on it, is the photo part of the page in the case where the license refers to the page? Some might say no but others might say yes.
... we have an evolution where these things are turning into applications… if people are talking about what they see when the page is rendered, that might have very little relation to what is delivered in the 200 response. There's enough vagueness about what the document is that we're going to continue to have ambiguity even if we do get agreement on range-14.

Larry: so you need it to be precisely referring to "what you see when you get all the data and render the page."

JAR: That's one approach.
... [why a:b solves the flickr problem]

Tim: FlickR could put some RDFa in there which says "landing page#photo" has CC license *whatever* and etc...

<ht> HT: So a:b is _necessary_ for a solution to the flickr problem, but not sufficient -- WebArch today doesn't even get us that far

<scribe> Scribe: Ashok

Can publication of hyperlinks constitute copyright infringment?

Noah: Some writing was done. Next step we decided was to get some legal advice.
... perhaps Thinh Nguyen may help
... but no legal advice so far
... quotes from product page: http://www.w3.org/2001/tag/products/PublishingLinking.html

Discusses success criteria

scribe: document should have impact

HT: I want to write a popular press version

Noah: Goal is PR in June.
... FPWD in October. Is that realistic?

DKA: I sent out 3 requests to people for legal review. Not heard back.

<noah> ACTION-541?

<trackbot> ACTION-541 -- Jeni Tennison to helped by DKA to produce a first draft of terminology about (deep-)linking etc. -- due 2011-07-26 -- OPEN

<trackbot> http://www.w3.org/2001/tag/group/track/actions/541

<noah> Jonathan: we need to bump the dates on ACTION-201 and ACTION-282, I think. Right?

DKA: Case in UK touched on this ... remains a current topic

Jeni: We should continue working on the document.

jar: I followed up with my legal contact. He referred me to a Professor of Law.
... In conversation with him

Ashok: If we go the direction of a popular press article what would our messages be?

<Zakim> timbl, you wanted to ask JeniT about "The problem some people (including me) have with this is that hash URIs are primarily used to indicate portions of a web page, and using them"

<Zakim> noah, you wanted to say I'm confused about dealing with Ashok's question now

DKA: Users should have a right to link ... parallel to freedom of speech

Noah: Let's wait until we crisp up the finding before we answer that question

<Zakim> ht, you wanted to cavill at the use of 'link' in our headline

<noah> HT: at the popular level, linking is a confusing concept

<masinter> http://jeffersonsmoose.org/?p=90

HT: Too high-level to be implicated by anything a lawyer says.
... the word "link" is in itself confusing

DKA: Linking vs. Transclusion

HT: Depends on whether you use "image" or "object".

Noah: The document says the terminology is subtle even for experts
... so we need to explain
... we ask legal community what would help them

Discussion of "fair use"

Music students must buy the music. For plays they must rent a copy of the play

<masinter> http://en.wikipedia.org/wiki/Fair_use

Yves: Understanding of free speech is different in different countries

Noah: Discusses material in the cache

Tim: A link can be seen as aiding and abetting
... here is where you can find all your favorite TV series

<masinter> I thought we had gone down the road that the TAG would document common practice and terminology, rather than documenting what should or shouldn't be legal or even legal practice

Noah: Discusses nature of links ...

<masinter> I thought we were going to look at "expert witness" contributions to copyright cases in order to summarize those as community consensus about terminology and concepts

Noah: web depends on linking and network effects

<masinter> i don't want to make recommendations about what kinds of laws should be passed

Noah: Restricting linking makes the Web a less useful place

Larry: We should restrict ourselves to factual technical discussion
... need expert testimony on copyright cases that is representative of community consensus

Noah: Value of web comes from making information resources available to others

Larry: We should not try and assess value

<Zakim> jar, you wanted to expectations

<masinter> or at least separate this into two documents: (a) technical terminology and use cases, for community consensus and Recommendation status (b) a policy document which represents what the TAG would like the W3C to advocate, as a TAG finding

jar: We are talking about technical aspects of web ...
... what are the expectations? How will people use these links
... if you see a link you should be able to follow it
... do you have to read and understand the surrounding text?

<Zakim> masinter, you wanted to advocate two documents

DKA: Could you send the editors some guidance

Larry: Separate policy and technology ... two documents
... Get community consensus on the technical document

DKA: Not sure we need the policy document

<masinter> encourage you to keep policy advocacy section separate

Noah: We should not say much about policy ... stress the architectural/technical aspects

<noah> Noah: what I actually said was that there's a middle ground between pure technology and policy, and that's to explain a bit about how the Web is used, and where people get value from it. That informs people who set policy, so they have the opportunity to support such uses, and to avoid inadvertently breaking things people value.

<noah> AM: I'd go further: I think we need to advocate policy.

Tim: Reasonable to point out the importance of things. Not what laws should be written.

Tim: We can point out the value of things

<noah> http://www.w3.org/2001/tag/doc/deeplinking-20030911

Tim: I can see countries making it illegal to supply links to bomb-making material, etc.
... what got us going was websites that say "do not link to this?"

DKA: Should not have legal standing to say "You cannot link to this website."

<noah> From deep linking finding:

<noah> "The Web is at the risk of damage. The hypertext architecture of the Web has brought substantial benefits to the world at large. The onset of legislation and litigation based on confusion between identification and access has the potential to impair the future development of the Web."

jar: Depends on whether a contract has been made and whether it can be enforced.

HT: Linking is like printing in a paper

Noah: Link is a capability for retrieving the material
... We have published a Deep Linking finding
... we have already made a policy statement
... Don't prohibit linking, put access controls if you want to restrict linking. We say that in the finding.

jar: Terms of use should not be interpreted as entering into a contract

Noah: I wonder if there is a point to be made about fragment identifiers ... can make you miss the terms of use
... if you link, you could like to fragments of a page

<noah> NM: Right, we should point out that fragment identifiers, for good reasons, can cause a user following a link to wind up in the middle of a page or work, which means material like terms of use at the top or bottom might not be seen.

DKA: Is there legal precedent?

<noah> NM: We should show use cases of where fragment references are valuable.

Tim: When I sent messages to the bank I got terms of use at the bottom of the page.

<masinter> What will be helpful to the community? How would a TAG policy statement have effect? Who would refer to it?

<masinter> please be careful about 'prevention' vs 'establish consequences'

DKA: Jonathan can you provide text that further elaborates the terms of use situation.

<jar> yes

<timbl> Tim: Their messages sent "by messaging with the sender you accept the sender's terms which are on the web here and may change at any time". I added a similar disclaimer to my own messages to them and so by that measure they would be bound my terms which are on the web and may change at any time.

<noah> DKA: Jeni and I will meet first week of October to collaborate

<Yves> on the same subject as Tim: having an HTTP header on get DoNotLog: yes, and a link to term of services "by responding to this http request, you agree not to log this interaction"

DKA: Jeni and I will meet first week of October to discuss further work on the document. We can discuss at TPAC.
... We could have a BOF at TPAC to discuss

<Yves> which should as difficult to defend as "do not link" ToS

Larry: We could use this as a away to bring in more of the community
... W3C management could use this to attract community interest
... Community group could be created around this topic

<masinter> http://people.ischool.berkeley.edu/~pam/

Noah: How about inviting Pam Samuelson to this session?

HT: Her article from 10 years ago is what I recommend people read to understand Copyright on the Web

<noah> ACTION: Appelquist to propose TPAC breakout on copyright and linking Due: 2011-09-27 [recorded in http://www.w3.org/2011/09/14-tagmem-irc]

<trackbot> Created ACTION-604 - Propose TPAC breakout on copyright and linking Due: 2011-09-27 [on Daniel Appelquist - due 2011-09-21].

<noah> ACTION-541?

<trackbot> ACTION-541 -- Jeni Tennison to helped by DKA to produce a first draft of terminology about (deep-)linking etc. -- due 2011-07-26 -- OPEN

<trackbot> http://www.w3.org/2001/tag/group/track/actions/541

<noah> ACTION-541 Due 2011-10-11

<trackbot> ACTION-541 Helped by DKA to produce a first draft of terminology about (deep-)linking etc. due date now 2011-10-11

<noah> http://www.w3.org/2001/tag/products/PublishingLinking.html

<timbl> very cool

<masinter> action-478?

<trackbot> ACTION-478 -- Jonathan Rees to prepare a second draft of a finding on persistence of references, to be based on decision tree from Oct. 2010 F2F -- due 2011-12-06 -- OPEN

<trackbot> http://www.w3.org/2001/tag/group/track/actions/478

<timbl> scribenick: timbl

Persistence

<noah> scribenick: timbl

<masinter> my thought lately has been that it's really important to be more precise about what you want to be 'persistent'

jar: We know persistence is primarily a social question, not really a technical one. To get it you need lots of copies or
... else a trustworthy institution.
... there is no way to test for it: it is a lot about whether someone else believes that the a document is persistent enough for their purposes.
... Who do you know, who do you trust? I [think I] have seen policy statements by journals saying they do not accept HTTP URIs as references. Some accept them only from webcitation.org for example.

<masinter> a "confidence" game in both sense of the word

jar: the right document about persistent URIs has not realy been written yet.
... A lot of people have been puzzled by this, and there have been a lot of analyses. People grok the social nature of the issue.
... Give that we are the TAG, and that in fact the persistence of a dereference is Somebody Else's Problem, should we have anything to say about it?

<noah> Larry: I think Jonathan's mention of "persistent actionability of a reference" is pretty close to the answer to your challenge about what is persistent

<noah> ...as far as it goes

<masinter> noah: i think it's leaving out the endpoint

HT: There was a crucial point when the DOI people realized that "actionable" URIs (ie dereferencable, i.e. HTTP) were a good idea.

<masinter> actionable URI as a reference to _what_ ?

<noah> You mean 'endpoint' as in endpoint of a connection, or something more like end state in some sense.

jar: There is not a big audience for this document.

<noah> I infer, e.g., reference to some published work.

<masinter> is the work allowed to change, or do you mean the exact representation?

jar: Some people just use DOIs. Some apply to be DOI registrars.

<noah> I assume that whoever establishes the reference string answers that question, but presumably making a reference to an "exact representation" needs to be an option when that's what you intend.

jar: Crossref has developed with time. The social contract with new members of crossref are now designed to support persistence. If you become a member of crossref, that means crossref has the right to keep the metadata. They are backing up out-of-organization. (They are a non-profit).
... Orcid and Datacite were not forced to use DOIs, they are using them because the social contract works for them, and becoming registrars.

masinter: Does the persistence of the DOI depend on these orgs -- or is the DOI in the document itself?

jar: Normally it is in the document, but that is not a requirement [AFAIK].
... The common practice is to publish in the web the DOI hyperlinked to the HTTP URI.

tim: Nice compromise.

jar: the funny thing is that they are not worried [AFAIK] about the doi.org domain name dying - they have this redundant form in the DOI itself.

<noah> NM: I still find <a href="urn:...blah blah">http://...some link to what you meant</a> seems troublesome. Maybe because analogous techniques are used maliciously in phishing attacks

<noah> NM: I think users mostly want to trust that when the link text appears to be an absolute URI, then the link should be to the same URI. Not a disaster, but somewhat troubling.

jar: Then w3.org and things like it are another story. They could make an effort to hang onto the domain name. There are persistent identifiers for for example dated tech reports.

Tim: What about new top-level domains with different properties, where you can buy forever a domain?

jar: There is an important function web(r,u) as to whether r is a valid representation for u. If a proxy is in the way, how do I know this function still works? well, HTTP-bis has a pretty good story.
... Now URN specs often discuss how you can deref them.
... They are often wrong in my view.
... What you want to say is that the only way to determine which , say IETF RFC, is actually a valid one., by going to those in charge who have the definitive say.
... As to the TLD idea, I'm not sure there is a demand for it.
... note that w3.org's persistence policy is still in draft form.

larry: For persistent refefences, one way is to embed GUIDs within the object itself.
... Most PDFs have two, a document id and an instance ID (which changes any time the doc is edited).
... The resolution service is you search on Google, and it works. It requires Google instead of a custom service.

Noah: There are wrap date issues with GUIDs in many implementations/

jar: There is a urn:guid: scheme being discussed.

Tim: What about uuid: scheme?

<masinter> xmp.iid and xmp.did differ from uuid: in that there's a specific semantics

jar: The curation community seems to be going in that direction, and there Tom Baker for example there is working on persistence for ontologies.
... They typically take copies of ontologies they use in RDF.
... I have also been thinking of PIR a lot, the non-profit who runs .org
... They could be involved.
... That would reduce the number of vulnerabilities by one.

ht: Or we could go to ICANN with PIR's support.
... The workshop dcc.ac.uk
... There is no archive anywhere which has an historical record of domain name ownership -- domain registrars are not required to keep archives.
... Maybe in order to get things to happen we need to be alarmist.
... Actually if w3.org goes offline, a lot of people get W3C stuff from the wayback machine.
... In fact what we do is all a question of how we estimate the risk.
... We could get people to talk about that at the workshop. I am happy to write up a proposal at the workshop.

<noah> FWIW, Wayback Machine copyright policy http://www.archive.org/about/faqs.php#20

ht: Business Continuity Management -- what happens if a crisis hits --

Tim: Interested to develop system which allows HTTP to failover to p2p, when there is a fail to get B linked from A, ask A to bootstrap move to p2p. A bit of protocol is necessary to bootstrap this.

peter: I am interested in this, failover from http to p2p

Tim: Important to look separately to short term and long term threats. Short term may be web server breakage, net breakage, crisis damage, or attack by government eg Egypt.

ht: Long term, orgs may be gone. Short term, they are around but can't do their jobs.

[Discussion of workshop]

[Discussion of product page]

<masinter> i think this is a lot less important for the TAG to work on than MIME and the web

<ht> .RESOLVED: The TAG agrees to endorse a workshop proposal on domain persistence for IDCC11 on 4 or 8 December. This probably means no more than that the workshop publicity would include some form of attribution to the TAG

<masinter> I don't endorse a workshop proposal from the TAG because it doesn't raise above my threshold for TAG priorities

<masinter> and i see no evidence of a community that wants to come to W3C

<noah> . RESOLVED: The TAG agrees to endorse a workshop proposal on domain persistence for IDCC11 on 4 or 8 December. This probably means no more than that the workshop publicity would include some form of attribution to the TAG. This is contingent on suitable approval from and coordination with W3C staff.

<masinter> i'm not sure what 'endorse' means if it has no commitment at all for any follow-on work

<masinter> i wouldn't mind a proposal for a breakout session at TPAC, on the other hand

<masinter> I don't mind a proposal for a breakout session at TPAC, on the other hand

<masinter> I don't mind a breakout session at TPAC, on the other hand

<DKA> +1 on noah's proposed resolution.

Votes: 7 yes, 1 no, 1 abstain.

<noah> Chair rules no consensus.

<noah> Larry says OK to do it anyway.

Chair rules no consensus, but Larry graciously agrees to having the group adopt the resolution anyway.

RESOLUTION: The TAG agrees to endorse a workshop proposal on domain persistence for IDCC11 on 4 or 8 December. This probably means no more than that the workshop publicity would include some form of attribution to the TAG. This is contingent on suitable approval from and coordination with W3C staff.

<noah> ACTION: To talk to Ian about whether a 15 min plenary presentation on TAG status would be appropriate at TPAC. [recorded in http://www.w3.org/2011/09/14-tagmem-irc]

<trackbot> Sorry, couldn't find user - To

<noah> ACTION: Noah to talk to Ian about whether a 15 min plenary presentation on TAG status would be appropriate at TPAC. [recorded in http://www.w3.org/2011/09/14-tagmem-irc]

<trackbot> Created ACTION-605 - Talk to Ian about whether a 15 min plenary presentation on TAG status would be appropriate at TPAC. [on Noah Mendelsohn - due 2011-09-21].

Unicode Normalization

<noah> http://lists.w3.org/Archives/Public/www-tag/2011Jun/0188.html

Noag: This started with an emil from Addison Philips (sp?)

Noah: I missed this, then Peter asked about it in July.. We had a discussion on Sept 1.

We assigned a couple of actions, one for me to follow up with Addison.

<noah> ACTION-592?

<trackbot> ACTION-592 -- Peter Linss to draft possible TAG position statement on Unicode, and alert Addison Phillips of our intention to attempt to get agreement starting in October after the F2F -- due 2011-09-08 -- OPEN

<trackbot> http://www.w3.org/2001/tag/group/track/actions/592

See also ACTION-592 of Peter

Peter: Started, not ready.

<noah> ACTION-590?

<trackbot> ACTION-590 -- Noah Mendelsohn to follow up with Addison Phillips on Unicode normalization http://lists.w3.org/Archives/Public/www-tag/2011Jun/0188.html -- due 2011-08-30 -- PENDINGREVIEW

<trackbot> http://www.w3.org/2001/tag/group/track/actions/590

<noah> close ACTION-590

<trackbot> ACTION-590 Follow up with Addison Phillips on Unicode normalization http://lists.w3.org/Archives/Public/www-tag/2011Jun/0188.html closed

Ashok: Could you please recap the problem?

Peter: Unicode has many ways of representing the same thing. These look identical on the screen. Eg accented character vs accent and character.

The result is users are confused, as the visible glyphs are identical.

This is not a problem if your Unicode is normalized. There are normalization forms.

There are though some ways things can have multiple encodings (eg Vietnamese multiple accents which are not tackled by those normalization algorithms.

Then you have JS APIs for these things, accessing using say class ames etc

Surprisingly this has not been a problem yet. Maybe because people just give up using non-ASCII.

<masinter> Is it the case that Vietnamese doesn't normalize? I don't think that's what Peter said

The I18n group have been pushing for more people to pay more attention to this problem.

Within CSS group, much pushback e.g. from implementers, as there are performance costs.

jar: Why do people think it's ok to not do his?

peter: Because they haven't seen the problem yet. But absence of evidence is not evidence of absence.

Noah: do we have a community of mostly people working in ASCII -- like the TAG?

Maybe we need non-English speaking input.

jar: This can't just be a web issue.

Larry: I worked on a system which had this problem. You can use non-ascii names without normalization.

Noah: Eg someone edits a style sheet in an editor.

masinter: Do any apps normalize

masinter: The "visual cognates" problem is a lot more tha this -- also o and 0, l and 1 for example.

<jar> plinss: Typos are beyond our control

<jar> plinss: Meta-question - should tag be involved at all?

<jar> "Programs should always compare canonical-equivalent Unicode strings as equal" http://unicode.org/faq/normalization.html

Ashok: Call Mark Davis, of Unicode fame.

masinter: The 118n group have been askign people to tke this up with no effect, and have asked the TAG to push.

peter: Questions abound as to whether to check just at input time, or what. There are illegal strings, what happens if you chop a string between a character and its accent then recombine it?

<masinter> http://www.w3.org/TR/2004/CR-charmod-resid-20041122/ is Candidate rec from 2004, hasn't progressed?

<masinter> http://www.w3.org/TR/2005/WD-charmod-norm-20051027/ is Working Draft, from 2005

Yves: The language all have unicode normalization in their runtime.

ht: The XML spec I just checked does not mention normalization at all, except in a XML name section, not normative, it says use normal form 3.

<ht> The XML spec has the following non-normative guidance on XML names: "Characters in names should be expressed using Normalization Form C as defined in [UnicodeNormal]."

ht: In the character model spec, there was no one asking for it, so no one paid attention to it.

masinter: ther is a WD dates 2004, and normalization WD dated 2005

<plinss_> http://www.w3.org/International/wiki/CharmodNormSummary

<masinter> http://www.w3.org/TR/charmod-norm/#sec-NormalizationApplication

ht: [reads] ... characters with multiple possible representations are compared code point by code pont.

<masinter> http://www.w3.org/TR/charmod-norm/#sec-IdentityMatching

<masinter> I think we should find that the I18N group should bring charmod-norm to rec, and address this problem by removing theory that doesn't correspond to practice

<ht> XML requires that processors _not_ normalize when comparing e.g. start and end tags (http://www.w3.org/TR/REC-xml/#dt-match): "Two strings or names being compared [*match* if they] are identical. Characters with multiple possible representations in ISO/IEC 10646 (e.g. characters with both precomposed and base+diacritic forms) match only if they have the same representation in both strings."

<noah> TBL: The TAG is at it's best when helping multiple groups. This is a slightly unusual case in which the i18n group themselves is in a way TAG-like in their role. I think we can support >them< in resolving this, but not clear we should be getting in at the level of detailed technical analysis.

<noah> TBL: I think one could imagine a string-compare-xxxx, so not do it in each API. Do not ask each subsystem to it.

<noah> NM: What about CSS?

<masinter> reading http://www.w3.org/TR/charmod-norm/#sec-IdentityMatching, it actually looks like this recommends exact string match. That is, why is there a problem?

<noah> TBL: (scribe still struggling)

<ht> Don't change the data models

<ht> Change the way equality is tested for

<noah> TBL: What I'm suggesting is doing the smart check at comparison time.

<jar> I thought we were asked to review http://www.w3.org/International/wiki/CharmodNormSummary ?

<noah> AM: This is a bigger problem due to the IRI situation. It behooves us to think about this.

Tim: Clearly if you just introduce canonicalization into one subsystem, then things break: for example, if the CSS system does and the XML DOM doesn't then the match between CSS and XML will fail where it used to work. Instead, good to introduce a compare function which compares blind to differences in encoding of accents. This will help and will not cause the same damage -- very rare damage.

If I was building a system from scratch, I would probably canonicalize on input.

<ht> I read the above from charmod-norm as at variance with TimBL's suggestion: " In accordance with section 3 Normalization, this step [normalization] must be performed by the _producers_ of the strings to be compared." [emphasis added]

But all this hodulebr [hmm... editor can't figure out what that word was supposed to be] architected by the i18n group not the TAG.

Yves: We could ask the i18n group why there advice did not take off.

<jar> "we would like to request that TAG schedule time in about four weeks to review I18N WG's proposed recommendations concerning Unicode Normalization" - Addison to www-tag on 6/29

Peter: They have talked to CSS group. There are disagreements within the CSS group.

Deal with it/Don't deal with it/worry about performance

jar: It in our charter to get involved.
... If this is a disagreement between working groups, it is in our charter. It needs to be more clearly laid out -- that os [?] it seems in progress.

<masinter> I don't understand why they have 2004 working draft with no progress.... where are the heartbeats documents, for example?

Peter: I volunteer to champion this within the TAG.

Jar: I'd like to ask various outside people for their opinion.

Peter: I'm not an expert on the unicode side.

<masinter> their working draft disagrees with the wiki page, but without justification for why the change

<jar> right.

<masinter> http://www.w3.org/TR/charmod-norm/#sec-IdentityMatching

Noah: If the i18n stuff were rec track, then thei CR phase would involve egtting eth CSS people etc to implement it, maybe.

masinter: It may be that the i18n group needs to think about the breadth of applicability of this work.

<jar> i18n core charter seems to be http://www.w3.org/International/core/

Noah: As an individual TAG member, I'm conflicted. Tim says this is in the i18n charter, and jAR says this is in our charter, and Larry says not clear that he i18n wg is using the w3c process normally.

<masinter> that is, the distinction between internationalization for presentation forms and protocol elements, and that the broad charter of I18N applies to presentation forms, and that they might want to be more modest in trying to internationalize protocol elements, against other priorities of reliability, implementation, performance

Noah: either their own material, and in their approach to CSS. [???]
... We do think the TAG doesn't scale if it takes over work that is the province of other groups — normally there are processes by which WGs give each other a heads up of direction, and I don't think the TAG should be used as a short cut in this case.
... We have no consensus. Propose: the TAG is not convinced or ready to commit to a deep dive. We would like to do a small 4-8 week exercise -- maybe a telcon with i18n -- and ask the question again after that time.

jar: Not happy. This is not a very big thing. We can't do things until they have done their homework. We can't help now -- they need to get sorted out.

<jar> not a big thing *yet*. What I mean is that there's not much for us to do until it's better prepared for us so that we can help efficiently — seems like a jumble to me.

<noah> ACTION-592?

<trackbot> http://www.w3.org/2001/tag/group/track/actions/592

jar: First step should be they provide a document for us to go from

<jar> i'd like an enumerated set of issues and options, in writing, as prep for a telcon

<jar> a brief should ideally summarize the opposing views

<noah> ACTION: plinss to invite I18N and other concerned groups to provide written technical input as prep to discussion with the TAG regarding unicode normalization [recorded in http://www.w3.org/2011/09/14-tagmem-irc]

<trackbot> Created ACTION-606 - Invite I18N and other concerned groups to provide written technical input as prep to discussion with the TAG regarding unicode normalization [on Peter Linss - due 2011-09-21].

ht: The wiki is actually not consistent with what it recommends -- do you or do you not compare canonicalization-aware?

Minimization

<noah> ACTION-590?

<trackbot> ACTION-590 -- Noah Mendelsohn to follow up with Addison Phillips on Unicode normalization http://lists.w3.org/Archives/Public/www-tag/2011Jun/0188.html -- due 2011-08-30 -- CLOSED

<trackbot> http://www.w3.org/2001/tag/group/track/actions/590

Noah: Commit or cancel?

we have a commitment to have a finding in 1 month, hoped to have an initial draft in July.

Worried about timing.

DanA said he thought he could get it together in time for TPAc and it will be contentious.

DanA; This has always been a really small thing anyway. The device API folks are happy they say to implement it, and we are in some cases already doing it -- but we would like to se more examples of where ti has been a good approach applied in the wild.

Where has this been applied and resulted in better privacy?

jar: (But this is a basic tenet of capability design!)

dka: We have had strong geolocation WG partcipation since it began, and upcoming will be civic address objects, as an enhancement to geo lat long, (street/city/region/country etc).
... If you apply this design, you should be able to just ask for say the state.

<masinter> ack

dka: But implementers Google, Moz in geo wg, and Opera push back.

"We did privacy in geopriv, a can of words .. " They never dealt with minimization -- if users and edevs are not asking for this, then why do it... OWTTE.

In Lyon last year I sat in on the geo location group

scribe: supported their approach to get out the document, and not be drawn too much into the privacy issue.

<masinter> I think we have some responsibility to push on policy-based requirements

masinter: I am trying to understand the responsibility of the TAG .. relationship with the privacy stuff going on ... you said you got push back from implementers, but this in not necessariy a market-based requirement, not a policy-based one.

accessibility and privacy fit into areas where there may be other forces beyond the ones the product manager immediately sees, which can't be justified on the basis that this is what the users are demanding.

Noah: I don't have a kind warm feeling about that geopriv history at the moment.

Ashok: I don't thin you can find good use cases.

<masinter> i think i understand some of the geopriv / geolocation history

DKA: I can find lots of academics who will talk about this but I need people in the real world, selling in the market

Noah: If I can do a getCity, getState() call etc, thats's one thing ... i can also do a different call getAddress() which may return blanks.

<masinter> i think it's an API design concern that has only a little bit to do with access control

dka: That is not the question -- this is not about access control.

<masinter> the shape of the API treats 'granularity' an input parameter, and where the result has options for returning multiple granularity

<masinter> i'm disagreeing with Noah that 'access control' is the right way to approach this problem

Dan: The model we are dealing with in this minimization issue, is that it is good for a developer to ask for the minimum data that they can use, to avoid having more private data around than ncessary. It is not about access control by the user.

Noah; [stuff about user permission access control]

Yves: All users know what an app will access, as it is displayed in big letters. They have seen that.

<masinter> in fact, what i like about 'data minimization' is that it gets around some of the access control problems

<Ashok> In access control someone decides who gets to access what ... in this case the user decides what is disclosed about him

Yves: Howver, as the apps use it, there are many possibilities of leaks by which the data can get out. So minimization is interesting to just reduce the likelihood of damage by a leak.

<masinter> alternative is a separate part of API to 'fuzz' or 'reduce granularity' ... might more be more effective than data minimization

dan: The assumption is that the user should not be given lots of prompts. Minimization as a best practice does not involve prompts.

noah: Two ways of doing this --inspecting the code of an app, or doing it at run time.

[discussion of various forms of attack]

larry: You can get teh same effect as minimization by fuzzing the geo data.

dan: This draft supports that sort of approach.
... The document is supposed to just state the arch'l principle, which is why it is good as a TAG doc.

<masinter> could the privacy activity take this forward without the TAG owning it?

I would hope we would get agreement with the WGs which would be most connected, such as geolocation , for example.

larry: What part ofthis could we push onto the privacy activity?

dan: I'd like to complete this, and then hopefully have the rivacy group support that document and point to it.

jar: These are not really guidelines, just an article.

dan: Yes.
... I would like this to be a finding.

Larry: I like it enough that I would like you to make it into something which could be a finding.

jar: It is a design pattern.

tim: I like the "pattern" language.

Noah: Do we know [where this fits] in the system? It is API mini'n

dan: Actually Data Minimization.

Tim: Do you want to extend it to Sensitive Data Minimization in Code ?

Ashok: Would like to publish this as a Note, not a Finding.

- DRAFT -

W3C Technical Architecture Group F2F

14 Sep 2011

Attendees

Contents

URI Definition Discovery; Metadata Architecture

Can publication of hyperlinks constitute copyright infringment?

Persistence

Unicode Normalization

Minimization

Summary of Action Items