Re: Official RDFa Response: ISSUE-90: CURIEorURI Value Space Collisions from Manu Sporny on 2011-06-01 (public-rdfa-wg@w3.org from June 2011)

From: Manu Sporny <msporny@digitalbazaar.com>
Date: Tue, 31 May 2011 22:07:05 -0400
To: Niklas Lindström <lindstream@gmail.com>
CC: RDFa WG <public-rdfa-wg@w3.org>
Message-ID: <4DE59EC9.1050601@digitalbazaar.com>
On 05/31/2011 11:19 AM, Niklas Lindström wrote:
> Hi Manu, all!

Hi Niklas - thanks for the very detailed and thoughtful reply. Responses
below (they're not official, just trying to work out where we go from here).

> I thank the working group for reviewing my issue!
> 
> However, it seems I haven't quite gotten my point through. I didn't
> propose to limit the lexical value space of CURIEs in general. It is
> only the construct SafeCURIEorCURIEorURI I am concerned about. And
> that is a new construct, hitherto used *only* in RDFa 1.1.

Ah, I don't know if that was clear to the rest of the group. It was not
clear to me, so thanks for clearing that up. That said, we have
considered this issue in various guises over the development of RDFa
1.1. More explanation follows...

> See my comments below. (I also elaborated on this in my reply [1]
> during the discussion in April.)
> 
>> We discussed this at length and found the following:
>>
>> 1. Limiting the CURIE to a regex arbitrarily limits the allow-able
>>   characters such that other use cases cannot be supported, such as
>>   CURIE references containing "@" or ":" or any internationalized
>>   character in them.
> 
> As said above, I didn't propose to reduce the current lexical space of
> CURIEs everywhere, only in CURIEorSafeCURIEorURI (and not necessarily
> restricted with a regex; but e.g. redefined as QNameOrSafeCURIEorURI).
> If one wants to use complex CURIEs there, e.g. "dpb:resource/Concept",
> SafeCURIEs would work fine, just as before.

Ah, ok - that is different than what we thought you were saying. Three
points:

1. Establishing QName-like behavior would confuse authors further.
2. SafeCURIEs are rarely used in practice.
3. CURIEs are rarely used in @about and @resource.

Point #1: I discussed this with the Editor today and both of us thought
that the QName bit is a non-starter. The reason being that we have spent
many years trying to convince people that CURIEs are not QNames - which
they are not. There is even a section of the RDFa specification that
details the difference:

http://www.w3.org/TR/rdfa-core/#why-curies-and-not-qnames

Introducing anything that is a QName or QName-like would confuse the
issue. That said, we could use something like
RestrictedCURIEOrSafeCURIEorURI, but as Ivan has pointed out, that is
problematic as well. It also wouldn't solve the problem where you have
schemes like "mailto" or "sip". So, we would restrict the potential
input, but not solve the problem.

Point #2: We have enough data now to know that people are not using
SafeCURIEs. We don't know if they are not using them by accident, or if
they are not using them because they don't know they exist. Personally,
I hate safe CURIEs and think that many people don't even know that they
exist. They are unnecessary in almost every use case imaginable and
complicate RDFa implementations. I know others in the community feel the
opposite way, but the bottom line is - the majority of RDFa documents
out there do not use SafeCURIEs, so a rule like
RestrictedCURIEOrSafeCURIEorURI would effectively boil down to
RestrictedCURIEorURI in practice - which wouldn't solve the problem
you're attempting to solve.

That is, most of the markup in the wild would be wrong - the RDFa
specification would deviate from how people use CURIEs.

Point #3: Almost every case of RDFa that I have seen does not use CURIEs
in @about or @resource - named bnodes are the rare exception and those
are very seldom used. Hash-IRIs (relative IRIs) or absolute IRIs are the
norm. I know your point is about when somebody defines an "http" prefix
in their document - well, in that case the RDFa is wrong. It will happen
- we know it will happen, but in almost every case, it will be sorted
out. If it's not sorted out, the data will be invalid and nobody will
use it. That is - one can easily recover from the error.

>> 2. Limiting the character set still doesn't prevent false positives
>>   for very simple schemes like SIP. For example, to prevent a
>>   false positive for "sip:niklas@example.org", one would have to
>>   limit the "@" in all CURIEs. However, there may be some vocabularies
>>   that want to utilize the "@" sign. That is, we may think we know
>>   which characters are important now, but all that must happen for
>>   this approach to fail is that an Internet Scheme would appear that
>>   uses a character in the list of acceptable characters - for example,
>>   "-" or "."
> 
> Again, my issue only concerns the collision-prone *mixing* of CURIEs
> and URIs, where URIs are the norm. I do not find TERMorCURIEorAbsURI
> nearly as problematic (as used for e.g. @rel and @property). That is
> simply because there CURIEs are the norm and AbsURIs the exception
> (since in RDFa 1.0, only CURIEs where allowed).

There is a contingent of people, namely coming out of the WHATWG and
Microformats communities that would disagree with you. They would argue
that AbsURIs would be the norm and CURIEs would be the exception (at
least, with the markup that they create). We should expect CURIEs and
AbsURIs, when used in the same datatype, to conflict at some point.
Again - the key here is how often that happens and how recoverable that
error is.

Granted there will be some cases where the author does not have a choice
on which vocabularies are defined in the head of the document. In that
case, they can always re-define "http" prefix to be "http:" like so:
prefix="http: http:". It is a bit ridiculous, but even in the worst case
scenario authors can still override the mappings they use.

As for the case where the mapping is inserted into a profile that they
don't control, after they have authored their documents, well - that's a
problem. However, so is if one of the vocabularies that they were using
is deleted from the profile. Profile modification is always a problem.

Keep in mind that this problem /only/ happens when a vocabulary prefix
is defined in a profile without the author's knowledge. We don't feel
that that is going to happen very often, and as I said above, it's
always correctable if it is detected.

>> 3. CURIEs are not allowed in @href and @src, so the likely-hood that
>>   this will become a practical concern is lessened.
> 
> I disagree. Since @about and @resource are fundamental to RDFa (indeed
> needed in certain places), I don't see how the collision risk is
> substantially reduced.

I think we're miscommunicating. What I meant was this:

Ignoring the other attributes, if we consider allowing CURIEs in @href,
@src, @about and @resource - there would be four places where CURIEs
could screw things up. Since we only allow CURIEs in @about and
@resource, we halve the places where the issue can occur.

Since people mainly use relative references (hash-URIs in @about), the
potential for the problem to surface is further lessened. However, that
doesn't mean that the problem can't occur. The most prevalent issue may
be the declaration of a 'http' prefix. That could cause issues if
"http:" IRIs are used in @about and @resource, but as I said before -
people typically use relative IRIs in @about, which leaves @resource as
our most common pain point for this problem. @resource is seldom used.

Really, this is an issue when people use full IRIs. People typically,
based on the data that we have seen to date, use IRIs in @about and
@resource. People prefer to use relative IRIs in @about. People tend to
not use @resource and use @href for the majority of cases.

We have not done a complete study on RDFa usage in the field - this is
just what the group has seen in general. Others may disagree with this
overview of the situation.

>> 4. There is no ambiguity as far as an RDFa Processor is concerned.
>>   For example, if an "http" prefix is defined, then anything that
>>   accepts a CURIE would expand the "http" prefix. That is, if the
>>   prefix is defined, it is a CURIE. If it is not defined, it is an
>>   IRI. Authors will discover this very quickly and vocabulary
>>   maintainers are advised to avoid naming their vocabularies after
>>   Internet Protocol Schemes.
> 
> Certainly. But my issue didn't concern processing ambiguities. There's
> a conflation of value spaces, which as you point out authors have to
> be aware of and avoid. The need for the proposed advice is what
> concerns me, since vocabulary prefix naming and URI schemes evolve
> completely independent of each other!

That is true, but unfortunately I don't think there is any way around
this problem other than "authors have to be aware of the issue". They
will become aware of it if they check the triples that are marked up on
their page - there are good tools to allow them to do that at this
point. If they generate bad triples, people will either complain, or not
use their pages. This down-side is a design trade-off, and that's really
what we're talking about here - the lesser of two evils.

The RestrictedCURIE approach just won't work because of schemes like
"sip" and "isbn". The alternative is to say "Safe CURIEs only!", which
won't work because people don't use them.

We are attempting to make authoring easier for web authors and because
of that, we are knowingly introducing a situation where there might be a
corner case where an author's markup generates triples that they were
not intending to generate.

However, the benefits that we reap are that plain CURIEs can be used in
most cases without error - and this matches the practice of using plain
CURIEs that we've seen out in the field.

> I basically don't see how this shorthand feature can be warranted when
> it leads to this conflation. Especially since SafeCURIEs have been
> there for this use case all along. (Now in RDFa 1.1, it seems
> SafeCURIEs are effectively a legacy.)
> 
> I understand that e.g. the suggested QNameOrSafeCURIEorURI is more
> complex to read though. This since the tokens in the "local name"
> determines if prefix expansion would be triggered. Given the choice,
> I'd probably prefer to revert to just SafeCURIEorURI for @about and
> @resource!
>
> Requiring users who want to use CURIEs in @about and @resource two
> surround them with "[" and "]" just seems more wise to me than making
> any prefix declared, perhaps in a profile beyond the author's control,
> to expand in every subject and object supplied via these attributes,
> for every RDF 1.1 document created.

I understand your reasoning and agree with it from a purely logical
standpoint if we were to only consider the evidence you present above.
However, doing that would go against established practice, which is also
evidence that we should take into account. People wouldn't use it in the
way that you intend people to use it.

>> 5. It would create a backwards-incompatible change to RDFa. The
>>   Working Group is not chartered to make this sort of change to
>>   RDFa.
> 
> On the contrary. The SafeCURIEorCURIEorURI construct did not exist in
> RDFa 1.1. In fact, the current situation does actually introduce a
> backwards-incompatible change. In RDFa 1.0, no prefix defined will be
> expanded for values in @about and @resource starting with it.

Hmm, this is an interesting point. I think we mention that if a @version
is specified, the RDFa Processor /MUST/ conform to that version. This
would mean that the backwards-incompatible change you mention never
happens because the CURIE won't be expanded in RDFa 1.0 mode. I don't
know if we state this clearly in the specification. However, we really
should have a test case in the test suite to check this.

>> In the end the group didn't think that limiting the value space of
>> CURIEs would actually solve the problem you are concerned about. It may
>> lessen the problem, theoretically, but nobody has demonstrated where
>> this leads to a critical real-world problem with RDFa. In the worst
>> case, the vocabulary prefix is changed in the RDFa document. In the end
>> the Working Group decided to not place additional limitations on the
>> value-space for CURIEs for the reasons listed above.
> 
> I am quite aware that by just avoiding the definition of a handful of
> common URI schemes (e.g. http, https, possibly ftp, mailto, sip), and
> providing nothing new comes along, this is not much of a problem
> today. But CURIEorSafeCURIEorURI is  an issue of conflation (of
> prefixes and schemes), and I want to emphasize that.
> 
> I can only reiterate the risk of potential rise in popularity of some
> other protocol than http(s) amongst linked data users, in combination
> with definition of prefixes in e.g. profiles or publishing systems
> beyond the author's immediate control. It is a small but complex
> problem which could cause a lot of *dynamically published* RDFa to
> become problematic in a year, or 5, or 10. And this might be hard to
> detect, unless publishers monitor all protocols and prefixes used in
> their publishing systems. If RDFa 1.1 is published with
> SafeCURIEorCURIEorURI as it is now, this would be very hard to
> rectify.

As I said above - I don't think the group believes that it would be
difficult to detect and rectify.

> It only takes for *one* prefix in the (decentrally) growing list of
> common prefixes to become a popular protocol for this to become a real
> problem.

I can guarantee you that if that starts happening, the entire community
will jump on the vocabulary author and make sure that they make it clear
that a different prefix should be used. We haven't done this yet for the
'http' scheme/vocabulary because we've never seen it become an issue.
Nobody has actually reported this to be an issue to the community, either.

Yes - it is theoretically possible - but it has yet to materialize.

Perhaps what we should do is take an action as a community to make sure
that services like http://prefix.cc/ clearly warn about the usage of
Internet schemes. Perhaps we could also get many of the RDFa Processor
authors to generate warnings when prefixes that are known Schemes are used.

> As I also said in [1], I am also worried that this practice may be
> carelessly adopted in other scenarios. Particularly RDF APIs, where
> one may want to define lots of prefixes for authors' convenience, and
> where it may very well be desirable to make statements about resources
> identified with protocols other than http. (And we've already found
> cases where "http" is defined as a prefix in code libraries.
> Furthermore, it's not uncommon for prefixes to be automatically
> generated.)

I see these as bugs in the APIs and prefix-generation code. I know that
you may see this differently, but there is no way to reliably prevent
this problem with the evidence that we have before us.

>> We discussed it during two telecons:
>>
>> http://www.w3.org/2010/02/rdfa/meetings/2011-05-05#CURIEorURI_Value_Space_Collisions
>> http://www.w3.org/2010/02/rdfa/meetings/2011-05-19#ISSUE__2d_90__3a__CURIEorURI_Value_Space_Collisions
>>
>> The decision is recorded here:
>>
>> http://www.w3.org/2010/02/rdfa/meetings/2011-05-19#resolution_2
>>
>> Since this is an official Last Call response, could you please respond
>> as soon as possible and let us know whether or not the Working Group has
>> considered your request and responded accordingly. Please let us know if
>> this is an acceptable outcome and whether you can live with the
>> decision. Thank you for reviewing the RDFa specification and sending in
>> your comments. :)
> 
> I could live with it, if it comes to that. :) But I cannot really agree.
> 
> Have you discussed this combination of CURIEorURI in e.g. the RDF
> working group, or the RDF community in general? I'd be somewhat
> surprised if I'm the only one feeling uneasy about it..

It was a very long discussion in the RDFa 1.0 Working Group. We have not
raised the issue with the RDF Working group. Perhaps we should raise it
as a coordination issue for all the Semantic Web groups. You are not the
only one that felt uneasy about this - we all did at first, but once we
started seeing more RDFa usage patterns, we tended to get a bit less
concerned about the potential issue.

> I know this might seem like an innocent issue with little real world
> problems, but I hope I've made my view clearer of the potential risks
> and difficulties of managing those. I genuinely wish for RDFa 1.1 to
> succeed, and I have the utmost respect for your work on it!

Thank you again for the long and thoughtful response. Please don't take
this e-mail as rejection of your thoughts or input. I think you make
several very good and very valid points. We have weighed the risks
versus the rewards and we think that the rewards far outweigh the risks.
That said, we'll make it a point to discuss this during our upcoming
telecon to make sure that the group still feels this way.

And as always, if people from the RDFa community could weigh in - that
would be great.

-- manu

-- 
Manu Sporny (skype: msporny, twitter: manusporny)
President/CEO - Digital Bazaar, Inc.
blog: PaySwarm Developer Tools and Demo Released
http://digitalbazaar.com/2011/05/05/payswarm-sandbox/
Received on Wednesday, 1 June 2011 02:07:46 UTC