Re: Percent-encoding in DM still broken? from Eric Prud'hommeaux on 2012-02-01 (public-rdb2rdf-wg@w3.org from February 2012)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Tue, 31 Jan 2012 19:37:55 -0500
To: Richard Cyganiak <richard@cyganiak.de>
Cc: W3C RDB2RDF <public-rdb2rdf-wg@w3.org>
Message-ID: <20120201003753.GA8387@w3.org>
* Eric Prud'hommeaux <eric@w3.org> [2012-01-31 11:56-0500]
> * Richard Cyganiak <richard@cyganiak.de> [2012-01-31 15:56+0000]
> > I just reviewed the responses to Last Call comments. It appears that Souri's comment here hasn't been properly handled:
> > http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2011Nov/0000.html
> > http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2011Nov/0002.html
> > 
> > Or am I looking at the wrong document? I'm looking at this:
> > http://www.w3.org/2001/sw/rdb2rdf/directMapping/LC/Overview.html
> 
> That's the correct document. These mods should give us resolve the issues listed in <http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2011Nov/0002> without making all non-ASCII impossible to read:
> 
>   Definition percent-encode: (a subset of HTML5 form dataset encoding):
> - * Replace each PERCENT SIGN character ('%', U+0025) with the string "%25".
> + * Replace each character in the range U+0000 to U+0019 with the percent-encoded form of that character [RFC3986].
> + * Replace each of these characters
>         PLUS SIGN character    ('+', U+002B)
>         PERCENT SIGN character ('%', U+0025)
>         LESS-THAN SIGN         ('<', U+003C)
>         GREATER-THAN SIGN      ('>', U+003E)
>         QUOTATION MARK         ('"', U+0022)
>         LEFT CURLY BRACKET     ('{', U+007B)
>         RIGHT CURLY BRACKET    ('}', U+007E)
>         VERTICAL LINE          ('|', U+007D)
>         CIRCUMFLEX ACCENT      ('^', U+005E)
>         GRAVE ACCENT           ('`', U+0060)
>         REVERSE SOLIDUS        ('\', U+005C)
  +       NUMBER SIGN            ('#', U+0023)
>     with the percent-encoded form of that character [RFC3986].
  - * For table names, replace each NUMBER SIGN character ('#', U+0023) with the string "%23".
>   * For table names, replace each SOLIDUS character ('/', U+002f) with the string "%2f".
>   * For attribute names, replace each HYPHEN-MINUS character ('-', U+003d) with the string "%3D".
>   * For attribute values, replace each FULL STOP character ('.', U+002e) with the string "%2E".
>   * Replace each SPACE character (U+0020) with the PLUS SIGN character (+, U+002B).
> 
> 
> > We discussed this on a call here:
> > http://www.w3.org/2011/11/08-RDB2RDF-minutes.html
> > but the minutes don't capture a clear resolution to the issue. At any rate it's still broken and needs to be fixed.
> 
> We had test cases which exemplify the decision around a minimal encoding:
> 
> http://www.w3.org/2001/sw/rdb2rdf/wiki/R2RML_Test_Cases_v1#I18NnoSpecialChars
          ↖ ↗
Ignore that│. I hadn't noticed "any character that is not in the iunreserved production" in the R2RML text. This test case provides the same results regardless of whether you use my definition above or R2RML's IRI-safe function.

> > The simplest fix might be to replace the definition of “percent-encode” in the DM spec with a reference to “IRI-safe” in R2RML:
> > http://www.w3.org/2001/sw/rdb2rdf/r2rml/#dfn-iri-safe

I'm OK with this, though I note that the intent is to provide strings which may be safely embedded in an IRI isegment-nz, which is 1 or more ipchars:
  ipchar = iunreserved | pct-encoded | sub-delims | ":" | "@"
Richard noted that these chars are %-encoded in R2RML, but never in the DM:

  @:!$&'()*,;=

, all of which appear in sub-delims. If we follow the lead of ipchar, we'd escape "any character that is not ":" or "@" and is not in the production <iunreserved> or <sub-delims>". This would render slightly more readable IRIs generated from a key of e.g. email addresses:

  <People/email-eric@w3.org> vs. <People/email-eric%40w3.org>

java class files:

  <Code/file-ConsoleDocument$ConsoleLine.class> vs. <Code/file-ConsoleDocument%24ConsoleLine.class>

biological thingies:

  <Receptors/GABA(A)>   vs. <Receptors/GABA%28A%28>

and silly company names:

  <Company/name-Yahoo!>  vs. <Company/name-Yahoo%21>
  <Company/name-AT&T>    vs. <Company/name-AT%26T>
  <Company/name-E*Trade> vs. <Company/name-E%2ATrade>
  <Company/name-Dooy,+Cheetum,+and+How> vs. <Company/name-Dooy%2C%20Cheetum%2C%20and%20How> # law firm


Either way, it makes sense to use the same encoding in R2RML and DM (modulo some extra delimiters needed in DM). Barring objection, I'll change DM to:
[[
Definition percent-encode:

  * Replace the string with its IRI-safe version per R2RML §7.3 [R2RML].
  * For attribute names, replace each HYPHEN-MINUS character ('-', U+003d) with the string "%3D".
  * For attribute values, replace each FULL STOP character ('.', U+002e) with the string "%2E".
]]
(striking "(a subset of HTML5 form dataset encoding)" as it would no longer be true if ' 's don't turn into '+'s)



> > Best,
> > Richard
> 
> -- 
> -ericP

-- 
-ericP
Received on Wednesday, 1 February 2012 00:38:25 UTC