Re: review of CSV/TSV (ACTION-594) from Andy Seaborne on 2012-03-06 (public-rdf-dawg@w3.org from January to March 2012)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Tue, 06 Mar 2012 12:13:30 +0000
To: public-rdf-dawg@w3.org
Message-ID: <4F55FF6A.8020108@epimorphics.com>
Hi Greg,

Thank you for the review

On 05/03/12 16:42, Gregory Williams wrote:
> Below is my review of the CSV/TSV document. I think there are a few
> issues that need clearing up before publication. The only big issue
> I have is that the document specifically talks about the default
> encoding for these formats being US-ASCII, but then doesn't discuss
> the possible need to escape unicode characters in the serialization.
> This is especially important for the CSV format where we are relying
> directly (and only) on the CSV escaping mechanism which really only
> covers the escaping of quotes and newlines. The rest of the points
> are minor/editorial.

The CSV format is supposed to be a direct and simple use of CSV, with no
assumption on a layer of processing at the receiving end.  CSV does not 
provide escaping or encoding for arbitrary characters but the HTTP 
charset can be set.

I don't see value in adding a general escaping, which isn't then just 
CSV.  If it matters, use the charset or use TSV.

If the sender sends character outside the charset setting, then there is
nothing that can be done. It's no different to sending raw characters
from the wrong character set in HTML or XML (binary escapes don't help 
either - what charset is the binary in?)

The current text seems sufficient to me. Do you have suggestions for
making it better?

Sec 2: Transmission issues using CSV and TSV Formats
[[
The charset parameter SHOULD be used in conjunction with SPARQL Results;
UTF-8 is recommended: text/csv; charset=utf-8 and 
text/tab-separated-values; charset=utf-8.
]]

I don't know of any solution to the general NxM problem of charset X and 
charset Y (e.g. ISO-8859-1 and GB2312).


>
> Abstract still has an @@. I think we agreed in the last telecon to
> drop it.

Removed - just hadn't got round to enacting the TC consensus.

>
> The set of SPARQL 1.1 docs doesn't include the CSV/TSV document.

Good catch.

It's probably wrong in other documents as well as it was a late addition 
as a REC track doc.  I've updated query.

> Some of the example data used in section 1.1 is confusing. Since
> it's not clear what formatting is being used in the example table,
> it's not immediately clear what literal value this represents:
> "String-with-dquote"". By context I assume it's a literal that
> starts with the character 'S' and ends with a sole double quote. If
> this table is meant to be using a turtle-like encoding (and not a
> CSV-like encoding), then perhaps that double quote should be
> backslash-escaped? Or perhaps there should be some text that
> explains the possibly ambiguous values in the example table.

I wanted to put in an explanation column but then it looks like part of 
the results!

I've added comments.  Please make suggestions for improvements.

> Regarding "Applications reading these formats are advised to cope
> with both CRLF and LF as end of line markers," should this be using
> "SHOULD" normative language?

I don't think so - it's advice to the consuming code and not a 
compliance issue.

> === Section 3 ===
>
> "the results table is serialized as ... one line for each query
> solution." I'm don't think this is true. The CSV spec document does
> say "Each record is located on a separate line," but also indicates
> that a CRLF can appear in a double quoted field value:
>
> "aaa","b CRLF
>  bb","ccc" CRLF
>  zzz,yyy,xxx
>
> Section 3.2 actually notes this case ("Within quote strings, all
> characters except ", including new line characters have their exact
> meaning - newlines do not end a CSV record.")

I've added "(a line may end up split by newlines in the data)." but it 
is just following the style of the CSV RFC

> "Values in the results are strings, for URIs and literals, together
> with numbers when the literals are of numeric XSD datatype." No
> mention is made of blank nodes.

Changed to "for URIs, literals and blank nodes,"

> === Section 3.1 ===
>
> "Each row has the same number of fields..." Is this meant to say
> that each row 'MUST' have the same number of fields?

Changed (and in TSV)

> === Section 3.2 ===
>
> "The entry in each field is the string corresponding to the RDF term
> value. (c.f. SPARQL STR()) without syntax to denote what kind of
> term it is. The encoding quoting rules of CSV format must be used."
> As it's earlier mentioned that the encoding of the CSV file may be
> US-ASCII, we probably need to mention that simply taking the STR()
> value and applying CSV escaping may not always be enough to produce
> a valid CSV file.

I think this is covered by the transport section.  The mention of STR() 
is indicative.  Do you want to suggest additional text?

(If at this point, a string has other chars in it, there is nothing you 
can do!)


> "((COMMA, code point 44, 0x2C)" has an extra open paren.

Fixed.

> === Section 4.1 ===
>
> "Variables are serialized in SPARQL syntax, using question mark ?
> character followed by the variable name." Is there a reason we chose
> to use the '?' in TSV, but not in CSV?

Yes - TSV encodes using  Turtle-style terms in SPARQL variables start 
"?" (or "$" - for easy of use we fix on ? here)

CSV the first row is column headings to be displayed directly so no 
marker char.

> "Each row has the same number of fields..." Again, I think this
> should probably be using "MUST".

Done.

> === Section 4.2 ===
>
> "The SPARQL Results TSV Results Format serializes RDF terms in the
> results table by using the syntax that SPARQL [RDF-SPARQL-QUERY]
> [SPARQL11-QUERY] and Turtle [TURTLE] use." Do we need references to
> both 1.0 and 1.1 versions of SPARQL Query?

Removed - it's an artifact of the biblio DB not having all the SPARQl 
specs in it.

> """ literals are enclosed with single quotes "..." or ' ...' """ The
> use of 'single quotes' here immediately followed by double quotes is
> confusing. I assume 'single' is meant to mean either of the quoting
> forms used, but not the triple-quote form available in turtle?

Changed to:
[[
  are enclosed with double quotes <tt>"</tt>...<tt>"</tt>
           or single quotes <tt>'</tt> ...<tt>'</tt>
]]

> As with CSV, I'm concerened about the use of unicode in terms when
> section 2 specifically talks about these formats defaulting to
> US-ASCII. The TSV encoding at least supports unicode escaping by
> default as it deals with turtle/sparql syntax for terms.

Yes - you can use Turtle escapes - see earlier for the general NxM  problem.

	Andy
Received on Tuesday, 6 March 2012 12:13:57 UTC