Re: ISSUE-137: Proposal to add sh:langShape from Andy Seaborne on 2016-09-09 (public-data-shapes-wg@w3.org from September 2016)

From: Andy Seaborne <andy.seaborne@topquadrant.com>
Date: Fri, 9 Sep 2016 10:35:05 +0100
To: public-data-shapes-wg@w3.org
Message-ID: <ceaaf311-131a-a09f-16bf-a6db34ed6ff5@topquadrant.com>

>   * Constrain the valid language tags to a provided set, e.g. (@en, @de,
>     @fr)
>
> See my email, sh:langShape [ sh:in ( "en" "de" "fr" ) ]

Do these match? "EN", "en-GB", "en-US", "@de-Latn-DE-1996"

It seems easier to adopt RFC4647 matching (in which case they all 
match). To match "en" exactly, sh:not can be used for "not match en-*" 
or "not match *-*".

In RDF 1.1, language tags compare case insensitively. In the RDF world, 
force-to-lower-case is common and endorsed by the RDF 1.1 specs.

https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal

RFC4647 defines language matching; section 3.3.1 is basic filtering.

SPARQL has LANGMATCHES that applies RFC4647

A predicate for "language match" that uses RFC 4647 would be natural.

    sh:langShape [ sh:language ( "en" "de" "fr" ) ]

Matches "en", "en-gb", "EN-GB", "en-uk", "de" , "de-de"

    sh:langShape [ sh:language ( "en-*") ]

Matches "en-gb", "en-us" but not "en"

http://www.ietf.org/rfc/rfc4647.txt
https://www.w3.org/TR/sparql11-query/#func-langMatches

The implementation burden of RFC4647 is not high. The algorithm is in 
the RFC if not using SPARQL.

(I have no strong opinion on the predicate name)

 >   * Require that all literals have/do not have a language tag
 >
 > Already exists: sh:datatype rdf:langString

True, though more natural to users to a use language-match of "*", which 
is defined in RFC4647.

    sh:langShape [ sh:language ("*") ]

>   * Check that the language tag is 2-letter | 3-letter | does/does not
>     have hyphens
>
> sh:langShape [ sh:minLength 2 ; sh:maxLength 2 ; or: sh:pattern "...
> regex ..." ]

2-letter, 3-letter is about the primary subtag? (the part up to the 
first "-")

 >   * Check that the 2 or 3-letter tag is valid

(I can't find the original use case for this on the issue log)

This is outside the RFC4647 algorithm and needs a regex.

 > Assuming that the list of valid tags is stored somewhere, e.g. in an
 > rdf:List iso:ValidLanguages:

"in" a list will need to be case insensitive.

In the real world, data can be a bit messy. To pick an example close to 
me, "en-uk" does not officially exist but it is not that uncommon and 
seems to be tolerated.  It would be good to both be able to cause a 
violation for it and also be able to be lax about it.

 Andy

RFC3066:
https://www.ietf.org/rfc/rfc3066.txt
section 2.1
[[
The syntax of this tag in ABNF [RFC 2234] is:
     Language-Tag = Primary-subtag *( "-" Subtag )
     Primary-subtag = 1*8ALPHA
     Subtag = 1*8(ALPHA / DIGIT)
]]

Received on Friday, 9 September 2016 09:35:36 UTC