11.4.12 langMatches

This is a comment on:
http://www.w3.org/TR/2007/WD-rdf-sparql-query-20070326/#func-langMatches

specifically the text:
[[
matches language-tag (first argument) per Matching of Language Tags
[RFC4647] section 2.1.
]]

Contents of comment:
- issue statement
- suggested editorial textual change
- further analysis and options (the bulk of the message, which to large
part can be ignored)

Issue
=====
Section 2.1 of RFC 4647 defines basic language ranges, without giving
any semantics, nor defining an algorithm for "matches". Hence the word
"matches" in the quoted text is unbound, and without clear meaning.

Sections 3.3.1 and 3.3.2 and 3.4 each describe different matching
algorithms that can be used with basic language ranges.

Suggested text:
===============
Replace
[[
Returns true if language-range (second argument) matches language-tag
(first argument) per Matching of Language Tags [RFC4647] section 2.1.
]]
with
[[
Returns true if language-range (second argument) matches language-tag
(first argument).
language-range is a basic language range
  per Matching of Language Tags [RFC4647] section 2.1.
'matches' is defined as basic filtering in [RFC4647] section 3.3.1.
]]

Analysis
========
The algorithm of section 3.4 is not suitable since it is scoped as
[[select[ing] the single language tag that best matches the [...]
request]]. i.e. it always gives exactly one result, when matching
against any non-empty set of languages - it does not define a boolean
function: lang-tag x lang-range => boolean, but a selection function
non-empty-list-of-lang-tags x lang-range => lang-tag

The algorithm of section 3.3.2 is designed for extended language ranges,
which are more appropriate for the new features of RFC 4646 (such as
script subtags).

The reference to section 2.1 is indicative that SPARQL is more
interested in basic language ranges, which were already specified in RFC
3066, and are suited to matching lang tags that conform with RFC 3066
(and hence also with RFC 4646). The algorithm of section 3.3.1 is hence
(IMO) currently the closest 'reading' of the SPARQL WD.

Technically, the choice is:
a) use basic language ranges (section 2.1) and basic filtering (3.3.1)
or
b) use extended language ranges (section 2.2) and extended filtering (3.3.2)

FYGI, the extended language ranges are like language ranges except they
permit a "*" in any subtag position, e.g.
de-*-DE
de-DE-*
(but not de-DE*)
When used with extended filtering, any -*- is effectively ignored, and
treated as -, but note that an initial *- is significant.

Then (simplifying by ignoring private use and other extensions) a lang
range matches a lang tag if  both
a) the first subtags match
b) (ignoring the *'s) the "-"-separated sequence of the language range
is a subsequence (allowing arbitrary deletions) of the "-"-separated
sequence of the language tag.

The reason this is more appropriate for new RFC 4646 style tags is that
RFC 4646 allows additional information, such as script subtags, to be
inserted in the appropriate place in a tag.

So, the example given in RFC 4647 is that

de-DE basic matches de-DE (i.e. german as spoken in Germany)
de-DE basic matches de-DE-1966 (i.e. german as spoken in Germany,
written with the orthography of 1996)
de-DE does not basic match de-Latf-DE (i.e. german, as spoken in
Germany, written in the Fraktur variant of the Latin script)

whereas
both the basic matches are extended matches (indeed, any basic match is
an extended match), but also
de-DE extended matches de-Latf-DE
which is probably more consistent behaviour from the end users point of
view when using such new features of RFC 4646 style tags.

It is plausible that some semantic web applications may well have a need
for using extended language ranges like "*-Latn", for example, to
populate some part of a web page, when no content exactly matching the
current language preferences has been found. Many users have a
preference for text in a script they can read, even if they don't
understand it, over a perhaps intelligible word, written in a script
that is not intelligible. This use case however, depends on widespread
use of RFC 4646 script subtags, which, while possibly desirable is not a
current actuality. Moreover, code that worked to end user satisfaction
would also depend on appropriate deployment of section 4.1 of RFC 4646
(choice of language tag) either in the code or the processes of
constructing the semantic web data or both, so that script codes were
used consistently.

Thus, I have suggested the more conservative change, but would be
equally satisfied if the SPARQL WG wanted to embrace extended language
ranges!

Jeremy

-- 
Hewlett-Packard Limited
registered Office: Cain Road, Bracknell, Berks RG12 1HN
Registered No: 690597 England

Received on Thursday, 5 April 2007 09:35:21 UTC