Re: dct:language range WAS: ISSUE-2 (olyerickson): dct:language should be added to DCAT [Best Practices for Publishing Linked Data]

Hi again.

On 12 December 2011 14:22, Richard Cyganiak <richard@cyganiak.de> wrote:
>
> On 9 Dec 2011, at 22:28, Stasinos Konstantopoulos wrote:
>> It's hard to imagine anybody having data that won't fit ISO 639.
>> Besides listing pretty much every documented language there is
>> (including extinct and made-up languages like Klingon) it also lists
>> useful clusters ("macrolanguages"), such as "Arabic" (ara), that allow
>> one to underspecify when a more detailed description is not available
>> ("ara" subsumes 30 variaties of Arabic, all with their own
>> three-letter code). It also includes three letter codes for
>> "undetermined" (und), "multiple and cannot list all" (mul), and "no
>> linguistic content, not applicable" (zxx).
>
> The question is not if the data fits ISO 639. The question is whether the data is already tagged with ISO 639. If it isn't, then someone has to do the tagging – that is, map “English” to “en”, “Irish” to “ga”, “Both English and Irish” to “mul” and so forth. That's not a difficult task, but it has a significant and nonzero cost, and we have to be aware that requiring ISO 639 makes adopting dcat significantly more expensive for data publishers who do not yet have ISO 639 compatible annotations.
>
> In situations like this, such data publishers are likely to either a) not provide the language information at all, b) provide it in whatever form they already have in violation of the standard, or c) even not adopt the standard altogether because it is seen as too complex an undertaking. These concerns apply whenever the use of a controlled vocabulary is demanded in a standard exchange format.
>
> Mapping existing data into controlled vocabularies always comes with a cost. And I would think that often the data consumers are in a better position to do that mapping than the data publishers, in terms of skills, quality and economic incentives.
>
> That being said, every effort should be made to *recommend* standard controlled vocabularies, and highlight their use as best practice.

I fully agree that the step to the First Star is the most important
step to make, so one should never discourage data publishers. At the
same time, it should be possible and encouraged to provide more
structured data. Lumping everything together into the same data
property seems to me like it's discouraging structure where it would
have been attainable.

>> If you are thinking of entries such as "15th c. English" and such, I
>> agree that that cannot be easily captured in its most general and
>> unrestricted form. But it would still be interesting, LOD-wise, to
>> have the "English" bit as structured data, possibly qualified in
>> free-text as "13th c. English". So we still need to decide on a
>> controlled vocabulary that includes a representation for "English"
>> even if it does not include one for "15th c. English".
>
> I see this as a job for consumers of the data, not for publishers of the data. Someone who has "13th c. English" in their metadata very likely cares about these fine distinction, and is likely to be offended by the suggestion that they should dumb down their data to fit into some impoverished ISO scheme…

I agree and I never even implied that "15th c. English" is not useful
to whomever made the effort to annotate at such detail. But I do not
think anybody would be offended by the proposition that "13th c.
English" is related to the controlled-vocab entry for "English". For
some applications the latter is enough and they gain the benefit of
the controlled vocabulary in exchange for giving up the finer-grain
description; those applications that require the finer grain
descripltion will have to know how to handle the free text.

In other words, I find it a good thing that a specification allows a
publisher to, if they so choose, provide both a link to the
closest-fitting entry in the controlled vocabulry and a fuller
free-text description (not to be construed as equivalent to the
former); or either of the two.

That would suggest one of two solutions:
1. defining two properties, one ranging over language URIs and one
ranging over text literals, or
2. defining a single property ranging over resources (not literals);
such resources can be either (a) language URIs or (b) unnamed
resources with an rdfs:label (or similar) property ranging over
arbitrary text and (optionally) some subsumption or relatedness
property ranging over language URIs.

I find (1) easier to explain but (2) more conceptually accurate.

> As always, it's best to survey some actual catalogs and see how they represent language, otherwise we can go in circles in this kind of discussion forever.

That is useful, but it is also useful, IMHO, to pave the way for more
structure even if the current situation is relatively unstructured.

Best,
Stasinos

Received on Thursday, 15 December 2011 02:04:16 UTC