RE: ACTION-7 "Check with w3c groups if there are other approches to represent languages as uris"

Dave and Felix,

 

I think I'm getting even more confused than before as regards language
standardization codes and ontologies!

 

First of all, a clarification: the ms:linguisticInformation in the
lexical/conceptual resource is meant for other things and not for the
language of the contents of a resource (e.g. what types of linguistic
information are contained, e.g. lemmas, stems, inflectional information
etc.). The ms vocab, in fact, includes the following elements for language:

-          metadataLanguageName and metadataLanguageId - for the language of
the metadata of a resource (similar to catalog_language of the dcat
vocabulary)

-          languageName and languageId - for the language of the contents of
a resource (e.g. a Greek/English lexicon or a Spanish corpus etc.)

-          documentLanguageName and documentLanguageId - for the language of
an external publication/document/. that is somehow linked to this resource
(an article describing it, a manual etc.)

-          tagsetLanguageName and tagsetLanguageId - for the language of
tagsets used for the annotation of a corpus

 

Going to the sources of my confusion, in the dcat vocabulary, there are two
entries: 

-          the catalog_language
(http://www.w3.org/TR/vocab-dcat/#Property:catalog_language) that Dave
refers to, and which I agree with Dave that this only refers to the language
of the metadata

-          the dataset language
(http://www.w3.org/TR/vocab-dcat/#Property:dataset_language) which is to be
used for the language of the dataset; I thought this was meant for the
language of the contents of the language resource (e.g. a lexicon of Greek
words which is described in a certain catalogue in English) and would
correspond to the ms:languageName and ms:languageId - however the usage note
says "This overrides the value of the
<http://www.w3.org/TR/vocab-dcat/#Property:catalog_language> catalog
language in case of conflict." which doesn't make any sense if they refer to
two different things.

 

As regards the various codes, at META-SHARE we wanted to use (but never
implemented) the BCP 47, which overrides the RFC4646
(https://tools.ietf.org/html/bcp47). In this document, there's a note for
using the "shortest ISO 639 code" and the examples consist of mainly
two-letter codes (ISO 639-1) and three-letter codes (ISO 639-3) only when
there's no two-letter code for each language - maybe this explains the dcat
Range note???
On the other hand, the lingvoj ontology includes a list of languages
(http://lingvoj.org/languages/all.html) which as they say:"This page is
providing the complete list of ISO 639 languages, and their tags as defined
by  <https://tools.ietf.org/html/bcp47> BCP 47". However, all the languages
at this page appear as ISO 639-3 codes and I have not been able to find
examples such as "en-US" (English as spoken in United States). I have also
not been able to find something in the other ontologies that brings together
in one tag/URI combinations of language+script+country+., as in BCP47. Maybe
I'm missing something?
 
Best,
Penny
 

 

From: Felix Sasaki [mailto:fsasaki@w3.org] 
Sent: Thursday, July 17, 2014 1:29 PM
To: Dave Lewis
Cc: public-ld4lt@w3.org
Subject: Re: ACTION-7 "Check with w3c groups if there are other approches to
represent languages as uris"

 

Hi Dave,

 

Am 17.07.2014 um 11:37 schrieb Dave Lewis <dave.lewis@cs.tcd.ie
<mailto:dave.lewis@cs.tcd.ie> >:





Hi Felix,
Thank's for this, I'll include it in the agenda for today.

One point:

http://www.w3.org/TR/vocab-dcat/#Property:catalog_language

defines the language used in the meta-data, and for that purpose is probably
sufficient.

However, the others seem more relevant to specifying the language of the
LanguageResource that is the subject of the meta-data.

For this i'd tend to agree that some way of allowing different schemes to be
used for applications that need them, e.g. lexical resources or resource
focussed for language preservation.

But where more specialised language code requirements are not in place, then
we still should specify the best practice, e.g. dct:LinguisticSystem as
specified in dcat for catalogue_language, in order to promote
interoperability in codes as far as possible.

 

 

That is what I am not sure about. The dcat specification itself is
ambiguous. If you click on the link of "dct:language", it brings you to

http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=terms#language

and that defines languages as an RFC 4646 value, which includes ISO 639-3
and much more. But if you follow the links 1 and 2 of

dct:LinguisticSystem <http://purl.org/dc/terms/LinguisticSystem>  
Resources defined by the Library of Congress (1
<http://id.loc.gov/vocabulary/iso639-1.html> , 2
<http://id.loc.gov/vocabulary/iso639-2.html> ) SHOULD be used.

 you are lead to the ISO 639 one and two codes. So it is a bit difficult to
understand what it actually means: use dct:LinguisticSystem as specified in
dcat.

 

Cheers,

 

Felix






The current ms vocab already supports this specialisation, for example
having ms:linguisticInformation information for the
ms:LexicalConceptualResource subclass, which seems reasonable.

cheers,
Dave

On 04/07/2014 13:06, Felix Sasaki wrote:



I did this and was pointed to this proposal was rejected both for RDF 1.0
and RDF 1.1, see for the later this thread
http://lists.w3.org/Archives/Public/public-rdf-wg/2012Oct/0001.html
which at least Jose Labra and probably Jorge are already aware of, see
http://www.weso.es/MLODPatterns/Linguistic_metadata.html


So now we have at least four different approaches for the same purpose
websites,

http://www.w3.org/TR/vocab-dcat/#Property:catalog_language
http://lingvoj.org/
http://www.lexvo.org/
http://glottolog.org/

I am wondering what best practice to derive from this - one suggestion was
to use owl:sameAs between these in appropriate situations. Thoughts?

- Felix

 

 

Received on Thursday, 17 July 2014 11:35:42 UTC