More Languages for More Vocabularies

Last month I encouraged the provision of multi-lingual labels for vocabularies hosted at W3C. Tokyo librarian Shuji Kamitsuna has been doing terrific work recently and has translated the specification documents for DCAT (English, Japanese) and ORG (English, Japanese), and is now well into completing his work on the Data Cube Vocabulary. After Shuji had completed his work on the specifications, I wanted to update the schemas to include the Japanese labels too, but doing this threw up some issues.

First up was DCAT. The vocabulary is formally specified in the Recommendation and for each term there is a table showing the definition and a usage note. Immediately before each table, the term itself is given as a section title and it’s these section titles that are the English language labels in the schema. See the entry for dcat:Catalog for example. When Shuji translated the spec, the labels were therefore translated too. Transferring these to the schema was trivial. But that was the easy part.

The definitions in the spec are copied into the schema as the rdfs:comment for each term – except they’re not 100% aligned. Take the definition of the property dcat:dataset. The spec says “A dataset that is part of the catalog” whereas the schema gives just a little more help when it says “Links a catalog to a dataset that is part of the catalog.” The Arabic, Spanish, Greek and French labels, definitions and usage notes in the DCAT schema were all translated from the schema, the Japanese from the spec.

This begs the question: assuming that there is no difference in semantics, just a difference in the clarity with which the semantics are expressed, how much does it matter that the definitions in the schema and the spec are not 100% aligned?

When Shuji sent us the translation of ORG, a different issue arose. Like DCAT, the specification for ORG has a small table for each term that gives its definition and usage note. Before each table there is a heading but here’s the difference: in the ORG specification, those headings are written as the vocabulary term such as subOrganizationOf. If ORG followed exactly the same style as DCAT, this would have been written ‘sub organization of’ which is the English language label for the term – i.e. as proper words, not terms written in camel case. Actually it’s even more confusing as the actual label in the schema for ORG says “subOrganization of” – a sort of half way house. Again, does this matter?

Finally Shuji’s work threw up an issue around the use of upper and lower case letters in vocabularies. The well established convention is that RDF class names begin with upper case letters, properties with lower case letters, both use camel case. Further, where an object property is used for an n-ary relationship between classes, the property is often named in exactly the same way as the class that is the range. For example, in ORG we have org:role that has range org:Role.

You see the problem for Japanese? It’s is one of many languages that does not have the concept of upper and lower case letters.

I raised this issue in the Web Schemas Task Force and was relived that there was consensus that for the purpose of translation, it was safe to advise Shuji that the label for the property org:role could legitimately be ‘has role.’

In this and other work I’ve done over the years it’s clear to me that if you really want to check that what you’ve written is consistent and unambiguous – see how it comes out of a translation process. On this occasion I think we’ve got some pointers for future work to tighten these things up.

One Response to More Languages for More Vocabularies

Leave a Reply

Your email address will not be published. Required fields are marked *

Before you comment here, note that your IP address is sent to Akismet, the plugin we use to mitigate spam comments.