222 Best Practices for Vocab Selection

From Government Linked Data (GLD) Working Group Wiki
Revision as of 18:54, 14 March 2012 by Mpendlet (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Best Practices: Vocabulary Selection

Back to Best Practices Wiki page

The group will provide advice on how governments should select RDF vocabulary terms (URIs), including advice as to when they should mint their own. This advice will take into account issues of stability, security, and long-term maintenance commitment, as well as other factors that may arise during the group's work.

@@TODO: distinguish between vocab discovery and vocab creation and management.


Status

  • Feb 22, 2012 - Refining the section - Boris
  • Feb 9, 2012 - Reformulation and update by Ghislain
  • Dec 2011 - Major revisions by Ghislain, Boris

Overview

Modeling is an important phase in any Government Linked Data life cycle. Within this phase, Governments need to build a vocabulary that models the data sources they want to publish as Linked Data. The most important recommendation in this context is to reuse as much as possible available vocabularies. This reuse-based approach speeds up the vocabulary development, and therefore, governments will save time, effort and resources. However, the reuse-based approach leads to two main questions (1) where/how do I find/discover available vocabularies, and (2) how do I select a vocabulary that best fits my needs?. Moreover, we have to consider that there may be cases in which Governments will need to mint their own vocabulary terms, these cases lead to another question (3) how to mint my own vocabulary terms?. In this section we provide answers to those questions, by means of checklists for each question.

Discovery checklist

As we already stated, following the reuse-based approach, governments have to look for available vocabularies to reuse, instead of building new vocabularies from scratch.

This checklist provides some considerations when trying to find out existing vocabularies that could best fit the needs of a Government or a specialized agency.

Define the scope of the domain
What it means: Developing a common understanding as to what is included in, or excluded from, in the domain. By defining the scope of the domain, it restricts and helps to quickly find out related works in Linked Open Data initiatives. Hence, it could help in reusing some existing vocabularies of the same domain. Most of the time, the dataset gives you some hints about the domain.
Examples of domain: Geography, Environment, Administrations, State Services, Statistics, People, Organisation, etc.
Identify relevant keywords in the dataset
What it means: Identifying words that describe the main ideas or concepts. By identifying the relevant keywords or categories of your dataset, it helps for the searching process using Semantic Web Search Engine. If you have raw data in csv, the columns of the tables can be used for the searching process.
Examples: commune, county, point, feature, address, etc...
Searching for a vocabulary in one specific language
'What it means: Many of the available vocabularies are in English. You may be aware of having a vocabulary in your own language.
Consider this issue as it may restrict your search. Sometimes it might be useful to translate some of the keywords to English.
How to find vocabularies
What it means: There are some specific search tools (Falcons, Watson, Sindice, Semantic Web Search Engine, Swoogle) that collect, analyse and index vocabularies and semantic data available online for efficient access.
Examples: It is possible to perform a search on a relevant term or category present in your data.
Where to find existing vocabularies in datasets catalogues
What it means: Another way around is to perform search using the previously identified key terms in datasets catalogues. Some of these catalogues provide samples of how the underlying data was modelled and how it was used for.
Examples: Some existing catalogues are: Data Hub (former CKAN), LOV directory, etc...

Vocabulary Selection Criteria checklist

This checklist aims at giving some advices to better assess and select the vocabulary that best fits your needs, according to the output of the vocabularies discovered in the *Discovery* section. The final result should be one or two vocabularies that could be reused for your own purpose (mappings, extension, etc..)

Vocabularies should be self-descriptive.
What it means: Each property or term in a vocabulary should have a Label, Definition and Comment defined.
Self-describing data suggests that information about the encodings used for each representation is provided explicitly within the representation. The ability for Linked Data to describe itself, to place itself in context, contributes to the usefulness of the underlying data.
For example, popular vocabulary called DCMI Metadata Terms has a Term Name Contributor which has a:
  Label: Contributor
  Definition: An entity responsible for making contributions to the resource
  Comment: Examples of a Contributor include a person, an organization, or a service.
Vocabularies should be described in more than one language
What it means: Multilingualism should be supported by the vocabulary, i.e., all the elements of the vocabulary should have labels, definitions and comments available in the government's official language, e.g., Spanish, and at least in English.
That is also very important as the documentation should be clear enough with appropriate tag for the language used for the comments or labels.
For example, for the same term Contributor:
  rdfs:label "Contributor"@en, "Colaborador"@es
  rdfs:comment "Examples of a Contributor include a person, an organization, or a service"@en , "Ejemplos de collaborator incluyen persona, organización o servicio"@es


Vocabulary reusability
What it means: It is always better to check how the vocabulary is used by others initiatives around and its popularity.
For example: The recent statistics of the use of vocabularies in the cloud reveals that foaf is reused by more than 55 other vocabularies.
Vocabularies should be accessible for a long period
What it means: The vocabulary selected should have a guarantee of maintenance in a long term, or at least the editors should be aware of that issue.
It also include here checking the permanence of the URIs, and how is the policy of vocabulary versioning. This is strongly related to the best practices described in the Stability section.
Vocabularies should be published by a trusted group or organization
What it means: Although anyone can create a vocabulary, it is always better to check if it is one person, group or organization that is responsible for publishing and maintaining the vocabulary.
It is recommended to better trust a well-known organization than a single person.
Vocabularies should have permanent URIs
What it means: It refers here to not have a 404 http error when trying to access at any *thing* of the vocabulary. It also refers to the permanent access to the server hosting the vocabulary, facilitating reusability and consumption of the data build upon them
Example: The Geo W3C vocabulary is one of the most used vocabulary for basic representation of geometry points (latitute/longitude) and has been around since 2009, always available at the same namespace. This is strongly related to the best practices described in the Stability section.
Vocabularies should provide a versioning policy
What it means: It refers to the mechanism put in place by the publisher to always take care of backward compatibilities of the versions, the ways those changes affected the previous versions.
Major changes of the vocabularies should be reflected on the documentation, in both machine or human-readable formats. This is strongly related to the best practices described in the Versioning section.
Vocabularies should provide documentations
What it means: A vocabulary should be well-documented for machine readable (use of labels and comments; tags to language used).
Also for human-readable, an extra documentation should be provided by the publisher to better understand the classes and properties, and if possible with some valuable use cases.

Vocabulary management/creation

As we already mentioned, we have to take into account that there may be cases in which Governments will need to mint their own vocabulary terms. This section provides a set of considerations aimed at helping to government stakeholders to mint their own vocabulary terms. This section includes some items of the previous section because some recommendations for vocabulary selection also apply to vocabulary creation.

Define the URI of the vocabulary.
What it means: The URI that identifies your vocabulary must be defined. This is strongly related to the Best Practices described in section URI Construction.
For example: If we are minting new vocabulary terms from a particular government, we should define the URI of that particular vocabulary
Vocabulary should be self-descriptive.
What it means: Each property or term in a vocabulary should have a Label, Definition and Comment defined.
Self-describing data suggests that information about the encodings used for each representation is provided explicitly within the representation. The ability for Linked Data to describe itself, to place itself in context, contributes to the usefulness of the underlying data.
For example, popular vocabulary called DCMI Metadata Terms has a Term Name Contributor which has a:
  Label: Contributor
  Definition: An entity responsible for making contributions to the resource
  Comment: Examples of a Contributor include a person, an organization, or a service.
Vocabulary should be described in more than one language
What it means: All the elements of the vocabulary should have labels, definitions and comments available in the government's official language, e.g., Spanish, and at least in English.
That is also very important as the documentation should be clear enough with appropriate tag for the language used for the comments or labels.
For example, for the same term Contributor:
  rdfs:label "Contributor"@en, "Colaborador"@es
  rdfs:comment "Examples of a Contributor include a person, an organization, or a service"@en , "Ejemplos de collaborator incluyen persona, organización o servicio"@es
Vocabulary should provide a versioning policy
What it means: It refers to the mechanism put in place by the publisher to always take care of backward compatibilities of the versions, the ways those changes affected the previous versions.
Major changes of the vocabularies should be reflected on the documentation, in both machine or human-readable formats. This is strongly related to the best practices described in the Versioning section.
Vocabulary should provide documentations
What it means: A vocabulary should be well-documented for machine readable (use of labels and comments; tags to language used).
Also for human-readable, an extra documentation should be provided by the publisher to better understand the classes and properties, and if possible with some valuable use cases.
Vocabulary should be published following available best practices
"What it means:" One of the goals is to contribute to the community by sharing the new vocabulary. To this end, it is recommended to follow available recipes for publishing RDF vocabularies, e.g., Best Practice Recipes for Publishing RDF Vocabularies

Multilingualism in vocabs

This section provides some considerations when we are dealing with multilingualism in vocabularies. We have identified that multilingualism in vocabularies can be found nowadays in the following formats:

  • As a set of rdfs:label in which the language has been restricted (@en, @fr...). Currently, this is the most commonly used approach. It is also a best practice to always include an rdfs:label for which the language tag in not indicated. This term corresponds to the "default" language of the vocabulary
  • As skos:prefLabel (or skosxl:Label), in which the language has also been restricted.
  • As a set of monolingual ontologies (ontologies in which labels are expressed in one natural language) in the same domain mapped or aligned to each other (see the example of EuroWordNet, in which wordnets in different natural languages are mapped to each other through the so-called ILI - inter-lingual-index-, which consists of a set of concepts common to all categorizations).
  • As a set of ontology + lexicon. This represent the latest trend in the representation of linguistic (multilingual) information associated to ontologies. The idea is that the ontology is associated to an external ontology of linguistic descriptions. One of the best exponents in this case is the lemon model [1], [2], an ontology of linguistic descriptions that is to be related with the concepts and properties in an ontology to provide lexical, terminological, morphosintactic, etc., information. One of the main advantages of this approach is that semantics and linguistic information are kept separated. One can link several lemon models in different natural languages to the same ontology.

The current trend is to follow the first approach, i.e., to use rdfs:label and rdfs:comment for each term in the vocabulary.

[1] http://tia2011.crim.fr/Workshop-Proceedings/pdf/TIAW15.pdf

[2] http://lexinfo.net/

From here only general notes

Some additional Notes

One of the most challenging task when publishing data set is to have metadata describing the model used to capture it. The model or the ontology gives the semantic of each term used within the data set or in the LOD cloud when published. The importance of selecting the appropriate vocabulary is threefold:

  • Ease the interoperability with existing vocabularies
  • Facilitate integration with other data source from others publishers
  • Speed-up the time of creating new vocabularies, since it is not created from scratch, but based on existing ones.


Publishers should take time to see what is the domain application of their data set: finance, statistics, geograpraphy, weather, administration divisions, organisation, etc. Based on the relevant concepts presented in the Data set, one of these two options could be performed:

  • Searching vocabularies using Semantic Web Engines: The five most used SWEs are Swoogle, Watson (Ontology-oriented web engines); SWSE, Sindice (Triples-oriented Web engines); and Falcons (an hybrid-oriented Web engine)

One of the difficult task sometimes in the reuse ontology process is to decide which Semantic Search engine to use for obtaining an efficient results in the search of ontologies. There are five well-known and typically used SWSEs in the literature.

  • What are the criteria to choose one Semantic search engine in a particular domain.

In the literature, there are no guidelines helping ontology developers to decide between one SWSEs. Guidelines proposed here could potentially help ontology designers in taking such a decision. However , we can divide SW search engines in 3 groups:

*Those that are "Ontology-oriented" Web engines such as Swoogle and Watson.
*The ones "Triple-oriented" Web engines or RDF-oriented like SWSE and Sindice.
*and finally those which are "Hybrid-oriented" Web engine as the case of Falcons.

Also, a rapid observation while experimenting the use of the abovementioned engines is that there is not a clear separation between ontologies and RDF data coming from blogs and other sources like DBPedia.

Using the search engines consist in practice querying them using the set of relevant concepts of the domain (e.g., tourism, point of interest, organization, etc). The output of this exercise is a list of candidate ontologies to be assessed for reusing purpose.

The datahub (previous CKAN) maintains the list of data sets shared and can be accessed by an API or a full JSON dump. The approach here could be to look for data sets or the similar domain of interest, and analyzed the metadata describing that data to find out the vocabularies reused. Another "data market" place worth mentioning could be Kasabi

  • Searching vocabularies using LOV LOV.

The Linked Open Vocabularies (a.k.a LOV) is a set of data expressed in RDF, that inventories vocabularies for describing data sets but also the semantic relations between the vocabularies. Although it is in its preliminary state, it contains more than 100 vocabularies already identified. It came out that there are some vocabularies "commonly" used like SKOS, FOAF, Dublin Core, Geo and Event.

  • Composition of the three methods above-mentioned: It consists of combining the search process making use of the existing searching engines and some data sets catalogue.



Boris

We need to determine the vocabulary to be used for modelling the domain of the government data sources. The most important recommendation in this context is to reuse as much as possible available vocabularies. This reuse-based approach speeds up the vocabulary development, and therefore, governments will save time, effort and resources. This activity consists of the following tasks:

  • Search for suitable vocabularies to reuse. Currently there are some useful repositories to find available vocabularies, such as, SchemaCache, Watson, Swoogle, and LOV Linked Open Vocabularies, SchemaPedia.
  • In case that we did not find any vocabulary that is suitable for our purposes, we should create them, trying to reuse as much as possible existing resources, e.g., government catalogues, vocabularies available at sites like [1], etc.
  • Finally, if we did not find available vocabularies nor resources for building the vocabulary, we have to create the vocabulary from scratch.

The following Figure shows the proposed workflow for creating the vocabulary


Vocabularycreation.PNG


@@TODO@@ Questions to answer:

  • What is the best repository for vocabulary?
  • What is the criteria for using a given vocabulary?
    • Number of LD datasets using it?

TO DO

VocabularySelectionQualityChecklist

MultilingualismOfVocabs