LLD Exploitation

From Best Practices for Multilingual Linked Open Data Community Group

Introduction

This best practice describes how to exploit Linguistic Linked Data resources. The suggested steps for exploitation comprise:

  • search and discovery of relevant resources
  • verify the license of the dataset
  • navigating to the distribution of the data (download or SPARQL endpoint)
  • extract that part of the data that is relevant for a particular purpose or application

Use Case

Let us consider the example of a company developing sentiment analysis and opinion mining software that has a working system for the English language and wants to port the system to also support German. The company wants to find a corpus that is annotated at the sentiment level and extract a first seed lexicon of German subjective expressions with their polarity (positive, negative, neutral).

Method

In order to exploit Linguistic Linked Data resources, the above mentioned methodology can be implemented as follows:

  • Search and discovery: relevant linguistic resources can be discovered using LingHub, which has been developed by the LIDER project.
  • Licensing: when a relevant dataset has been found using LingHub, by clicking on the link of the resource one can navigate to a page containing all the metadata about the resource.
  • Distribution: from the metadata page in LingHub, one can either download the dataset or discover where the SPARQL endpoint of the data is.
  • Extraction: Using W3C standards, in particular SPARQL as RDF query language, one can extract that portion of the data that is needed for a particular purpose.

If LIDER guidelines are followed during publication and metadata provision for resources and if the resource is registered at either Metashare, CLARIN VO, LRE Map or DataHub, LingHub will crawl the resource and index the resource with the appropriate metadata. Further, if de facto standards and vocabularies as recommended by LIDER are followed, then the same extraction patterns can be used to extract data from different datasets.

Use Case Revisited

Our company looking for a German lexicon would follow the above sketched methodology as follows:

SELECT ?string ?polarity
WHERE {
    ?phrase <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#anchorOf> ?string ;
            <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#lang> <http://www.lexvo.org/page/iso639-3/deu> ;
            <http://www.gsi.dit.upm.es/ontologies/marl/ns#hasPolarity> ?polarity .
}

the company would obtain a seed lexicon of subjective phrases with their polarity as a result.