Use Case Language Technology

From Library Linked Data
Revision as of 14:34, 9 May 2011 by Dvila (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Back to Use Cases & Case Studies page

Name

Language Technology

Owner

Felix Sasaki

Background and Current Practice

Language Technology is applied in areas like machine translation, automatic summarization, (web) search, spell checking etc. Especially in the area of machine translation, in recent years statistical approaches (e.g. pursued by Google translate) have made enormous progress. Such approaches rely on a massive amounts of data. However there are use cases / application areas which don't have that data available. Here rule-based approaches to machine translation or other language technology applications, or hybrid approaches are being considered.

In these approaches linguistic knowledge is a key to successful development. Examples of knowledge are:

  • knowledge about terms or concepts (or both) in a specific domain, sometimes provided for several and / or across languages
  • parts of speech and annotated corpora relying on them. Annotated corpora are sometimes used for training or verifying statistical applications, sometimes for rule based applications.
  • linguistic grammars

A prominent resource being frequently used in language technology is WordNet. WordNet is a lexical database, available in English and with counterparts in many other languages. WordNet is used in applications like word sense disambiguation, anaphora resolution, information retrieval (also across languages), document classification etc. WordNet is also available in various RDF-representations, which demonstrates the potential to use it as linked data.

Not only WordNet, but other resources on the Web could be used as well for similar application scenarios. Especially the knowledge being created via library modeling, expressed e.g. as authority files, has not been used for language technology applications extensively. Recently and as an ongoing process, more and more authority files have been published as linked data. These resources can be an important input to language technology not on , but in the Web, that is, with resources not being hard coded in one application, but distributed resources, updated and improved by the Web itself (or the Web users themselves).

Goal

(1) Describe the general requirements of language technology applications on linked data

  • Data representation and services: what requirements do language technology applications have on the availability of linked data?
  • How do language technology services need to look like for taking linked data into account?

(2) Describe the specific contributions library linked data can make:

  • What kind of relations between terms / concepts etc. available in library linked data are useful for what kind of language technology applications?
  • What contributions can library linked data make to cross-lingual applications, which are not feasible with pure statistical language technology approaches?
  • How can library linked data be integrated with other, larger linked data resources (e.g. DBPedia)?

Use Case Scenario

To estimate the usefulness of library linked data for language technology, it is important to concentrate first on one specific use case. This will be named entity recognition (NER) in single and potentially across languages.

A traditional approach towards NER is the application of a gazetteer, that is a dictionary with information about places, people, institutions etc. This approach has the drawback that it is hard to keep the gazetteer up to date. Another problem is the sustainable creation of gazeteers across languages.

Application of linked data for the given use case

Linked data could help to solve the two problems of NER ("keeping up to date" and "briding across languages"). The resource DBPedia has already been used in pilot projects for creating gazetteers in one language and / or linking to other lanugages. However it is obvious that, given the heterogenous quality of named entities in DBPedia, other resources need to be taken into account as well.

Resources like WordNet, currently being used in many language technology applications, have the drawback that they are being developed in a centralized manner, and that they do not concentrate on named, that is specific "instance like" entities, but rather general, "concept like" entities. A carefully choosen set of linked data resources, including WordNet(s), but not limited to it, could contribute to the solution of the two problems of NER in various and across languages.

Existing Work

Muddy uses named entities extracted from DBpedia. The difference to our use case is that we want to focus on named entities extracted from library data.

Problems and Limitations

  • Variety of types in linked data useful for NER: what is a person, a place, a region etc.?
  • Alignment of the categories of types in and between languages: what is a street in English and in Japanese (if there is one)?
  • Alignment between high quality, small resources and larger resources with heterogenous quality, within the same application
  • Scalability of the "up to date" approach, integration into a language technology processing pipeline, potentially crossing technology stacks (e.g. RDF > XML based Web Services)