Cluster Authority data

From Library Linked Data
Revision as of 17:18, 11 September 2011 by Aisaac (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Back to Use Cases & Case Studies page

Authors: Alexander Haffner, Jeff Young, Joachim Neubert

Background

Library data as well as its provision and corresponding search interfaces improved very well over the time. From catalog cards to networked databases, now libraries approach internationally aligned data sets in the Web to serve user needs best. A crucial part of library data movements came along with the establishment of its backbone - authority data.

These authority records themselves present so called entities. Entities are described by attributes and relationships between each other. Typically authority entities comprise “works” , “persons”, “families”, “corporate bodies”, “objects”, “concepts”, “events” and “places”. Models for bibliographic records like Functional Requirements for Bibliographic Records (FRBR) and Functional Requirements for Authority Data (FRAD) explain applicable attributes to describe the single entities and relationships between the entities . Corresponding models mainly intend an increasingly linked environment to match user needs for identifying, finding, selecting, and obtaining particular entities.

The linking of library data changed same as the authority idea itself. In earliest years cataloguing cards affected our perspective. Controlled access points were used to relate single cards to each other. With the advent of computers, our conceptualization changed from the card thinking into records as the primary entity and the controlled access point mutated into a controlled heading. This made it possible to relate authority and bibliographic data via record identifiers or controlled headings.

Currently, authority data fulfils primarily library domain-specific requirements, particularly in the level of detail which is essential for differentiating persons, places, etc.. This is a valuable and necessary fact for approaching internationalized library data. The highly detailed information will act as the starting point for aligning all the upcoming linked data sets. Required match and merge process can base on existing literal information to perform equality checks and arrange the aspired linking.

Consequently, the linking of library data will allow an international data harmonization and an intensified reuse of data within the library community. The establishment of a worldwide connected library data network also enables other memory institutions like museums and archives joining the maintenance of authorities. Additionally commercial organizations like news agencies or book sellers could reuse this information and take part in the maintenance by integrating their domain-specific information (i.e. broadcast information, distributor details) into the upcoming data pool.

The goal of linked authority data is the collaborative use and maintenance of authorities of a variety of memory institutions and interested communities. Aiming a globally connected data network, partners could execute overall search and help to avoid and decrease the occurrence of redundancies. Approaching high quality data a sharing of responsibility and reconciliation based on shared international principles has to take place. Nevertheless, by an increasing number of maintenance members the restrictions provided by FRBR or FRAD have sooner or later to be reconsidered to match the needs of all participants in such a linked authority data cloud.

Topic in the Context of Linked Data

The role of concept schemes in authorities

Regarding the reuse of authorities and intended matching processes it is recommendable to pool groupings of authorities in so called concept schemes. Conceivable is an aggregation of all occurring authorities of an authority file in one concept scheme but more advisable is an pooling of topic oriented datasets.

A SKOS concept scheme can aggregate one or more SKOS concepts. Semantic relationships between those concepts may also be viewed as part of a concept scheme. This definition is, however, meant to be suggestive rather than restrictive, and there is some flexibility in the formal data model.

The notion of a concept scheme is useful when dealing with data from an unknown source, and when dealing with data that describes two or more different knowledge organization systems i.e. authority files.

For identifying concept scheme's context it’s also helpful to define owner and maintainer information, some versioning information, as well as restrictions on their access and use.

Real world entities and their conceptualizations

Authorities are often conceptualizations of real world entities (people, organizations, places, events etc.). These conceptualizations may have some meta-information attached, like creation dates or editorial notes. They may be part of a certain concept scheme which was created by an agency curating the authority. On the other hand, there are real world entities to which the conceptualizations belong. They have attributes which only describe their characteristics, like the date of birth for people or longitude and latitude for a place. What is true for the conceptualization normally is not true for the real world entity, and viceversa. Therefore, it makes sense to distinguish the real world entity from the conceptualization.

Further on: Different agencies inevitably coin their own conceptualizations (with different URIs) of the same real world entity and attach their own meta-information to the conceptualizations. Connecting these conceptualizations via a owl:sameAs statements would cause confusion (since all metadata attributes would be attached to all URIs). But for being useful in a bigger scope it is essential to state that they all are related to the same real world entity.

Concept-RWO3.JPG

For solving both of the above mentioned problems - distinguishing and relating conceptualizations and real world entities -, we suggest (as visualized in the figure above)

  1. each conceptualization (URI) in an authority file should be connected via foaf:focus (1) to the URI of the according real world entity
  2. it would be correct to conclude that if <URI-C1> foaf:focus <URI-R1> and <URI-C2> foaf:focus <URI-R2>, then <URI-C1> skos:exactMatch <URI-C2>
  3. the same is true for <URI-C1> foaf:focus <URI-R1>, <URI-C2> foaf:focus <URI-R2> and <URI-R1> owl:sameAs <URI-R2>

This means, authority entities could be interlinked by exploiting the fact that they reference the same real world entity. For this reason, authority agencies should consider to coin URIs for the real world entities, too. These could be used by themselves or others for matching and mapping.

Of cause, a lower degree of confidence in mappings (especially in cases where fuzzy machine-matching is involved) can be expressed with skos:closeMatch on the concept side or e.g. umbel:isLike on the real world side.

Example

 # LIBRIS Authority
 <http://libris.kb.se/resource/auth/207420#concept> a skos:Concept ; # URI-C1
     skos:exactMatch <http://viaf.org/viaf/102333412/#skos:Concept> ; # URI-C2
     foaf:focus <http://libris.kb.se/resource/auth/207420> . # URI-R1
 <http://libris.kb.se/resource/auth/207420> a foaf:Person ; # URI-R1
     foaf:name "Dzjejn Osten" ;
     owl:sameAs <http://viaf.org/viaf/102333412/#foaf:Person> ; # URI-R2
     owl:sameAs <http://dbpedia.org/resource/Jane_Austen> . # URI-R3
 # VIAF Authority  
 <http://viaf.org/viaf/102333412/#skos:Concept> a skos:Concept ; # URI-C2
     skos:exactMatch <http://libris.kb.se/resource/auth/207420#concept> ; # URI-C1
     foaf:focus <http://viaf.org/viaf/102333412/#foaf:Person> . # URI-R2
 <http://viaf.org/viaf/102333412/#foaf:Person> a foaf:Person ; # URI-R2
     foaf:name "Austen, Jane" ;
     owl:sameAs <http://libris.kb.se/resource/auth/207420> ; # URI-R1
     owl:sameAs <http://dbpedia.org/resource/Jane_Austen> . # URI-R3

The Role of Labeling

While - especially in the anglo-american library world - authorities traditionally deal with names ("controlled access points"), in the Semantic Web and Linked Data context, these names are used as labels to denominate concepts (nomen vs. thema in FRSAD speech).

SKOS has identified patterns for such labels, which are not restricted to SKOS concepts(*). Labels are character strings; they often come with a language tag and thus support multilingual systems.

  • preferred labels are meant as authoritative labels for a resource. Therefore, there is no more than one value of a preferred label per language permitted for a resource. Furthermore, it is recommended that the preferred label unambiguously represents a single resource (within the scope of a single knowledge representation system).
  • alternate labels can be used to express synonyms, but can be used for other purposes, too (near synonmyms, upward posting, disambiguation of labels).
  • hidden labels can be introduced as (normally invisible) labels for example to include misspelled variants of a label in search operations.

So alternate and hidden labels can offer the means to support searching and accessing resources, while preferred labels are essential for identifying resources (especially in the context of lists, where a short entry within a larger display must be sufficient). For application builders, it is extremely helpful to have well-known generic properties for these purposes across different sources of Linked Data. Therefore, we suggest adding SKOS lexical properties to authorities, as they are based on a well thought-out and common scheme.

The SKOSXL extension allows more elaborated labeling properties, for example to attach an abbreviation to the same label as the non abbreviated form, or to attach a first and a family name together with a reference to the transcription rules used to the label for a person.

(*) Note: skos:prefLabel could be handy for every kind of resource. Beware, however, of the case of different skos:prefLabels attached to resources which are "smushed" via owl:sameAs. This could violate the condition of no more than one skos:prefLabel per language!

Scenarios (Case Studies)

This cluster is based on the following submitted use cases and scenarios:

[1] Use Case AuthorClaim

  • Agents create their own personal profile, including name variants.
  • These names are matched to the names of document creators in a database for the agent to confirm or deny their identity relative to the documents.

[2] Use Case Authority Data Enrichment

  • A librarian compares/matches entities across local/remote datasets and selectively merges the discovered entity/properties or else creates a link to the remote entity.
  • Merging into the local dataset could either involve integrating the remote vocabulary or translating the information into a local vocabulary.

[3] Use Case FAO Authority Description Concept Scheme

  • This is a concept scheme with multilingual preferred and alternate labels and various concept-to-concept relationships.
  • The scheme is designed for the agriculture and related sciences domain.

[4] Use Case International Registry for Authors

  • Agents create their own personal profile, including their preferred and variant name forms. The supported naming structure is designed to accommodate more complex personal names.
  • Agents are then encouraged to use their preferred name form when publishing materials.

[5] Use Case Linked Data Service of the German National Library

  • The German National Library currently publishes their national authority data about persons, corporate bodies, etc.
  • The Linked Data is available by dereferencing URIs, a search interface, or as a database dump.

[6] Use Case Virtual International Authority File (VIAF)

  • VIAF processes authority records from a variety of sources to produce a "cluster".
  • The Linked Data is used to:
    • represent a hub and spoke relationship between the cluster and contributed records.
    • represent the primary entity as a Person, Corporate Body, etc. with appropriate properties.

Scenarios (Extracted Use Cases)

Adding metadata by non-librarians while uploading a working paper

Alice uploads her working paper on corporate taxation to a economics repository. After entering title and abstract, she has to add her own name and the names of her co-workers. When she starts typing, a list of already-known authors from a authority file is presented, augmented with additional information (e.g. year of birth) to make the persons identifiable. When she selects an author, the system stores the authors URI in the background, which facilitates precice retrieval of all papers of a given author, irrespective of the possibly varying literal forms she and her collegues used when entering his or her name. When she starts adding keywords for the paper, suggestions from a disciplinary thesaurus (e.g. STW Thesaurus for Economics) are presented. Hints lead from alternate forms of keywords (possibly in alternate languages) to the preferred form and help selecting the best fitting keywords. The storage of an URI in the background, again, supports precise retrieval. Additionally, it allows to display keywords in the preferred language for a given user.

Using Authority Data to Extend Search Results

John searches for “FAO” in a document repository (with data from differenct source not controlled by a central authority). The system will direct him to all the records associated with the authorized form of this corporate body which is indicated in its authority. The authorized form is “Food and Agriculture Organization of the United Nations”. The authority record serves thus to bring together all form of names for this corporate body, authorized and non-authorized, e.g. “FAO, Rome (Italy)”, “F.A.O.”, “FAO”, “Food and Agriculture Organization, F.A.O. of the U.N.” or “FAO of the UN”. Associated to the authority record for FAO are all the bibliographical records of documents issued by the concept of FAO. This assures John that his search is exhaustive. The system could also suggest related terms as further possible search terms.

Aggregating Authority Data

VIAF collects authority data from The German National Library and other contributors. Despite differences in the information collected, the records in these systems often refer to the same entity. By comparing information, VIAF can create semantic links between them and publishes those relationships from a "cluster" URI. The German National Library can then harvest the clusters it contributes to and ingest the links to relate their entities directly to other contributors. From an end-user perspective, VIAF can also aggregate the properties of these individuals and include those in the cluster representation to help end-user Alice discover and trace them from a central location.

Relevant technologies

SPARQL (see Authority SPARQL Examples)

Relevant vocabularies

  • SKOS
  • FRBRer
  • FRAD
  • RDA
  • DBpedia
  • FOAF

References

(1) For the history of the discussion about foaf:focus on the LLD mailing list, see especially http://lists.w3.org/Archives/Public/public-lld/2010Aug/0008.html and http://lists.w3.org/Archives/Public/public-lld/2010Oct/0111.html