Library standards and linked data

From Library Linked Data
Revision as of 14:17, 10 March 2011 by Gdunsire (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Author: Gordon Dunsire

Note: the original first half of this page has been incorporated in Library_Data_Resources. Issues outlined on this page will be discussed by the LLD XG to inform the Problems and limitations section of the final report DraftReport#Problems_and_limitations.

Issues for further discussion

General issues arising from discussion within IFLA, the DCMI RDA Task Group, and the LLD Incubator Group include:

Constrained versus unconstrained properties and classes

The FR models assign attributes to and describe relationships between specific entities. These are represented in the draft RDF element sets as properties with domains and/or ranges of the classes which represent the entities. The values of FR attributes are deliberately left open to reflect the generality of the models, and their corresponding properties have no ranges. The RDF properties from RDA, as an application of the FRBR and FRAD models, are similarly constrained. Some examples of constrained properties in RDA are:

  • Screenplay for the motion picture (Work)
  • Screenplay for the motion picture (Expression)
  • Critique of (Work)
  • Critique of (Expression)
  • Critique of (Manifestation)
  • Critique of (Item)
  • Name of the corporate body
  • Name of the family
  • Name of the person
  • Identifier for the corporate body
  • Identifier for the family
  • Identifier for the person

In addition, RDA specifies controlled vocabularies for the values of many attributes. Although the vocabularies have been represented in RDF, the constraints on property ranges have not yet been represented.

The FR models impose many other constraints, including class and property disjointness, class cardinality restrictions, and property typing (inverse, symmetric, transitive, etc.).

ISBD imposes few semantic constraints: all properties have the domain Resource and no range (except the controlled vocabularies for content form and media type).

The FRBR Review Group and JSC think it is important that they recommend use of constrained properties and classes to ensure that use of the FR models and RDA satisfies their intended utility. However, both groups have acknowledged that other communities may wish to benefit from this work by using unconstrained versions of the properties. It is very likely that corresponding RDF properties without domains or ranges will be made available, probably in separate namespaces. Such properties would, by definition, be super-properties of the recommended constrained versions.

Arguments for a constrained approach:

  • Retains information latent in the structure of legacy records.
  • Helps to ensure intended utility.
  • Allows powerful semantic inferencing.
  • Easy for other communities to declare unconstrained versions.

Arguments for an unconstrained approach:

  • Encourages other communities to benefit from the standard.
  • Provides higher-level elements for interoperating different, but related, standards.

Q: Are there other advantages for each approach?

Q: What positive and negative impacts might the unconstrained approach have on the standards?

Application profiles or OWL ontologies

The FR models are using OWL to represent richer aspects of their RDF classes and properties. ISBD is developing an application profile to represent the structure of records based on its RDF classes and properties. RDA may also develop one or more application profiles. The LLD Incubator Group is investigating the use of application profiles for subject authority data.

The DCMI RDA Task Group is developing a method for representing aggregated statements (pre-coordinated groupings of elements) in RDA; see RDA vocabularies: process, outcome, use for further information. ISBD hopes to take a similar approach.

There is clearly a need for higher-level frameworks to support the RDF representation of library models, applications and other standards, but there does not seem to be a consensus on when application profiles or ontologies should be used, or, indeed, whether there is any fundamental difference.

Organizations developing and maintaining library standards need advice on what appear to be complex technical issues.

Q: How can clearer advice and recommendations of best practice be given to library organizations?

Use of published properties and classes

None of the RDF representations described above use classes or properties from the namespaces of other communities (although FRAD is investigating the potential use of FOAF properties for parent, child and sibling relationships). Reasons include:

  • Definitions of non-local elements are too broad.
  • Constraints on non-local elements are too loose.
  • Uncertainty about the stability of other namespaces.
  • Uncertainty about the impact of changes in other namespaces.
  • Fear of loss of control.

Q: What can be done to encourage re-use of published properties and classes?

As discussed in the Mappings and alignments section above, it is desirable to review and cross-link the namespaces used in the library domain. It is also desirable to cross-link library domain namespaces with other namespaces, and especially the high-level classes and properties of DC, FOAF, SKOS, OWL, etc., to weave the Semantic Web.

Q: How can this be encouraged and achieved? Is there a framework for inter-community collaboration that library communities can use?

Granularity of library metadata

Levels of granularity

Bibliographic resources can be described at multiple levels of granularity (Level numbers are for internal reference only):

  • International collection (Europeana ...) Level 1
    • National collection (national bibliographies ...) Level 2
      • Regional collection (state or regional consortia ...) Level 3
        • Institutional collection (library service ...) Level 4
          • Special and working collection (rare books, short loan items ...) Level 5
            • Manifestation/item (book ...) Level 6
              • Physical unit (one of multi-volume manifestation ...) Level 7
                • Intellectual unit (poem in anthology ...) Level 8
                  • Component (chapter in book ...) Level 9
                    • Sub-component (paragraph in chapter in book ...) Level 10

Level 1-5 is generally the realm of collection-level description.

Level 6-7 is the realm of item-level description. Level 6 is often split in practice into separate manifestation and item levels.

Level 8 is the realm of analytic description.

Level 9-10 is the realm of citation description.

The generic relationship between levels is has-part/is-part-of.

Inheritance

Generally, lower levels inherit some metadata from higher levels. For example, a citation record (level 9-10) will usually include appropriate metadata from levels 6-8.

One significant inheritance occurs between levels 5 and 6. The user task Obtain (from Functional Requirements for Bibliographic Records) is primarily supported by level 6 (Manifestation/item) metadata, but this rarely includes information about the location and opening hours of the building where a physical item is stored. This metadata is considered to belong to collection-level description levels 4-5. The Obtain task therefore requires metadata from levels 4-6, and is not adequately met in the absence of suitable collection-level metadata.

Mixed granularity

Most library catalogues contain metadata records for levels 6-7.

Many add records for specific resources at level 8, for example essays, poems, and short stories by an author of interest, although most metadata at this level is supplied by external sources such as abstract and indexing services and is usually not integrated with the main catalogue.

Some libraries include metadata for level 5 special collections, particularly if there are large numbers of such collections with little or no level 6-7 metadata, as is often the case in national and large academic libraries. Metadata from levels above 5 are usually not inherited or included in library catalogues.

Interoperability

Some bibliographic metadata attributes from different levels have similar semantics. For example, the "collector" attribute is used across levels 1-5. It has the value of the agent responsible for aggregating the members of the collection being described at that level, such as "Europeana Foundation", "National Library of Scotland", "Hochschulbibliothekszentrum", etc. It is semantically equivalent to the "creator" attribute use across levels 6-8: a bibliographic collection is a work created through an act of aggregation by an agent.

Library linked-data and legacy records

There are hundreds of millions of legacy records containing high-quality library metadata. See Initiatives to make standard library metadata models and structures available to the Semantic Web for further discussion (and further information about some of the topics discussed on this page).

Intellectual property right

The issue of who owns legacy catalogue records is complex. There has been a year-on-year increase in the sharing of records since the inception of MARC in the 1960s. The most-shared records are those created by national cataloguing agencies such as the Library of Congress in the USA and the British Library in the UK (as well as national libraries in many other countries). ISBD, UNIMARC, MARC21, and other standards are intended to improve compatibility, encoding and transmission of records between cataloguing agencies. But copies of such records are frequently modified by libraries for use in local catalogues by local users, and subsequently re-aggregated into national and international catalogues. Assigning copyright between relevant agents and agencies is difficult, and has not been definitively tested in law.

IPR is a barrier to creating linked-data from such records:

  • Libraries are extremely cautious about legal matters; they have to deal with censorship, data privacy, and data protection on a daily basis.
  • Larger national and international agencies put a value on past, present and future investment in creating, maintaining, and collecting records.
  • Larger national and international agencies regard records as assets in business plans.

Q: How can libraries be encouraged to "free" their records?

Technical infrastructure

Nearly all current library catalogue systems are record-based; the record is the unit to be processed (stored, indexed, displayed, transmitted, etc.). Most systems use the semi-relational architecture of linked bibliographic and authority records given as Scenario 2 of the RDA database implementation scenarios rather than the relational or object-oriented approach of Scenario 1 which mirrors the FRBR and FRAD models. Scenario 1 itself remains record-based, albeit with the single bibliographic record replaced by FRBR's four-part Work/Expression/Manifestation/Item record.

The vendors of library management systems operate on relatively thin profit margins and have a customer-base which has, in general, shrinking budgets. The customer-base (libraries) is also cautious and conservative. Vendors usually operate a just-in-time (often just inside the tolerance zone of the customer) approach to development rather than just-in-case. Moving to RDA's Scenario 1 is fraught with risk: will RDA be adopted by enough customers able to upgrade their systems?

Q: Will a linked-data approach be adopted by enough libraries to justify investment in developing statement/triple-based systems?

Q: Will libraries be able to afford such systems?

Duplication and identifiers

As noted in the Intellectual property right section above, multiple copies of catalogue records with slight or significant differences exist in national and international aggregations. It is extremely difficult to de-duplicate them to produce a single consolidated record. Global identifiers, such as Library of Congress control numbers, are attached to only a minority of records. Algorithms based on matching metadata content in one or more fields are generally inefficient because of variations which appear insignificant to human consumers.

Q: What impact will this duplication have on linked-data derived from large aggregated catalogues?

Q: How can libraries assign and use URIs for records to disaggregate them into instance triples?

Service infrastructure

There are linked-data applications that can be shown to libraries to illustrate the utility of the Semantic Web. But there are few, if any, that go beyond demonstrators and pilots for applications that can demonstrate improvement in the efficiency, effectiveness, and trustworthiness of current bibliographic information retrieval and resource discovery services.

Q: Where are the linked-data applications that justify libraries investing in a service infrastructure based on linked-data?

Library culture

Librarians, and cataloguers in particular, tend to be cautious and conservative; they have a long-term view (40-year old MARC records are still relevant in many catalogues) requiring a high-quality, rules-based approach to their work (rules attempt to ensure that two different cataloguers will describe an information object in the same way). Most cataloguers can cite case after case where low-quality approaches to metadata based on reduced costs and simplified schemas have failed. Cataloguers take intellectual pride in their work; they act as intermediaries between the minds of those who create information and those who consume it. This culture can foster a mine-is-better-than-yours attitude towards other cataloguers; it can be a high point of the day when a mistake is found in a record created by another agency, especially if that agency is national or international. This in conflict with the general culture of sharing prevalent in cataloguing communities.

Perceived threats to quality (and jobs) by linked-data include:

  • User-generated metadata.
  • Machine-generated metadata, including inferred triples.
  • Loss of control of what metadata is displayed.

Q: How can these fears be addressed?

Library domain as Semantic Web

FRSAD models the "aboutness" of FRBR Works: what is a specific work about? what is its subject? Instance triples are statements about something (the subject of the triple). Is an instance triple a FRBR Work? (Yes, in the sense that it is an item exemplifying a manifestation embodying an expression realizing a work.) So can the FR models assist with the reification of instance triples to model attribution and provenance?

The standards discussed on this page are not confined to the library domain. FRBR is aligned with museum domain models via CIDOC CRM. RDA is designed for use in archive and museum catalogues as well as libraries. Digital surrogates of archive and museum items lose the context and uniqueness that characterises the specialised curation of items in those domains; the surrogates may be better curated using library methods.

Are there any "things" which can not be considered to be archive, library or museum resources, or the subjects of those resources? Is the library domain in some way co-extensive with the Semantic Web?

Q: What are the difference between the library domain represented by RDF and the Semantic Web, if any?