Draft issues page

From Library Linked Data
Jump to: navigation, search

See also Draft_recommendations_page


Implementation challenges and barriers to adoption

Designed for stability, the library ecosystem resists change

As stable and reliable archives with long-term goals, cultural heritage organizations -- particularly libraries -- are predisposed to traditionalism and conservation. This emphasis on larger goals has led libraries to fall out of step with the faster-moving technology culture of the past few decades. When most information was in print format, libraries were at the forefront of information organization and retrieval. With the introduction of machine-readable catalogs in the 1960s, libraries were early adopters of the computer, though primarily for automating the production of printed catalogs of print materials. As the volume of information in digital format has overtaken print, libraries have struggled both to maintain their function as long-term archives as well as to extend their missions to include digital information. Decreased budgets for libraries and their parent institutions have greatly hindered libraries' ability to create competitive information services.

Cooperative metadata creation is economical but creates barriers to change

Libraries take advantage of cooperative agreements allowing them to share resources, as well as metadata describing those resources. These cooperative efforts are both a strength and a weakness: while shared data creation has economic benefit, changes to share data require coordination among the sharing parties.

Consequently, major changes require a strong agent to coordinate the effort. In most countries, the national library provides this type of leadership. Changes that transcend the borders of any single country -- such as adopting data standards like FRBR or moving to linked library data -- require a broad leadership that can take into account the many local needs of the international library community.

Library Data is shareable among libraries, but not yet with the wider world

Linked Data reaches a diverse community far broader than the library community; moving to library Linked Data requires libraries to understand and interact with the entire information community. Much of this information community has been engendered by the capabilities provided by new technologies. The library community has not fully engaged with these new information communities, yet the success of Linked Data will require libraries to interact with them as fully as they interact with other libraries today. This will be a huge cultural change that must be addressed.

Libraries are understaffed in the technology area

As libraries have not kept pace with technological change, they also have not provided sufficient educational opportunities for staff. Training within libraries is limited in some countries, and workers are not encouraged to seek training on their own. Technological changes have taken place so quickly that many in library positions today began their careers long before the World Wide Web was a reality, and these workers may not fully understand the import of these changes. Libraries struggle to maintain their technological presence and are often under-staffed in key areas of technology.

  • In-house software developers. An informal survey of Code4Lib participants suggests that there are few software developers in libraries. Although the developers are embedded in library operations, coding is often a small part of their duties. Staff developers tend to be closely bound to working with systems from off-the-shelf software providers. These developers are for the most part maintaining existing systems and do not have much time to explore new technology paradigms and new software systems. They are dependent on a shrinking number of off-the-shelf providers as market players have consolidated over the past two decades (see Marshall Breeding's History of Library Automation).
  • Library workers. Software development skills, including metadata modeling, have often not been a strong part of a library worker's education. Libraries have in essence out-sourced their technology development to a few organizations in the community and to the library systems vendors. These vendors understand library functionality and data, but they need an expectation of development-costs recovery before beginning work on new products.
  • Library leaders. There are many individual Linked Data projects coming out of libraries and related institutions, but no obvious emerging leaders. IFLA has been a thought-leader in this area, but there is still a need to use their work to provide functional systems and software. Many national libraries have an interest in exploring LLD and some have ongoing projects. LLD will be international in scope, and this increases the amount of coordination that will be needed. Because of its strong community ties, however, leadership from within can be expected to have a dramatic effect on the community's ability to move in the direction of Linked Data. "no obvious emerging leaders" still seems overstrong to meJodi Schneider 16:52, 8 April 2011 (UTC)

Library technology has largely been implemented by a small set of vendors

Much of the technical expertise in the library community is concentrated in the small number of vendors who provide the systems and software that run library management functions as well as the user discovery service. These vendor systems hold the bibliographic data integrated into library management functions like acquisitions, receipt of materials, user data, and circulation. Other technical expertise exists primarily in large academic libraries where development of independent discovery systems for local materials is not uncommon. These latter systems are more likely to use mainstream technologies for data creation and management, but they do not represent the primary holdings of the library.

Libraries do not adapt well to technological change

Technology has continually evolved since computers were first used for library operations in the 1960s. However, the library community tends to engage only with established technologies that have brought proven benefits to their operations and services. The Linked Data approach is relatively new, with enabling technologies and best practices being developed outside of mainstream library applications. Experimentation with Linked Data in the library community has been limited in part due to lack of developer tools for LD in general but also because there are no tools that specifically address library data. It can be difficult to demonstrate the value of LD to librarians because the few examples of implementations that do exist use unfamiliar data and interfaces.

The long-term view by libraries applies also to standards

While both library and Web communities value preservation and endurance (or permanence) of information, the timescales differ: library preservation is measured in generations and centuries (if not millenia) while Web-native information might be considered old at two decades. Ensuring this long-term life of information promotes a conservative outlook for library organizations, which is in contrast to the mainstream perspective of the Web community which values novelty and experimentation over preservation of the past.

Therefore it is not surprising that the library standardization process is slower than comparable Web standards development. Current developments towards a new metadata environment can be traced back more than ten years: The basic groundwork for a shift to a new data format was laid in 1998 with the development of the Functional Requirements for Bibliographic Records (FRBR) which provides an entity-relation view of library catalog data. That model is the basis for a new set of cataloguing rules, Resource Description and Access (RDA), which although they became final in 2010, are still under review before implementation. RDA is a standard of four Anglo-American library communities, and has not had international acceptance, although it is being studied widely. LLD standards associated with RDA are still in the process of development. Through a joint working group with DCMI, the Joint Steering Committee for RDA approved an RDF implementation of the properties and value vocabularies of RDA. These have not yet been moved to production status and are not integrated with the primary documentation and cataloguer tools in the RDAToolkit.

Library standardization process is cumbersome

A further difference is that Web-related organizations focus on implementations, often hammering out differences with practical examples, and leaving edge cases for later work. This is in contrast to the library standardization approach: Standards such as FRBR and RDA have been created as documents, without the use of test cases, prototype implementations, and iterative development methodologies that characterize modern IT approaches. Library standards have a strong "top-down" direction, and major standards efforts are undertaken by national or international bodies. S Development of an international standard takes years and that development cannot keep up with the increasingly fast pace of technological change. Development cycles are often locked into face-to-face business meetings of the parent organization or group to comply with formal approval procedures. As a result, standards may be technologically out-of-date as soon as they are published.

Bottom-up standards can be successful but garner little recognition

While on the Web, bottom-up development is common for all but the largest and most-used standards (e.g. HTML5), bottom-up development often does not get proper recognition from the library community. Even so, some bottom-up initiatives have led to successful standards adopted by the library community, including OpenURL, METS, OAI, and Dublin Core. LLD will require funding and will need institutional support (though it isn't clear where funding and support will come from) but it will also require an environment where the bottom-up developers can flourish.

Library standards are limited to the library data

While the Web values global interchange between all parties, library cataloguing standards in the past have aimed to address only the exchange of data within the library community where the need to think of broader bibliographic data exchange (e.g. with publishers) is new and not universally accepted. There is fear that library data will need to be "dumbed down" in order to interact with other communities; few see the possibility of "smarting up" bibliographic data using library-produced information.

ROI is difficult to calculate

Some cost issues are known but are unmeasured

It is admittedly difficult to calculate or estimate costs and benefits in a publicly funded service environment. This makes it particularly difficult to create concrete justifications for large-scale changes of the magnitude required for adopting Linked Data in libraries. While there is a general recognition of distinct disadvantages to the silo'd library data practices, no measurement exists that would compare the resources required to create and manage current library data compared to linked library data. (Note: there are some studies on the cost of cataloging, but they do not separately study costs related to data technology: Library of Congress Study of the North American MARC Records Marketplace, R2 Consulting LLC, Ruth Fischer, Rick Lugg, October 2009 ) and Implications of MARC Tag Usage on Library Metadata Practices, OCLC, March 2010.)

"MARC data cannot continue to exist in its own discrete environment, separate from the rest of the information universe. It will need to be leveraged and used in other domains to reach users in their own networked environments. The 200 or so MARC 21 fields in use must be mapped to simpler schema."

Smith-Yoshimura, et al., Implications of MARC Tag Usage on Library Metadata Practices. www.oclc.org/research/publications/library/2010/2010-06.pdf

Library-specific data formats require niche systems solutions

It is possible, however, to observe the consequences of library data practices. Libraries use data technology specific to libraries and library systems. They are therefore dependent on niche software systems tailored to formats that nobody uses outside of the library world. Because the formats used in libraries (notably MARC) are unique to libraries, vendors of library systems cannot use mainstream data modeling systems, programmer tools, and database software to build library systems. Development of library systems also requires personnel specifically trained in library data. This makes it expensive to provide systems for the library community. The common practice of commissioning a new, customized system in every library -- every 5 to 10 years -- is very expensive; the aggregate cost to the library community has not been reliably estimated.

Vocabulary changes in library data are costly

Controlled vocabularies will play an important role in linked data in general, and although controlled vocabularies are used in library data (in particular for names of persons and organizations, and for subjects) they are not managed in a manner to facilitate linked data: changes to vocabularies require that all related records be retrieved and changed; this is a disruptive process, made even more expensive because the library metadata record, being designed primarily as a communication format, requires a full record replace for updates to any of its fields.

Data may have rights issues that prevent open publication

For a perspective from Europe, see Free library data? by Raymond Bérard.

Some data cannot be published openly

Data related to user identity and use of the library is protected by privacy policies and legislation. Other data, such as that related to purchasing and contracts, is not included in our analysis.

Rights ownership can be unmanageably complex

Some library bibliographic data has unclear and untested rights issues that can hinder the release of open data. Ownership of legacy catalogue records has been complicated by data sharing among libraries over the past 50 years. The records most-shared are those created by national cataloguing agencies such as the Library of Congress in the USA and the British Library in the UK. Records are frequently copied and the copies are modified or enhanced for local cataloguer users. These records may be subsequently re-aggregated into the catalogues of regional, national, and international consortia. Assigning legally-sound intellectual property rights between relevant agents and agencies is difficult, and the lack of certainty is a hindrance to data sharing in a community which is necessarily extremely cautious on legal matters such as censorship data privacy/protection.

Rights have perceived value

On the other hand, some bibliographic data may never have been shared with another party, so rights may be exclusively held by creating agencies, who put a value on past, present and future investment in creating, maintaining, and collecting metadata. Larger agencies are likely to treat records as assets in their business plans, and may be reluctant to publish them as open LD, or may be willing to release them only in a stripped- or dumbed-down form with loss of semantic detail. For example, data about specific types of title such as preferred title and parallel title might be output as a general title, losing the detail required for a formal citation of the resource.

Consider incorporating Simon Spero's comments (on Talk page) for US perspective Jodi Schneider 10:41, 18 March 2011 (UTC) "See also notes from UK perspective on Talk page Jodi Schneider 09:02, 22 March 2011 (UTC)"

Library data is expressed in library-specific formats that cannot be easily shared outside the library community

Library data is expressed primarily as text strings, not "linkable" URIs

Most information in library data is encoded as display-oriented text strings. There are a few shared identifiers for resources, such as ISBNs for books, but most identification is done with text strings. Some coded data fields are used in MARC records, but there is not a clear incentive to include these in all records, since most coded data fields are not used in library system functions. Some data fields, such as authority controlled names and subjects, do have their own associated records in separate files, which have identifiers that could be used to represent those entities in library metadata. However, the data formats currently used do not support the inclusion of these identifiers in existing library records and consequently neither do current library systems support their use.

Some library data is being expressed in RDF on an experimental basis, but without standardization or best practices

Work has begun to express library data in RDF. Some libraries have experimented with publishing LD from their catalogue records although no standard or best practice has yet emerged. There has been progress in defining value vocabularies currently used in libraries. Transformation of legacy data will require more than the mapping of attributes to RDF properties; where possible, library data should be transformed from text to data with identified values. New approaches for library data, such as the FRBR model which informs RDA, offer an opportunity for incorporating linked data principles into future library data practices, particularly when these new standards are implemented.

The library community and the Semantic Web community have no shared terminology for metadata concepts

Work on LLD can be hampered by the disparity in concepts and terminology between libraries and the Semantic Web community. Few in libraries would use a term like "statement" for metadata, and the Web community does not have concepts equivalent to libraries' "headings" or "authority control." Each community has its own vocabulary and these reflect the differences in their points of view. Mutual understanding must be fostered as both groups bring important expertise to the potential web of data.

Library data must be conceptualized according to the Graph Paradigm

Translators of legacy library standards into Linked Data must recognize that Semantic Web technologies are not merely variants of practices but represent a fundamentally different way to conceptualize and interpret data. Since the introduction of MARC formats in the 1960s, digital data in libraries has been managed predominantly in the form of "records" -- bounded sets of information described in documents with a precisely specified structure -- in accordance with what may be called a Record Paradigm. The Semantic Web and Linked Data, in contrast, are based on a Graph Paradigm. In graphs, information is conceptualized as a boundless "web" of links between resources -- in visual terms as sets of nodes connected by arcs (or "edges"), and in semantic terms as sets of "statements" consisting of subjects and objects connected by predicates. The three-part statements of Linked Data, or "triples", are expressed in the language of the Resource Description Framework (RDF). In the Graph Paradigm, the "statement" is an atomic unit of meaning that stands on its own and can be combined with statements from many different sources to create new graphs -- a notion ideally suited for the task of integrating information from multiple sources into recombinant graphs.

Under the Record Paradigm, a data architect can specify with precision the form and expected content of a data record, which can be "validated" for completeness and accuracy. Data sharing among libraries has been based largely on the standardization of fixed record formats, and the consistency of that data has been ensured by adherence to well-defined content rules. Under the Graph Paradigm, in contrast, data is conceptualized according to significantly different assumptions. According to the so-called "open-world assumption", any data at hand may, in principle, be incomplete. It is assumed that data may be supplemented by incorporating information from other, possibly unanticipated, sources, and that information can be added without invalidating information already present.

The notion of "constraints" takes on significantly different meanings under these two paradigms. Under the Record Paradigm, if the format schema for a metadata record says that the description of a book can have only one subject heading and a description with two subject headings is encountered, a validator will report an error in the record. Under the Graph Paradigm, if an OWL ontology says that a book has only one subject heading, and a description with two subject headings (URIs) is encountered, an OWL reasoner will infer that the two subject-heading URIs identify the same subject.

As will be discussed below, the two paradigms may be seen as complementary. The traditional "closed-world" approach is good for flagging data that is inconsistent with the structure of a metadata record as a document, while OWL ontologies are good for flagging logical inconsistencies with respect to a conceptualization of things in the world. The differences between these two approaches mean that the process of translating library standards and datasets into Linked Data cannot be undertaken mechanically, but requires intellectual effort and modeling skill. Translators, in other words, must acquire some fluency in the language of RDF.