Difference between revisions of "Draft issues page"

From Library Linked Data
Jump to: navigation, search
(Undo revision 5109 by Kcoyle (Talk))
(creating diff for Ray Denenberg's review http://www.loc.gov/standards/sru/w3clld/ , see followup email)
Line 4: Line 4:
 
== Designed for stability, the library ecosystem resists change ==
 
== Designed for stability, the library ecosystem resists change ==
  
As stable and reliable archives with long-term goals, cultural heritage organizations -- particularly libraries -- are predisposed to conservatism.  This emphasis on larger goals has led libraries to fall out of step with the faster-moving technology culture of the past few decades. When most information was in print format, libraries were at the forefront of information organization and retrieval.  With the introduction of machine-readable catalogs in the 1960s, libraries were early adopters of the computer, though primarily for automating the production of printed catalogs of print materials.  As the role of digital technology evolved from a support for print into the native format for information, eventually overtaking print in volume and centrality, libraries have struggled both to continue their function as long-term archives and to extend their missions to include digital information. Decreasing budgets for libraries and their parent institutions have greatly hindered libraries' ability to create competitive information services.
+
As stable and reliable archives with long-term goals, cultural heritage organizations -- particularly libraries -- are predisposed to traditionalism and conservation.  This emphasis on larger goals has led libraries to fall out of step with the faster-moving technology culture of the past few decades. When most information was in print format, libraries were at the forefront of information organization and retrieval.  With the introduction of machine-readable catalogs in the 1960s, libraries were early adopters of the computer, though primarily for automating the production of printed catalogs of print materials.  As the volume of information in digital format has overtaken print, libraries have struggled both to maintain their function as long-term archives as well as to extend their missions to include digital information. Decreased budgets for libraries and their parent institutions have greatly hindered libraries' ability to create competitive information services.
  
 
=== Cooperative metadata creation is economical but creates barriers to change ===
 
=== Cooperative metadata creation is economical but creates barriers to change ===
  
Libraries take advantage of cooperative agreements that allow them to share resources, including sharing metadata used to describe those resources. The agreements help offset budget problems. These cooperative efforts are both a strength and a weakness: while shared data creation is efficient, changes in areas where libraries share data are complicated because the entire community must make those changes in a coordinated fashion.  
+
Libraries take advantage of cooperative agreements allowing them to share resources, as well as metadata describing those resources. These cooperative efforts are both a strength and a weakness: while shared data creation has economic benefit, changes to share data require coordination among the sharing parties.  
  
 
Consequently, major changes require a strong agent to coordinate the effort. In most countries, the national library provides this type of leadership. Changes that transcend the borders of any single country -- such as adopting data standards like FRBR or moving to linked library data -- require a broad leadership that can take into account the many local needs of the international library community.
 
Consequently, major changes require a strong agent to coordinate the effort. In most countries, the national library provides this type of leadership. Changes that transcend the borders of any single country -- such as adopting data standards like FRBR or moving to linked library data -- require a broad leadership that can take into account the many local needs of the international library community.
  
=== Data is intended for sharing among libraries, not with the wider world ===
+
=== Library Data is shareable among libraries, but not yet with the wider world ===
  
Linked Data is not only global, but also reaches a diverse community far broader than the library community: thus moving to library Linked Data requires libraries to understand and interact with the entire information community. Much of this information community has been engendered by the capabilities provided by new technologies. The library community has not fully engaged with these new information communities, yet the success of Linked Data will require libraries to interact with them as integrally as they interact with other libraries today. This is a huge cultural change that must be addressed.
+
Linked Data reaches a diverse community far broader than the library community; moving to library Linked Data requires libraries to understand and interact with the entire information community. Much of this information community has been engendered by the capabilities provided by new technologies. The library community has not fully engaged with these new information communities, yet the success of Linked Data will require libraries to interact with them as fully as they interact with other libraries today. This will be a huge cultural change that must be addressed.
  
 
=== Libraries are understaffed in the technology area ===
 
=== Libraries are understaffed in the technology area ===
  
Because of their lack of emphasis on keeping pace with technological change, libraries have not had a sufficient interest in providing educational opportunities for staff.  Training within libraries is limited in some countries, and workers are not given encouragement to seek training on their own. Technological changes have taken place so quickly that many in library positions today began their careers long before the World Wide Web was a reality, and these workers may not fully understand the import of the changes that have taken place in the information environment in general.
+
As libraries have not kept pace with technological change, they also have not provided sufficient educational opportunities for staff.  Training within libraries is limited in some countries, and workers are not encouraged to seek training on their own. Technological changes have taken place so quickly that many in library positions today began their careers long before the World Wide Web was a reality, and these workers may not fully understand the import of these changes. Libraries struggle to maintain their technological presence and are often under-staffed in key areas of technology.   
Libraries struggle to maintain their technological presence and are often minimally staffed in key areas of technology.   
+
* '''In-house software developers'''. An [http://go-to-hellman.blogspot.com/2011/02/why-code4-libraries-exist.html informal survey] of [http://www.code4lib.org/ Code4Lib] participants suggests that there are few software developers in libraries.  Although the developers are embedded in library operations, coding is often a small part of their duties. Staff developers tend to be closely bound to working with systems from off-the-shelf software providers.  These developers are for the most part maintaining existing systems and do not have much time to explore new technology paradigms and new software systems.  They are dependent on a shrinking number of off-the-shelf providers as market players have consolidated over the past two decades (see Marshall Breeding's [http://www.librarytechnology.org/automationhistory.pl History of Library Automation]).
* '''In-house software developers'''. An [http://go-to-hellman.blogspot.com/2011/02/why-code4-libraries-exist.html informal survey] of [http://www.code4lib.org/ Code4Lib] participants suggests that there are few software developers in libraries overall.  Although the developers are embedded in library operations, coding is often a small part of their actual duties. Staff developers tend to be closely bound to working with systems from off-the-shelf software providers.  These developers are working as best as they are able to maintain existing systems and do not have a lot of time to devote to exploring new technology paradigms and new software systems.  They are dependent on a shrinking number of off-the-shelf providers as market players have consolidated over the past two decades (see Marshall Breeding's [http://www.librarytechnology.org/automationhistory.pl History of Library Automation]).
+
* '''Library workers'''. Software development skills, including metadata modeling, have often not been a strong part of a library worker's education. Libraries have in essence out-sourced their technology development to a few organizations in the community and to the library systems vendors. These vendors understand library functionality and data, but they need an expectation of development-costs recovery before beginning work on new products.
* '''Library workers'''. Software development skills, including metadata modeling, have often not been a strong part of a library worker's education. Libraries have in essence out-sourced their technology development to a few organizations in the community and to the library systems vendors. These vendors understand library functionality and library data, but need some expectation of recovering their development costs before they can begin work on new products.
+
* '''Library leaders'''. There are many individual Linked Data projects coming out of libraries and related institutions, but no obvious emerging leaders. IFLA has been a thought-leader in this area, but there is still a need to use their work to provide functional systems and software. Many national libraries have an interest in exploring LLD and some have ongoing projects. LLD will be international in scope, and this increases the amount of coordination that will be needed. Because of its strong community ties, however, leadership from within can be expected to have a dramatic effect on the community's ability to move in the direction of Linked Data.  ''"no obvious emerging leaders" still seems overstrong to me[[User:Jschneid4|Jodi Schneider]] 16:52, 8 April 2011 (UTC)''
* '''Library leaders'''. There are many individual Linked Data projects coming out of libraries and related institutions, but no obvious emerging leaders. IFLA has been a thought-leader in this area, but there is still a need to use their work to provide functional systems and software. Many national libraries have an interest in exploring LLD and some have ongoing projects in this area. LLD will be international in scope, and this increases the amount of coordination that will be needed. Because of its strong community ties, however, it can be expected that leadership from within could have a dramatic effect on the community's ability to move in the direction of Linked Data.  ''"no obvious emerging leaders" still seems overstrong to me[[User:Jschneid4|Jodi Schneider]] 16:52, 8 April 2011 (UTC)''
+
  
 
=== Library technology has been driven largely by commercial vendors ===
 
=== Library technology has been driven largely by commercial vendors ===
  
Much of the technical expertise in the library community is concentrated in the small number of vendors who provide the systems and software that run library management functions as well as the user discovery service. These vendor systems hold the bibliographic data that is integrated into library management functions like acquisitions, receipt of materials, user data, and circulation. Other technical expertise exists primarily in large academic libraries where development of independent discovery systems for local materials is not uncommon. These latter systems are more likely to use mainstream technologies for data creation and management, but they do not represent the primary holdings of the library.
+
Much of the technical expertise in the library community is concentrated in the small number of vendors who provide the systems and software that run library management functions as well as the user discovery service. These vendor systems hold the bibliographic data integrated into library management functions like acquisitions, receipt of materials, user data, and circulation. Other technical expertise exists primarily in large academic libraries where development of independent discovery systems for local materials is not uncommon. These latter systems are more likely to use mainstream technologies for data creation and management, but they do not represent the primary holdings of the library.
  
== Libraries are ill-adapted to continual technological change ==
+
== Libraries do not adapt well to technological change ==
  
The library community operates in a technological environment that has continually evolved since computers were first used for library operations in the 1960s. This community tends to engage only with established standards and technologies that have at some point brought proven benefits to their operations and services. The Linked Data approach is relatively new, with enabling technologies and best practices being developed outside of mainstream library applications.  Experimentation with Linked Data in the library community has been limited in part for the lack of developer tools for LD in general is relatively small but also because there are no tools that specifically address library data. It can be difficult to demonstrate the value of LD to librarians because the few examples of implementations use unfamiliar data and interfaces.
+
Technology has continually evolved since computers were first used for library operations in the 1960s. However, the library community tends to engage only with established technologies that have brought proven benefits to their operations and services. The Linked Data approach is relatively new, with enabling technologies and best practices being developed outside of mainstream library applications.  Experimentation with Linked Data in the library community has been limited in part due to lack of developer tools for LD in general but also because there are no tools that specifically address library data. It can be difficult to demonstrate the value of LD to librarians because the few examples of implementations that do exist use unfamiliar data and interfaces.
  
=== The long-term view of libraries applies also to standards ===
+
=== The long-term view by libraries applies also to standards ===
  
While both library and Web communities value preservation and longevity of information, the timescales differ: library preservation is measured in generations and centuries (if not millenia) while Web-native information might be considered old at two decades. Ensuring this long-term continuation of information promotes a conservative outlook for library organizations, which is in contrast to the mainstream perspective of the Web community which values novelty and experimentation over preservation of the past.  
+
While both library and Web communities value preservation and endurance (or permanence) of information, the timescales differ: library preservation is measured in generations and centuries (if not millenia) while Web-native information might be considered old at two decades. Ensuring this long-term life of information promotes a conservative outlook for library organizations, which is in contrast to the mainstream perspective of the Web community which values novelty and experimentation over preservation of the past.  
  
Therefore it is not surprising that the library standardization process is slower than comparative Web standards development. Current developments towards a new metadata environment can be traced back more than ten years: The basic groundwork for a shift to a new data format was laid in 1998 with the development of the Functional Requirements for Bibliographic Records (FRBR) which provides an entity-relation view of library catalog data. That model is the basis for a new set of cataloguing rules, Resource Description and Access (RDA), which became final in 2010 but are still under review before implementation. RDA is a standard of four Anglo-American library communities, and has not had international acceptance, although it is being studied widely. LLD standards associated with RDA are still in the process of development. Through a joint working group with DCMI, the Joint Steering Committee for RDA approved an RDF implementation of the properties and value vocabularies of RDA. These have not yet been moved to production status and are not integrated with the primary documentation and cataloguer tools in the RDAToolkit.  
+
Therefore it is not surprising that the library standardization process is slower than comparable Web standards development. Current developments towards a new metadata environment can be traced back more than ten years: The basic groundwork for a shift to a new data format was laid in 1998 with the development of the Functional Requirements for Bibliographic Records (FRBR) which provides an entity-relation view of library catalog data. That model is the basis for a new set of cataloguing rules, Resource Description and Access (RDA), which although they became final in 2010, are still under review before implementation. RDA is a standard of four Anglo-American library communities, and has not had international acceptance, although it is being studied widely. LLD standards associated with RDA are still in the process of development. Through a joint working group with DCMI, the Joint Steering Committee for RDA approved an RDF implementation of the properties and value vocabularies of RDA. These have not yet been moved to production status and are not integrated with the primary documentation and cataloguer tools in the RDAToolkit.  
  
=== Library standards are not closely tied to experimentation and implementation ===
+
=== Library standardization process is cumbersome ===
  
A further difference is that Web-related organizations focus on implementations, often hammering out differences with practical examples, and leaving edge cases for later work (i.e. "rough consensus and running code"). This is in contrast to the library standardization approach: Standards such as FRBR and RDA have been created as documents, without the use of test cases, prototype implementations, and iterative development methodologies that characterize modern IT approaches. Library standards have a strong "top-down" direction, and major standards efforts are undertaken by national or international bodies. Standards take many years to be developed at the international level, and are not able to keep up with the increasingly fast pace of technical change. Development cycles are often locked into face-to-face business meetings of the parent organization or group to comply with formal approval procedures. As a result, standards may be out-of-date from a technical point of view as soon as they are published.
+
A further difference is that Web-related organizations focus on implementations, often hammering out differences with practical examples, and leaving edge cases for later work. This is in contrast to the library standardization approach: Standards such as FRBR and RDA have been created as documents, without the use of test cases, prototype implementations, and iterative development methodologies that characterize modern IT approaches. Library standards have a strong "top-down" direction, and major standards efforts are undertaken by national or international bodies. S Development of an international standard takes years and that development cannot keep up with the increasingly fast pace of technological change. Development cycles are often locked into face-to-face business meetings of the parent organization or group to comply with formal approval procedures. As a result, standards may be technologically out-of-date as soon as they are published.
  
 
=== Bottom-up standards can be successful but garner little recognition ===
 
=== Bottom-up standards can be successful but garner little recognition ===
  
While on the Web, bottom-up development is common for all but the largest and most-used standards (e.g. HTML5), bottom-up development often does not get direct recognition from the library community. Even so, some bottom-up initiatives have led to successful standards adopted by the library community, including OpenURL, METS, OAI, and Dublin Core. LLD will require funding and will need institutional support, yet it isn't clear where funding and support will come from, or where the bottom-up developers can flourish.
+
While on the Web, bottom-up development is common for all but the largest and most-used standards (e.g. HTML5), bottom-up development often does not get proper recognition from the library community. Even so, some bottom-up initiatives have led to successful standards adopted by the library community, including OpenURL, METS, OAI, and Dublin Core. LLD will require funding and will need institutional support (though it isn't clear where funding and support will come from) but it will also require an environment where the bottom-up developers can flourish.
  
=== Library standards are limited to metadata in the library data silo ===
+
=== Library standards are limited to the library data ===
  
While the Web values global interchange between all parties, in the past, library cataloguing standards in the past have aimed only to address the exchange of data within the library community. Within libraries, the need to think of a broader bibliographic data exchange (e.g. with publishers) is new and is not universally accepted. There is a fear that library data will need to be "dumbed down" in order to interact with other communities; few see the possibility of "smarting up" bibliographic data using library-produced information.
+
While the Web values global interchange between all parties, library cataloguing standards in the past have aimed to address only the exchange of data within the library community where the need to think of broader bibliographic data exchange (e.g. with publishers) is new and not universally accepted. There is fear that library data will need to be "dumbed down" in order to interact with other communities; few see the possibility of "smarting up" bibliographic data using library-produced information.
  
== Cost of current library practices is unknown; ROI is difficult to calculate ==
+
== ROI is difficult to calculate ==
  
 
=== Some cost issues are known but are unmeasured ===
 
=== Some cost issues are known but are unmeasured ===
  
It is admittedly difficult to estimate costs and calculate benefits in a publicly funded service environment. This makes it particularly difficult to create concrete justifications for large-scale changes of the magnitude required for adopting Linked Data in libraries. While there is a general recognition that there are distinct disadvantages to the silo'd library data practices, no measurement exists that would compare the resources required to create and manage current library data compared to linked library data. (Note: there are some studies on the cost of cataloging, but they do not separately study costs related to data technology:  [http://www.loc.gov/bibliographic-future/news/MARC_Record_Marketplace_2009-10.pdf Library of Congress Study of the North American MARC Records Marketplace], R2 Consulting LLC, Ruth Fischer, Rick Lugg, October 2009
+
It is admittedly difficult to calculate or estimate costs and benefits in a publicly funded service environment. This makes it particularly difficult to create concrete justifications for large-scale changes of the magnitude required for adopting Linked Data in libraries. While there is a general recognition of distinct disadvantages to the silo'd library data practices, no measurement exists that would compare the resources required to create and manage current library data compared to linked library data. (Note: there are some studies on the cost of cataloging, but they do not separately study costs related to data technology:  [http://www.loc.gov/bibliographic-future/news/MARC_Record_Marketplace_2009-10.pdf Library of Congress Study of the North American MARC Records Marketplace], R2 Consulting LLC, Ruth Fischer, Rick Lugg, October 2009
 
) and [http://www.oclc.org/research/publications/library/2010/2010-06.pdf Implications of MARC Tag Usage on Library Metadata Practices], OCLC, March 2010.)
 
) and [http://www.oclc.org/research/publications/library/2010/2010-06.pdf Implications of MARC Tag Usage on Library Metadata Practices], OCLC, March 2010.)
 
<blockquote>
 
<blockquote>
Line 65: Line 64:
 
=== Library-specific data formats require niche systems solutions ===
 
=== Library-specific data formats require niche systems solutions ===
  
It is possible, however, to observe the consequences of the library data practices. Libraries use data technology that is specific to libraries and library systems. They are therefore are dependent on niche software systems tailored to formats that nobody uses outside of the library world. Because the formats used in libraries (notably MARC) are unique to libraries, vendors of library systems cannot use mainstream data modeling systems, programmer tools, and database software to build library systems.  Development of library systems also requires personnel specifically trained in library data.  This makes it expensive to provide systems for the library community.  The common practice of commissioning a new, customized system in every library -- every 5 to 10 years -- is very expensive, though in the aggregate nobody has made reliable cost estimates for the library community.
+
It is possible, however, to observe the consequences of library data practices. Libraries use data technology specific to libraries and library systems. They are therefore dependent on niche software systems tailored to formats that nobody uses outside of the library world. Because the formats used in libraries (notably MARC) are unique to libraries, vendors of library systems cannot use mainstream data modeling systems, programmer tools, and database software to build library systems.  Development of library systems also requires personnel specifically trained in library data.  This makes it expensive to provide systems for the library community.  The common practice of commissioning a new, customized system in every library -- every 5 to 10 years -- is very expensive; the aggregate cost to the library community has not been reliably estimated.
  
=== Changes are costly due to full record replace and the need to modify display forms ===
+
=== Vocabulary changes in library data are costly ===
  
The library metadata record is primarily designed as a communication format and requires a full record replace for updates to any of its fields. This is very inefficient from a system point of view. It also encourages catalogers to review the entire record every time a change is made to a single field.
+
Controlled vocabularies will play an important role in linked data in general, and although controlled vocabularies are used in library data (in particular for names of persons and organizations, and for subjects) they are not managed in a manner to facilitate linked data: changes to vocabularies require that all related records be retrieved and changed; this is a disruptive process, made even more expensive because the library metadata record, being designed primarily as a communication format, requires a full record replace for updates to any of its fields.
 
+
Although there are controlled vocabularies in use in library data, in particular for names of persons and organizations, and for subjects, these are only linked to the bibliographic data using display strings. Changes to vocabularies require all related records to be retrieved and changed. Even when done algorithmically, this is a disruptive process.
+
 
+
=== Library systems require technical staff with specific expertise in library data ===
+
 
+
Maintaining these systems is also expensive and cumbersome, since personnel specifically trained in library data are also required for day-to-day system operations. For an example of the interpretative understanding of MARC required from IT staff, see Jason Thomale, [http://journal.code4lib.org/articles/3832 Interpreting MARC: Where’s the Bibliographic Data?], Code4Lib Journal Issue 11, 2010-09-21.
+
  
 
== Data may have rights issues that prevent open publication ==
 
== Data may have rights issues that prevent open publication ==
Line 83: Line 76:
 
=== Some data cannot be published openly ===
 
=== Some data cannot be published openly ===
  
Some data cannot be published openly for a variety of reasons. In this report we assume that data relating to user identity and use of the library is protected by privacy policies and legislation. Other data, such as that related to purchasing and contracts, is not included in our analysis.
+
Data related to user identity and use of the library is protected by privacy policies and legislation. Other data, such as that related to purchasing and contracts, is not included in our analysis.
  
 
=== Rights ownership can be unmanageably complex ===
 
=== Rights ownership can be unmanageably complex ===
  
Some library bibliographic data has unclear and untested rights issues that can hinder the release of open data. The issue of who owns legacy catalogue records is complex as a result of data sharing between libraries for the past 50 years. The most-shared records are those created by national cataloguing agencies such as the Library of Congress in the USA and the British Library in the UK. Copies of records are frequently modified or enhanced for local cataloguer users. These records may be subsequently re-aggregated into the catalogues of regional, national, and international consortia. Assigning legally-sound intellectual property rights between relevant agents and agencies is difficult, and the lack of certainty is a hindrance to data sharing in a community which is necessarily extremely cautious on legal matters such as censorship, data privacy, and data protection.
+
Some library bibliographic data has unclear and untested rights issues that can hinder the release of open data. Ownership of legacy catalogue records has been complicated by data sharing among libraries over the past 50 years. The records most-shared are those created by national cataloguing agencies such as the Library of Congress in the USA and the British Library in the UK. Records are frequently copied and the copies are modified or enhanced for local cataloguer users. These records may be subsequently re-aggregated into the catalogues of regional, national, and international consortia. Assigning legally-sound intellectual property rights between relevant agents and agencies is difficult, and the lack of certainty is a hindrance to data sharing in a community which is necessarily extremely cautious on legal matters such as censorship data privacy/protection.
 
+
=== Rights ownership can be simple ===
+
 
+
Some bibliographic data may never have been shared with another party, so rights may be exclusively held by the creating agency. Examples of this are when the data is in a non-standard format which is difficult to share, when the data pertains to a collection of unique materials, and when the data pertains to materials in which no-one else is apparently interested.
+
  
 
=== Rights have perceived value ===
 
=== Rights have perceived value ===
  
Agencies put a value on past, present and future investment in creating, maintaining, and collecting metadata, and the larger agencies are likely to treat records as assets in their business plans. Such agencies may be reluctant to publish such data as open LD, or may only be willing to release it in a stripped- or dumbed-down form with loss of semantic detail. For example, data about specific types of title such as preferred title and parallel title might be output as a general title, losing the detail required for a formal citation of the resource.
+
On the other hand, some bibliographic data may never have been shared with another party, so rights may be exclusively held by creating agencies, who put a value on past, present and future investment in creating, maintaining, and collecting metadata. Larger agencies are likely to treat records as assets in their business plans, and may be reluctant to publish them as open LD, or may be willing to release them only in a stripped- or dumbed-down form with loss of semantic detail. For example, data about specific types of title such as preferred title and parallel title might be output as a general title, losing the detail required for a formal citation of the resource.
  
 
''Consider incorporating Simon Spero's comments (on Talk page) for US perspective [[User:Jschneid4|Jodi Schneider]] 10:41, 18 March 2011 (UTC)
 
''Consider incorporating Simon Spero's comments (on Talk page) for US perspective [[User:Jschneid4|Jodi Schneider]] 10:41, 18 March 2011 (UTC)
Line 104: Line 93:
 
=== Library data is expressed primarily as text strings, not "linkable" URIs ===
 
=== Library data is expressed primarily as text strings, not "linkable" URIs ===
  
Most information in library data is encoded as display-oriented text strings. There are a few shared identifiers for resources, such as ISBNs for books, but most identification is done with text strings. Some coded data fields are used in MARC records, but there is not a clear incentive to include these in all records, since most coded data fields are not used in library system functions and therefore. '' therefore what?''Some data fields, such as authority controlled names and subjects, do have their own records in separate files, including identifiers which could be used to represent those entities in library metadata. However, the data formats currently used and current library systems do not support the inclusion of these identifiers in existing library records.  
+
Most information in library data is encoded as display-oriented text strings. There are a few shared identifiers for resources, such as ISBNs for books, but most identification is done with text strings. Some coded data fields are used in MARC records, but there is not a clear incentive to include these in all records, since most coded data fields are not used in library system functions. Some data fields, such as authority controlled names and subjects, do have their own associated records in separate files, which have identifiers that could be used to represent those entities in library metadata. However, the data formats currently used do not support the inclusion of these identifiers in existing library records and consequently neither do current library systems support their use.  
  
 
=== Some library data is being expressed in RDF on an experimental basis, but without standardization or best practices ===
 
=== Some library data is being expressed in RDF on an experimental basis, but without standardization or best practices ===
  
Work has begun to express library data in RDF. Some libraries have experimented with publishing LD from their catalogue records although as yet no standard or best practice has emerged. Progress is being made in the area of defining value vocabularies that are currently in use in libraries. Transformation of legacy data will require more than the mapping of attributes to RDF properties; where possible, library data needs to be transformed from primarily text into data with identified values. New approaches for library data, such as the FRBR model which informs RDA, offer an opportunity for incorporating linked data principles into future library data practices, particularly when these new standards undergo refinement as they are implemented in real world situations.
+
Work has begun to express library data in RDF. Some libraries have experimented with publishing LD from their catalogue records although no standard or best practice has yet emerged. There has been progress in defining value vocabularies currently used in libraries. Transformation of legacy data will require more than the mapping of attributes to RDF properties; where possible, library data should be transformed from text to data with identified values. New approaches for library data, such as the FRBR model which informs RDA, offer an opportunity for incorporating linked data principles into future library data practices, particularly when these new standards are implemented.
  
=== The library community and the Semantic Web community do not have a shared terminology for metadata concepts ===
+
=== The library community and the Semantic Web community have no shared terminology for metadata concepts ===
  
Work on LLD can be hampered by the disparity in concepts and terminology between libraries and the Semantic Web community. Few in libraries would use terms like "statement" for their metadata, and the Web community does not have equivalent concepts to libraries' "headings" or "authority control." Each community has their own vocabulary which reflects the differences in their points of view. Mutual understanding must be fostered as both groups bring important expertise to the potential web of data.
+
Work on LLD can be hampered by the disparity in concepts and terminology between libraries and the Semantic Web community. Few in libraries would use a term like "statement" for metadata, and the Web community does not have concepts equivalent to libraries' "headings" or "authority control." Each community has its own vocabulary and these reflect the differences in their points of view. Mutual understanding must be fostered as both groups bring important expertise to the potential web of data.
  
== The translation of library data into triples must consider the Graph Paradigm ==
+
== Library data must be conceptualized according to the Graph Paradigm ==
  
Translators of legacy library standards into Linked Data terms must recognize that Semantic Web technologies are not merely a variants of practices but represent a fundamentally different way of conceptualizing and interpreting data.  Since the introduction of MARC formats in the 1960s, digital data in libraries has been managed predominantly in the form of "records" -- bounded sets of information described in documents with a precisely specified structure -- in accordance with what may be called a Record Paradigm.  The Semantic Web and Linked Data, in contrast, are based on a Graph Paradigm.  In graphs, information is conceptualized as a boundless "web" of links between resources -- in visual terms as sets of nodes connected by arcs (or "edges"), and in semantic terms as sets of "statements" consisting of subjects and objects connected by predicates.  The three-part statements of Linked Data, or "triples", are expressed in the language of the Resource Description Framework (RDF).  In the Graph Paradigm, the "statement" is an atomic unit of meaning that stands on its own and can be combined with statements from many different sources to create new graphs -- a notion ideally suited for the task of integrating information from multiple sources into recombinant graphs.
+
Translators of legacy library standards into Linked Data must recognize that Semantic Web technologies are not merely variants of practices but represent a fundamentally different way to conceptualize and interpret data.  Since the introduction of MARC formats in the 1960s, digital data in libraries has been managed predominantly in the form of "records" -- bounded sets of information described in documents with a precisely specified structure -- in accordance with what may be called a Record Paradigm.  The Semantic Web and Linked Data, in contrast, are based on a Graph Paradigm.  In graphs, information is conceptualized as a boundless "web" of links between resources -- in visual terms as sets of nodes connected by arcs (or "edges"), and in semantic terms as sets of "statements" consisting of subjects and objects connected by predicates.  The three-part statements of Linked Data, or "triples", are expressed in the language of the Resource Description Framework (RDF).  In the Graph Paradigm, the "statement" is an atomic unit of meaning that stands on its own and can be combined with statements from many different sources to create new graphs -- a notion ideally suited for the task of integrating information from multiple sources into recombinant graphs.
  
Under the Record Paradigm, a data architect can specify with precision the form and expected contents of a data record, and records so designed can be "validated" for completeness and accuracy.  Data sharing among libraries has been based largely on the standardization of fixed record formats, and the consistency of that data has been ensured by adherence to well-defined content rules.  Under the Graph Paradigm, in contrast, data is conceptualized according to significantly different assumptions.  According to the so-called "open-world assumption", any data at hand may, on principle, be incomplete.  It is assumed that data may be supplemented by incorporating information from other, possibly unanticipated, sources, and that information can be added without invalidating information already present.
+
Under the Record Paradigm, a data architect can specify with precision the form and expected content of a data record, which can be "validated" for completeness and accuracy.  Data sharing among libraries has been based largely on the standardization of fixed record formats, and the consistency of that data has been ensured by adherence to well-defined content rules.  Under the Graph Paradigm, in contrast, data is conceptualized according to significantly different assumptions.  According to the so-called "open-world assumption", any data at hand may, in principle, be incomplete.  It is assumed that data may be supplemented by incorporating information from other, possibly unanticipated, sources, and that information can be added without invalidating information already present.
  
The notion of "constraints" takes on significantly different meanings under these two paradigms.  Under the Record Paradigm, if the format schema for a metadata record says that the description of a book can have only one subject heading -- a URI -- and a description with two subject headings is encountered, a validator will report an error in the record.  Under the Graph Paradigm, if an OWL ontology says that a book has only one subject heading, and a description with two subject headings is encountered, an OWL reasoner will infer that the two subject-heading URIs identify the same subject.
+
The notion of "constraints" takes on significantly different meanings under these two paradigms.  Under the Record Paradigm, if the format schema for a metadata record says that the description of a book can have only one subject heading and a description with two subject headings is encountered, a validator will report an error in the record.  Under the Graph Paradigm, if an OWL ontology says that a book has only one subject heading, and a description with two subject headings (URIs) is encountered, an OWL reasoner will infer that the two subject-heading URIs identify the same subject.
  
As will be discussed below, the two paradigms may be seen as complementary. The traditional "closed-world" approach is good for flagging data that is inconsistent with the structure of a metadata record as a document, while OWL ontologies are good for flagging logical inconsistencies with respect to a conceptualization of things the world.  The differences between these two approaches mean that the process translating library standards and datasets into Linked Data cannot be undertaken mechanically, but requires intellectual effort and modeling skill.  Translators, in other words, must acquire some fluency in the language of RDF.
+
As will be discussed below, the two paradigms may be seen as complementary. The traditional "closed-world" approach is good for flagging data that is inconsistent with the structure of a metadata record as a document, while OWL ontologies are good for flagging logical inconsistencies with respect to a conceptualization of things in the world.  The differences between these two approaches mean that the process of translating library standards and datasets into Linked Data cannot be undertaken mechanically, but requires intellectual effort and modeling skill.  Translators, in other words, must acquire some fluency in the language of RDF.

Revision as of 17:45, 21 June 2011

See also Draft_recommendations_page

Contents

Implementation challenges and barriers to adoption

Designed for stability, the library ecosystem resists change

As stable and reliable archives with long-term goals, cultural heritage organizations -- particularly libraries -- are predisposed to traditionalism and conservation. This emphasis on larger goals has led libraries to fall out of step with the faster-moving technology culture of the past few decades. When most information was in print format, libraries were at the forefront of information organization and retrieval. With the introduction of machine-readable catalogs in the 1960s, libraries were early adopters of the computer, though primarily for automating the production of printed catalogs of print materials. As the volume of information in digital format has overtaken print, libraries have struggled both to maintain their function as long-term archives as well as to extend their missions to include digital information. Decreased budgets for libraries and their parent institutions have greatly hindered libraries' ability to create competitive information services.

Cooperative metadata creation is economical but creates barriers to change

Libraries take advantage of cooperative agreements allowing them to share resources, as well as metadata describing those resources. These cooperative efforts are both a strength and a weakness: while shared data creation has economic benefit, changes to share data require coordination among the sharing parties.

Consequently, major changes require a strong agent to coordinate the effort. In most countries, the national library provides this type of leadership. Changes that transcend the borders of any single country -- such as adopting data standards like FRBR or moving to linked library data -- require a broad leadership that can take into account the many local needs of the international library community.

Library Data is shareable among libraries, but not yet with the wider world

Linked Data reaches a diverse community far broader than the library community; moving to library Linked Data requires libraries to understand and interact with the entire information community. Much of this information community has been engendered by the capabilities provided by new technologies. The library community has not fully engaged with these new information communities, yet the success of Linked Data will require libraries to interact with them as fully as they interact with other libraries today. This will be a huge cultural change that must be addressed.

Libraries are understaffed in the technology area

As libraries have not kept pace with technological change, they also have not provided sufficient educational opportunities for staff. Training within libraries is limited in some countries, and workers are not encouraged to seek training on their own. Technological changes have taken place so quickly that many in library positions today began their careers long before the World Wide Web was a reality, and these workers may not fully understand the import of these changes. Libraries struggle to maintain their technological presence and are often under-staffed in key areas of technology.

  • In-house software developers. An informal survey of Code4Lib participants suggests that there are few software developers in libraries. Although the developers are embedded in library operations, coding is often a small part of their duties. Staff developers tend to be closely bound to working with systems from off-the-shelf software providers. These developers are for the most part maintaining existing systems and do not have much time to explore new technology paradigms and new software systems. They are dependent on a shrinking number of off-the-shelf providers as market players have consolidated over the past two decades (see Marshall Breeding's History of Library Automation).
  • Library workers. Software development skills, including metadata modeling, have often not been a strong part of a library worker's education. Libraries have in essence out-sourced their technology development to a few organizations in the community and to the library systems vendors. These vendors understand library functionality and data, but they need an expectation of development-costs recovery before beginning work on new products.
  • Library leaders. There are many individual Linked Data projects coming out of libraries and related institutions, but no obvious emerging leaders. IFLA has been a thought-leader in this area, but there is still a need to use their work to provide functional systems and software. Many national libraries have an interest in exploring LLD and some have ongoing projects. LLD will be international in scope, and this increases the amount of coordination that will be needed. Because of its strong community ties, however, leadership from within can be expected to have a dramatic effect on the community's ability to move in the direction of Linked Data. "no obvious emerging leaders" still seems overstrong to meJodi Schneider 16:52, 8 April 2011 (UTC)

Library technology has been driven largely by commercial vendors

Much of the technical expertise in the library community is concentrated in the small number of vendors who provide the systems and software that run library management functions as well as the user discovery service. These vendor systems hold the bibliographic data integrated into library management functions like acquisitions, receipt of materials, user data, and circulation. Other technical expertise exists primarily in large academic libraries where development of independent discovery systems for local materials is not uncommon. These latter systems are more likely to use mainstream technologies for data creation and management, but they do not represent the primary holdings of the library.

Libraries do not adapt well to technological change

Technology has continually evolved since computers were first used for library operations in the 1960s. However, the library community tends to engage only with established technologies that have brought proven benefits to their operations and services. The Linked Data approach is relatively new, with enabling technologies and best practices being developed outside of mainstream library applications. Experimentation with Linked Data in the library community has been limited in part due to lack of developer tools for LD in general but also because there are no tools that specifically address library data. It can be difficult to demonstrate the value of LD to librarians because the few examples of implementations that do exist use unfamiliar data and interfaces.

The long-term view by libraries applies also to standards

While both library and Web communities value preservation and endurance (or permanence) of information, the timescales differ: library preservation is measured in generations and centuries (if not millenia) while Web-native information might be considered old at two decades. Ensuring this long-term life of information promotes a conservative outlook for library organizations, which is in contrast to the mainstream perspective of the Web community which values novelty and experimentation over preservation of the past.

Therefore it is not surprising that the library standardization process is slower than comparable Web standards development. Current developments towards a new metadata environment can be traced back more than ten years: The basic groundwork for a shift to a new data format was laid in 1998 with the development of the Functional Requirements for Bibliographic Records (FRBR) which provides an entity-relation view of library catalog data. That model is the basis for a new set of cataloguing rules, Resource Description and Access (RDA), which although they became final in 2010, are still under review before implementation. RDA is a standard of four Anglo-American library communities, and has not had international acceptance, although it is being studied widely. LLD standards associated with RDA are still in the process of development. Through a joint working group with DCMI, the Joint Steering Committee for RDA approved an RDF implementation of the properties and value vocabularies of RDA. These have not yet been moved to production status and are not integrated with the primary documentation and cataloguer tools in the RDAToolkit.

Library standardization process is cumbersome

A further difference is that Web-related organizations focus on implementations, often hammering out differences with practical examples, and leaving edge cases for later work. This is in contrast to the library standardization approach: Standards such as FRBR and RDA have been created as documents, without the use of test cases, prototype implementations, and iterative development methodologies that characterize modern IT approaches. Library standards have a strong "top-down" direction, and major standards efforts are undertaken by national or international bodies. S Development of an international standard takes years and that development cannot keep up with the increasingly fast pace of technological change. Development cycles are often locked into face-to-face business meetings of the parent organization or group to comply with formal approval procedures. As a result, standards may be technologically out-of-date as soon as they are published.

Bottom-up standards can be successful but garner little recognition

While on the Web, bottom-up development is common for all but the largest and most-used standards (e.g. HTML5), bottom-up development often does not get proper recognition from the library community. Even so, some bottom-up initiatives have led to successful standards adopted by the library community, including OpenURL, METS, OAI, and Dublin Core. LLD will require funding and will need institutional support (though it isn't clear where funding and support will come from) but it will also require an environment where the bottom-up developers can flourish.

Library standards are limited to the library data

While the Web values global interchange between all parties, library cataloguing standards in the past have aimed to address only the exchange of data within the library community where the need to think of broader bibliographic data exchange (e.g. with publishers) is new and not universally accepted. There is fear that library data will need to be "dumbed down" in order to interact with other communities; few see the possibility of "smarting up" bibliographic data using library-produced information.

ROI is difficult to calculate

Some cost issues are known but are unmeasured

It is admittedly difficult to calculate or estimate costs and benefits in a publicly funded service environment. This makes it particularly difficult to create concrete justifications for large-scale changes of the magnitude required for adopting Linked Data in libraries. While there is a general recognition of distinct disadvantages to the silo'd library data practices, no measurement exists that would compare the resources required to create and manage current library data compared to linked library data. (Note: there are some studies on the cost of cataloging, but they do not separately study costs related to data technology: Library of Congress Study of the North American MARC Records Marketplace, R2 Consulting LLC, Ruth Fischer, Rick Lugg, October 2009 ) and Implications of MARC Tag Usage on Library Metadata Practices, OCLC, March 2010.)

"MARC data cannot continue to exist in its own discrete environment, separate from the rest of the information universe. It will need to be leveraged and used in other domains to reach users in their own networked environments. The 200 or so MARC 21 fields in use must be mapped to simpler schema."

Smith-Yoshimura, et al., Implications of MARC Tag Usage on Library Metadata Practices. www.oclc.org/research/publications/library/2010/2010-06.pdf

Library-specific data formats require niche systems solutions

It is possible, however, to observe the consequences of library data practices. Libraries use data technology specific to libraries and library systems. They are therefore dependent on niche software systems tailored to formats that nobody uses outside of the library world. Because the formats used in libraries (notably MARC) are unique to libraries, vendors of library systems cannot use mainstream data modeling systems, programmer tools, and database software to build library systems. Development of library systems also requires personnel specifically trained in library data. This makes it expensive to provide systems for the library community. The common practice of commissioning a new, customized system in every library -- every 5 to 10 years -- is very expensive; the aggregate cost to the library community has not been reliably estimated.

Vocabulary changes in library data are costly

Controlled vocabularies will play an important role in linked data in general, and although controlled vocabularies are used in library data (in particular for names of persons and organizations, and for subjects) they are not managed in a manner to facilitate linked data: changes to vocabularies require that all related records be retrieved and changed; this is a disruptive process, made even more expensive because the library metadata record, being designed primarily as a communication format, requires a full record replace for updates to any of its fields.

Data may have rights issues that prevent open publication

For a perspective from Europe, see Free library data? by Raymond Bérard.

Some data cannot be published openly

Data related to user identity and use of the library is protected by privacy policies and legislation. Other data, such as that related to purchasing and contracts, is not included in our analysis.

Rights ownership can be unmanageably complex

Some library bibliographic data has unclear and untested rights issues that can hinder the release of open data. Ownership of legacy catalogue records has been complicated by data sharing among libraries over the past 50 years. The records most-shared are those created by national cataloguing agencies such as the Library of Congress in the USA and the British Library in the UK. Records are frequently copied and the copies are modified or enhanced for local cataloguer users. These records may be subsequently re-aggregated into the catalogues of regional, national, and international consortia. Assigning legally-sound intellectual property rights between relevant agents and agencies is difficult, and the lack of certainty is a hindrance to data sharing in a community which is necessarily extremely cautious on legal matters such as censorship data privacy/protection.

Rights have perceived value

On the other hand, some bibliographic data may never have been shared with another party, so rights may be exclusively held by creating agencies, who put a value on past, present and future investment in creating, maintaining, and collecting metadata. Larger agencies are likely to treat records as assets in their business plans, and may be reluctant to publish them as open LD, or may be willing to release them only in a stripped- or dumbed-down form with loss of semantic detail. For example, data about specific types of title such as preferred title and parallel title might be output as a general title, losing the detail required for a formal citation of the resource.

Consider incorporating Simon Spero's comments (on Talk page) for US perspective Jodi Schneider 10:41, 18 March 2011 (UTC) "See also notes from UK perspective on Talk page Jodi Schneider 09:02, 22 March 2011 (UTC)"

Library data is expressed in library-specific formats that cannot be easily shared outside the library community

Library data is expressed primarily as text strings, not "linkable" URIs

Most information in library data is encoded as display-oriented text strings. There are a few shared identifiers for resources, such as ISBNs for books, but most identification is done with text strings. Some coded data fields are used in MARC records, but there is not a clear incentive to include these in all records, since most coded data fields are not used in library system functions. Some data fields, such as authority controlled names and subjects, do have their own associated records in separate files, which have identifiers that could be used to represent those entities in library metadata. However, the data formats currently used do not support the inclusion of these identifiers in existing library records and consequently neither do current library systems support their use.

Some library data is being expressed in RDF on an experimental basis, but without standardization or best practices

Work has begun to express library data in RDF. Some libraries have experimented with publishing LD from their catalogue records although no standard or best practice has yet emerged. There has been progress in defining value vocabularies currently used in libraries. Transformation of legacy data will require more than the mapping of attributes to RDF properties; where possible, library data should be transformed from text to data with identified values. New approaches for library data, such as the FRBR model which informs RDA, offer an opportunity for incorporating linked data principles into future library data practices, particularly when these new standards are implemented.

The library community and the Semantic Web community have no shared terminology for metadata concepts

Work on LLD can be hampered by the disparity in concepts and terminology between libraries and the Semantic Web community. Few in libraries would use a term like "statement" for metadata, and the Web community does not have concepts equivalent to libraries' "headings" or "authority control." Each community has its own vocabulary and these reflect the differences in their points of view. Mutual understanding must be fostered as both groups bring important expertise to the potential web of data.

Library data must be conceptualized according to the Graph Paradigm

Translators of legacy library standards into Linked Data must recognize that Semantic Web technologies are not merely variants of practices but represent a fundamentally different way to conceptualize and interpret data. Since the introduction of MARC formats in the 1960s, digital data in libraries has been managed predominantly in the form of "records" -- bounded sets of information described in documents with a precisely specified structure -- in accordance with what may be called a Record Paradigm. The Semantic Web and Linked Data, in contrast, are based on a Graph Paradigm. In graphs, information is conceptualized as a boundless "web" of links between resources -- in visual terms as sets of nodes connected by arcs (or "edges"), and in semantic terms as sets of "statements" consisting of subjects and objects connected by predicates. The three-part statements of Linked Data, or "triples", are expressed in the language of the Resource Description Framework (RDF). In the Graph Paradigm, the "statement" is an atomic unit of meaning that stands on its own and can be combined with statements from many different sources to create new graphs -- a notion ideally suited for the task of integrating information from multiple sources into recombinant graphs.

Under the Record Paradigm, a data architect can specify with precision the form and expected content of a data record, which can be "validated" for completeness and accuracy. Data sharing among libraries has been based largely on the standardization of fixed record formats, and the consistency of that data has been ensured by adherence to well-defined content rules. Under the Graph Paradigm, in contrast, data is conceptualized according to significantly different assumptions. According to the so-called "open-world assumption", any data at hand may, in principle, be incomplete. It is assumed that data may be supplemented by incorporating information from other, possibly unanticipated, sources, and that information can be added without invalidating information already present.

The notion of "constraints" takes on significantly different meanings under these two paradigms. Under the Record Paradigm, if the format schema for a metadata record says that the description of a book can have only one subject heading and a description with two subject headings is encountered, a validator will report an error in the record. Under the Graph Paradigm, if an OWL ontology says that a book has only one subject heading, and a description with two subject headings (URIs) is encountered, an OWL reasoner will infer that the two subject-heading URIs identify the same subject.

As will be discussed below, the two paradigms may be seen as complementary. The traditional "closed-world" approach is good for flagging data that is inconsistent with the structure of a metadata record as a document, while OWL ontologies are good for flagging logical inconsistencies with respect to a conceptualization of things in the world. The differences between these two approaches mean that the process of translating library standards and datasets into Linked Data cannot be undertaken mechanically, but requires intellectual effort and modeling skill. Translators, in other words, must acquire some fluency in the language of RDF.