Warning:
This wiki has been archived and is now read-only.

Draft issues page notused

From Library Linked Data
Jump to: navigation, search

Problems and Limitations: raw notes and materials not used

RAW statements Organized as Issues

Library data is expressed in library-specific formats that cannot be easily shared

  • This data is based on century-old cataloging standards that are still mainly text-based. There is little use of identifiers for things. Controlled vocabularies use text strings that are also the display forms. The emphasis is on the creation of records, with records representing complex bibliographic entities.
  • There are not standards yet for re-formulating that data in RDF, although some work has begun, primarily on value vocabularies. Transformation will take more than a simple mapping: data will have to change to move from text-based to data-based values.
  • The data has the advantage, however, that it has been developed to standards and has been heavily curated over time.
  • The library standards process is also dated. The library community continues to create standards in a top-down fashion that does not include any technical proofs of concept. Standards take many years to be developed and are not able to keep up with the pace of technical change. (Example: FRBR developed in 1998 based on "relational database" concepts; by the time it was being implemented there was more interest in modeling metadata around RDF.) These standards are often out of date by the time they are issued. (gd: This is a bit harsh - library community is no different from many others in these respects. Also implies that bottom-up standards development has been viable for some time, instead of the at-most 10-20 years of the open source/data movement.)

The Case for LLD

  • What's the case for LLD? What problem does *Library* LD help to solve? How will it be of benefit to libraries (in addition to and/or improvements on current practices, current standards, current costs, etc)? How will LLD help libraries better serve their users?
  • what is the benefit of this activity to libraries, which will be asked to pour resources into this activity? To ask this in a more challenging way: So what? How are libraries losing out now? What's broke that needs fixing?
  • Will LLD only benefit data consumers? Who are these data consumers? How will it benefit my mother, or your's? Will it benefit data creators (catalogers, for example)? Will it benefit those managing library technology?

Standardization

  • Current standards (Gordon's document Library_standards_and_linked_data)
  • Paradigm shift for standards development
  • Role of library committees v. bottom-up development?
  • How will libraries support innovation?

General

  • Libraries need to take advantage of the economies of scale that data sharing afford. This means that libraries will need to apply standards to their data for use within libraries and library systems.
  • No community guidance on which technologies and vocabularies to use
  • No standard way to express bibliographic data (or related library data) as RDF (either agreed via standards bodies, or simply through standard practices)
  • Lack of URIs for relevant metadata components. [1]
  • When ontologies/metadata schemas/vocabularies overlap, which should be used?
  • Provenance is a big, scary unknown
  • No examples in our community domain that we can follow
  • Lack of information on how to create a data model
  • Description to date has been based on the practice of transcribing or describing based on the item in hand. The use of linked data calls for identification in addition to or instead of transcription.


Ontology Questions

  • There is some uncertainty about whether we should be using some IFLA sanctioned version of the FRBR vocabulary, or if using Ian Davis vocabulary is good enough.
  • Sometimes specific (physical state of original in a preservation context), sometimes general (need vocabularies for preservation data), but no vocabulary for the function or data elements
  • No consensus on the components of an optimal reusable package of RDF metadata for applications using library linked data. [12]
  • When to coin new terms, when to re-use, and how to align.
  • Need to carefully articulate library value vocabularies (concepts, terms) with real world entities they stand for. Library entities are "proxies" for real things, which linked data has as core focus. Perhaps this needs some education of the LD community, ie, convince them that there is any value in having such proxies. And identifying appropriate mechanisms (ie. efficient in terms of data creation and consumption) to represent this (a-la skos:Concept + foaf:focus)

Vocabulary mapping

  • Current mapping vocabularies in RDF reveal some expressivity issues:
    • As more types of value vocabularies are aligned, there can be a gap between the entities to be aligned and what mapping vocabularies can express. These include specific support in SKOS for:
      • Expressing mapping between compound concepts represented in a pre-coordinated and a post-coordinated vocabulary,
      • Expressing mapping between concepts from vocabularies that have different structures, for example, a classification system and a thesaurus.
    • More investigation is needed on links between entities of different semantic types, e.g., Concepts vs. RWO (Real World Objects), Concepts vs. (OWL) Classes, RWO to RWO, etc.
    • In addition to the type of link (mapping [property]), other information such as provenance (creation methods, e.g., automatic vs. manual mapping, degree of reliability, etc.) would probably be required to ease share and re-use of alignment results.
    • There is a need of generalized mapping rules for the degrees of mapping.
  • Usage of standard patterns for mapping is not yet a reality
    • There are many proposals, not all yet agreed-upon.[1]
    • Although there are new cases of using SKOS as a pattern in the vocabularies beyond KOS, such as in MADS and VIAF, there is a need for a comprehensive study of the effectiveness of such an approach, regarding the benefits for the mapping of the vocabularies.
  • Stability and availability of dereferencing services, triple stores, data maintenance, and synchronisation. [12]

Library Legacy Data

  • resources required to convert these triples?
  • duplication -- big overlap between libraries, which means duplication in triple-space
  • dumbing down -- in order to retain control over full data; or vocabularies not available
  • coordination between legacy and LD over time


  • The library world has an enormous cache of data that is somewhat standardized but uses an antiquated concept of data and data modeling. Transformation of this data will take coordination (since libraries share data and systems for data creation). But before it can be transformed it needs to be analyzed and there must be a plan for converting it to linked data. (There is a need for library systems to be part of this change, and that is very complex.)
  • Vocabularies and element sets in widespread use in library legacy metadata are not available in RDF. [12]
  • Lack of critical mass of linked data for legacy records. [5], [10]
  • Lack of stable community-wide mappings for common legacy metadata formats to RDF classes and properties. [12]
  • For some items details content information is in free text fields within the library MARC record (e.g. MARC 505 table of contents field)
  • Some material format information is stored in free text fields (e..g. MARC 300)
  • Also many pieces of data you may wish to standardise as single entities in RDF (e.g. Publisher, Place of Publication) are recorded as free text in MARC, with no 'authority control' and so may require work to map to single entities, or result in accidental duplicate identities being created for the same entity
  • Current data is free text, but contains quantitative information that needs to be pulled out
  • Data needs to be qualified as "estimated" or "derived" so users know it is not precise (this is possibly a vocabulary issue)
  • Current practice does not include rich relationships, just "related," so there is no source of relationships
  • Linked Data standards and MARC - going back to question 1, we can envision Linked Data as another interoperability framework that will come in addition to source formats (Dublin Core never replaced MARC, but still, it's useful to exchange data with other communities). Is RDF really a candidate to replace MARC ? In what timeframe ? What does it mean for legacy data : is it possible to convert it without loosing information ? what tools are needed to do so ? is it possible to share them ?
  • Is a new data model needed for LLD? If so, how far must it depart from current models (and why)? If a new data model is to be recommended, what is the scope and purpose of this data model? Only for LLD purposes (which is the scope of the current charter)? Or meant to replace the current model completely (which, while not mutually exclusive, is technically beyond the current charter)? What impact, if any, does a new data model have on legacy data? There's been talk about data modeling and a general understanding that something different is needed. It is beyond the scope of this group to recommend a detailed solution, but this group should be able to talk about how current data models are insufficient to the task and make general recommendations in light of those reservations. It should be clear how proposed models will (positively) impact libraries. (kc: this last part might need to be part of the introduction.)

Incomplete State of Development of LLD

  • The course material recorded in the library catalogue at the Open University is more heterogeneous in reality that perhaps is expressed in the library catalogue records. Specifically there are questions as to whether it is appropriate to model audio-visual material in the same way as print material – this question applies more generally across the library sector.
  • Some existing work to model library catalogue data seems to be focused on 'bibliographic' material - a specific example is the proposed isbd ontology (http://metadataregistry.org/vocabulary/list.html). It is not clear it would be appropriate to apply isbd properties to audio-visual material
  • Diversity of users' needs regarding the alignment quality. It is difficult to reach a consensus about what is a good mapping for any project.
    • There will be different applications of a vocabulary alignment. A same alignment may be performed differently in different scenarios. Users' needs are diverse and can have direct impact on the mapping practices and results.
    • Patterns for vocabulary alignment can be multiple and decisions have to be made based on the assessments. For example, there are direct mapping and backbone (hub) mapping approaches. Within the same project different patterns may be combined.
  • Taking over aligned vocabularies in a local repository remains an option
  • Handling mapping to not-yet LD-published vocabularies
  • Diversity of application environments, e.g., integration into a language technology processing pipeline, potentially crossing technology stacks (RDF & XML based Web Services)
  • Is citation different from identification?
  • What about citing things that aren’t publications, e.g. datasets.
  • Formats for citations: very closely related to Bibliography Cluster
  • There are very few comprehensive data sets based on directories of libraries, including address and other location information, and contact and other agent information. The information tends to be fragmented into consortia, sector, and subject-based groupings.
  • The International Standard Identifier for Libraries and Related Organisations (ISIL) offers a method for assigning an identifer to a library organization. Organizations do not always register individual physical branch libraries. The ISIL Registration Authority coordinates national agencies for assigning identifiers. There is no consolidated list of libraries and codes, and each agency uses a different interface to list or search national identifiers.
  • There are some national services using identifiers for libraries, but they generally do not interoperate.
  • There is no standard typology of memory institution types. Existing typologies are tied to one of the three primary entities in collection-level description: Collection itself, Location (sub-entities physical and electronic), and Agent (sub-entities person and corporate body/family). Typologies may be based on material format of the items in the collection, administration or curation of the collection, architecture and type of building housing the collection, audience level, subject, etc. Developing a single typology that fits all entities is difficult.

Skills and Education

  • Teaching catalogers how to move from a "records" mental model to a "graph" mental model of interrelated statements. This requires a fundamental shift of perspective that bundles up other related issues, among them that text labels for identifiers are not good enough.
  • General education is a must. Answers to questions about how LD benefits libraries and users will make it easier to make case for education and funding for education.
  • Community agreement and leadership - There are many in the community who are either not interested in LLD, don't know about LLD, or who are actually opposed to LLD.
  • Publishing Linked Data requires expertise which is often not available at the institutions that wish to publish Linked Data, in the form of a.o.:
    • Best practices
    • Examples
    • Suggested vocabularies and metadata schemas to use
    • Best practices in connecting library material and other types of resources, including courses (e.g., through reading lists, references), A/V material, and available open educational material (which might be repurposed from existing resources) need to be created.
  • Institutions may not have the time themselves to convert their data to Linked Data, but willing to allow others to do it. This requires coordination with communities willing to make that effort.
  • It is hard to find personnel well-versed in Semantic Web technology. Job sites (e.g. the Dutch MonsterBoard) do not have/list personnel versed in standards such as RDF.
  • Cost of intellectual work for vocabulary mapping, esp. for complex metadata element sets or large value vocabularies.

Paradigm shift: "record" or "concept" as surrogate?

Jeff: I think our notion of "surrogate" is destined to change from "record" to "concept". I suspect it will be a quiet revolution analogous to how our notion of LCCN changed over the years from "card number" to "control number" and now (for all intents and purposes) to "concept number".

But Emmanuelle points out: A few years ago, when we were first brainstorming about Europeana, we decidedly stated that Europeana would not be another library catalogue, nor another portal. We wanted to do something "more" with the data, we wanted to be able to align our descriptions of objects (which wouldn't be records) with a semantic layer describing "real things" : works, creators, events, etc. (All this may seem really familar to you all, but that was 4 years ago, and quite new at the time.) So, we came up with the idea of "surrogate". The surrogate was something that was meant to express that Europeana was not hosting digital objects themselves, but a representation of them, and this representation had to be something more than just a record. 2 years later, the term surrogate failed and we gave it up. Why ?

  • because the surrogate was initially meant to be conceptual, but people kept trying to instantiate it and name it in the data, which led to confusion
  • because "surrogate" is a term that has no satisfying translation in some languages (including french) and thus corresponds to no ready-made reality for (at least some) non-native english speakers Maybe there were other reasons that I don't remember.

I know that the world has changed a lot in the meantime, now we have Linked Data, and a great deal of thoughts on resources and their representations ([1] and its great summary at [2] ;-). But If we are to choose "the" word that will make the shift from the record to the graph, I would avoid surrogate.

Gordon: I think the focus of library metadata maintenance will shift from record to triple (set of statements about a bibliographic entity to a single statement/triple with that entity as subject). But while a triple is great for linking linked data, an isolated triple is not that useful for consumption by human agents. Libraries and their users are somewhat familiar with the idea of "levels of description", equivalent to sets of descriptive attributes that increase in size/coverage. AACR2 notes three such levels, with some indication of which is appropriate for different kinds of libraries. For example, the third level of description includes all attributes relevant to the resource described. This level is used by national agencies, national bibliographies, etc. The first level of description only includes a basic sub-set of all possible attributes, and is suitable for brief record displays, etc. Furthermore, library catalogues currently display different sets of attributes at different points of the resource discovery process: the so-called author/title list as the result of a search, then the "standard" record display for a selected resource, often with an option to display the "full" record if the user wants to see it. So there never has been a fixed "record" in Libraryland.

Other Community Issues: Management, Rights, Complexity, Systems

Management

  • At the moment, there are no centers of leadership to facilitate such a major change to library thinking about its data (although IFLA is probably the most active).
  • Where to start? To convert a dataset of any significant size, we'll need name authorities, subject thesauri, controlled vocabulary terms, etc. If everyone does this in isolation, minting their own URIs, etc., how is this any better than silos of MARC records? How do institutions the size of University of Michigan or Stanford get access to datasets such as VIAF so they don't have to do millions of requests every time they remodel their data? How do they know which dataset to look in for a particular value? What about all of the data that won't be found in centralized datasets (local subject headings, subject headings based on authorities with floating terms, names not in the NAF, etc.)?

+ "Where to stop": libraries should perhaps learn to rely on data produced by others and not try to produce every required data by themselves (gazetteers come to mind). + versioning and updates. When/how to disseminate change notifications, how data consumers should integrate them (problem of deprecation/removal of triples).

  • Who will be first to take the plunge? What is the benefit for them to do so? In the absence of standards, will their experience have any influence on how standards are created (that is, will they go through the work only to have to later retool everything)?
  • It is still quite difficult to convince potential funders that this is an important area to be working in. This is the "chicken/egg" problem, that without something to show funders, you can't get funding.
  • Reliability of data sources that are targets of alignments is critical yet difficult to discover without certain levels of investment. ("See also" under 'Missing Vocabularies' section.)
  • Re-alignments may occur due to the updates of the vocabularies involved. Participants need to be informed about other vocabularies' updating policies, workflow, and frequencies. Such updating needs to be incorporated in the mapping results or routine. (See also under 'Technology availability/questions' section.)
  • Communication/cooperation with wider cultural sector. Problems are quite similar across LAM actors.


Rights

  • Copyright and licensing can influence mapping strategies and their implementation.
    • Selection of sources is largely influenced by their availability.
    • There is a need to differentiate between the licensing and use of metadata about digital asset and the licensing and use of the assets themselves.
  • The ownership of the alignment data is still an unclear area. Where should one publish mappings if multiple vocabularies are involved? Who owns the original data and mapping expression?
  • Lack of open licenses for re-use of linked data. [3]
    • While linked data can be used in an enterprise system, the value for libraries is to encourage open use of bibliographic data. Institutions that "own" bibliographic data may be under constraints, legal or otherwise, that do not allow them to let their data be used openly. We need to overcome this out-dated concept of data ownership.
  • Lots of talk about LLD, but little about Library LOD. The data needs to be freed from restrictions and, as Ross suggested, preferably bulk downloads provided. To echo Ross's sentiment, should the BL ping VIAF or ID 10s of millions of times for information? The inability to share/include/use resources/data with minimal restrictions, from an array of sources, will negatively impact interoperability, the same "interoperability" from the mission statement.

Complexity

  • Resources to be aligned vary in their scopes and granularity levels, modeling principles (E.g. thesauri vs. classification systems), language and culture, and many other aspects.
  • Quality and size of resources involved in the alignment are heterogeneous.
  • Semantic enrichment may be needed for some vocabularies before vocabulary integration.
  • Linking instead of copy-and-merging
    • Local changes may become "second-class"
    • When Links Break...
  • Alignment tools (for metadata element sets and value vocabularies)
    • Research is more focused on ontologies than library general value vocabularies
    • Tools have scalability issues for large vocabularies
  • Provenance of alignment information has been only slightly touched
  • Management of mappings re. vocabulary evolution
    • Concept scheme evolution is challenging considering that even deprecated concepts and relationships must maintain an accessible URL
    • Mappings should be updated to take into account new elements in the mapped vocabularies
  • Persistence of resolvable URIs. In the short term, Linked Data facilitates mash-ups, but for the long term, the use of RDF and URIs holds out the possibility of preserving the meaning of content in a way that will remain accessible twenty years from now -- provided that the URIs on which it is based are not sold, re-purposed, or simply forgotten and remain resolvable to machine-readable documentation. For libraries, this implies not just preservation policies for locally owned URIs and associated content, but an active voice, as a community, in the long-term governance of the global Web's Domain Name System.
  • Provenance of triples. In Linked Data, statements may be merged from many sources, creating a graph the statements of which may no longer be traceable to those sources. This problem can be solved in pragmatic, non-standard ways, but as institutions which historically were created to make citations resolvable, libraries have a stake in supporting the standardization of graph identification [1]. One very practical related problem is that which MacKenzie Smith has called "attribution stacking" -- how to credit the one hundred creators of a graph created from the merger of one hundred sources [2]. MacKenzie Smith refers to Provenance as one of the "three Ps of Linked Data", the other two being Persistence (see #1 above) and Policy (Karen's point #4 [3]).

[1] http://www.w3.org/2011/rdf-wg/wiki/TF-Graphs

[2] http://www.youtube.com/watch?v=uSmG1-hoZfE&t=43m43s

[3] http://lists.w3.org/Archives/Public/public-xg-lld/2011Feb/0044.html

  • Preservation of vocabularies. We can be reasonably certain that the Library of Congress will be around twenty years from now, so the persistence of http://id.loc.gov seems secure, though history shows that ultimately no institution is too big to fail. At the other extreme, useful vocabularies may be created by sponsored projects with a known expiration date. How can memory organizations, including libraries, better collaborate to ensure that ownership and responsibility for persistence of access (and possibly for ongoing maintenance duties) devolves over time to institutions committed to their preservation?
  • How do we keep the original data and linked data in sync? If changes happen to the linked data representation, how do we funnel that back into the original representation? Do we even want to?
  • The richer the data, the more complicated the dependencies: how do we prevent rats nests of possible licensing issues (Karen raised this, as well)? Similarly, this web also creates an n+1 problem: there's always the potential of a new URIs being introduced with each graph; how much is enough? How will a library know?
  • How do we deal with incorrect data that we don't own/manage?
  • As the graph around a particular resource improves in quality, how do these changes propagate around to the various copies of the data? How do libraries deal with the changes (not only regarding conflicts, but how to keep up with changes in the data model, with regard to indexing, etc.)?

Systems

  • What's the impact of Linked Data on current systems ? there are different scenarios or different steps. Linked Data can be first implemented just as another protocol for exposing the data (like for instance OAI-PMH) : thus with little impact on systems and production processes. Or, Linked Data can imply to re-think completely the way we model, produce and use our data : a revolution. Does our group have any recommendation or comments on these different perspectives ?
  • Library systems and Linked Data - should all library systems adopt Linked Data ? Does it make any sense at all that every library exposes their own data ? Does Linked Data imply to have more centralized systems, where most of the data is mutualized ?
  • Systems vendors need to be involved. Libraries generally do not do systems development, but purchase systems from vendors who develop for libraries.

Software and Applications

  • Lack of open licenses to support APIs, data standards and client software in the linked data environment. [3]
  • No systems available on market for linked data creation and use
  • Open source solutions available are in an unfinished state
  • Semantic Web infrastructure and tools (triple stores, editors, ...) are perceived as not very mature and/or spartan. This increases the barrier for adopting SemWeb technology.

State of Semantic Web Development

  • Available semantic mapping properties have been sometimes overused or used inappropriately. For example, 'owl:sameAs' has been used to express many kinds of mappings, beyond its original formal semantics.
  • Ontologies and metadata element sets, even widely used ones such as FOAF, are by some perceived to be inadequate, missing vital elements, or strangely designed.
    • For example, FOAF does not always have inverse properties, making it harder to design a display that describes an RDF resource by listing its properties and values.
    • When properties are missing this can result in "islands" of unconnected groups of resources
  • There is a general sparseness of linkage in the LOD cloud.
  • Use of RDF/XML as a serialization for RDF has proven to be a bit of hurdle for web developers not already familiar with semantic web technologies.
  • Universal identifiers are hard to administer and are even harder to get widely adopted
  • Is linked data scalable to the size we need?
  • Is linked data appropriate for highly hierarchical data models?
  • Is the current LD technology stack suitable? If libraries are to begin sharing the very information crucial to bibliographic description (a mere link to a subject heading versus a string's presence in the data), in no small part by relying on data from external sources, do specific technological requirements need to be defined to support look-up services, not only of known resources but of yet-matched strings? SPARQL end points have not been widely implemented in existing LLD Implementations.
  • What technological needs will be required, if any, given the potential scope of change that could accompany a new data model? Perhaps for LLD, very little is needed beyond the current technology stack. That would make any new data model an auxiliary model to the current one, no?



Organized Set Stops Here


to add in

From the Use Cases (RAW)

Missing Vocabularies

  • Vocabularies and element sets in widespread use in library legacy metadata are not available in RDF. [12]
  • Current mapping vocabularies in RDF reveal some expressivity issues:
    • As more types of value vocabularies are aligned, there can be a gap between the entities to be aligned and what mapping vocabularies can express. These include specific support in SKOS for:
      • expressing mapping between compound concepts represented in a pre-coordinated and a post-coordinated vocabulary,
      • expressing mapping between concepts from vocabularies that have different structures, for example, a classification system and a thesaurus.
    • More investigation is needed on links between entities of different semantic types, e.g., Concepts vs. RWO (Real World Objects), Concepts vs. (OWL) Classes, RWO to RWO, etc.
    • In addition to the type of link (mapping [property]), other information such as provenance (creation methods, e.g., automatic vs. manual mapping, degree of reliability, etc.) would probably be required to ease share and re-use of alignment results.
    • There is a need of generalized mapping rules for the degrees of mapping.
  • Usage of standard patterns for mapping is not yet a reality
    • There are many proposals, not all yet agreed-upon.[1]
    • Available semantic mapping properties have been sometimes overused or used inappropriately. For example, 'owl:sameAs' has been used to express many kinds of mappings, beyond its original formal semantics.
    • Although there are new cases of using SKOS as a pattern in the vocabularies beyond KOS, such as in MADS and VIAF, there is a need for a comprehensive study of the effectiveness of such an approach, regarding the benefits for the mapping of the vocabularies.
  • sometimes specific (physical state of original in a preservation context), sometimes general (need vocabularies for preservation data), but no vocabulary for the function or data elements
  • There is some uncertainty about whether we should be using some IFLA sanctioned version of the FRBR vocabulary, or if using Ian Davis vocabulary is good enough.
  • no standard way to express bibliographic data (or related library data) as RDF (either agreed via standards bodies, or simply through standard practices)
  • Ontologies and metadata element sets, even widely used ones such as FOAF, are by some perceived to be inadequate, missing vital elements, or strangely designed.
    • For example, FOAF does not always have inverse properties, making it harder to design a display that describes an RDF resource by listing its properties and values.
    • When properties are missing this can result in "islands" of unconnected groups of resources

Data Incompatibilities or Lacks

  • For some items details content information is in free text fields within the library MARC record (e.g. MARC 505 table of contents field)
  • Some material format information is stored in free text fields (e..g. MARC 300)
  • The course material recorded in the library catalogue at the Open University is more heterogeneous in reality that perhaps is expressed in the library catalogue records. Specifically there are questions as to whether it is appropriate to model audio-visual material in the same way as print material – this question applies more generally across the library sector.
  • Some existing work to model library catalogue data seems to be focused on 'bibliographic' material - a specific example is the proposed isbd ontology (http://metadataregistry.org/vocabulary/list.html). It is not clear it would be appropriate to apply isbd properties to audio-visual material
  • Also many pieces of data you may wish to standardise as single entities in RDF (e.g. Publisher, Place of Publication) are recorded as free text in MARC, with no 'authority control' and so may require work to map to single entities, or result in accidental duplicate identities being created for the same entity
  • current data is free text, but contains quantitative information that needs to be pulled out
  • data needs to be qualified as "estimated" or "derived" so users know it is not precise (this is possibly a vocabulary issue)
  • current practice does not include rich relationships, just "related," so there is no source of relationships
  • There is a general sparseness of linkage in the LOD cloud.
  • Resources to be aligned vary in their scopes and granularity levels, modeling principles (E.g. thesauri vs. classification systems), language and culture, and many other aspects.
  • Quality and size of resources involved in the alignment are heterogeneous.
  • Semantic enrichment may be needed for some vocabularies before vocabulary integration.
  • Lack of URIs for relevant metadata components. [1]
  • Lack of critical mass of linked data for legacy records. [5], [10]

Community Guidance/Organization Issues

  • It would be useful to be able to document how various vocabularies are being used in RDF being delivered by Chronicling America. Possible ways to do this could be Dublin Core Application Profiles, or VoID.
  • Use of RDF/XML as a serialization for RDF has proven to be a bit of hurdle for web developers not already familiar with semantic web technologies.
  • When ontologies/metadata schemas/vocabularies overlap, which should be used?
  • Publishing Linked Data requires expertise which is often not available at the institutions that wish to publish Linked Data, in the form of a.o.:
    • Best practices
    • Examples
    • Suggested vocabularies and metadata schemas to use
    • Best practices in connecting library material and other types of resources, including courses (e.g., through reading lists, references), A/V material, and available open educational material (which might be repurposed from existing resources) need to be created.
  • Institutions may not have the time themselves to convert their data to Linked Data, but willing to allow others to do it. This requires coordination with communities willing to make that effort.
  • It is hard to find personnel well-versed in Semantic Web technology. Job sites (e.g. the Dutch MonsterBoard) do not have/list personnel versed in standards such as RDF.
  • Universal identifiers are hard to administer and are even harder to get widely adopted
  • Linking instead of copy-and-merging
    • Local changes may become "second-class"
    • When Links Break...
  • Provenance is a big, scary unknown
  • no examples in our community domain that we can follow
  • lack of information on how to create a data model
  • no community guidance on which technologies and vocabularies to use
  • Cost of intellectual work for vocabulary mapping, esp. for complex metadata element sets or large value vocabularies.
  • Diversity of users' needs regarding the alignment quality. It is difficult to reach a consensus about what is a good mapping for any project.
    • There will be different applications of a vocabulary alignment. A same alignment may be performed differently in different scenarios. Users' needs are diverse and can have direct impact on the mapping practices and results.
    • Patterns for vocabulary alignment can be multiple and decisions have to be made based on the assessments. For example, there are direct mapping and backbone (hub) mapping approaches. Within the same project different patterns may be combined.
  • Copyright and licensing can influence mapping strategies and their implementation.
    • Selection of sources is largely influenced by their availability.
    • There is a need to differentiate between the licensing and use of metadata about digital asset and the licensing and use of the assets themselves.
  • The ownership of the alignment data is still an unclear area. Where should one publish mappings if multiple vocabularies are involved? Who owns the original data and mapping expression?
  • Reliability of data sources that are targets of alignments is critical yet difficult to discover without certain levels of investment. ("See also" under 'Missing Vocabularies' section.)
  • Re-alignments may occur due to the updates of the vocabularies involved. Participants need to be informed about other vocabularies' updating policies, workflow, and frequencies. Such updating needs to be incorporated in the mapping results or routine. (See also under 'Technology availability/questions' section.)
  • Lack of open licenses for re-use of linked data. [3]
  • Lack of stable community-wide mappings for common legacy metadata formats to RDF classes and properties. [12]
  • Stability and availability of dereferencing services, triple stores, data maintenance, and synchronisation. [12]
  • No consensus on the components of an optimal reusable package of RDF metadata for applications using library linked data. [12]

Technology Availability/Questions

  • Lack of open licenses to support APIs, data standards and client software in the linked data environment. [3]
  • Alignment tools (for metadata element sets and value vocabularies)
    • Research is more focused on ontologies than library general value vocabularies
    • Tools have scalability issues for large vocabularies
  • Provenance of alignment information has been only slightly touched
  • Management of mappings re. vocabulary evolution
    • Concept scheme evolution is challenging considering that even deprecated concepts and relationships must maintain an accessible URL
    • Mappings should be updated to take into account new elements in the mapped vocabularies
  • Taking over aligned vocabularies in a local repository remains an option
  • Handling mapping to not-yet LD-published vocabularies
  • Diversity of application environments, e.g., integration into a language technology processing pipeline, potentially crossing technology stacks (RDF & XML based Web Services)
  • is linked data scalable to the size we need?
  • is linked data appropriate for highly hierarchical data models?
  • no systems available on market for linked data creation and use
  • open source solutions available are in an unfinished state
  • Does the institutional context for OpenURL mean that it doesn’t necessarily fit with global identifiers as deployed in Linked Data? Does it call for the need for a global URL space/service that redirects to an institutional context (e.g. like the Shibboleth Where-Are-You-From service)?
  • Is citation different from identification?
  • What about citing things that aren’t publications, e.g. datasets.
  • formats for citations: very closely related to Bibliography Cluster
  • Semantic Web infrastructure and tools (triple stores, editors, ...) are perceived as not very mature and/or spartan. This increases the barrier for adopting SemWeb technology.
  • There are very few comprehensive data sets based on directories of libraries, including address and other location information, and contact and other agent information. The information tends to be fragmented into consortia, sector, and subject-based groupings.
  • The International Standard Identifier for Libraries and Related Organisations (ISIL) offers a method for assigning an identifer to a library organization. Organizations do not always register individual physical branch libraries. The ISIL Registration Authority coordinates national agencies for assigning identifiers. There is no consolidated list of libraries and codes, and each agency uses a different interface to list or search national identifiers.
  • There are some national services using identifiers for libraries, but they generally do not interoperate.
  • There is no standard typology of memory institution types. Existing typologies are tied to one of the three primary entities in collection-level description: Collection itself, Location (sub-entities physical and electronic), and Agent (sub-entities person and corporate body/family). Typologies may be based on material format of the items in the collection, administration or curation of the collection, architecture and type of building housing the collection, audience level, subject, etc. Developing a single typology that fits all entities is difficult.

From Email (RAW)

Karen

1) Community agreement and leadership There are many in the community who are either not interested in LLD, don't know about LLD, or who are actually opposed to LLD. At the moment, there are no centers of leadership to facilitate such a major change to library thinking about its data (although IFLA is probably the most active).

2) Funding It is still quite difficult to convince potential funders that this is an important area to be working in. This is the "chicken/egg" problem, that without something to show funders, you can't get funding.

3) Legacy data The library world has an enormous cache of data that is somewhat standardized but uses an antiquated concept of data and data modeling. Transformation of this data will take coordination (since libraries share data and systems for data creation). But before it can be transformed it needs to be analyzed and there must be a plan for converting it to linked data. (There is a need for library systems to be part of this change, and that is very complex.)

4) Openness and rights issues While linked data can be used in an enterprise system, the value for libraries is to encourage open use of bibliographic data. Institutions that "own" bibliographic data may be under constraints, legal or otherwise, that do not allow them to let their data be used openly. We need to overcome this out-dated concept of data ownership.

5) Standards Libraries need to take advantage of the economies of scale that data sharing afford. This means that libraries will need to apply standards to their data for use within libraries and library systems.

Tom

Persistence of resolvable URIs. In the short term, Linked Data facilitates mash-ups, but for the long term, the use of RDF and URIs holds out the possibility of preserving the meaning of content in a way that will remain accessible twenty years from now -- provided that the URIs on which it is based are not sold, re-purposed, or simply forgotten and remain resolvable to machine-readable documentation. For libraries, this implies not just preservation policies for locally owned URIs and associated content, but an active voice, as a community, in the long-term governance of the global Web's Domain Name System.

2. Provenance of triples. In Linked Data, statements may be merged from many sources, creating a graph the statements of which may no longer be traceable to those sources. This problem can be solved in pragmatic, non-standard ways, but as institutions which historically were created to make citations resolvable, libraries have a stake in supporting the standardization of graph identification [1]. One very practical related problem is that which MacKenzie Smith has called "attribution stacking" -- how to credit the one hundred creators of a graph created from the merger of one hundred sources [2]. MacKenzie Smith refers to Provenance as one of the "three Ps of Linked Data", the other two being Persistence (see #1 above) and Policy (Karen's point #4 [3]).

[1] http://www.w3.org/2011/rdf-wg/wiki/TF-Graphs

[2] http://www.youtube.com/watch?v=uSmG1-hoZfE&t=43m43s

[3] http://lists.w3.org/Archives/Public/public-xg-lld/2011Feb/0044.html

3. Preservation of vocabularies. We can be reasonably certain that the Library of Congress will be around twenty years from now, so the persistence of http://id.loc.gov seems secure, though history shows that ultimately no institution is too big to fail. At the other extreme, useful vocabularies may be created by sponsored projects with a known expiration date. How can memory organizations, including libraries, better collaborate to ensure that ownership and responsibility for persistence of access (and possibly for ongoing maintenance duties) devolves over time to institutions committed to their preservation?

4. When to coin new terms, when to re-use, and how to align.

Peter

  • Teaching catalogers how to move from a "records" mental model to a

"graph" mental model of interrelated statements. This requires a fundamental shift of perspective that bundles up other related issues, among them that text labels for identifiers are not good enough.

This is possibly tangential to our charge, but I think our report needs to stake out some territory here for further development because there is a huge group of library professionals that are still trying to figure out how to shoehorn linked data in to the MARC record structure.

Ross

1) Where to start? To convert a dataset of any significant size, we'll need name authorities, subject thesauri, controlled vocabulary terms, etc. If everyone does this in isolation, minting their own URIs, etc., how is this any better than silos of MARC records? How do institutions the size of University of Michigan or Stanford get access to datasets such as VIAF so they don't have to do millions of requests every time they remodel their data? How do they know which dataset to look in for a particular value? What about all of the data that won't be found in centralized datasets (local subject headings, subject headings based on authorities with floating terms, names not in the NAF, etc.)?

2) How do we keep the original data and linked data in sync? If changes happen to the linked data representation, how do we funnel that back into the original representation? Do we even want to?

3) The richer the data, the more complicated the dependencies: how do we prevent rats nests of possible licensing issues (Karen raised this, as well)? Similarly, this web also creates an n+1 problem: there's always the potential of a new URIs being introduced with each graph; how much is enough? How will a library know?

4) How do we deal with incorrect data that we don't own/manage?

5) As the graph around a particular resource improves in quality, how do these changes propagate around to the various copies of the data? How do libraries deal with the changes (not only regarding conflicts, but how to keep up with changes in the data model, with regard to indexing, etc.)?

6) Piggybacking on Karen's "chicken or the egg" problem, who will be first to take the plunge? What is the benefit for them to do so? In the absence of standards, will their experience have any influence on how standards are created (that is, will they go through the work only to have to later retool everything)?

Emmanuelle

1) What's the impact of Linked Data on current systems ? there are different scenarios or different steps. Linked Data can be first implemented just as another protocol for exposing the data (like for instance OAI-PMH) : thus with little impact on systems and production processes. Or, Linked Data can imply to re-think completely the way we model, produce and use our data : a revolution. Does our group have any recommendation or comments on these different perspectives ? 2) Library systems and Linked Data - should all library systems adopt Linked Data ? Does it make any sense at all that every library exposes their own data ? Does Linked Data imply to have more centralized systems, where most of the data is mutualized ?

3) Linked Data standards and MARC - going back to question 1, we can envision Linked Data as another interoperability framework that will come in addition to source formats (Dublin Core never replaced MARC, but still, it's useful to exchange data with other communities). Is RDF really a candidate to replace MARC ? In what timeframe ? What does it mean for legacy data : is it possible to convert it without loosing information ? what tools are needed to do so ? is it possible to share them ?

Antoine

+ Need to carefully articulate library value vocabularies (concepts, terms) with real world entities they stand for. Library entities are "proxies" for real things, which linked data has as core focus. Perhaps this needs some education of the LD community, ie, convince them that there is any value in having such proxies. And identifying appropriate mechanisms (ie. efficient in terms of data creation and consumption) to represent this (a-la skos:Concept + foaf:focus) + Communication/cooperation with wider cultural sector. Problems are quite similar across LAM actors. + bouncing back on Ross' "Where to start?", a "Where to stop" complement: libraries should perhaps learn to rely on data produced by others and not try to produce every required data by themselves (gazetteers come to mind).

+ versioning and updates. When/how to disseminate change notifications, how data consumers should integrate them (problem of deprecation/removal of triples).

Kevin

1. Clear Purpose and Objective with LLD

What's the case for LLD? What problem does *Library* LD help to solve? How will it be of benefit to libraries (in addition to and/or improvements on current practices, current standards, current costs, etc)? How will LLD help libraries better serve their users?

This group's charter notes that "the mission of the Library Linked Data incubator group is to help increase global interoperability of library data on the Web ..." [1]. This is fine, but what is the benefit of this activity to libraries, which will be asked to pour resources into this activity? To ask this in a more challenging way: So what? How are libraries losing out now? What's broke that needs fixing?

Will LLD only benefit data consumers? Who are these data consumers? How will it benefit my mother, or your's? Will it benefit data creators (catalogers, for example)? Will it benefit those managing library technology?

I feel that whatever follows in this document should further illuminate the answers to these types of questions.

FWIW, it's fine that end users (my mother, e.g.) benefits only indirectly. Perhaps even catalogers benefit indirectly. I'm not suggesting there are right or wrong answers here, but that we identify how and to whom LLD will be of benefit (though, generically, it absolutely must be of benefit to libraries).

It might be that these questions are closely related to point 2.


2. Attention to Education and Outreach

With 1, general education will be a must, as Karen rightly noted. Good answers to 1 will make this easier by providing focus to education efforts. It might also help to convince decision makers to direct resources toward LLD projects.


3. Open Data

Lot's of talk about LLD, but little about Library LOD. The data needs to be freed from restrictions and, as Ross suggested, preferably bulk downloads provided. To echo Ross's sentiment, should the BL ping VIAF or ID 10s of millions of times for information?

The inability to share/include/use resources/data with minimal restrictions, from an array of sources, will negatively impact interoperability, the same "interoperability" from the mission statement.


4. Data Modeling and Legacy Data

I think these are two sides of the same coin and should be treated simultaneously. Data modeling questions should be asked in light of current and future practices. Is a new data model needed for LLD? If so, how far must it depart from current models (and why)? If a new data model is to be recommended, what is the scope and purpose of this data model? Only for LLD purposes (which is the scope of the current charter)? Or meant to replace the current model completely (which, while not mutually exclusive, is technically beyond the current charter)? What impact, if any, does a new data model have on legacy data?

There's been talk about data modeling and a general understanding that something different is needed. It is beyond the scope of this group to recommend a detailed solution, but this group should be able to talk about how current data models are insufficient to the task and make general recommendations in light of those reservations. It should be clear how proposed models will (positively) impact the audience members identified in 1.


5. Technology systems

Is the current LD technology stack suitable? If libraries are to begin sharing the very information crucial to bibliographic description (a mere link to a subject heading versus a string's presence in the data), in no small part by relying on data from external sources, do specific technological requirements need to be defined to support look-up services, not only of known resources but of yet-matched strings? SPARQL end points have not been widely implemented in existing LLD Implementations.

What technological needs will be required, if any, given the potential scope of change that could accompany a new data model? Perhaps for LLD, very little is needed beyond the current technology stack. That would make any new data model an auxiliary model to the current one, no?


Kevin

[1] http://www.w3.org/2005/Incubator/lld/

Jodi

  • Handling legacy data
  • Getting sufficient cataloger trust and buy-in -- which requires education on both sides
  • Learning to rely on others (as Antoine says: "libraries should perhaps learn to rely on data produced by others and not try to produce every required data by themselves")
  • Licensing legacy data: what can we open
  • Economic issues around ongoing data production

Some are ways of ensuring that we *can* rely on others in robust and authoritative ways: Trust & provenance -- to aggregate rich statements from everyone, yet filter to trusted authorities.

    • "Trusted authorites" need not be a direct list, but could also be "everyone X trusts", "everyone anyone in group G trusts" (and more generally "everyone trusted by more than x% of group G"), "everyone not on blacklist B", and combinations of these.
    • "Trust" will also need to take into account "proximity" to relevant information (by expertise, proximity to archival source material, etc)

Finally, I have to emphasize a universal issue -- which I think preservation-oriented orgs like libraries take more seriously than most: Continued resolution of URIs (as Tom has expressed quite well)

Ed

There has been some really good content in this thread so far. I really liked the point that Antoine and Jeff identified regarding what pre-web libraries have traditionally called "surrogates" and the need for such a notion on the web--in particular in the Linked Data space. It is an extremely important point which will largely effect how well library data will fit in with the Linked Data community, and the Web in general.

I think this very specific point ripples out quite a bit, into how vocabularies are used to describe library materials. Perhaps it is too ambitious but I would like the final report to make recommendations about what vocabularies are useful for making library linked data available, and to identify places where new vocabulary is needed.

Kevin and Emmanuelle's point about needing to come up with a compelling elevator pitch is also extremely important. I would like to see some pretty clear language in the report describing a) why library system developers might want to consider using Linked Data, and b) why library professionals should make Linked Data support a requirement when purchasing or developing systems.

Marcia

I believe that LIS professions have two roles: as the contributor (most cases) and as the primary user.

Asaf

Just in case Jodi got the impression this didn't resonate -- I think it's an excellent summary of very real concerns for libraries, and furthermore, concerns that are _key_, i.e. take the first few slots in a prioritized list of concerns, for libraries.

I therefore think these should be well covered in our final report.