Cluster Digital Objects

From Library Linked Data
Revision as of 11:49, 26 July 2011 by Aisaac (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Authors: Mark van Assem, Asaf Bartov, Jodi Schneider

Background

This cluster concerns use cases where establishing groups of related digital objects is central.

Digital objects can be anything, ranging from PDF files of scientific papers to JPEG images, web documents, and multimedia files such as MPEG4 movie files.

Digital objects may be used individually but they are often grouped in collections. These groups may be thematic (Use Case Civil War Data 150, digital materials about the Civil War), contextual (Use Case NDNP newspaper articles published together in one newspaper issue), or format-based (all the movies on my hard drive). Groups may arise in many ways: digitizing an existing collections (Use Case NLL Digitized Map Archive), collecting items in the course of creating or remixing them (researchers collecting the papers they have written, along with the associated images, datasets and project partners), selecting from commercially or publicly available sources (an iTunes music library). Items may be collected collaboratively by a group for their own use (e.g. a research unit tracking relevant literature), either explicitly or implicitly, or may be professionally curated for others' use (e.g. librarians keeping a record of recent discoveries in a particular field). Collectors may store the full object, but in some cases they may store links to external objects (e.g. bookmarks to Webpages published on the Web by other parties).

Digital objects may be reformatted from other sources (e.g. scanned version of a 1908 journal paper) and may be representations of non-digital objects (digital photograph of a sculpture). In some cases groups are implicit and defined through machine processing (all files containing the word 'wind' found by searching a computer).

Representing the grouping of objects is necessary here. In some cases there is also a need to indicate the particular reason for grouping them, or other relationships between particular objects. For instance, a book may discuss a similar topic as another book, or may have been written by a particular author who is described by an authority file available elsewhere on the Web. Besides representing all this information, the users want to be able to conveniently search and browse this data, and be supported in discovering new relationships between objects. All the information should be published on the Web in a way that allows others to browse and process the objects and groups.

Topic in the Context of Linked Data

Semantic Web technology and the Linked Data networks built with it have proven to be a flexible means to publish and link data on the Web. Especially the ability to link to content elsewhere on the Web is a key ability in the context of this use case cluster. The links can be used to integrate disparate data sources, and also to define new relationships between the data. For this cluster this results in the ability to represent not only groups of related sources within one data source, but also across data sources.

The convergence of several communities of practice towards Semantic Web technology has led to specification of existing standards in RDF and OWL, making them available and reusable for others. For example, the availability of Dublin Core in RDF allows others to use Dublin Core in their own data, which is then immediately readable to any processor that understands the Dublin Core RDF Schema. In the context of this cluster, the OAI-ORE standard for representing groups of resources - now available in RDF - is particularly relevant.

Scenarios (Case Studies)

In this section we should indicate the Goals of each use case

  • Use Case Digital Text Repository
    • SEARCH/BROWSE: Explore the text repository through rich interlinks between works, authors, and topics.
    • REUSE-VOCABS and RELATE (New): Describe and catalog holdings of digital text repositories quickly (i.e. without much original research) and efficiently (i.e. without replication of much metadata)
    • REUSE-VOCABS, URIS, and PUBLISH: Expose and share entities from digital text repositories ensuring maximal discoverability and ease of delivery.
  • Use Case NDNP (National Digital Newspaper Program)
    • URIS: To give each newspaper title, issue and page a unique URL to enable citation.
    • RELATE (new): To contextualize newspaper content by associating it with other content on the web.
    • PUBLISH: To allow digital objects (titles, issues and pages) and their associated bitstreams (pdf, jp2, ocr/xml) to be meaningfully harvested out of the web application, so that the data can be re-purposed, and preserved elsewhere.
    • API: To provide an API for third parties to use the content in their own environments without needing to harvest the actual content.
  • Use Case Publishing 20th Century Press Archives
    • RELATE(aggregate): folders allowing access to multiple documents, search results accessed via a single URIs are typical aggregation cases
    • RELATE(new): To provide context from metadata and link to other data relevant to the domain.
    • REUSE-VALUE-VOCABS: VIAF, Geonames, GND are mentioned as targets of enrichment for providing a better context.
    • PUBLISH: harvesting and "To support the use of a standard image and metadata viewer based on METS/MODS." put publication of data at the core.

Scenarios (Extracted Use Cases)

  • Grouping: users should be enabled to define groups of resources on the web that for some reason belong together. The relationship that exists between the resources is often left unspecified. Some of the resources in a group may not be under control of the institution that defines the groups.
  • Enrichment: users should be enabled to link resources together, e.g. related descriptions, persons, topics, etcetera. For example, a poem in a digital text repository may be linked to the poet as defined in an authority file elsewhere on the Web. Fine-grained linkage could even be made at the level of individual terms in a document.
  • Browsing: users should be supported in browsing through groups and resources that belong to the groups. Interlinks should allow the user to explore the connections between resources.
  • Re-use: users should be enabled to re-use all or parts of a collection, with all or part of its metadata, elsewhere on the linked Web.
  • ...

Relevant technologies

  • OAI-ORE
  • OAI-PMH
  • Graph/network visualization tools
  • triple stores and SPARQL endpoints
  • TEI and other digital object annotation technologies
  • others?

Relevant vocabularies

Metadata element sets

  • OAI-ORE in RDF
  • Dublin Core's partOf element
  • Dublin Core: to describe groups themselves, and parts of groups

Value vocabularies

The number of value vocabularies that are possibly relevant is potentially very broad, so we do not attempt to list them.

Problems and Limitations

Note: Model this section after: Cluster_VocAlign#Problems_and_limitations

EDITING this part; copy of text from the UCs, to be reformulated and categorized

Vocabularies for describing map-specific metadata will need to be identified. New vocabularies may need to be created if some data can not be expressed using existing ones.

Historical maps use place names that may have changed or disappeared since. This will create difficulties relating locations on the map to a geographical name taxonomy. Such a taxonomy would need to be enriched with historical names and their locations. The area covered by a particular place may have changed as well.

Depending on applicability of OCR to recognizing place names and notes on maps, significant manual annotation work may be involved.

How to handle annotations from users? Shall users be able to change map metadata? If users are allowed to add annotations, an API for adding annotations would be need to be provided.

   * The RDFa pages are generated dynamically from a relational database. Since information from different levels of the aggregation hierarchy and associated metadata tables is required to build a meaningful display for the user, performance is an issue. Besides database means such as materialized views, we try to solve this by caching strategies (which leverage standard web technologies). These techonologies have also to be applied to external linked data sources in order to guarantee availability and to achieve overall performance.
   * The granularity of the aggregations presented to the user is also an issue to solve (e.g. the companies collection aggregates some 13,000 companies, which is far too much for display as well for an efficient harvesting).
   * For use in the DFG-Viewer, the aggregations are to be mapped to METS-MODS XML files in different granularities. Up to now, no general mapping methodology from ORE to METS (let alone to MODS elements) exists, so currently we generate the files directly from the database.
   * The order of the documents within a folder, generally following the publishing date of the articles, is crucial (especially if a set of documents comes without any metadata). Currently, the order is not expressed in RDF, but solely by convention (ascending document numbers). Now, sometimes it turns out that somebody 20 or 50 years ago had messed arround in a folder, and that the sequence has to be rearranged to meet the users expectations. Because identifiers (including the document number) are meant to be persistent, a "renumber" command wouldn't be an option. ORE provides a solution with ore:Proxy and xyz:hasNext / xyz:hasPrevious, but this comes with all the hazzles of double linked lists and a large implementation overhead. 



Missing Vocabularies

  • There is some uncertainty about whether we should be using some IFLA sanctioned version of the FRBR vocabulary, or if using Ian Davis vocabulary is good enough.
  • no standard way to express bibliographic data (or related library data) as RDF (either agreed via standards bodies, or simply through standard practices)
  • Ontologies and metadata element sets, even widely used ones such as FOAF, are by some perceived to be inadequate, missing vital elements, or strangely designed.
    • For example, FOAF does not always have inverse properties, making it harder to design a display that describes an RDF resource by listing its properties and values.
    • When properties are missing this can result in "islands" of unconnected groups of resources


Data incompatibilities or lacks

  • For some items details content information is in free text fields within the library MARC record (e.g. MARC 505 table of contents field)
  • Some material format information is stored in free text fields (e..g. MARC 300)
  • The course material recorded in the library catalogue at the Open University is more heterogeneous in reality that perhaps is expressed in the library catalogue records. Specifically there are questions as to whether it is appropriate to model audio-visual material in the same way as print material – this question applies more generally across the library sector.
  • Some existing work to model library catalogue data seems to be focused on 'bibliographic' material - a specific example is the proposed isbd ontology (http://metadataregistry.org/vocabulary/list.html). It is not clear it would be appropriate to apply isbd properties to audio-visual material
  • Also many pieces of data you may wish to standardise as single entities in RDF (e.g. Publisher, Place of Publication) are recorded as free text in MARC, with no 'authority control' and so may require work to map to single entities, or result in accidental duplicate identities being created for the same entity


Community guidance/organization issues

  • It would be useful to be able to document how various vocabularies are being used in RDF being delivered by Chronicling America. Possible ways to do this could be Dublin Core Application Profiles, or VoID.
  • Use of RDF/XML as a serialization for RDF has proven to be a bit of hurdle for web developers not already familiar with semantic web technologies.
  • When ontologies/metadata schemas/vocabularies overlap, which should be used?
  • Publishing Linked Data requires expertise which is often not available at the institutions that wish to publish Linked Data, in the form of a.o.:
    • Best practices
    • Examples
    • Suggested vocabularies and metadata schemas to use
    • Best practices in connecting library material and other types of resources, including courses (e.g., through reading lists, references), A/V material, and available open educational material (which might be repurposed from existing resources) need to be created.
  • Institutions may not have the time themselves to convert their data to Linked Data, but willing to allow others to do it. This requires coordination with communities willing to make that effort.
  • It is hard to find personnel well-versed in Semantic Web technology. Job sites (e.g. the Dutch MonsterBoard) do not have/list personnel versed in standards such as RDF.
  • Universal identifiers are hard to administer and are even harder to get widely adopted
  • Linking instead of copy-and-merging
    • Local changes may become "second-class"
    • When Links Break...
  • Provenance is a big, scary unknown

Technology availability/questions

  • Semantic Web infrastructure and tools (triple stores, editors, ...) are perceived as not very mature and/or spartan. This increases the barrier for adopting SemWeb technology.

Cluster admin

(moved to the Talk page)