This wiki has been archived and is now read-only.

Use Case Digital Text Repository

From Library Linked Data
Jump to: navigation, search

Back to Use Cases & Case Studies page


Digital Text Repository


Asaf Bartov

Background and Current Practice

Digital text repositories, in particular grassroots/volunteer initiatives (e.g. Project Gutenberg[1], Wikisource[2][3], Project Ben-Yehuda[4], Project Runeberg[5]), produce/procure, store, and deliver digital resources containing textual data, with varying amounts and coverage of appropriate metadata. Often, the metadata available amounts to a citation of the printed source the digital resource is based on; sometimes, even that is not made clear.

Wikisource, by far the most dynamic and participatory of the above projects, strives to enrich the metadata by linking to other resources, often other Wikimedia resources (e.g. Wikipedia entries), that are relevant to the text in question, e.g. encyclopedia entries on the author, the subject of the work, the period of composition, etc. Being a multi-lingual project (multiple single-language Wikisource sites are hosted by the Wikimedia Foundation[6]), Wikisource also connects translations in one Wikisource to the originals in the appropriate original-language Wikisource, where possible.

Project Gutenberg, Project Runeberg, and the Perseus Digital Library[7] (a scholarly rather than volunteer initiative) curate the digital texts they host at the level of books/volumes. For example, Walt Whitman's Leaves of Grass would be a single item in Project Gutenberg. This matches to a large extent common practice in traditional (physical) libraries.

Conversely, Wikisource and Project Ben-Yehuda -- a volunteer-run digital repository of public domain Hebrew texts -- curate material as individual works rather than books/volumes. Each poem in Whitman's book, e.g. "Song of Myself", would be a separate resource in Wikisource.

Most importantly: All of these projects are not committed to extensive legacy metadata (e.g. MARC21), and all are fairly flexible in terms of choosing implementations.

Essentially, digital text repositories, unlike most libraries, actually produce new Manifestations (in FRBR terms), namely the digital editions of the texts they prepare. Ideally, digital text repositories wouldn't have to define Work and Expression entities at all; they would be able to refer to those entities defined elsewhere (e.g. OpenLibrary-like project). Indeed, even their new Manifestations would need to be linked to the other, print, Manifestations from which the digital edition was prepared. Those print Manifestations, again, can be entities defined remotely and referenced by links.


(see Goals for formalized goal descriptions)

  1. SEARCH/BROWSE: Explore the text repository through rich interlinks between works, authors, and topics.
  2. REUSE-VOCABS and RELATE (New): Describe and catalog holdings of digital text repositories quickly (i.e. without much original research) and efficiently (i.e. without replication of much metadata)
  3. REUSE-VOCABS, URIS, and PUBLISH: Expose and share entities from digital text repositories ensuring maximal discoverability and ease of delivery.

Linked Data can allow digital text repositories to harness large metadata- and authority-providers, as well as other data providers, to describe their holdings.

Target Audience

  1. The general public.
  2. Digital text repository editors/curators

Use Case Scenario

Browsing a Rich Repository

Patrons of a digital text repository browse and query the repository along several vectors -- authors, work titles, topics, genres, periods, etc. Given the query "John Milton" (NB: not "Milton, John"), the patron expects to receive not only works by Milton, but also works about Milton, and perhaps even [fiction] works where Milton is a character.

A given item offers navigation links to related items, authors, topics, etc.

Catalogue Development

Editors/curators of digital text repositories (in Wikisource -- any user) seek appropriate data and metadata to link to a given item in their holdings. Examples of relevant data/metadata:

  • related items (with semantic context qualifying 'related' -- distinguishing between translations, adaptations, aboutness, etc.)
  • authority data (relying on external authority data to identify persons, places, topics)

Finding appropriate data/metadata by manual or automated querying of both local and remote data/metadata providers, the editor/curator adds a stable link (locally) to the other resource (whether local or remote).

Automated enrichment processes can dynamically generate suggested links ("possible related items"), again relying on querying local/remote providers. Such enrichment can play a crucial role next in connection with the textual content of documents in a repository. Indeed, full text indexes often play a major role in search processes. Producing metadata that links relevant terms in the text to appropriate vocabularies or name authorities can enhance search (yielding more relevant results) as well as allow better browsing and exploration of documents.

  • Examples of data providers: Wikipedia[8], DBPedia[9], other digital text repositories, Open Library
  • Examples of 'metadata providers: VIAF[10], Open Library, Library of Congress Authorities[11], DBPedia, LibraryThing[12], WorldCat[13]

Note that in work-level digital text repositories (Wikisource, Project Ben-Yehuda), a large number of additional information can be associated with each item, and delivered to the consumer -- from finer-granularity metadata (e.g. individual dates of composition for each poem), through deeper relations to authority entities (e.g. link each article in an anthology to its individual author[s], beyond linking the entire collection to the editor/compiler of the anthology, as book-level repositories would do).

Sharing Local Metadata

Having created links between local items and remote items (or other local items), or improved/corrected metadata in items, digital text repositories may share their new/updated metadata, to allow their data (the textual resources they curate) to be maximally discoverable and most accurately described and linked to other resources and authorities.

Such export of data depends not only on technically making new/updated metadata available, but on efficient means to identify opportunities for interchange of data. One such means is universal identifiers, e.g. ISBNs, ISTCs[14]; universal identifiers are probably far the easiest and simplest in terms of managing complexity.

Application of linked data for the given use case

Linked Data technology (RDF, SPARQL) can facilitate the tasks of discovery and selection required by the use case. Linked Data also offers tools to express complex, semantically-meaningful relationships between items, and allows stable links to remote resources.

Linked Data also facilitates easy use of universal identifiers.

Existing Work

Wikisource enriches its textual items via its native wikilinks[15] syntax, linking items to author pages (in Wikisource), to Wikipedia articles, or (in extreme cases, such as the few "annotated editions" developed on Wikisource[16]) to Wiktionary[17][18] entries on specific words considered difficult or obscure). Aboutness is also expressed in Wikisource, through manual user-contributed links, as in the bottom of the author page for John Milton[19].

Related Vocabularies (optional)

  • Dublin Core terms
  • FOAF

[To be completed]

Problems and Limitations

  • Universal identifiers
    • are hard to administer
    • are even harder to get widely adopted
  • Linking instead of copy-and-merging
    • Local changes may become "second-class" entities
    • When Links Break...
  • Metadata Interchange

Related Use Cases and Unanticipated Uses

[To be completed]

Library Linked Data Dimensions / Topics


  • Social bibliography; Mashups
  • Information lifecycle
  • Authority Data
  • library and non-library system connections
  • Thesauri and controlled vocabularies
  • provide new data as LOD
  • digital objects


  • [To be completed]