Use Case NDNP
From Library Linked Data
Back to Use Cases & Case Studies page
NDNP (National Digital Newspaper Program - Harvesting Digital Objects From The Web)
Background and Current Practice
The National Digital Newspaper Program (NDNP) is a partnership between the National Endowment for the Humanities (NEH), the Library of Congress (LC), and state projects to provide enhanced access to United States newspapers published between 1836 and 1922. NEH awards support state projects to select and digitize historically significant titles that are aggregated and permanently maintained by LC.
Chronicling America is a web application that allows users to search and view the 2.5 million (and growing) digitized pages as well as consult a national newspaper directory of bibliographic and holdings information for 140,000 newspapers, to identify newspaper titles in all types of formats. The directory was compiled through an earlier NEH initiative, the United States Newspaper Program.
- URIS: To give each newspaper title, issue and page a unique URL to enable citation.
- RELATE (new): To contextualize newspaper content by associating it with other content on the web.
- PUBLISH: To allow digital objects (titles, issues and pages) and their associated bitstreams (pdf, jp2, ocr/xml) to be meaningfully harvested out of the web application, so that the data can be re-purposed, and preserved elsewhere.
- API: To provide an API for third parties to use the content in their own environments without needing to harvest the actual content.
Use Case Scenario
A researcher, institution or other party wants to harvest the newspaper content out of the Chronicling America web application to perform their own analysis of the textual content. The user is able to get metadata for newspaper titles, issues, pages, and bitstreams associated with each page (pdf, jp2, ocr/xml). The user should also be able to look for new material on a routine basis.
- Librarians, Archivists, Curators
- Computer Programmers
Application of linked data for the given use case
Linked Data Design Issues and Cool URIs for the Semantic Web provided the foundation for the design of the Chronicling America identifier space. We wanted to enable interested parties to extract data out the web application to use in their own environments. Specifically we designed our web application to mint URLs for each and every newspaper title, issue and page. For example:
- The Arizona champion - http://chroniclingamerica.loc.gov/lccn/sn82016246#title
- November 3rd, 1883 Issue - http://chroniclingamerica.loc.gov/lccn/sn82016246/1883-11-03/ed-1#issue
- Page 1 - http://chroniclingamerica.loc.gov/lccn/sn82016246/1883-11-03/ed-1/seq-1#page
Each of these URLs identifies the "real world" newspaper title, issue and page. When a user puts them in their web browser they see a HTML view for the resource. When an agent requests application/rdf+xml of the same URL they will get an RDF representation of the resource. The machine readable RDF was chosen to allow the resources to be described using existing vocabularies, and for resources to be explicitly linked together. In addition the page objects are linked to the digital objects that they are composed of: jp2, pdf, ocr/xml files. RDFa was used to create new vocabulary terms for NDNP specific semantics. RDF has also allowed place names, and languages to be meaningfully linked to dbpedia, geonames and lingvoj. Also, selected pages have been linked to Flickr resources, when pages have been uploaded there.
Linked Data and the OAI-ORE vocabulary allows interested clients to harvest Chronicling America objects from the web. For example, clients that are interested in crawling the content can start with a resource map for all newspapers and follow their nose (inspecting the RDF and resolving URLs of interest) to resource maps for titles, issues, pages and on down to their respective bitstreams (pdf, ocr xml, ocr text, jpeg2000, thumbnail jpg).
Problems and Limitations
- Use of RDF/XML as a serialization for RDF has proven to be a bit of hurdle for web developers not already familiar with semantic web technologies.
- It would be useful to be able to document how various vocabularies are being used in RDF being delivered by Chronicling America. Possible ways to do this could be Dublin Core Application Profiles, or VoID.
- There is some uncertainty about whether we should be using some IFLA sanctioned version of the FRBR vocabulary, or if using Ian Davis vocabulary is good enough.
Library Linked Data Dimensions / Topics
- Users needs > Browse / explore / select
- Users needs > Retrieve / find
- Users needs > Identify
- Users needs > Access / obtain
- Users needs > Integrate / contextualize
- Context > Communication > Online access
- Information assets > Archival materials
- Information lifecycle > interpret / analyze / synthesize: > to enrich existing entities with more data
- Information lifecycle > interpret / analyze / synthesize: > to identify an entity
- Information lifecycle > interpret / analyze / synthesize: > to contextualise the entities by connecting them with other entities
- Information lifecycle > present / publish: > to visualize entities and their relations
- Information lifecycle > present / publish: > to make new entities accessible inside an information system
- Information lifecycle > present / publish: > to provide new data as LOD
- Use of Identifiers
- Linking across datasets
- REST patterns for Linked Data
- linked data management, hosting, and preservation
- Versioning, updates
- Search Engine Optimization for Library Data
- Witt, Michael. Object Reuse and Exchange (OAI-ORE). American Library Association, 2010.
- ORE Specifications and User Guides