Use Case Publishing 20th Century Press Archives
Back to Use Cases & Case Studies page
Publishing 20th Century Press Archives
Background and Current Practice
The 20th Century Press Archives of the German National Library of Economics (ZBW) is a large collection of newspaper clippings about persons, companies, subjects and wares, extending from 1826 to 2005, organized in thematic folders. For parts of the collections, metadata (like source and date of an article or name and location of a company) is available (solely in German). Currently, parts of the more than six million documents are accessible as digitized page images (without OCR data) through a web application. "Deep Links" into this application are just plain impossible or URLs are heavily dependent on Coldfusion syntax elements. Also, harvesting of the data for foreign applications is not supported.
- RELATE(aggregate): folders allowing access to multiple documents, search results accessed via a single URIs are typical aggregation cases
- RELATE(new): To provide context from metadata and link to other data relevant to the domain.
- REUSE-VALUE-VOCABS: VIAF, Geonames, GND are mentioned as targets of enrichment for providing a better context.
- PUBLISH: harvesting and "To support the use of a standard image and metadata viewer based on METS/MODS." put publication of data at the core.
- Scholars and students in Economic and Contemporary History
- The general public
- Service Providers
Use Case Scenario
The user can browse and search the collections by the available metadata. Search should be supported by a autosuggest service including alternative names (eg. from the German Personal Name Authority File, PND). Every item - folder, document, single page of a document - has its own persistent web address which can be cited and linked to. Additional information is provided from other sources on the web, such as a persons Wikipedia abstract or nationality (from the authority file). For non-German users, an English version of the website is prepared with data from the web. Links to places where more information is available, such as VIAF, is offered. And of cause the user can comfortably view folders and documents with their page images and some metadata attached.
Additionally, institutional users and providers of value added services (like Europeana) can harvest the data in an efficient way.
OAI-ORE provides the backbone for organizing the large and deeply nested aggregations of data. On every level of aggregations, it provides access to the aggregated resources (which may be aggregations themselves, or image files on the deepest level). Search results (e.g. company by location) are represented as dynamically built ORE aggregations. The aggregations are described by RDFa resource maps.
Metadata provided by the application database and links into the Linked Data cloud (especially DBpedia, Geonames, the German Authorities File, VIAF and Chronicling America) enrich these resource maps. RDFa facilitates building a web application for both, humans and machines, which follows the REST architectural principles.
Personal name lookup uses SPARQL queries against a SKOS file (skos:prefLabel/altLabel) derived from PND, mediated through a web service.
- Beta version: http://zbw.eu/beta/p20 (notice changed URI from the first prototype!)
A voiD description file is available at http://zbw.eu/beta/p20/void.ttl
- http://zbw.eu/beta/p20/person (the biographical collection)
- http://zbw.eu/beta/p20/person/12 (folder about a single person) (with links to DBpedia, PND and VIAF)
- http://zbw.eu/beta/p20/person/13476/0147 (single document) (with links to Chronicling America)
- http://zbw.eu/beta/p20/person/13476/0147/0001 (the first page of this document)
- http://zbw.eu/beta/p20/company/searchresult?q=hamburg (company search result for "hamburg")
- http://zbw.eu/beta/p20/company_by_geoname/2886242 (comanies located in Cologne - by geoname location)
- In Chronicling America whole newspapers with their issues and pages are made available on the Semantic Web
Problems and Limitations
- The RDFa pages are generated dynamically from a relational database. Since information from different levels of the aggregation hierarchy and associated metadata tables is required to build a meaningful display for the user, performance is an issue. Besides database means such as materialized views, we try to solve this by caching strategies (which leverage standard web technologies). These techonologies have also to be applied to external linked data sources in order to guarantee availability and to achieve overall performance.
- The granularity of the aggregations presented to the user is also an issue to solve (e.g. the companies collection aggregates some 13,000 companies, which is far too much for display as well for an efficient harvesting).
- For use in the DFG-Viewer, the aggregations are to be mapped to METS-MODS XML files in different granularities. Up to now, no general mapping methodology from ORE to METS (let alone to MODS elements) exists, so currently we generate the files directly from the database.
- The order of the documents within a folder, generally following the publishing date of the articles, is crucial (especially if a set of documents comes without any metadata). Currently, the order is not expressed in RDF, but solely by convention (ascending document numbers). Now, sometimes it turns out that somebody 20 or 50 years ago had messed arround in a folder, and that the sequence has to be rearranged to meet the users expectations. Because identifiers (including the document number) are meant to be persistent, a "renumber" command wouldn't be an option. ORE provides a solution with ore:Proxy and xyz:hasNext / xyz:hasPrevious, but this comes with all the hazzles of double linked lists and a large implementation overhead.
Related Use Cases and Unanticipated Uses
- Europeana harvests and aggregates metadata from sites like the 20th Century Press Archives. The Linked Data (OAI-ORE) interface aims to facilitate this.
- NDNP (Chronicling America) provides a large corpus of historic newspapers, down to the page level searchable through OCR text.
- Linked Data Service of the German National Library and VIAF authority file data provides links to holdings of National Libraries around the world and other linked data sources.
- More details about the P20 application can be found in Joachim Neubert: The 20th Century Press Archives as Linked Data Application, Submission to Semantic Web Challenge 2010
- OAI-ORE applied to classical archival finding aids is outlined in Deborah Kaplan, Anne Sauer, Eliot Wilczek: Archival description in OAI-ORE (OR 2010)
- Scholary use for OAI-ORE is described in Herbert van de Sompel: Adding eScience Assets to the Data Web (LDOW 2009)
- METS wiki about METS & OAI-ORE