Back to Use Cases & Case Studies page
Library of Congress: Leslie Johnston
Zepheira: Kathy MacDougall
Background and Current Practice
The National Digital Information Infrastructure and Preservation Program at the Library of Congress is an initiative to develop a national strategy to collect, archive and preserve the burgeoning amounts of digital content for current and future generations. It is based on an understanding that digital stewardship on a national scale depends on active cooperation between communities. The NDIIPP network of partners have collected a diverse array of digital content, including social science data-sets; geospatial information; Web sites and blogs; e-journals; audiovisual materials; and digital government records (http://www.digitalpreservation.gov).
These diverse collections are held in the dispersed repositories and archival systems of over 180 partner institutions where each organization collects, manages, and stores at-risk digital content according to what is most suitable for the industry or domain that it serves. This practice is necessary in a federated network of heterogeneous infrastructures but creates challenges in providing meaningful access across collections. However, it is clear that digital content grows in value exponentially as it is integrated and interconnected. As the Library of Congress and its partners develop a framework for a national digital collection, they have recognized a requirement to share and integrate partner collections in the interest of coherent strategy (Campbell, 2009).
NDIIPP partners understand through experience that aggregating and sharing diverse collections is very challenging. Transfer of data and accompanying metadata from one institution to another for integration into a different system is resource-intensive. The knowledge workers’ understanding of digital preservation content rarely translates to the understanding of systems and computer infrastructure that offer alternative means for sharing and aggregating such data. As such, the aggregation and consolidation of information about NDIIPP collections for Web resources have, to date, been a manual process. This is not a scalable strategy, and information becomes outdated as Partner systems and collections evolve.
Early in 2009, a pilot project recognizing the specific characteristics of this community was initiated by the Library of Congress and Zepheira. Working together, Project Recollection was created. Recollection seeks to provide the platform, tools and environment that enables the community of NDIIPP Partners to share their collections and data on an ongoing basis. In addition, NDIIPP collections can be showcased from a central point through the activities of the Partners, and not the manual labor of the Library. This allows NDIIPP to maintain the benefits of a distributed network of partners and also take advantage of the collections speaking to one another (Campbell, 2009).
Recollection achieves the following key goals for the Library and its NDIIPP Partners:
- Rapidly combine private and public information collections for easy sharing.
- Enable new insights into patterns and relationships inherent to data.
- Build a network of trust and participation around common goals.
Recollection ties together the many public collections that are curated by NDIIPP Partners in different systems and formats across the globe; it allows curators of information to enhance their information resources by connecting them to other open data sources available on the Web.
The target audience for this use case is librarians, archivists, museum curators, educators and other LAM (Libraries/Archives/Museums) professionals.
Note: Current users of Recollection are librarians and archivist who work for the Library of Congress and NDIIPP Partner Organizations. For a view of NDIIPP Partners by geographic location and activity see --> http://recollection.zepheira.com/views/em/ndiipp-partners/. An alphabetical listing of partners can be found at --> http://www.digitalpreservation.gov/partners/partners_alpha.html.
Use Case Scenario
Users can either point at or upload their collections to Recollection and describe the types of data within the collection via a web interface. They can enhance this data by leveraging Recollection to generate latitude/longitude coordinates, consistent date formats, breaking lists into individual values, and other data manipulations useful for analysis. Users can merge multiple data files in order to build views which provide users the ability to analyze information across the combined collections.
Once they've described the data, users can quickly create a custom web interface to the data with visualizations that show map plots, timelines, number charts, and other interesting views. They can select facets and tag clouds associated with the data to provide the ability to filter information in ways that are interesting to themselves or targeted audiences. The user can then publish and share the finished page on the Web.
It is easiest to get a sense of Recollection by viewing it in action (in screencast form) (swf format).
Linked data technology is used in Recollection as a basic platform for librarians and curators exposing collections to the Web, and as a source of data to augment these collections. Potential users of the information can more easily discover and analyze this data in a variety of new ways as a result. Not only do consumers of the information have increased access, but collection curators can begin to connect information across collections and from the World Wide Web to enhance collection value with new resources. These connections create a powerful "Web of Data" for all of the resources curated under the auspices of the NDIIPP Program.
Recollection applies the following linked data principles to support the use case:
- Expose information resources via URIs: URIs are used to identify all information that is exposed to the Web as a new resource. Prior to Recollection, many collections of information were held in "dark archives" that were inaccessible to most.
- Use standard HTTP for ease of access: HTTP URIs are used to allow information to be located by the widest variety of tools and services possible. Collections housed in Recollection are now exposed to all users of the World Wide Web and easily accessible.
- Provide data in common formats to maximize sharing: The data provided when people access Recollection URIs is available in many of the common formats used on the Web today making it more widely useful. Data is made available in RDF/XML, HTML, Semantic wikitext, JSON and a variety of additional formats. Users of Recollection are free to take the information that is exposed and use it in new ways that are meaningful to them.
More information on the specific technologies used in Recollection is provided in the "Existing Work" section.
Existing Work (optional)
Foundation in Open Source
The Recollection Platform leverages Web-based open standards and open source tools, many of which Zepheira is actively leading. We gratefully acknowledge the assistance of the Free/Libre/Open Source Software (FLOSS) community. Using the open source components listed below, Zepheira has created the Recollection Platform to share and use digital resources, and to identify new sources, audiences and collaborators.
Recollection utilizes Django for template-based application development and administration. The user interface is based on the MIT Simile Project's Exhibit tool for faceted navigation and incorporates an Exhibit Server hosted by Zepheira. Back end data transformation and data enhancements are handled by Akara. The storage layer is designed to allow for file-system storage for maximum transparency.
The Recollection Platform includes custom software (the “glue layer”) to integrate these open source components. Zepheira has defined a series of RESTful wrappers / interfaces around these tools that provides the required interfaces for Recollection.
For more details on open source software used to create the Recollection Platform see --> http://recollection.zepheira.com/about/community/.
Examples of Linked Data Technologies
An important aspect of Recollection is its ability to link its collections to other publicly available information on the Web. For example, Recollection uses GeoNames, one of the best sources for geographical information for users and developers, and a shining example of the power of open data. GeoNames is a database, Web service, and destination site for all things geographical. It has a rich, RESTful API and offers Semantic Web features using LOD conventions.
Further, as described on the GeoNames site: "The GeoNames geographical database is available for download free of charge under a creative commons attribution license. It contains over eight million geographical names and consists of 6.5 million unique features whereof 2.2 million populated places and 1.8 million alternate names. All features are categorized into one out of nine feature classes and further subcategorized into one out of 645 feature codes.[...] The data is accessible free of charge through a number of webservices and a daily database export. GeoNames is already serving up to over 11 million web service requests per day.[...] GeoNames is integrating geographical data such as names of places in various languages, elevation, population and others from various sources."
Recollection uses GeoNames to augment geographical preserved materials with geographical context. Similar patterns and models to support contextual linking are anticipated to subjects (e.g. DDC or LCSH) or people (e.g. VIAF). In the spirit of LOD, Recollection stakeholders make contributions back to the original initiative by contributing code, documentation, evangelism, and serving on the community team.
The LOD community not just about sharing data, but also about sharing the wherewithal to process data. In the process of developing Recollection, the stakeholders have also contributed a great deal of data-processing tools and best practices to the community, including by enhancing open-source tools used in Recollection. Relevant domains of data include date/time formats, statistical formats such as SPSS, Web feeds (i.e. RSS and Atom), XML MODS, and more.
Key members of the Recollection team are also leaders of the MIT Simile project, focused on designing tools to facilitate interoperability among digital asset distributed across individual, community, and institutional stores. Not only has Recollection included basic enhancements to the Simile widget set (maps, pie charts, graphs, tag clouds and specialized pick lists), but it has included fundamental enhancements to the data conduits and views (e.g. OpenLayers support) that underpin Exhibit. Since the Simile project in general, and Exhibit specifically, are such a popular component of LOD systems, this work reflects the give-and-take of participating in the LOD community.
Continued Contributions to Open Source and the Linked Data Community
On June 10th and 11th, 2010 a group of Simile Exhibit users, software developers and architects met in Washington D.C. to discuss the current limitations of the tool and how it could be redesigned to meet the changing needs of its user community. The group reviewed a sample of current uses of Exhibit to understand both its value and its limitations, devised a new architecture for the product, and discussed how to insure broad participation and adoption of the new tool. In 2010, the Library is funding a joint project with MIT and Zepheira to develop the next generation of Exhibit called "Exhibit 2" to meet these architectural and open source community goals.
Related Vocabularies (optional)
The data employs well-established and widely-used library community standards that are suited to the linked data paradigm by their ubiquity, such as the Library of Congress Subject Headings, the Thesaurus for Graphic Materials, and the Library of Congress Name Authority File. Relationships between individual and sets of data records can be easily discovered and made programmatically and by users.
Other vocabularies created for use in Recollection are, while influenced by external vocabularies, specific to the platform. These vocabularies are reflective of the Recollection primitives and center around contextually relating People, Organizations, Subjects, Collections, Data and Views. Internal vocabularies are designed to be exposed via RESTful endpoints and late-bound to existing standard vocabularies where appropriate.
Problems and Limitations (optional)
- The open source components in use are collectively rather on the bleeding edge, and require a good deal of polish and integration enhancements.
- It is difficult to combine usability needs with policy compliance when organizing information in discrete collections for broader use.
- There are gaps in the linked data framework (e.g. provenance) that are being addressed in ongoing work.