Data Catalog Vocabulary/Use Cases and Requirements

From W3C eGovernment Wiki
Jump to: navigation, search

Status: This is a working document of the Data Catalog Vocabulary project within the W3C eGovernment Interest Group. Feedback is welcome and should be sent to the public-egov-ig mailing list.

Last significant change: 2010-05-13

Terminology

A dataset is a collection of information in a machine-readable format. It is published by an agency, usually some sort of official government organisation, and thought to be useful to the public.

A catalog record consists of metadata for a dataset. It thus describes the dataset. The actual dataset is not considered part of the catalog record, but the catalog record usually contains a download link or web page link from where the actual dataset can be obtained.

A catalog is a collection of catalog records. It is operated by a catalog operator, which could be a government agency, citizen initiative, …

Use cases

Creating a combined catalog from multiple data catalogs (UC1)

An increasing number of government agencies make their data available on-line in the form of data catalogs such as data.gov (see global map of data catalogs at CTIC). Catalogs exist at national, regional and local level; some are operated by official government bodies and others by citizen initiatives; some have general coverage, while others have a specific focus (e.g., statistical data, historical datasets).

Citizens, journalists, researchers and businesses thus may have to spend considerable amounts of time searching a number of catalogs for relevant datasets. Federated catalogs such as the Guardian's World Government Data site and Sunlight Labs' National Data Catalog are emerging as a response to this problem. They present a unified catalog and unified user interface. They may also provide additional advanced features that individual catalog operators will not or can not supply, such as convenient APIs for mashup developers.

The federated catalog replicates individual catalogs' contents into its local database. A website interface similar to those of current individual catalogs is offered for interacting with the federated catalog. Updates to the individual catalogs (new datasets, modified metadata, deleted datasets) also have to be reflected in the federated catalog.

Creating federated catalogs is challenging for various reasons. First, not all catalogs make their records available in a machine-readable form, forcing the developers of federated catalogs to employ screen scraping. Second, where the catalog is available in machine-processable form, it is usually in a custom one-off format, requiring the development of custom importers for each catalog that is to be federated. Third, the developer of the federated catalog has to undertake the task of mapping and harmonising the metadata fields provided by different catalogs.

A standard format for data catalogs helps with all three problems: First, the existence of a well-documented standard creates an additional incentive towards publishing machine-readable metadata for the catalog operators. Second, a single importer can be used to import all catalogs that support the format. Third, harmonising metadata fields becomes the job of individual catalog operators, who know the contents of their own catalog best.

Including metadata published directly on agency web sites into catalogs (UC2)

The model of most current data catalogs assumes that agencies publish datasets on their own website, and then register the dataset with the central catalog by providing the download location and other metadata to the catalog operator. This model is not always efficient. Individual agencies sometimes have existing dataset publishing workflows and metadata management capabilities (e.g., statistics offices). Also, the amount and nature of metadata that agencies can provide differs widely, and a central catalog with a single, non-extensible metadata schema cannot capture the requirements of a wide range of government institutions.

In a distributed publishing model, on the other hand, agencies manage their own metadata on their own websites, using their own publishing workflows and information systems. Central catalogs such as data.gov play the role of aggregator that collects dataset descriptions from different agency websites and presents them in a unified user interface. The central catalog must somehow be able to discover newly published datasets on an agency's web site, e.g., by crawling or by receiving an automated notification from the agency. There also has to be a way of notifying about changes to the metadata.

Note that individual agencies in this scenario may not want to run a full-blown “agency-level data catalog”, but may just want to make metadata available in a more structured form alongside the datasets that are already scattered throughout its web site. This distinguishes this use case from the catalog federation scenario (UC1), which assumes that the sites to be federated are dedicated data catalog websites.

Advanced queries against catalogs (UC3)

All catalogs websites provide some sort of parametric search facility (e.g., search by publishing agency, by data format, or by theme). Available search parameters differ among catalogs and they are not sufficient for all users needs. For example, data.gov provides search by department, format and category, but not by keyword, update date, or temporal/geographic coverage.

If catalogs are exposed in a standard machine-readable format, then third parties are able to replicate the contents of a catalog into their own database, and run advanced queries over the catalog, or provide interfaces for performing such queries to the general public.

Queries may rely on information that is not present in the catalog but in external sources. For example, by using the US Government Structure Ontology one can query for datasets published by an agency that directly reports to the Executive Office of the President.

Bulk download of datasets (UC4)

Data catalogs support the creation of innovative mashups of government data by making it easier for developers to find data sources of interest. Developers may browse or search the catalog until they have found a dataset of interest, and then download the linked file.

However some mashups and applications may access not just one but a very large number of datasets from a catalog. For example, an application could make all geographic datasets (in ESRI shapefile, GML, KML formats) available for display on a map.

The creation of such applications would become much easier if it was possible to automate the downloading of all datasets that meet certain criteria. Furthermore, the ability to automatically discover new datasets that meet those criteria, and to discover updated datasets, would be useful.

Requirements

Machine-readable representations of catalog entries

Must allow retrieval of a machine-readable representation of catalog entries.

Required by: UC1, UC2, UC3, UC4

Retrieval of all catalog entries

Must allow retrieval of all entries in a catalog.

Required by: UC1, UC3, UC4

Persistent URIs for catalog entries

Must provide stable, persistent identifiers for individual entries.

Required by: UC1, UC2

Update checks for individual datasets

Must allow checking wether an individual dataset has changed or was updated.

Required by: UC2

Discovery of new and updated catalog entries

Must allow the discovery of new entries in a catalog, and the discovery of entries that have been recently updated.

Required by: UC1, UC4

Tracking of data provenance

Must include pointers/links to original catalog record when an entry is federated into another catalog.

Required by: UC1, UC2

Coverage of typical catalog metadata

Must cover the metadata that is found in typical government data catalogs.

Required by: all use cases

Simple transformation from existing catalog data

Must allow population from existing data catalogs without requiring the production of new metadata, or an expensive (that is, manual) modification of existing metadata. In other words, implementing the standard format for an existing data catalog must not require cleaning up or otherwise modifying the metadata that your catalog collects beyond simple mechanical transformations.

Required by: all use cases

Extensible metadata model

Must be extensible with additional, catalog-specific metadata fields.

Required by: UC2

Bandwidth conserving

Must scale to catalogs that contain thousands of datasets without putting unreasonable strain on the bandwidth resources of catalog operator and catalog consumer.

Required by: all use cases


Standard Queries on Entries and Catalog Metadata

Must allow to query the entries and catalog metadata using a standard mechanism (e.g., SPARQL, XQuery, etc.).

Required by: UC3