Difference between revisions of "Data Catalog Vocabulary/Vocabulary Reference"

From W3C eGovernment Wiki
Jump to: navigation, search
(Property: listing date)
(Property: update/modification date)
Line 134: Line 134:
 
* '''RDF Property:''' [http://purl.org/dc/terms/modified dct:modified]
 
* '''RDF Property:''' [http://purl.org/dc/terms/modified dct:modified]
  
* '''Range:'''  @@@should we write it this way?@@@ [http://www.w3.org/2000/01/rdf-schema#Literal rdfs:Literal] typed as [http://www.w3.org/TR/xmlschema-2/#date xsd:date]—the date is encoded as a literal in "YYYY-MM-DD" form. If the specific day or month are not known, then 01 should be specified.
+
* '''Range:'''  @@@should we write it this way?@@@ [http://www.w3.org/2000/01/rdf-schema#Literal rdfs:Literal] typed as [http://www.w3.org/TR/xmlschema-2/#date xsd:date]—the date is encoded as a literal in "YYYY-MM-DD" form ([http://www.w3.org/TR/NOTE-datetime ISO 8601 Date and Time Formats]). If the specific day or month are not known, then 01 should be specified.
  
 
* '''Usage note:''' This indicates the date of last change of a catalog entry, i.e. the catalog metadata description of the dataset, and not the date of the dataset itself.
 
* '''Usage note:''' This indicates the date of last change of a catalog entry, i.e. the catalog metadata description of the dataset, and not the date of the dataset itself.

Revision as of 06:45, 20 August 2010

EDITOR'S DRAFT!!! This is work in progress and not yet ready for public review.

Introduction

  • This document does not prescribe any particular method of deploying data expressed in dcat. There are many options, such as SPARQL endpoints, RDFa, RDF/XML, Turtle. Examples here use Turtle, but that's just because of Turtle's readability.

Vocabulary overview

Example

Encoding of property values

  • Values like "unknown" or "unspecified" should never be used, and if present in the original database should be filtered for publishing. Instead, the property should simply be omitted.
  • @@@ policy for using resources vs. literals

Class: Catalog

A data catalog is a curated collection of metadata about datasets.

  • Usage note: Typically, a web-based data catalog is represented as a single instance of this class.


Property: homepage

The homepage of the catalog.

  • Usage note: foaf:homepage is an inverse functional property (IFP) which means that it should be unique and precisely identify the catalog. This allows smushing various descriptions of the catalog when different URIs are used.

Property: publisher

The entity responsible for making the catalogue online.

Property: spatial/geographic coverage

The geographical area covered by the catalogue.

Property: themes

The knowledge organization system (KOS) used to classify catalog's datasets.

Property: title

A name given to the catalog.

Property: description

free-text account of the catalog.

Property: license

  • Usage note: To allow automatic analysis of datasets, it is important to use canonical identifiers for well-known licenses. see void guide for a list.

@@@ this describes the license under which the catalog can be used/reused and not the datasets. Even if the license of the catalog applies to all of its datasets it should be replicated on each dataset.

Property: dataset

A dataset that is part of the catalog.

Property: catalog record

A catalog record that is part of the catalog.

Class: Catalog record

A record in a data catalog, describing a single dataset.

  • Usage note: This class is optional and not all catalogs will use it. It exists for catalogs where a distinction is made between metadata about a dataset and metadata about the dataset's entry in the catalog. For example, the publication date property of the dataset reflects the date when the information was originally made available by the publishing agency, while the publication date of the catalog record is the date when the dataset was added to the catalog. In cases where both dates differ, or where only the latter is known, the publication date should only be specified for the catalog record.

@@@ in web-based catalogs, the URL of the catalog page should be used as URI for the catalog record if it is a permalink.

@@@ if named graphs are used, all RDF triples describing the catalog record, the dataset, and its distributions, should go into a graph named with the catalog record's URI

Property: listing date

The date of listing the corresponding dataset in the catalog.

  • Usage note: This indicates the date of listing the dataset in the catalog and not the publication date of the dataset itself.

Property: update/modification date

Most recent date on which the catalog entry was changed, updated or modified.

  • Usage note: This indicates the date of last change of a catalog entry, i.e. the catalog metadata description of the dataset, and not the date of the dataset itself.

Property: dataset

Links the catalog record to the dcat:Dataset resource described in the record.

Class: Dataset

A collection of data, published or curated by a single source, and available for access or download in one or more formats.

  • Usage note: This class represents the actual dataset as published by the dataset publisher. In cases where a distinction between the actual dataset and its entry in the catalog is necessary (because metadata such as modification date and maintainer might differ), the catalog record class can be used for the latter.

Property: update/modification date

Most recent date on which the dataset was changed, updated or modified.

  • Range: xsd:date—the date is encoded as a literal in "YYYY-MM-DD" form. If the specific day or month are not known, then 01 should be specified. Values like "unkonwn" or "continuous" are not permitted.
  • Usage note: The value of this property indicates a change to the actual dataset, not a change to the catalog record. An absent value may indicate that the dataset has never changed after its initial publication, or that the date of last modification is not known, or that the dataset is continuously updated.
  • Example: 2010-05-07

Property: title

A name given to the dataset.

Property: description

free-text account of the dataset.

Property: publisher

An entity responsible for making the dataset available.

Property: release date

Date of formal issuance (e.g., publication) of the dataset.

  • Range: xsd:date—the date is encoded as a literal in "YYYY-MM-DD" form. If the specific day or month are not known, then 01 should be specified. Values like "unkonwn" or "continuous" are not permitted.
  • Usage note: The value of this property indicates a change to the actual dataset, not a change to the catalog record. An absent value may indicate that the dataset has never changed after its initial publication, or that the date of last modification is not known, or that the dataset is continuously updated.
  • Example: 2010-05-07

Property: frequency

The frequency with which dataset is published.

  • Usage note: @@@ values should come from a controlled vocabulary i.e. predefined set of resources.
  • Domain: dct:Collection so, a Catalog must be a dct:Collection as well.

Property: identifier

A unique identifier of the dataset.

  • Usage note: the identifier might be used to coin permanent and unique URI for the dataset, but still having it represented explicitly is useful.

Property: spatial/geographical coverage

Spatial coverage of the dataset.

  • Usage note: @@@ controlled vocabulary. geonames???

Property: temporal coverage

@@@ The temporal period that the dataset covers.

  • Range: dct:PeriodOfTime (An interval of time that is named or defined by its start and end dates)

Property: license

The license under which the dataset is published and can be reused.

  • Usage note: @@@ copied from void guide @@@ To allow automatic analysis of datasets, it is important to use canonical identifiers for well-known licenses. see void guide for a list.

Property: granularity

describes the level of granularity of data. @@@ elaborate more@@@

  • Usage note: This is usually geographical or temporal but can also be other dimension e.g. Person can be used to describe granularity of a dataset about average income.

A set of sample values used in data.gov: country, county, longitude/latitude, region, plane, airport.

Property: data dictionary

provides some sort of description that helps understanding the data. This usually consisits of a table providing explanation of columns meaning, values interpretation and acronyms/codes used in the data.

  • Usage note: @@@ Review @@@ It is rarely provided in the current catalogs and does not have a consistent usage, however when it is provided it is a link to some document or embeded in a document packaged together with the dataset. It is recommended to represent it as a resource having the URL of the online document as its URI. Statistical datasets, as a particular yet common case, can have a more structured description and the on-progress work on SDMX+RDF can be utilized here.

Property: data quality

describes the quality of data.

  • Usage note: @@@Review@@@ This is a very general property and it is not clear how exactly it will be used as catalogs currently do not use it or use it with meaningless values. Catalogs are expected to define more specific sub-properties to describe quality characteristics e.g. statistical data usually have a lot to describe about the quality of sampling, collection mode, non-response adjustment...

Property: theme/category

The main category of the dataset. A dataset can have multiple themes.

  • Usage note: The set of skos:ConceptS used to categorize the datasets are organized in a skos:ConceptScheme describing all the categories and their relations in the catalog.

Property: keyword/tag

A keyword or tag describing the dataset.

Property: related documents

A related document such as technical documentation, agency program page, citation, etc.

  • Range: Has no defined range
  • Usage note: The value is the URI of the related document.

Property: dataset distribution

Connects a dataset to its available distributions.

Class: Distribution

Represents a specific available form of a dataset. Each dataset might be available in different forms, these forms might represent different formats of the dataset, different endpoints,... Examples of Distribution include a downloadable CSV file, an XLS file representing the dataset, an RSS feed...

  • Usage note: This represents a general availability of a dataset it implies no information about the actual access method of the data, i.e. whether it is a direct download, API, or a splash page. Use one of its subclasses when the particular access method is known.

Property: access/download

points to the location of a distribution. This can be a direct download link, a link to an HTML page containing a link to the actual data, Feed, Web Service etc. the semantic is determined by its domain (Distribution, Feed, WebService, Download).

  • Usage Note: the value is a URL.

Property: size

the size of a distribution.

  • Example:
   :distribution dcat:size [dcat:bytes 5120^^xsd:integer; rdfs:label "5KB"]

Property: format

the file format of the distribution.

  • Usage note: MIME type is used for values. A list of MIME types URLs can be found at IANA. However ESRI Shape files have no specific MIME type (A Shape distribution is actually a collection of files), currently this is still an open question? @@@.

Class: Download

represents a downloadable distribution of a dataset.

  • Usage note: accessUrl of the Download distribution should be a direct download link (a one-click access to the data file).

Class: WebService

represents a web service that enables access to the data of a dataset.

  • Usage note: describe the web service using accessUrl, format and size. Further description of the web service is out the scope of dcat.

Class: Feed

represent availability of a dataset as a feed.

  • Usage note: describe the feed using accessUrl, format and size. Further description of the web service is out the scope of dcat.

Class: Category and category scheme

The knowledge organization system (KOS) used to represent themes/categories of datasets in the catalog.

  • Usage note: It’s necessary to use either skos:inScheme or skos:topConceptOf on every skos:Concept otherwise it's not clear which concept scheme they belong to.

Class: Organization/Person

  • Usage note: FOAF provided sufficient properties to describe these entities.

Extending the vocabulary

  • As with all RDF models, the model can be extended simply by using additional RDF properties anywhere. Catalog operators may choose from properties in existing vocabularies, or create their own custom vocabulary.
  • Additional classes, from existing or new vocabularies, can also be used.
  • Extensions used in a particular catalog should be documented to make users of the data aware of the additional available properties.
  • Creating new subclasses and subproperties of terms used in dcat, such as new types of distributions, is generally discouraged because it can break SPARQL queries that data consumers use to query the data.
  • As always with RDF, if you need to introduce new classes or properties, do not introduce new terms in existing namespaces owned by someone else, but set up your own namespace and define new terms in that namespace.