TaskForces/CommunityProjects/LinkingOpenData/DataSets/CKANmetainformation

From W3C Wiki

Guidelines for Collecting Metadata on Linked Datasets in the datahub.io Data Catalog

For keeping the LOD cloud diagram up to date, the Linking Open Data community effort has started to collect meta-information about Linked datasets on datahub.io, a registry of open data and content packages provided by the Open Knowledge Foundation.

This page explains how dataset publishers or other people that want a dataset to be added to the LOD cloud, describe datasets on datahub.io.

The list of datasets about which we have already collected information is be found here:

http://validator.lod-cloud.net/


Which datasets are included into the LOD cloud diagram?

All datasets are included that fullfil the following requirements:

  1. Data items are accessible via dereferencable URIs, Offering only a SPARQL endpoint but no dereferencable URIs is not considered enough for inclusion.
  2. The dataset sets at least 50 RDF links pointing at other datasets or at least one other dataset is setting 50 RDF links pointing at your dataset.

How do I add a data set to datahub.io or edit an existing data set?

  1. Please register with datahub.io before editing or adding any packages.
  2. Please confirm that your data set does not already exist on datahub.io before adding a new data set.
  3. Add or edit your data set and describe it with the following minimum required information:
    • name (a unique id)
    • title
    • URL
    • number of triples
    • links to other data sets.
  4. Please tag newly added data sets with lod.
  5. If you are not aware of any in- or outlinks, tag it with lodcloud.nolinks.
  6. Please provide as much additional information as possible (e.g. SPARQL endpoint, voiD description, license, and the topic of the data set) as described below. This information helps the community to know more about the development state of the Web of Linked Data and is made available via the datahub.io API.

Minimum Information

Please provide the following minimum information about your data set.

Standard CKAN fields

Field name Description Format/Examples
Name Unique ID for your data set on datahub.io [a-z0-9-]+ "my-dataset"
Title Full name of your data set "My Dataset"
URL Link to data set homepage http://example.com/my-ds

Custom datahub.io fields

Field name Description Format/Examples
triples Approximate size of your data set in RDF triples 100000, 62345123
links:xxx Number of RDF links pointing at data set with Data Hub ID xxx (http://thedatahub.org/dataset/xxx). Please provide separate links xxx statements for each data set your are linking to 20000

datahub.io tags

Please use the following tags to provide meta-information about your data set.

We will use the topic information to color the LOD cloud later.

Please also list the vocabularies used by your data set so that the community can get an overview of which vocabularies are commonly used on the Web of Linked Data.

Linked Data published on the Web should be as self-describing as possible in order to make it easier for clients to understand and use the data. Important aspects of self-descriptiveness are making vocabulary terms dereferenceable according to the best practices described in Publishing RDF Vocabularies, using terms from common vocabularies and providing vocabulary mappings for proprietary vocabulary terms. In order to allow the community to get an overview which data sets implement these best practices, please tag your data set accordingly.

Tag Purpose
<topic> One of:
  • media
  • geographic
  • lifesciences
  • publications (including library and museum data)
  • government
  • ecommerce
  • socialweb (people and their activities)
  • usergeneratedcontent (blog posts, discussions, pictures, ...)
  • schemata (structural resources, including vocabularies, ontologies, classifications, thesauri)
  • crossdomain

Enhanced Information

Please provide the following additional information about your data set. This information helps the community to know more about the development state of the Web of Linked Data and is made available via the datahub.io API.

Standard datahub.io fields

Field name Description Format/Examples
Version Last modification date or version of your data set "2010-04 (3.5)", "2006", "beta"
Notes Description of your data set some free text
Author Name of publishing org and/or person "Talis (Leigh Dodds)"
Author email Contact email leigh@ldodds.com
License Standard license drop-down OSI approved::MIT license

Custom datahub.io fields

Field name Description Format/examples
shortname Short name for LOD bubble "NY Times"
license_link Custom license link http://example.com/so-sue-me
sparql_graph_name Named graph in SPARQL store (if used by your SPARQL endpoint) http://species.geospecies.org
namespace Instance namespace http://dbpedia.org/resource/

datahub.io resource links

Links (other than dereferenceable URIs) that enable alternative access to the data set (e.g., via downloads or SPARQL endpoints) should be specified in the Resources section of the CKAN entry form. Please also provide links to the voiD description or Semantic Web Sitemap describing your data set.

Purpose Format Description
Download page Download
XML Sitemap meta/sitemap XML Sitemap
SPARQL endpoint api/sparql SPARQL endpoint
voiD file meta/void voiD description
RDF/XML download application/rdf+xml Download
Turtle download text/turtle Download
N-Triples download application/x-ntriples Download
N-Quads download application/x-nquads Download
RDF Schema meta/rdf-schema Download link to RDF/OWL Schema used by your data set (in addition to having dereferenceable vocabulary URIs)
RDF/XML example link example/rdf+xml Link to an example data item within your data set (RDF/XML)
Turtle example link example/turtle Link to an example data item within your data set (Turtle)
N-Triples example link example/ntriples Link to an example data item within your data set (N-Triples)
HTML+RDFa example link example/rdfa Link to an example data item within your data set (RDFa)
Vocabulary Mappings, e.g., OWL, RDFS, RIF, R2R mapping/<format> If your data set uses proprietary vocabulary terms and you know these terms also exists in other vocabularies, you should set owl:equivalentClass, owl:equivalentProperty, rdfs:subClassOf, and/or rdfs:subPropertyOf links pointing at these terms or provide mapping expressed as RIF rules or using the R2R Mapping Language. If your mappings can be downloaded as a single file, please provide the link to the download.

datahub.io tags

Please use the following tags to provide meta-information about your data set.

We will use the topic information to color the LOD cloud later.

Please also list the vocabularies used by your data set so that the community can get an overview of which vocabularies are commonly used on the Web of Linked Data.

Linked Data published on the Web should be as self-describing as possible in order to make it easier for clients to understand and use the data. Important aspects of self-descriptiveness are making vocabulary terms dereferenceable according to the best practices described in Publishing RDF Vocabularies, using terms from common vocabularies and providing vocabulary mappings for proprietary vocabulary terms. In order to allow the community to get an overview which data sets implement these best practices, please tag your data set accordingly.

Tag Purpose
format-<prefix> A vocabulary used by the data set, e.g., format-skos, format-dc, format-foaf. Use http://prefix.cc/ to find a prefix for a vocabulary. If a vocabulary is not in prefix.cc, then add it there or ignore that vocabulary.
no-proprietary-vocab Indicates that your data set does not use a proprietary vocabulary (defined within your top-level domain).
deref-vocab

no-deref-vocab

Indicates whether the proprietary vocabulary terms used by your data set (the ones that are defined within your top-level domain) are dereferenceable according to the best practices for Publishing RDF Vocabularies
vocab-mappings

no-vocab-mappings

Indicates whether you provide mappings for proprietary vocabulary terms (by setting owl:equivalentClass, owl:equivalentProperty, rdfs:subClassOf, and/or rdfs:subPropertyOf links, or publish mapping expressed as RIF rules or using the R2R Mapping Language).
provenance-metadata

no-provenance-metadata

Indicates whether the data set provides provenance meta-information (creator of the data set, creation date, maybe creation method) as document meta-information or via a voiD description. For instance, using the dc:creator or dc:date properties.
license-metadata

no-license-metadata

Indicates whether the data set provides licensing meta-information as document meta-information or via a voiD description. For instance, using the dc:rights property.
published-by-producer

published-by-third-party

Indicates whether the data set is published by the original data producer or a third party.
limited-sparql-endpoint Indicates whether the SPARQL endpoint is not serving the whole data set.
lodcloud.nolinks Dataset has no external RDF links to other datasets.
lodcloud.unconnected Dataset has no external RDF links to or from other datasets.
lodcloud.needsinfo The data provider or dataset homepage do not provide mininum information (and information can't be determined from SPARQL endpoint or downloads).
lodcloud.needsfixing The dataset is currently broken. Provide details in the Notes.