Best practices guidelines

From Data on the Web Best Practices
Jump to: navigation, search
General Guidelines

What is a best practice?

  • A Best Practice implements one or more UC Requirements.
  • A UC Requirement is motivated by one or more Use Cases.
  • A Best Practice has a title, a description and one or more How to Sections.
  • A How to Section specifies one possible way of implementing a Best Practice.

Use Cases Requirements [1] show that we should also consider licenses and vocabularies as "main targets" when defining best practices instead of just the data. However, describing best practices for licenses is out of the scope of this working group.

Question 1: Should we define best practices for vocabularies or just for datasets?

Terminology
  • Dataset: assuming that best practices definitions should not consider specific standards or technologies, we adopt the Dataset definition provided by DCAT, which considers a dataset as a "collection of data, published or curated by a single agent, and available for access or download in one or more formats".
  • Distribution: according to DCAT, a distribution represents an accessible form of a dataset as for example a downloadable file, an RSS feed or a web service that provides the data. A dataset may have multiple distributions
  • Vocabulary: according to the Linked Data Glossary, a vocabulary is a collection of "terms" for a particular purpose. Vocabularies play a fundamental role when "publishing and consuming Data on the Web", specifically to help with data integration. The use of this term overlaps with Ontology. Vocabularies may be used to define metadata.
  • Metadata: according to the Linked Data Glossary, metadata is an information used to administer, describe, preserve, present, use or link other information held in resources, especially knowledge resources, be they physical or virtual. Metadata may be further subcategorized into several types. Datasets, distributions and vocabularies are described by metadata.
Best Practices

The examples of Best Practices presented below were extracted from the use case requirements [2]. Each BP may be implemented in several different ways, according to a specific technology, for example.

  • BP1. Datasets should be available in an open format
  • BP2. Datasets should be available in a machine-readable format
  • BP3. Datasets should be available in multiple formats
  • BP4. Datasets should be accessible in different ways
  • BP5. Datasets should be available in standard data formats
  • BP6. Datasets should be described by metadata
  • BP7. Metadata should be available in a machine-readable format
  • BP8. Standard vocabularies should be used to define metadata
  • BP9. Datasets should be available at different levels of granularity
  • BP10. Datasets selected for publication should be of high-value
  • BP11. Datasets should be available in an up-to-date manner
  • BP12. Each data resource should be associated with a unique identifier
  • BP13. Datasets should be suitable for industry reuse
  • BP14. Provenance information should be available
  • BP15. Quality information should be available
  • BP16. Usage information should be available
  • BP17. Versioning information should be available
  • BP18. Licensing information should be available
  • BP19. Vocabularies should be well documented
  • BP20. Existing reference vocabularies should be reused where possible
  • BP21. Vocabularies should be shared in an open way
Mapping between Best Practices and Proposed Chapters

The table below shows and attempt to map the General Best Practices to the different groups that are currently working on the Best Practices document. (to be defined)

Section Best Practice
URI Best Practice for Web Data URI (DURI), URI Design and Management for Persistence, URIs versus APIs
Guidance on the Provision of Metadata
Use of core vocabularies to improve interoperability
Data quality vocabulary
Data usage vocabulary
Publishing and accessing versions of datasets

To discuss with the group (how to map?)

  • Making controlled vocabularies accessible as URI sets:
Mark, Antoine
  • Technical factors for consideration when choosing data sets for publication
Nathalia, Flavio
  • Technical factors affecting potential use of open data for innovation, efficiency and commercial exploitation
Vagner, Nathalia, Hadley, Yaso
  • Data preservation:
Phil, Christophe


Mapping between UC Requirements and Best Practices

The table below shows a map between Use Case Requirements and Best Practices (General and Specific). A Best Practice was created based on one or more requirements. (to be defined)

Requirement Requirement Description Best Practice
R-MetadataAvailable Metadata should be available
R-MetadataMachineRead Metadata should be machine-readable
R-MetadataStandardized Metadata should be standardized
R-MetadataDocum Metadata vocabulary, or values if vocabulary is not standardized, should be well-documented
R-MetadataInteroperable Metadata should be interoperable
R-GranularityLevels Data available at different levels of granularity should be accessible and modelled in a common way
R-FormatMachineRead Data should be availabe in a machine-readable format
R-FormatStandardized Data should be availabe in a standardized format
R-FormatOpen Data should be availabe in an Open format
R-FormatMultiple Data should be availabe in multiple formats
R-FormatLocalize It should be possible to localize data on the Web
R-VocabReference Existing reference vocabularies should be reused where possible
R-VocabDocum Vocabularies should be clearly documented
R-VocabOpen Vocabularies should be shared in an Open way
R-VocabVersion Vocabularies should include versioning information
R-LicenseAvailable Data should be associated with a license
R-LicenseMachineRead Data licenses should be provided in a machine-readable format
R-LicenseStandardized Standard vocabularies should be used to describe licenses
R-LicenseInteroperable Data licenses should be interoperable
R-LicenseLiability Liability terms associated with usage of Data on the Web should be clearly outlined
R-ProvAvailable Data provenance information should be available
R-SelectHighValue Datasets selected for publication should be of high-value
R-SelectDemand Datasets selected for publication should be in demand by potential users
R-AccessBulk Data should be available for bulk download
R-AccessRealTime Where data is produced in real-time, it should be available on the Web in real-time
R-AccessUptodate Data should be available in an up-to-date manner
R-SensitivePrivacy Data should not infringe on a person's right to privacy
R-SensitiveSecurity Data should not infringe on national security
R-UniqueIdentifier Each data resource should be associated with a unique identifier
R-MultipleRepresentations A data resource may have multiple representations, e.g. xml/html/json/rdf
R-DynamicGeneration Dynamic generation of Data on the Web from non-Web data resources
R-AutomaticUpdate Automatic update of Data on the Web when original data source is updated
R-CoreRegister Core registers should be accessible
R-IndustryReuse Data should be suitable for industry reuse
R-SLAAvailable Service Level Agreements (SLAs) for industry reuse of the data should be available if requested
R-SLAMachineRead SLAs should be provided in a machine-readable format
R-SLAStandardized Standard vocabularies should be used to describe SLAs
R-PotentialRevenue Potential revenue streams from data should be described
R-PersistentIdentification Data should be persistently identifiable
R-Archiving It should be possible to archive data
R-QualityAvailable Quality information should be available
R-UsageAvailable Usage information should be available
References