Data on the Web Best Practices Use Cases & Requirements

Abstract

This document lists some use cases, compiled by the Data on the Web Best Practices Working Group, that represent scenarios of how data is commonly published on the Web and how it is used. This document also provides a set of requirements derived from these use cases that have been used to guide the development of the set of Data on the Web Best Practices and the development of two new vocabularies: Quality and Granularity Description Vocabulary and Data Usage Description Vocabulary.

2. Use Cases

A use case describes a scenario that illustrates an experience of publishing and using Data on the Web. The information gathered from the uses cases should be helpful for the identification of the best practices that will guide the publishing and usage of Data on the Web. In general, a best practice will be described at least by a statement and a how to do it section, i.e., a discussion of techniques and suggestions as how to implement it. Use cases descriptions shows some of the main challenges faced by publishers or developers. Information about challenges will be helpful to identify areas where Best Practices are necessary. According to the challenges, a set of requirements were defined, in such a way that a requirement motivates the creation of one or more best practices.

2.1 Use Case #1 - BuildingEye: SME use of public data

(Contributed by Deirdre Lee)

Buildingeye.com makes building and planning information easier to find and understand by mapping what's happening in your city. In Ireland local authorities handle planning applications and usually provide some customized views of the data (PDFs, maps, etc.) on their own website. However there isn't an easy way to get a nationwide view of the data. BuildingEye, an independent SME, built http://mypp.ie/ to achieve this. However as each local authority didn't have an Open Data portal, BuildingEye had to directly ask each local authority for its data. It was granted access to some authorities, but not all. The data it did receive was in different formats and of varying quality/detail. BuildingEye harmonized this data for its own system. However, if another SME wanted to use this data, they would have to go through the same process and again go to each local authority asking for the data.

Elements:

Domains: Planning data
Obligation/motivation: demand from SME
Usage: Commercial usage
Quality: standardized, interoperable across local authorities
Size: medium
Type/format: structured according to legacy system schema
Rate of change: daily
Potential audience: Business, citizens
“Governance”: local authorities

Challenges:

Access to data is currently a manual process, on a case by case basis
Data is provided in different formats, e.g. database dumps, spreadsheets
Data is structured differently, depending on the legacy system schema, concepts and terms not interoperable
No official Open license associated with the data
Data is not available for further reuse by other parties

Potential Requirements:

Creation of top-down policy on Open Data to ensure common understanding and approach
Top-down guidance on recommended Open license usage
Standardized, non-proprietary formats
Availability of recommended domain-specific vocabularies.

Requires: MetadataAvailable , FormatMachineRead , FormatStandardized , FormatOpen , LicenseAvailable and AccessBulk

Issue 36

R-FormatMachineRead seems to be more specific than the requirement from the two use cases listed as motivation?

2.2 Use Case #2 - The Land Portal

(Contributed by Carlos Iglesias)

The IFAD Land Portal platform it's been completely rebuilt as an Open Data collaborative platform for the Land Governance community. Among the new features the Land Portal will provide access to comprehensive and in-depth 100+ indicators from 25+ different sources on land governance issues for 200+ countries over the world, as well as a repository of land related-content and documentation. Thanks to the new platform people could (1) curate and incorporate new data and metadata by means of different data importers and making use of the underlying common data model; (2) search, explore and compare the data through countries and indicators; and (3) consume and reuse the data by different means (i.e. raw data download at the data catalog; linked data and SPARQL endpoint at RDF triplestore; RESTful API; and built-in graphic visualization framework)

Elements:

Domains: Land Governance; Development
Obligation/motivation: To find reliable data driven indicators on land governance and put all them together to facilitate access, study, analysis, comparison and data gaps detection.
Usage: Research; Policy Making, Journalism; Development; Investments; Governance; Food security; Poverty; Gender issues.
Quality: Every sort of data, from high quality to unverified one.
Size: Varies, but low-medium in general.
Type/format: Varies: APIs; JSON; spreadsheets; CSVs; HTMLs; XMLs; PDFs...
Rate of change: Usually yearly, but also lower rates (monthly, quarterly...)
Data lifespan: Unlimited.
Potential audience: Practitioners; Policy makers; Activists; Researchers; Journalists.

Challenges:

Data coverage.
Quality of data and metadata.
Lack of machine-readable metadata.
Inconsistency between different data sources.
Wide variety of formats and technologies.
Some non machine-readable formats.
Data variability (models, sources, etc.)
Data provenance.
Diversity and (sometimes) complexity of Licenses.
Internationalization issues (e.g. different formats for numbers, dates, etc.) and multilingualism

Potential Requirements:

Availability of general use taxonomies (countries, topics, etc.).
Data interoperability i.e. domain-specific vocabularies for a common data model with reference formats and protocols.
Data persistence.
Versioning mechanisms.

Requires: MetadataMachineRead , GranularityLevels , FormatMachineRead , FormatStandardized , FormatLocalize , VocabReference , VocabVersion , LicenseInteroperable , LicenseStandardized , ProvAvailable , AccessBulk , AccessRealTime , Persistent , QualityCompleteness and QualityMetrics

Requires: R-MetadataStandardized , MetadataInteroperable and GranularityLevels

2.3 Use Case #3 - Recife Open Data Portal

(Contributed by Bernadette Lóscio )

Recife is a city situated in the Northeast of Brazil and it is famous for being one of the Brazil’s biggest tech hubs. Recife is also one of the first Brazilian cities to release data generated by public sector organizations for public use as Open Data. Then Open Data Portal Recife was created to offer access to a repository of governmental machine-readable data about several domains, including: finances, health, education and tourism. Data is available in CSV and GeoJSON format and every dataset has a metadata description, i.e. descriptions of the data, that helps in the understanding and usage of the data. However, the metadata is not described using standard vocabularies or taxonomies. In general, data is created in a static way, where data from relational databases are exported in a CSV format and then published in the data catalog. Currently, they are working to have dynamically generated data from the contents of relational databases, then data will be available as soon as they are created. The main phases of the development of this initiative were: to educate people with appropriate knowledge concerning Open Data, relevant data identification in order to identify the sources of data that their pontential consumers could find useful, data extraction and transformation from the original data sources to the open data format, configuration and installation of the open data catalogue tool, data publication and portal release.

Elements:

Domains: Base registers, Cultural heritage information, Geographic information, Infrastructure information, Social data and Tourism Information
Obligation/motivation: Data that must be provided to the public under a legal obligation (Brazilian Information Acess Act, edited in 2012); Provide public data to the citizens
Usage: Data that supports democracy and transparency; Data used by application developers
Quality: Verified and clean data
Size: in general small to medium CSV files
Type/format: CSV, geojson
Rate of change: different rates of changes depending on the data source
Potential audience: application developers, startups, government organizations

Challenges:

Use common vocabs to facilitate data integration
Provide structural metadata to help data understanding and usage
Automate the data publishing process to keep data up to date and accurate

Requires: MetadataMachineRead , MetadataStandardized , MetadataDocum , VocabReference , VocabDocum , VocabOpen , SelectHighValue , SelectDemand , QualityCompleteness , DynamicGeneration , AutomaticUpdate and QualityComparable

Issue 25

Are R-SelectHighValue and R-SelectHighDemand are workable requirements?

Issue 28

The difference between R-DynamicGeneration and R-AutomaticUpdate is not clear.

2.4 Use Case #4 - Dados.gov.br

(Contributed by Yasodara)

Data.gov.br is the open data portal of the Brazil's Federal Government. The site was built in community, in a network pulled by three technicians from the Ministry of Planning. They managed the WG3 from "INDA" or "National Infrastructure for Open Data". The CKAN was chosen because it is Free Software and present more independent solutions for the placement of data catalog of the Federal Government provided on the internet.

Elements:

Domains: federal budget, addresses, Infrastructure information, e-gov tools usage, social data, geographic information, political information, Transport information
Obligation/motivation: Data that must be provided to the public under a legal obligation, the called LAI or Brazilian Information Acess Act, edited in 2012
Usage: Data that is the basis for services to the public; Data that has commercial reuse potential.
Quality: Authoritative, clean data, vetted and guaranteed;
Lineage/Derivation: Data came from various publishers. As a catalog, the site has faced several challenges, one of them was to integrate the various technologies and formulas used by publishers to provide datasets in the portal.
Type/format: Tabular data, text data
Rate of change: There is fixed data and data with high rate of change

Challenges:

data integration (lack of vocabs)
collaborative construction of the portal: managing online sprints and balancing public expectatives.
Licencing the data of the portal. Most of data that is inn the portal has not a special licence for data. As you can see, there is different types of licences that applied to the datasets.

Requires: R-VocabReference , R-LicenseAvailable , R-LicenseStandardized and R-QualityOpinions

2.5 Use Case #5 - ISO GEO Story

(Contributed by Ghislain Atemezing)

ISO GEO is a company managing catalogs records of geographic information in XML, conformed to ISO-19139. (ISO- 19139 is a French adaptation of the ISO- 19115) An excerpts is here . They export thousands of catalogs like that today, but they need to manage them better. In their platform, they store the information in a more conventional manner, and use this standard for export dataset compliant to Inspire interoperability , or via the CSW protocol. Sometimes, they have to enrich their metadata with other ones, produced by tools like GeoSource and accessed through SDI (Spatial Data Infrastructure), with their own metadata records. A sample containing 402 metadata records in ISO 19139 are in public consultation here . They want to be able to integrate all the different implementations of the ISO 19139 in different tools in a single framework to better understand the thousand of metadata records they use in their day-to-day business. Types of information recorded in each file (see example here ) are the following: Contact info (metadata) [Data issued]; spatial representation ; reference system info [code space ], spatial Resolution ; Geographic Extension of the data, File distribution; Data Quality ; process step, etc.

Challenges:

Achieve interoperability between supporting applications, e.g.:, validation and discovery services built over metada repository
Capture the semantics of the current metadata records with respect to ISO 19139 standard.
Unify way to have access to each record within the catalog at different level e.g.:, local, regional, national or EU level.

2.6 Use Case #6 - Dutch basic registers

(Contributed by Christophe Guéret)

The Netherlands has a set of registers they are looking at opening and exposing as Linked (Open) Data under the context of the project "PiLOD" community of expertise. The registers contain information about buildings, people, businesses and other individuals public bodies may want to refer to for they daily activities. One of them is, for instance, the service of public taxes ("BelastingDienst") which regularly pulls out data from several registers, stores this data in a big Oracle instance and curates it. This costly and time consuming process could be optimized by providing on-demand access to up-to-date descriptions provided by the register owners.

Challenges:

In terms of challenges, linking is for once not much of an issue as registers already cross-reference unique identifiers (see also http://www.wikixl.nl/wiki/gemma/index.php/Ontsluiting_basisgegevens ). A URIs scheme with predicable URIs is being considered for implementation. Actual challenges include:

Capacity: at this point, it can not be asked that every register owner cares for publishing his own data. Some of them export what they have on the national open data portal. This data has been used to do some testing with third-party publication from PiLODers but this is rather sensitive as a long term strategy (governmental data has to be tracable/trustable as such). The middle ground solution currently deployed is the PiLOD platform, a (semi)-official platform for publishing register data.
Privacy: some of the register data is personal or may become so when linked to others (e.g. disambiguate personal data based on adresses). Some registers will require to provide secured access to some of their data to some people only (Linked Data, not Open). Some others can go along with open data as long as they get a precise log of who is using what.
Revenue: institutions working under mixed gov/non-gov funding generate part of their revenue by selling some of the data they curate. Switching to an open data model will generate a direct loss in revenue that has to be backed-up by other means. This does not have to mean closing the data, e.g. a model of open dereferencing + paid dumps can be considered, as well as other indirect revenue streams.

Requires: VocabReference , R-SensitivePrivacy , UniqueIdentifier , MultipleRepresentations and R-CoreRegister

2.7 Use Case #7 - Wind Characterization Scientific Study

(Contributed by Eric Stephan)

This use case describes a data management facility being constructed to support scientific offshore wind energy research for the U.S. Department of Energy’s Office of Energy Efficiency and Renewable Energy (EERE) Wind and Water Power Program. The Reference Facility for Renewable Energy (RFORE) project is responsible collecting wind characterization data from remote sensing and in situ instruments located on an offshore platform. This raw data is collected by the Data Management Facility and processed into a standardized NetCDF format. Both the raw measurements and processed data are archived in the PNNL Institutional Computing (PIC) petascale computing facility. The DMF will record all processing history, quality assurance work, problem reporting, and maintenance activities for both instrumentation and data. All datasets, instrumentation, and activities are cataloged providing a seamless knowledge representation of the scientific study. The DMF catalog relies on linked open vocabularies and domain vocabularies to make the study data searchable. Scientists will be able to use the catalog for faceted browsing, ad-hoc searches, query by example. For accessing individual datasets a REST GET interface to the archive will be provided.

Challenges:

For accessing numerous datasets scientists will be accessing the archive directly using other protocols such as sftp, rsync, scp, access techniques such as: http://www.psc.edu/index.php/hpn-ssh

Requires: FormatStandardized , VocabReference , VocabOpen and AccessRealTime

2.8 Use Case #8 - Digital archiving of Linked Data

(Contributed by Christophe Guéret)

Taking the concrete example of the digital archive "DANS" , digital archives have so far been concerned with the preservation of what could be defined as "frozen" dataset. A frozen dataset is a finished, self-contained, set of data that does not evolve after it has been constituted. The goal of the preserving institution is to ensure this dataset remains available and readable for as many years as possible. This can for example concern an audio record, a digitized image, e-books or database dumps. Consumers of the data are expected to look-up for a specific content based on its associated persistent identifier , download it from the archive and use it. Now comes the question of the preservation of Linked Open Data. In opposition to "frozen" data sets, linked data can be qualified as "live" data. The resources it contains are part of a larger entity to which third parties contribute, one of the design principles indicate that other data producers and consumers should be able to point to data. As LD publishers stop offering their data (e.g. at the end of a project), taking the LD off-line as a dump and putting it in an archive effectively turns it into a frozen dataset, likewise to SQL dumps and other kind of data bases. The question then raises as to which extent this is an issue...

Challenges: The archive has to think about whether serving dereferencing for resources found in preserved datasets is required or not, also think about providing a SPARQL end point or not. If data consumers and publishers are fine with having RDF data dumps to be downloaded from the archive prior to its usage - just like any other digital item so far - the technical challenges could be limited to handling the size of the dumps and taking care of serialisation evolution over time (e.g. from Ntriples to Trig, or from RDF/XML to HDT ) as the preference for these formats evolves. Turning a live dataset into a frozen dump also raises the question of the scope. Considering that LD items are only part of a much larger graph that gives them meaning through context the only valid dump would be a complete snapshot of the entire connected component of the Web of Data graph the target dataset is part of.

Potential Requirements: Decide on the importance of the de-referencability of resources and the potential implications for domain names and naming of resources. Decide on the scope of the step that will turn a connected sub-graph into an isolated data dump.

Requires: VocabReference , UniqueIdentifier , PersistentIdentification and Archiving

2.9 Use Case #9 - LA Times' reporting of Ron Galperin's Infographic

(Contributed by Phil Archer )

On 27 March 2014, the LA Times published a story Women earn 83 cents for every $1 men earn in L.A. city government. It was based on an Infographic released by LA's City Controller, Ron Galperin. The Infographic was based on a dataset published on LA's open data portal, Control Panel LA . That portal uses the Socrata platform which offers a number of spreadhseet-like tools for examining the data, the ability to download it as CSV, embed it in a Web page and see its metadata.

Positive aspects:

The LA Times story makes its sources clear (it also links to a related Pew Research Center article ).
It offers readers a commentary on the particular issue raised and is easy for anyone to digest.
Data sources are cited directly and can be followed up on by (human) readers.

Negative aspects:

The Infographic itself only cites the data portal, not the specific dataset, i.e. https://controllerdata.lacity.org/ not https://controllerdata.lacity.org/Payroll/Gender-Breakdown-of-City-Workers-by-Category/fvfi-5kja/
The metadata provided on the data portal is very sparse with many fields left empty.
The dataset is itself the result of an analysis (there are only 8 lines in the table), the raw data on which it is based is not cited, let alone made available, and the methods used are not described.

Challenges:

Data Citation - how could Ron Galperin have referred to the source data in the Infographic? (the URI is way too long). QR code? Short PURL?
How could the publisher of the data link to the Infographic as a visualization of it?
In this case, the creator of the underlying data is the same as the creator of the Infographic, but if they were different, how could the data creator discover the Infographic, still less the media report about it?
The methodology used is not explained - making it hard to assess trustworthiness. How can provenance be described?
The metadata is incomplete and does not used a recognized standard vocabulary making automated discovery and use by anyone other than the data creator difficult.

Other Data Journalism blogs:

Requires: MetadataAvailable , MetadataStandardized , UniqueIdentifier and Citable

Issue 40

Review R-Citable as a requirement for Data Usage.

2.10 Use Case #10 - Uruguay: open data catalogue

(Contributed by AGESIC )

Uruguay open data site holds 85 datasets containing 114 resources since the first dataset was published in Dec. 2012. Open data initiative prioritizes the “use of data” rather than “quantity of data”, that’s why the catalogue holds 25 applications using datasets resources in some way. It’s important for the project to keep the relation 1/3 between applications and datasets. Most of the resources are CSV and shapefiles; basically we have a 3 stars catalogue and the reason why we can’t go to the next level is the lack of resources (time, human, economic, etc.) at government agencies to implement an open data liberation strategy. So when we are asked about opening data, keep it simple is the answer, and CSV is far the easiest and smart way to start. Uruguay has an Access to public information law but don’t have legislation about open data. The open data initiative is leaded by AGESIC with the support of the open data working group. OD Working group: - Intendencia de Montevideo – www.montevideo.gub.uy - INE – www.ine.gub.uy - AGEV – www.agev.opp.gub.uy - FING – UDELAR – www.fing.edu.uy - D.A.T.A. – www.datauy.org

Elements:

Domains:
- Infrastructure: Most of the datasets are shapefiles.
- Transportation: Shapefiles and CSV, containing information about public transportation (stops and frequency), roads, accidents, etc.
- Tourism: data about regional events, cultural agenda, hotels, campings, statistics.
- Economics: Budget, Consumer price declarations, etc.
- Social development
- Environment
- Health
- Education
- Culture
Obligation/motivation: There is no obligation for the government agencies to publish open data. All initiatives were carried on by agencies that wants to support the initiative.
Usage: Develop applications and new services for citizens, agencies interoperability (exchange of information in open data formats), transparency
Quality: Most of the data is actualized properly, datasets metadata is complete, resources metadata about 70% complete.
Size: Small; most of the datasets size is less than 1Gb.
Type/format: SHAPEFILE (35), CSV (26), TXT (19), ZIP (12), HTML (7), XLS (6),PDF (4), XML (3), RAR (2)
Rate of change: Depends on the dataset.
Data lifespan: Depends on the dataset, some change in real time, other monthly, every 6 month, annual or never change.
Potential audience: Developers, Journalists, Civil society, Entrepreneurs.

Challenges: Consolidate tool to manage datasets, improve visualizations and transform resources to higher level (4 – 5 stars). Automate publication process using harvesting or similar tools. Alerts or control panel to keep data updated.

Requires: VocabReference , DynamicGeneration and AutomaticUpdate

2.11 Use Case #11 - GS1: GS1 Digital

(Contributed by Mark Harrison (University of Cambridge) & Eric Kauz (GS1) )

Retailers and Manufacturers / Brand Owners are beginning to understand that there can be benefits to openly publishing structured data about products and product offerings on the web as Linked Open Data. Some of the initial benefits may be enhanced search listing results (e.g. Google Rich Snippets) that improve the likelihood of consumers choosing such a product or product offer over an alternative product that lacks the enhanced search results. However, the longer term vision is that an ecosystem of new product-related services can be enabled if such data is available. Many of these will be consumer-facing and might be accessed via smartphones and other mobile devices, to help consumers to find the products and product offers that best match their search criteria and personal preferences or needs - and to alert them if a particular product is incompatible with their dietary preferences or other criteria such as ethical / environmental impact considerations - and to suggest an alternative product that may be a more suitable match. A more complete description of this use case is available.

Elements:

Domains:
- Product master data (e.g. technical specifications, ingredients, nutritional information, dimensions, weight, packaging)
- Product offerings (e.g. sales price, availability (online, locally), payment options, delivery/collection options
- Ethical / environmental claims about a product and its production process
Obligation/motivation:
- initially, enhanced search result listings (e.g. Google Rich Snippets)
- vision is to enable an ecosystem of new digital apps around product data
- the food sector in the EU is already obliged under new food labelling legislation (EU 1169 / 2011, Article 14) to provide the same amount of information about a food product that is sold online to consumers as the information that would be available to them from the product packaging if they picked up the product in-store. Although the legislation does not suggest that Linked Open Data technology should be used to make the same information available in a machine-readable format, there is currently significant investment and effort to upgrade websites to provide accurate and detailed information about food products; the GS1 Digital team consider that for a relatively small amount of effort, these companies could gain some tangible benefits (e.g. enhanced search results) from such compliance efforts by using Linked Open Data technology within their web pages.
Usage:
- data providing transparency about product characteristics
- data used to help consumers make informed choices about which products to buy/consume
Quality: Very important to have trustworthy authoritative data from respective organizations
Size: Typically 20+ factual claims per product - probably 40+ RDF triples
Type/format: HTML + RDFa / JSON-LD / Microdata
Rate of change: mostly static data initially - but subject to some variation over time
Data lifespan: data should remain accessible until products are no longer considered to be in circulation; this represents a challenge for deprecated product lines data that is stated authoritatively by one organization might be embedded / referenced in the data asserted by another organization; this raises concerns about whether embedded data becomes stale if it is inadequately synchronized, that referenced data is not dereferenced (and therefore not discovered / gathered) by consumers or the data. From a liability perspective, there also needs to be clarity about which organization asserted which factual information - and also information about which organization has the authority to assert specific factual claims.
Potential audience: machine-readable (search engines, data aggregators, mobile apps etc.)

Challenges:

Linked Open Data about products is likely to be highly distributed in nature and various parties have authority over specific claims
Accreditation agencies have authority over ethical/environmental claims
Brand owners / manufacturers have authority over product master data
Retailers have authority over facts related to product offerings (price, availability etc.)
An organization (e.g. retailer) might embed authoritative data asserted by another organization (e.g. brand owner) and there is the risk that such embedded information becomes stale if it is not continuously synchronized.
An organization (e.g. retailer) might reference a graph of authoritative data that can be retrieved via an HTTP request to a remote HTTP URI. There is a risk that software or search engines consuming Linked Open Data containing such references may fail to dereference such HTTP URIs and in doing so may fail to gather all of the relevant data.
Organizations are currently faced with a choice of whether to embed machine-readable structured data in their web pages using a block approach (e.g. using JSON-LD) or using an inline approach (e.g. using RDFa, RDFa Lite or Microdata). A block approach (JSON-LD) may be simpler and less brittle than inline annotation, especially as it can be easily decoupled from structural changes to the body of the web page that may happen over time in the redesign of a website. At present, tool support for the 3 major markup approaches for embedded Linked Open Data (RDFa, JSON-LD, MIcrodata) is unequal across the three formats and some tools may not export or import / ingest all 3 formats - some tools even fail to extract data from JSON-LD markup created by their corresponding export tool. There are some significant challenges to ensure that the structured data embedded within a web page is correctly linked to form coherent RDF triples, without any dangling nodes that should be connected to the Subject or other nodes.
Only through the provision of best-in-class tool support that recognize all three major formats on a completely equal footing can organizations have any confidence that they can use any of the 3 major markup formats and the ability to verify / validate that their own markup does result in the correct RDF triples.

Potential Requirements:

The ability to determine who asserted various facts - and whether they are the organization that can assert those facts authoritatively.
Where data from other sources is embedded, there is a risk that the embedded data might be stale. It is therefore helpful to indicate which graph of triples is a snapshot in time from data from another source - and to provide a link to the original source, so that the consumer of the data has the opportunity to obtain a fresh version of the live data rather than relying on a potentially stale snapshot graph of data. DWBP could provide guidance about how to indicate which graph of data is a snapshot and where it came from.
Consumers of Linked Open Data about products might rely on it for making decisions - not only about purchase but even consumption. If the data about a product is inaccurate or out-of-date, we might need to provide some guidance about how liability terms and disclaimers can be expressed in Linked Open Data. We’re not suggesting that we define such terms from a legal perspective - but perhaps there is an existing framework in a similar way that there is an existing framework for expressing various licences of the data? If not, perhaps such a framework needs to be developed - but outside of the DWBP group? Licensing generally says what you’re allowed to do with the data - but I don’t think it says anything about liability for using the data or making decisions based on that data. This area probably needs some clarification, particularly if there is a risk of injury or death (due to inaccurate information about allergens in a food product).

Requires: FormatStandardized , FormatMultiple , ProvAvailable , AccessUptodate , LicenseLiability , PersistentIdentification , Citable , AutomaticUpdate and CoreRegister

2.12 Use Case #12 - Tabulae - how to get value out of data

(Contributed by Luis Polo )

Tabul.ae is a framework to publish and visually explore data that can used to deploy powerful and easy-to-exploit open data platforms, so contributing organizations to unleash the potential of their data. The aim is to enable data owners (public organizations) and consumers (citizens and business reusers) to transform the information they manage into added-value knowledge, empowering them to easily create data-centric web applications. These applications are built upon interactive and powerful graphs, and take the shape of interactive charts, dashboards, infographies and reports. Tabulae provides a high degree of assistance to create these apps and also automate several data visualizations tasks (i.e., recognition of geographical entities to automatically generate a map). In addition, the charts and maps are portable outside the platform and can be smartly integrated with any web content, enhancing the reusability of the information.

Elements:

Domains: Quantitative and geographical information: stats, biodiversity, socio-economic indicators, environment, security, etc
Obligation/motivation: to help citizens and companies (especially, consultancy firms) to understand and create value from open data by means of reusable, user-made visualizations.
Usage: Data used by citizens, public employees and companies.
Quality: The information must be at least semi-structured (for instance, an spreadsheet).
Size: Medium and large datasets (hundreds of thousands and millions rows)
Type/format: Tabulae can manage relational databases, GeoJSON, CSV files and spreadsheets, and provides an API for programmatic access.
Rate of change: depending on the original datasets. The platform enables automatic update from original sources.
Data lifespan: depending on the original datasets.
Potential audience: Organizations that want to publish their catalogue of datasets and aim to maximize their impact and consumption.

Challenges:

Quality of data and metadata.
Inconsistency between different data sources.
Wide variety of formats and technologies.
Different data schemas that complicates the integration of data sources.
Diversity and (sometimes) complexity of Licenses.
Data persistence.
Internationalization and format issues (e.g., languages, numbers, dates, etc.)

Potential Requirements:

Dataset versioning and updating mechanisms
Standardization of schemas
Integration with other platforms/services

Requires: FormatStandardized , FormatLocalize , VocabReference , VocabVersion , LicenseStandardized , LicenseInteroperable , ProvAvailable , AutomaticUpdate and QualityCompleteness

2.13 Use Case #13 - Retrato da Violência (Violence Map)

(Contributed by Yasodara )

This is a Data Visualization made in 2012 by Vitor Batista , Léo tartari and Thiago Bueno for a W3C Brazil Office challenge about data from Rio Grande do Sul (a brazilian region). The data was released in a .zip package, the original format was .csv. The code and the documentation of the project are in it's GitHub repository

Elements:

Domains: political information, regional security information.
Obligation/motivation: Data that must be provided to the public under a legal obligation, the called LAI or Brazilian Information Acess Act, edited in 2012
Quality: not guaranteed data
Type/format: Tabular data
Rate of change: There is no new releases of data

Positive Aspects: the decision on transforming CSV in to JSON was based on the necessity to have hierarchical data - the positive point, that CSV structure can be mapped to an XML or JSON was considered. CSV only covers tabular format and JSON can cover more complex structures.

Negative Aspects: the data was in CSV format, but it's now (2014) outdated, and there's no prevision for new releases. There's no metadata in it.

Requires: MetadataAvailable , QualityCompleteness , PersistentIdentification and AutomaticUpdate

2.14 Use Case #14 - Bio2RDF

(Contributed by Carlos Laufer)

Bio2RDF is an open source project that uses Semantic Web technologies to make possible the distributed querying of integrated life sciences data. Since its inception [2], Bio2RDF has made use of the Resource Description Framework (RDF) and the RDF Schema (RDFS) to unify the representation of data obtained from diverse (molecules, enzymes, pathways, diseases, etc.) and heterogeneously formatted biological data (e.g. flat-files, tab-delimited files, SQL, dataset specific formats, XML etc.). Once converted to RDF, this biological data can be queried using the SPARQL Protocol and RDF Query Language (SPARQL), which can be used to federate queries across multiple SPARQL endpoints.

Elements:

Domains:Biological data
Obligation/motivation: Biological researchers are often confronted with the inevitable and unenviable task of having to integrate their experimental results with those of others. This task usually involves a tedious manual search and assimilation of often isolated and diverse collections of life sciences data hosted by multiple independent providers including organizations such as the National Center for Bio-technology Information ( NCBI ) and the European Bioinformatics Institute ( EBI ) ) which provide dozens of user-submitted and curated data, as well as smaller institutions such as the Donaldson group which publishes iRefIndex [3], a database of molecular interactions aggregated from 13 data sources. While these mostly isolated silos of biological information occasionally provide links between their records (e.g. Uni-Prot links its entries to hundreds of other databases ), they are typically serialized in either HTML tags or in flat file data dumps that lack the semantic richness required to serialize the intent of the linkage between data records. With thousands of biological databases and hundreds of thousands if not millions of datasets, the ability to find relevant data is hampered by non-standard database interfaces and an enormous number of haphazard data formats [4]. Moreover, metadata about these biological data providers (dataset source data information, dataset versioning, licensing information, date of creation, etc.) is often difficult to obtain. Taken together, the inability to easily navigate through available data presents an overwhelming barrier to their reuse.
Usage: Biological research
Quality: Provenance Bio2RDF scripts generate provenance records using the W3C Vocabulary of Interlinked Datasets (VoID), the Provenance vocabulary (PROV) and Dublin Core vocabulary. Each data item is linked to a provenance object that indicates the source of the data, the time at which the RDF was generated, licensing (if available from data source provider), the SPARQL endpoint in which the resource can be found, and the downloadable RDF file where the data item is located. Each dataset provenance object has a unique IRI and label based on the dataset name and creation date. The date-specific dataset IRI is linked to a unique dataset IRI using the W3C PROV predicate "wasDerivedFrom" such that one can query the dataset SPARQL endpoint to retrieve all provenance records for datasets created on different dates. Each resource in the dataset is linked the date-unique dataset IRI that is part of the provenance record using the VoID "inDataset" predicate. Other important features of the provenance record include the use of the Dublin Core "creator" term to link a dataset to the script on Github that was used to generate it, the VoID predicate "sparqlEndpoint" to point to the dataset SPARQL endpoint, and VoID predicate "dataDump" to point to the data download URL.
Dataset metrics
1. total number of triples
2. number of unique subjects
3. number of unique predicates
4. number of unique objects
5. number of unique types
6. unique predicate-object links and their frequencies
7. unique predicate-literal links and their frequencies
8. unique subject type-predicate-object type links and their frequencies
9. unique subject type-predicate-literal links and their frequencies
10. total number of references to a namespace
11. total number of inter-namespace references
12. total number of inter-namespace-predicate references
Size:
Nineteen datasets were generated as part of the Bio2RDF 2 release. Several of the datasets are themselves collections of datasets that are now available as one resource. Each dataset has been loaded into a dataset specific SPARQL endpoint using Openlink Virtuoso version 6.1.6. SPARQL endpoints, available at http://[namespace].bio2rdf.org. All updated Bio2RDF linked data and their corresponding Virtuoso DB files are available for download.

Dataset	Namespace	#of triples
Affymetrix	affymetrix	44469611
Biomodels	biomodels	589753
Comparative Tox-icogenomics Data-base	ctd	141845167
DrugBank	drugbank	1121468
NCBI Gene	ncbigene	394026267
Gene Ontology Annotations	goa	80028873
HUGO Gene No-menclature Committee	hgnc	836060
Homologene	homologene	1281881
InterPro	interpro	999031
iProClass	iproclass	211365460
iRefIndex	irefindex	31042135
Medical Subject Headings	mesh	4172230
National Center for Biomedical Ontology	ncbo	15384622
National Drug Code Directory	ndc	17814216
Online Mendelian Inheritance in Man	omim	1848729
Pharmacogenomics Knowledge Base	pharmgkb
SABIO-RK	sabiork	2618288
Saccharomyces Genome Database	sgd	5551009
NCBI Taxonomy	19	17814216
Total	taxon	1010758291

Type/format:RDF
Rate of change:
Data lifespan:
Potential audience: Biological researchers

Challenges:

Lack of human-readable metadata.
Data variability (models, sources, etc.).
RDFizations of Datasets.
Wide variety of formats and technologies.

Potential Requirements:

Dataset versioning and updating mechanisms
Standardization of schemas
Integration with other platforms/services
Data persistence

Requires: Archiving and R-FormatStandardized

2.15 Use Case #15 - Documented Support and Release of Data

(Contributed by Deirdre Lee)

While many cases of Data on the Web may contain meta-data about creation data and last update, the regularity of the release schedule is not always clear. Similarly, how and by whom the dataset is supported should also be made clear in the meta-data. These attributes are necessary to improve the reliability of the data so that third-party users can trust the timely delivery of the data, with a follow-up point should there be any issues.

Challenges:

Describe release schedule in meta-data
Describe support mechanisms in meta-data

Requires: MetadataAvailable , AccessUptodate and SLAAvailable

2.16 Use Case #16 - Feedback Loop for Corrections

(Contributed by Deirdre Lee (based on Pieter Colpaert's paper 'Route planning using Linked Open Data') )

One of the advantages of publishing Open Data is often quoted as improving the quality of the data. Many eyes looking at a dataset helps spot errors and holes quicker than a public body may identify this themselves. For example, in his paper 'Route planning using Linked Open Data' Colpaert looks at how feedback can be incorporated into transport data to improve its data quality. How can this 'improved' data be fed back into the public body,processed an incorporated into the original dataset. Should there be an automated mechanism for this? How can the improvement be described in a machine readable format? What is best practice for reincorporating such improvements?

Technical Challenges:

Should there be an automated mechanism for this?
How can the improvement be described in a machine readable format?
What is best practice for reincorporating such improvements?

Requires: QualityOpinions and IncorporateFeedback

2.17 Use Case #17 - Datasets required for Natural Disaster Management

(Contributed by Deirdre Lee (based on OKF Greece workshop) )

Many of the datasets that are required for Natural Disaster Management, for example critical infrastructure, utility services, road networks, are not available online as they are also deemed to be datasets that could be used for homeland security attacks.

Requires: SensitiveSecurity

2.18 Use Case #18 - OKFN Tranport WG

(Contributed by Deirdre Lee (based on 2012 ePSI Open Transport Data Manifesto) )

The Context: Transportation is an important contemporary issue, which has a direct impact on economic strength, environmental sustainability and social equity. Accordingly, transport data – largely produced or gathered by public sector organisations or semi-private entites, quite often locally – represents one of the most valuable sources of public sector information (PSI, also called ‘Open Data’), a key policy area for many, including the European Commission.

The Challenge: Combined with the advancement of Web 2.0 technologies and the increasing use of smart phones, the demand for high quality machine-readable and openly licensed transport data, allowing for reuse in commercial and non-commercial products and services, is rising rapidly. Unfortunately this demand is not met by current supply: many transport data producers and holders (from the public and private sectors) have not managed to respond adequately to these new challenges set by society and technology.

So what do we need?

Access to any transport data of any operator, of high quality, in real time, against free or at least fair standard conditions.
An inclusive infrastructure, based on common open, non-discriminatory and interoperable standards and APIs, to which operators, service providers, developers and users can connect
An ecosystem wherein universal access and re-usabiliy of transport data is the rule, not the exception

Why is this not happening?

Data that is necessary for integrated personal transportation solutions is rich and encompasses several domains (geospatial data, environmental data, private service provider data), involving a wide array of data holders from the public and private sectors. Because of its very nature, transport data is often held locally.
Legacies create lock-ins that prevent adoption of open standards and hamper interoperability.
Many operators and incumbent service providers, in particular those relying on income from sales of data, still regard selective and exclusive access to transport data as a competitive advantage, restricting access and re-use through the exercise of intellectual property rights.
Perceived liability risks, often associated with data quality issues, prevent operators from opening up their data.
Significant differences between countries, regions and transport modalities in terms of level of development, market maturity and associated business models prevent a ‘one size fits all’ solution.
A lack of leadership in the value chain –either by the industry or from the authorities (whatever the level) –limits governance capabilities as to establishment of access, accessibility and other framework conditions, creating a need for a subtle mix of mostly bottom-up instruments and a dash of top-down measures.
Existing market players with associated interests turn governmental actions into a delicate matter, in particular as to the question of where the role of the government should start and end within the value chain and where the market parties should take over and become the driving factor.
Where market parties need to step in, the lack of a clear and predictable environment prevents businesses from establishing a long-term perspective, whereby fair competition needs to be safeguarded

Requires: AccessBulk, FormatOpen, VocabOpen, QualityMetrics, FormatLocalize,and LicenseLiability

2.19 Use Case #19 - Tracking of data usage

(Contributed by Deirdre Lee)

There are many potential/perceived benefits of Open Data, however in order to publish data, some initial investment/resources are required by public bodies. When justifying these resources and evaluating the impact of the investment, many Open Data providers express the desire to be able to track how the datasets are being used. However Open Data by design often requires no registration, explanation or feedback to enable the access to and usage of the data. How can data usage be tracked in order to inform the Open Data ecosystem and improve data provision?

Challenges:

No registration required by data user
automatic vs. manual solution
solution should not break basic Open Data principles
Most developers may not mind giving feedback if it will improve quality of data/service

Requires: TrackDataUsage

2.20 Use Case #20 - Open City Data Pipeline

(Contributed by Deirdre Lee)

The Open City Data Pipeline aims to to provide an extensible platform to support citizens and city administrators by providing city key performance indicators (KPIs),leveraging Open Data sources. The assumption of Open Data is the “Added value comes from comparable Open datasets being combined”. Open Data needs stronger standards to be useful, in particular for industrial uptake. Industrial usage has different requirements than app hobbyist or civil society, it's important to think how Open Data can be used by industry at time of publication. They have developed a data pipeline to:

(semi-)automatically collect and integrate various Open Data Sources in different formats
compose and calculate complex city KPIs from the collected data

Current Data Summary

Ca. 475 different indicators
Categories: Demography, Geography, Social Aspects, Economy, Environment, etc.
from 32 sources (html, CSV, RDF, ...)
Wikipedia, urbanaudit.org, Statistics from City homepages, country Statistics, iea.org
Covering 350+cities in 28 European countries
District Data for selected cities (Vienna, Berlin)
Mostly snapshots, Partially covering timelines
On average ca. 285 facts per city.

Base assumption (for our use case): Added value comes from comparable Open datasets being combined Challenges & Lessons Learnt:

Incomplete Data: can be partially overcome
- By ontological reasoning (RDF & OWL), by aggregation, or by rules & equations, e.g. :populationDensity = :population / :area , cf. [ESWC2013]
- By statistical methods or Multi-dimensional Matrix Decomposition (unfortunately only partially successful, because these algorithms assume normally-distributed data.)
Incomparable Data:
- dbpedia:populationTotal
- dbpedia:populationCensus
Heterogeneity across Open Government Data efforts:
- Different Indicators, Different Temporal and Spatial Granularity
- Different Licenses of Open Data: e.g. CC-BY, country specific licences, etc.
- Heterogeneous Formats (CSV != CSV) ... Maybe the W3C CSV on the Web WG will solve this issue)

Challenges:

Incomplete data (can be overcome using semantic technologies and/or statistical methods)
Heterogeneity (indicators, licenses, formats)
Open Data needs stronger standards to be useful (in particular for industrial uptake), at a metadata level, and dataset level.
Metadata is not always uniform, not only titles of columns, but standardisation about units, etc.

Requires: FormatStandardized , LicenseInteroperable , IndustryReuse , QualityCompleteness and QualityComparable

2.21 Use Case #21 - Machine-readability and Interoperability of Licenses

(Contributed by Deirdre Lee, based on post by Leigh Dodds)

There are many different licenses available under which data on the web can be published, e.g. Creative Commons, Open Data Commons, national licenses, etc. It is important that the license is available in a machine-readable format. Leigh Dodds has done some work towards this with the Open Data Rights Statement Vocabulary http://schema.theodi.org/odrs/ http://theodi.org/guides/publishers-guide-to-the-open-data-rights-statement-vocabulary http://theodi.org/guides/odrs-reusers-guide Another issue is when data under different licenses are combined, the license terms under which the data is available also have to be merged. This interoperability of licenses is a challenge [may be out of scope of W3C DWBP, as it is more concerned with legal issues]

Challenges:

standard vocabulary for data licenses
machine-readability of data licenses
interoperability of data licenses

Requires: LicenseAvailable , LicenseMachineRead , LicenseStandardized and LicenseInteroperable

2.22 Use Case #22 - Machine-readability of SLAs

Issue 18

Does the WG have the capacity to deliver this? It's a potentially huge piece of work.

(Contributed by Deirdre Lee (based on a number of talks at EDF14) )

A main focus of publishing data on the web is to facilitate industry resuse for commercial purposes. In order for a commercial body to reuse data on the web, the terms of reuse must be clear. The legal terms of reuse are included in the license, but there are other factors that are important for commercial reuse, e.g. reliabiliy, support, incidient recovery, etc. These could be included in an SLA. Is there a standardized, machine-readable approach to SLAs?

Challenges:

Defining common SLA requrirements for industry reuse
Existing standards/vocabularies for SLA requirements
Machine-readable access to SLAs

Requires:SLAAvailable, SLAMachineRead and SLAStandardized

2.23 Use Case #23 - Publication of Data via APIs

(Contributed by Deirdre Lee)

APIs are commonly used to publish data in formats designed for machine-consumption, as opposed to the corresponding HTML pages whose main aim is to deliver content suitable for human-consumption. There remains questions around how APIs can best be designed to publish data, and even if APIs are the most suitable way for publishing data at all . Could use of HTTP and URIs be sufficient? If the goal is to facilitate machine-readable data, what is best-practice?

Challenges:

APIs can be too clunky/rich in their functionality, which may increase the amount of calls necessary and size of data transferred, reducing performance
Collaboration between API providers and users is necessary to agree on 'useful' calls
API key agreements could restrict Openess of Open Data?
Documentation accompanying APIs can be lacking
What is best practice for publishing streams of real-time data (with/without APIs)?
Each resource should have one URI uniquly identifying it. There can then be different representations of the resource (xml/html/json/rdf)

Requires: AccessBulk and AccessRealTime

Challenge	Requirements
Metadata	Requirements for Metadata
Data Granularity	Requirements for Data Granularity
Data Formats	Requirements for Data Formats
Data Vocabularies	Requirements for Data Vocabularies
Licenses	Requirements for Licenses
Provenance	Requirements for Provenance
Data Selection	Requirements for Data Selection
Data Access	Requirements for Data Access
Sensitive Data	Requirements for Sensitive Data
Data Identification	Requirements for Data Identification
Data Publication	Requirements for Data Publication
Industry-reuse	Requirements for Industry reuse
Persistence	Requirements for Persistence
Data Quality	Requirements for Data Quality
Data Usage	Requirements for Data Usage

4. Requirements

4.1 Requirements for Data on the Web Best Practices

4.1.1 Requirements for Metadata

R-MetadataAvailable

Metadata should be available

Motivation: DocumentedSupportandRelease, BuildingEye, LATimesReporting and ViolenceMap

R-MetadataMachineRead

Metadata should be machine-readable

Motivation: RecifeOpenDataPortal, Bio2RDF and TheLandPortal

R-MetadataStandardized

Metadata should be standardized

Motivation: RecifeOpenDataPortal, ISOGEOStory and LATimesReporting

R-MetadataDocum

Metadata vocabulary, or values if vocabulary is not standardized, should be well-documented

Motivation: RecifeOpenDataPortal

R-MetadataInteroperable

Metadata should be interoperable

Motivation: ISOGEOStory

4.1.2 Requirements for Data Granularity

R-GranularityLevels

Data available at different levels of granularity should be accessible and modelled in a common way

Motivation: ISOGEOStory and TheLandPortal

4.1.3 Requirements for Data Formats

R-FormatMachineRead

Data should be availabe in a machine-readable format

Motivation: BuildingEye and TheLandPortal

R-FormatStandardized

Data should be availabe in a standardized format

Motivation: OpenCityDataPipeline , WindCharacterizationScientificStudy , BuildingEye , TheLandPortal , GS1 Digital and Tabulae ,

R-FormatOpen

Data should be availabe in an Open format

Motivation: BuildingEye ,

R-FormatMultiple

Data should be availabe in multiple formats

Motivation: GS1 Digital

R-FormatLocalize

It should be possible to localize data on the Web

Motivation: TheLandPortal and Tabulae

4.1.4 Requirements for Data Vocabularies

R-VocabReference

Existing reference vocabularies should be reused where possible

Motivation: OpenCityDataPipeline , RecifeOpenDataPortal , DadosGovBr , ISOGEOStory , DutchBasicRegisters , DigitalArchivingofLinkedData , TheLandPortal , UruguayOpenDataCatalogue and Tabulae

R-VocabDocum

Vocabularies should be clearly documented

Motivation: RecifeOpenDataPortal

R-VocabOpen

Vocabularies should be shared in an Open way

Motivation: RecifeOpenDataPortal and WindCharacterizationScientificStudy

R-VocabVersion

Vocabularies should include versioning information

Motivation: TheLandPortal and Tabulae

4.1.5 Requirements for Licenses

R-LicenseAvailable

Data should be associated with a license

Motivation: MachineReadabilityandInteroperabilityofLicenses , DadosGovBr and BuildingEye

R-LicenseMachineRead

Data licenses should be provided in a machine-readable format

Motivation: MachineReadabilityandInteroperabilityofLicenses

R-LicenseStandardized

Standard vocabularies should be used to describe licenses

Motivation: MachineReadabilityandInteroperabilityofLicenses , DadosGovBr , TheLandPortal and Tabulae

R-LicenseInteroperable

Data licenses should be interoperable

Motivation: OpenCityDataPipeline , MachineReadabilityandInteroperabilityofLicenses , TheLandPortal and Tabulae

R-LicenseLiability

Liability terms associated with usage of Data on the Web should be clearly outlined

Motivation: GS1 Digital

4.1.6 Requirements for Provenance

R-ProvAvailable

Data provenance information should be available

Motivation: TheLandPortal , GS1 Digital and Tabulae

4.1.7 Requirements for Data Selection

R-SelectHighValue

Datasets selected for publication should be of high-value

Motivation: RecifeOpenDataPortal

R-SelectDemand

Datasets selected for publication should be in demand by potential users

Motivation: RecifeOpenDataPortal

4.1.8 Requirements for Data Access

R-AccessBulk

Data should be available for bulk download

Motivation: PublicationofDataviaAPIs , BuildingEye and TheLandPortal

R-AccessRealTime

Where data is produced in real-time, it should be available on the Web in real-time

Motivation: PublicationofDataviaAPIs , WindCharacterizationScientificStudy and TheLandPortal

R-AccessUptodate

Data should be available in an up-to-date manner

Motivation: DocumentedSupportandRelease and GS1 Digital

4.1.9 Requirements for Sensitive Data

R-SensitivePrivacy

Data should not infringe on a person's right to privacy

Motivation: DutchBasicRegisters

R-SensitiveSecurity

Data should not infringe on national security

Motivation: DatasetsforNaturalDisasterManagement

4.1.10 Requirements for Data Identification

R-UniqueIdentifier

Each data resource should be associated with a unique identifier

Motivation: DutchBasicRegisters , DigitalArchivingofLinkedData , LATimesReporting and UruguayOpenDataCatalogue

R-MultipleRepresentations

A data resource may have multiple representations, e.g. xml/html/json/rdf

Motivation: DutchBasicRegisters

4.1.11 Requirements for Data Publication

R-DynamicGeneration

Dynamic generation of Data on the Web from non-Web data resources

Motivation: RecifeOpenDataPortal and UruguayOpenDataCatalogue

R-AutomaticUpdate

Automatic update of Data on the Web when original data source is updated

Motivation: RecifeOpenDataPortal , UruguayOpenDataCatalogue , GS1 Digital , Tabulae , ViolenceMap

R-CoreRegister

Core registers should be accessible

Motivation: DutchBasicRegisters and GS1 Digital

4.1.12 Requirements for Industry Reuse

R-IndustryReuse

Data should be suitable for industry reuse

Motivation: OpenCityDataPipeline

R-SLAAvailable

Service Level Agreements (SLAs) for industry reuse of the data should be available if requested

Motivation: DocumentedSupportandRelease and MachineReadabilityofSLAs

R-SLAMachineRead

SLAs should be provided in a machine-readable format

Motivation: MachineReadabilityofSLAs

R-SLAStandardized

Standard vocabularies should be used to describe SLAs

Motivation: MachineReadabilityofSLAs

R-PotentialRevenue

Potential revenue streams from data should be described

Motivation: DutchBasicRegisters

4.1.13 Requirements for Persistence

R-PersistentIdentification

Data should be persistently identifiable

Motivation: DigitalArchivingofLinkedData , TheLandPortal , GS1 Digital and ViolenceMap

R-Archiving

It should be possible to archive data

Motivation: DigitalArchivingofLinkedData

4.2 Requirements for Quality and Granularity Description Vocabulary

4.2.1 Requirements for Data Quality

R-QualityCompleteness

Data should be complete

Motivation: OpenCityDataPipeline , RecifeOpenDataPortal , TheLandPortal , Tabulae and ViolenceMap

R-QualityComparable

Data should be comparable with other datasets

Motivation: OpenCityDataPipeline

R-QualityMetrics

Data should be associated with a set of standardized, objective quality metrics

Motivation: TheLandPortal

R-QualityOpinions

Subjective quality opinions on the data should be supported

Motivation: FeedbackLoopforCorrections and DadosGovBr

4.3 Requirements for Data Usage Description Vocabulary

4.3.1 Requirements for Data Usage

R-TrackDataUsage

It should be possible to track the usage of data

Motivation: TrackingofDataUsage

R-IncorporateFeedback

It should be possible to incorporate feedback on the data

Motivation: FeedbackLoopforCorrections ,

R-Citable

It should be possible to cite data on the Web

Motivation: LATimesReporting and GS1 Digital

Abstract

Status of This Document

Table of Contents

1. Introduction

2. Use Cases

2.1 Use Case #1 - BuildingEye: SME use of public data

2.2 Use Case #2 - The Land Portal

2.3 Use Case #3 - Recife Open Data Portal

2.4 Use Case #4 - Dados.gov.br

2.5 Use Case #5 - ISO GEO Story

2.6 Use Case #6 - Dutch basic registers

2.7 Use Case #7 - Wind Characterization Scientific Study

2.8 Use Case #8 - Digital archiving of Linked Data

2.9 Use Case #9 - LA Times' reporting of Ron Galperin's Infographic

2.10 Use Case #10 - Uruguay: open data catalogue

2.11 Use Case #11 - GS1: GS1 Digital

2.12 Use Case #12 - Tabulae - how to get value out of data

2.13 Use Case #13 - Retrato da Violência (Violence Map)

2.14 Use Case #14 - Bio2RDF

2.15 Use Case #15 - Documented Support and Release of Data

2.16 Use Case #16 - Feedback Loop for Corrections

2.17 Use Case #17 - Datasets required for Natural Disaster Management

2.18 Use Case #18 - OKFN Tranport WG

2.19 Use Case #19 - Tracking of data usage

2.20 Use Case #20 - Open City Data Pipeline

2.21 Use Case #21 - Machine-readability and Interoperability of Licenses

2.22 Use Case #22 - Machine-readability of SLAs

2.23 Use Case #23 - Publication of Data via APIs

3. General Challenges

4. Requirements

4.1 Requirements for Data on the Web Best Practices

4.1.1 Requirements for Metadata

4.1.2 Requirements for Data Granularity

4.1.3 Requirements for Data Formats

4.1.4 Requirements for Data Vocabularies

4.1.5 Requirements for Licenses

4.1.6 Requirements for Provenance

4.1.7 Requirements for Data Selection

4.1.8 Requirements for Data Access

4.1.9 Requirements for Sensitive Data

4.1.10 Requirements for Data Identification

4.1.11 Requirements for Data Publication

4.1.12 Requirements for Industry Reuse

4.1.13 Requirements for Persistence

4.2 Requirements for Quality and Granularity Description Vocabulary

4.2.1 Requirements for Data Quality

4.3 Requirements for Data Usage Description Vocabulary

4.3.1 Requirements for Data Usage

5. Reading Material

5.1 General Resources

5.2 Relevant Vocabularies

5.3 Communities of Interest

A. Acknowledgements

B. Change history