Use Cases

From Data on the Web Best Practices
Jump to: navigation, search

Use Case Notes

A use case describes a scenario that illustrates an experience of publishing and using Data on the Web. Those descriptions may be related to one or more tasks or activities from the Data on the Web Life Cycle.

The information gathered from the uses cases should be helpful for the identification of the best practices that will guide the publishing and usage of Data on the Web. In general, a best practice will be described at least by a statement and a how to do it section, i.e., a discussion of techniques and suggestions as how to implement it (similar to http://www.w3.org/TR/mobile-bp/).

Then, to help the identification of possible best practices, it is desirable that a use case presents information about positive aspects/benefits (similar to http://www.w3.org/TR/vocab-data-cube-use-cases/), but also negative aspects from the experience. These aspects may also be seen as learned lessons and will be helpful for the identification of statements of best practices as well as suggestions of how to implement (or not) a given best practice.

Other important information to be included in a use case description concerns the main challenges faced by publishers or developers. Information about challenges will be helpful to identify areas where Best Practices are necessary. According to the challenges, a set of requirements may be defined, where a requirement motivates the creation of one or more best practices.

The general description of a use case is given by:

  • Title:
  • Contributor:
  • Overview:
  • Detailed description: (Each element described in more detail at Use-Case Elements)
    • Domains:
    • Obligation/motivation:
    • Usage:
    • Quality:
    • Size:
    • Type/format:
    • Rate of change:
    • Data lifespan:
    • Potential audience:
  • Positive aspects:
  • Negative aspects:
  • Challenges:
  • Potential Requirements:

To illustrate, consider the following example:

Recife Open Data Portal

Contributor: Bernadette Lóscio

Overview: Recife is a beautiful city situated in the Northeast of Brazil and it is famous for being one of the Brazil’s biggest tech hubs. Recife is also one of the first Brazilian cities to release data generated by public sector organisations for public use as Open Data. An Open Data Portal Recife was created to offer access to a repository of governmental machine-readable data about several domains, including: finances, health, education and tourism. Data is available in csv and geojson format and every dataset has a metadata description, i.e. descriptions of the data, that helps in the understanding and usage of the data. However, the metadata is not described using standard vocabularies or taxonomies. In general, data is created in a static way, where data from relational databases are exported in a csv format and then published in the data catalog. Currently, they are working to have dynamically generated data from the contents of relational databases, then data will be available as soon as they are created. The main phases of the development of this initiative were: to educate people with appropriate knowledge concerning Open Data, relevant data identification in order to identify the sources of data that their potential consumers could find useful, data extraction and transformation from the original data sources to the open data format, configuration and installation of the open data catalogue tool, data publication and portal release.

Positive aspects:

  • All datasets are published together with a metadata description. Metadata is described in a very clear way, which facilitates the understanding of the published data.
  • According to the frequency of data updating, some data may be automatically updated every day.

Negative aspects:

  • Metadata is not described in a machine processable format.
  • Data is provided in just one format (csv).

Challenges:

  • Use common vocabs to facilitate data integration
  • How to keep different versions of the same dataset?
  • How to define the "granularity" of the data being published?
  • How to provide information about the quality of the data?
  • How to measure the quality of the data?


The following lesson and requirements may be extracted from this use case:

Lessons:

  • When publishing a dataset, provide metadata in a machine processable format.

Requirements:

  • How to keep different versions of the same dataset?
  • How to define the "granularity" of the data being published?
  • How to provide information about the quality of the data?
  • Which vocabs should be used to improve data integration?

Data on the Web Life Cycle

Use Cases

To add a new use-case, copy the use-case template and complete all of the sections. Use-case elements are optional, depending on information available. If you want to add a challenge or requirement to somebody else's use-case, please add your name in brackets after your update.

Use Case Template: Please Copy

Contributor:

Overview:

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains:
  • Obligation/motivation:
  • Usage:
  • Quality:
  • Size:
  • Type/format:
  • Rate of change:
  • Data lifespan:
  • Potential audience:

Positive aspects:

Negative aspects:

Challenges:

Potential Requirements:


Documented Support and Release of Data

Contributor: Deirdre Lee (based on email by Leigh Dodds)

Overview: While many cases of Data on the Web may contain meta-data about creation data and last update, the regularity of the release schedule is not always clear. Similarly, how and by whom the dataset is supported should also be made clear in the meta-data. These attributes are necessary to improve the reliability of the data so that third-party users can trust the timely delivery of the data, with a follow-up point should there be any issues.

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains:
  • Obligation/motivation:
  • Usage:
  • Quality:
  • Size:
  • Type/format:
  • Rate of change:
  • Data lifespan:
  • Potential audience:

Technical Challenges:

  • Describe release schedule in meta-tdata
  • Describe support mechanisms in meta-data

Potential Requirements:

  • Propose use of dcat properties dct:accrualPeriodicity and dcat:contactpoint
  • Potentially extend dcat?

Feedback Loop for Corrections

Contributor: Deirdre Lee (based on email by Leigh Dodds and OKF Greece workshop)

Overview: One of the advantages of publishing Open Data is often quoted as improving the quality of the data. Many eyes looking at a dataset helps spot errors and holes quicker than a public body may identify this themselves. For example, when bus-stop data is published, it may turn out that the official location of a bus-stop is not always accurate, but when this is mashed-up with OSM, the mistake is identified. However, how this 'improved' data is fed back into the public body is not clear. Should there be an automated mechanism for this? How can the improvement be described in a machine readable format? What is best practice for reincorporating such improvements?

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains:
  • Obligation/motivation:
  • Usage:
  • Quality:
  • Size:
  • Type/format:
  • Rate of change:
  • Data lifespan:
  • Potential audience:

Technical Challenges:

  • Should there be an automated mechanism for this?
  • How can the improvement be described in a machine readable format?
  • What is best practice for reincorporating such improvements?

Potential Requirements:


Datasets required for Natural Disaster Management

Contributor: Deirdre Lee (based on OKF Greece workshop)

Overview: Many of the datasets that are required for Natural Disaster Management, for example critical infrastructure, utility services, road networks, are not available online as they are also deemed to be datasets that could be used for homeland security attacks. (will expand on this use-case once slides are available)

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains:
  • Obligation/motivation:
  • Usage:
  • Quality:
  • Size:
  • Type/format:
  • Rate of change:
  • Data lifespan:
  • Potential audience:

Technical Challenges:

Potential Requirements:

OKFN Tranport WG

Contributor: Deirdre Lee (based on OKF Greece workshop)

Overview: The OKFN Transport WG have identified the following shortcomings with transport data on the web... (will expand on this use-case once slides are available)

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains:
  • Obligation/motivation:
  • Usage:
  • Quality:
  • Size:
  • Type/format:
  • Rate of change:
  • Data lifespan:
  • Potential audience:

Technical Challenges:

Potential Requirements:

Tracking of data usage

Contributor: Deirdre Lee

Overview: There are many potential/perceived benefits of Open Data, however in order to publish data, some initial investment/resources are required by public bodies. When justifying these resources and evaluating the impact of the investment, many Open Data providers express the desire to be able to track how the datasets are being used. However Open Data by design often requires no registration, explanation or feedback to enable the access to and usage of the data. How can data usage be tracked in order to inform the Open Data ecosystem and improve data provision?

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains: all
  • Obligation/motivation: improve Open Data ecosystem through feedback
  • Usage:
  • Quality:
  • Size:
  • Type/format:
  • Rate of change:
  • Data lifespan:
  • Potential audience:

Technical Challenges:

  • No registration required by data user
  • automatic vs. manual solution
  • solution should not break basic Open Data principles
  • Most developers may not mind giving feedback if it will improve quality of data/service

Potential Requirements:

Open City Data Pipeline

Contributor: Deirdre Lee (based on presentation by Axel Polleres at EDF14) http://ai.wu.ac.at/~polleres/presentations/20140319CityDataPipeline_EDF2014_Polleres.pdf

Overview: Axel presented the Open City Data Pipeline, which aims to to provide an extensible platform to support citizens and city administrators by providing city key performance indicators (KPIs),leveraging Open Data sources.

The assumption of Open Data is the “Added value comes from comparable Open datasets being combined”. Axel highlighted that Open Data needs stronger standards to be useful, in particular for industrial uptake. Industrial usage has different requirements than app hobbyist or civil society, it's important to think how Open Data can be used by industry at time of publication.

They have developed a data pipeline to

  1. (semi-)automatically collect and integrate various Open Data Sources in different formats
  2. compose and calculate complex city KPIs from the collected data

Current Data Summary

  • Ca. 475 different indicators
  • Categories: Demography, Geography, Social Aspects, Economy, Environment, etc.
  • from 32 sources (html, CSV, RDF, ...)
  • Wikipedia, urbanaudit.org, Statistics from City homepages, country Statistics, iea.org
  • Covering 350+cities in 28 European countries
  • District Data for selected cities (Vienna, Berlin)
  • Mostly snapshots, Partially covering timelines
  • On average ca. 285 facts per city.

Base assumption (for our use case): Added value comes from comparable Open datasets being combined Challenges & Lessons Learnt:

  • Incomplete Data: can be partially overcome
  • By ontological reasoning (RDF & OWL), by aggregation, or by rules & equations, e.g.
 :populationDensity = :population / :area , cf. [ESWC2013]
  • By statistical methods or Multi-dimensional Matrix Decomposition:
(unfortunately only partially successful, because these algorithms assume normally-distributed data.)
  • Incomparable Data:
dbpedia:populationTotal
dbpedia:populationCensus
  • Heterogeneity across Open Government Data efforts:
  • Different Indicators, Different Temporal and Spatial Granularity
  • Different Licenses of Open Data: e.g. CC-BY, country specific licences, etc.
  • Heterogeneous Formats (CSV != CSV) ... Maybe the W3C CSV on the Web WG will solve this issue)

Open Data needs stronger standards to be useful [ESWC2013] Stefan Bischof and Axel Polleres. RDFS with attribute equations via SPARQL rewriting. In Proc. Of the 10th ESWC, vol. 7882 of Lecture Notes in Computer Science (LNCS), p. 335-350, May 2013. Springer.


Elements: (Each element described in more detail at Use-Case Elements )

  • Domains:
  • Obligation/motivation:
  • Usage:
  • Quality:
  • Size:
  • Type/format:
  • Rate of change:
  • Data lifespan:
  • Potential audience:

Technical Challenges:

  • Incomplete data (can be overcome using semantic technologies and/or statistical methods)
  • Incomparable data
  • Heterogeneity (indicators , licenses, formats)
  • Open Data needs stronger standards to be useful (in particular for industrial uptake), at a metadata level, and dataset level.
  • Metadata is not always uniform, not only titles of columns, but standardisation about units, etc.

Potential Requirements:

Machine-readability and Interoperability of Licenses

Contributor: Deirdre Lee, based on post by Leigh Dodds

Overview: There are many different licenses available under which data on the web can be published, e.g. Creative Commons, Open Data Commons, national licenses, etc. http://opendefinition.org/licenses/ It is important that the license is available in a machine-readable format. Leigh Dodds has done some work towards this with the Open Data Rights Statement Vocabulary http://schema.theodi.org/odrs/ http://theodi.org/guides/publishers-guide-to-the-open-data-rights-statement-vocabulary http://theodi.org/guides/odrs-reusers-guide

Another issue is when data under different licenses are combined, the license terms under which the data is available also have to be merged. This interoperability of licenses is a challenge [may be out of scope of W3C DWBP, as it is more concerned with legal issues]

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains:
  • Obligation/motivation:
  • Usage:
  • Quality:
  • Size:
  • Type/format:
  • Rate of change:
  • Data lifespan:
  • Potential audience:

Technical Challenges:

  • standard vocabulary for data licenses
  • machine-readability of data licenses
  • interoperability of data licenses

Potential Requirements:

Machine-readability of SLAs

Contributor: Deirdre Lee (based on a number of talks at EDF14)

Overview: A main focus of publishing data on the web is to facilitate industry resuse for commercial purposes. In order for a commercial body to reuse data on the web, the terms of reuse must be clear. The legal terms of reuse are included in the license, but there are other factors that are important for commercial reuse, e.g. reliabiliy, support, incidient recovery, etc. These could be included in an SLA. Is there a standardised, machine-readable approach to SLAs?

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains:
  • Obligation/motivation:
  • Usage:
  • Quality:
  • Size:
  • Type/format:
  • Rate of change:
  • Data lifespan:
  • Potential audience:

Technical Challenges:

  • Defining common SLA requrirements for industry re-use
  • Existing standards/vocabularies for SLA requirements
  • Machine-readable access to SLAs

Potential Requirements:

Publication of Data via APIs

Contributor: Deirdre Lee

Overview: APIs are commonly used to publish data in formats designed for machine-consumption, as opposed to the corresponding HTML pages whose main aim is to deliver content suitable for human-consumption. There remains questions around how APIs can best be designed to publish data, and even if APIs are the most suitable way for publishing data at all (http://ruben.verborgh.org/blog/2013/11/29/the-lie-of-the-api/). Could use of HTTP and URIs be sufficient? If the goal is to facilitate machine-readable data, what is best-practice?

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains:
  • Obligation/motivation:
  • Usage: Developer
  • Quality:
  • Size: Use of APIs can serve to increase size of data transfer
  • Type/format: html/xml/json/rdf
  • Rate of change: static/real-time
  • Data lifespan:
  • Potential audience: machine-readable

Technical Challenges:

  • APIs can be too clunky/rich in their functionality, which may increase the amount of calls necessary and size of data transferred, reducing performance
  • Collaboration between API providers and users is necessary to agree on 'useful' calls
  • API key agreements could restrict Openess of Open Data?
  • Documentation accompanying APIs can be lacking
  • What is best practice for publishing streams of real-time data (with/without APIs)?
  • Each resource should have one URI uniquly identifying it. There can then be different representations of the resource (xml/html/json/rdf)

Potential Requirements:

NYC Open Data Program

Contributor: Steven Adler

Overview: Carole Post was appointed by Mayor Bloomberg as Commissioner of the NY Departnmen of IT (DOITT) in 2010 and was the first woman in the city's history to be CIO. She was the architect of NYC's Open Data program, sponsored the Open Data Portal and helped pass the city's Open Data Legislation. On March 11, she gave a presentation to the W3C on her experiences changing the city culture, building the Open Data Portal. A recording of her presentation is provided here: Carole Post Webinar - NYC. A copy of her presentation in PDF can be found here: - Carole Post Presentation on NYC Open Data


Elements:

Recife Open Data Portal

Contributor: Bernadette Lóscio

Overview: Recife is a beautiful city situated in the Northeast of Brazil and it is famous for being one of the Brazil’s biggest tech hubs. Recife is also one of the first Brazilian cities to release data generated by public sector organisations for public use as Open Data. An Open Data Portal Recife was created to offer access to a repository of governmental machine-readable data about several domains, including: finances, health, education and tourism. Data is available in csv and geojson format and every dataset has a metadata description, i.e. descriptions of the data, that helps in the understanding and usage of the data. However, the metadata is not described using standard vocabularies or taxonomies. In general, data is created in a static way, where data from relational databases are exported in a csv format and then published in the data catalog. Currently, they are working to have dynamically generated data from the contents of relational databases, then data will be available as soon as they are created. The main phases of the development of this initiative were: to educate people with appropriate knowledge concerning Open Data, relevant data identification in order to identify the sources of data that their pontential consumers could find useful, data extraction and transformation from the original data sources to the open data format, configuration and installation of the open data catalogue tool, data publication and portal release.

Elements:

  • Domains: Base registers, Cultural heritage information, Geographic information, Infrastructure information, Social data and Tourism Information
  • Obligation/motivation: Data that must be provided to the public under a legal obligation (Brazilian Information Acess Act, edited in 2012); Provide public data to the citizens
  • Usage: Data that supports democracy and transparency; Data used by application developers
  • Quality: Verified and clean data
  • Size: in general small to medium CSV files
  • Type/format: CSV, geojson
  • Rate of change: different rates of changes depending on the data source
  • Data lifespan:
  • Potential audience: application developers, startups, government organizations

Technical Challenges:

  • Use common vocabs to facilitate data integration
  • Provide structural metadata to help data understanding and usage
  • Automate the data publishing process to keep data up to date and accurate

Retrato da Violência (Violence Map)

Contributor: Yaso

Overview: This is a Data Visualization made in 2012 by Vitor Batista, Léo tartari and Thiago Bueno for a W3C Brazil Office challenge about data from Rio Grande do Sul (a brazilian region). The data was released in a .zip package, the original format was .csv. The code and the documentation of the project are in it's GitHub repository.

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains: political information, regional security information.
  • Obligation/motivation: Data that must be provided to the public under a legal obligation, the called LAI or Brazilian Information Acess Act, edited in 2012
  • Usage:
  • Quality: not guaranteed data
  • Size:
  • Type/format: Tabular data
  • Rate of change: There is no new releases of data
  • Data lifespan:
  • Potential audience:

Positive aspects: the data was in CSV format, but it's now (2014) outdated, and there's no prevision for new releases. There's no metadata in it.

Negative aspects: the decision on transforming CSV in to JSON was based on the necessity to have hierarchical data - the positive point, that CSV structure can be mapped to an XML or JSON was considered. CSV only covers tabular format and JSON can cover more complex structures.

Challenges: this was not guaranteeed data and there was no metadata. there's a sample of the .csv files on location

ACEGUA;2007;10;22;segunda-feira;Manhã;09:30:00;Consumado;Residência;012;032
ACEGUA;2011;01;18;terca-feira;Madrugada;00:30:00;Consumado;Residência;015;027
AGUA SANTA;2006;01;01;domingo;Madrugada;00:01:00;Consumado;Residência;009;019
AGUA SANTA;2008;01;30;quarta-feira;Madrugada;00:01:00;Consumado;Residência;014;034
AGUAS CLARAS (VIAMAO);2010;04;03;sabado;Tarde;14:00:00;Consumado;;010;063
AGUDO;2006;05;10;quarta-feira;Tarde;12:00:00;Consumado;Outros;017;022
AGUDO;2006;12;30;sabado;Manhã;06:30:00;Tentado;Residência;013;028
AGUDO;2007;03;01;quinta-feira;Manhã;08:00:00;Consumado;Residência;010;023
AGUDO;2007;04;12;quinta-feira;Madrugada;00:15:00;Tentado;Via Publica;015;030

Potential Requirements:

Dados.gov.br

Contributor: Yaso

Overview: Data.gov.br is the open data portal of the Brazil's Federal Government. The site was built in community, in a network pulled by three technicians from the Ministry of Planning. They managed the WG3 from "INDA" or "National Infrastructure for Open Data". The CKAN was chosen because it is Free Software and present more independent solutions for the placement of data catalog of the Federal Government provided on the internet.

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains: federal budget, addresses, Infrastructure information, e-gov tools usage, social data, geographic information, political information, Transport information
  • Obligation/motivation: Data that must be provided to the public under a legal obligation, the called LAI or Brazilian Information Acess Act, edited in 2012
  • Usage:
  • Data that is the basis for services to the public;
  • Data that has commercial re-use potential.
  • Quality: Authoritative, clean data, vetted and guaranteed;
  • Lineage/Derivation: Data came from various publishers. As a catalog, the site has faced several challenges, one of them was to integrate the various technologies and formulas used by publishers to provide datasets in the portal.
  • Size:
  • Type/format: Tabular data, text data
  • Rate of change: There is fixed data and data with high rate of change
  • Data lifespan:
  • Potential audience:

Technical Challenges:

  • data integration (lack of vocabs)
  • collaborative construction of the portal: managing online sprints and balancing public expectatives.
  • Licencing the data of the portal. Most of data that is inn the portal has not a special licence for data. As you can see, there is different types of licences that applied to the datasets.

ISO GEO Story

Contributor: Ghislain Atemezing

Overview: ISO GEO is a company managing catalogs records of geographic information in XML, conformed to ISO-19139. (ISO- 19139 is a French adaptation of the ISO- 19115) An excerpts is here: http://cl.ly/3A1p0g2U0A2z. They export thousands of catalogs like that today, but they need to manage them better. In their platform, they store the information in a more conventional manner, and use this standard for export dataset compliant to Inspire interoperability , or via the CSW protocol. Sometimes, they have to enrich their metadata with other ones, produced by tools like GeoSource and accessed through SDI (Spatial Data Infrastructure), with their own metadata records.

A sample containing 402 metadata records in ISO 19139 are in public consultation at http://geobretagne.fr/geonetwork/srv/fr/main.home. They want to be able to integrate all the different implementations of the ISO 19139 in different tools in a single framework to better understand the thousand of metadata records they use in their day-to-day business. Types of information recorded in each file, see example at http://www.eurecom.fr/~atemezin/datalift/isogeo/5cb5cbeb-fiche1.xml are the following: Contact info (metadata) [Data issued]; spatial representation ; reference system info [code space ], spatial Resolution ; Geographic Extension of the data, File distribution; Data Quality ; process step, etc.

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains: Geographic information,
  • Obligation/motivation:
  • Usage:
  • Quality:
  • Size: hundreds (~500) records at regional level
  • Type/format: XML, CSW API
  • Rate of change: Low rate of change
  • Data lifespan:
  • Potential audience:

Technical Challenges:

  • Achieve interoperability between supporting applications, e.g.:, validation and discovery services built over metada repository
  • Capture the semantics of the current metadata records with respect to ISO 19139 standard.
  • Unify way to have access to each record within the catalog at different level e.g.:, local, regional, national or EU level.

Potential Requirements:

Dutch basic registers

Contributor: Christophe Guéret

Overview: The Netherlands have a set of registers they are looking at opening and exposing as Linked (Open) Data under the context of the project "PiLOD". The registers contain information about buildings, people, businesses and other individuals public bodies may want to refer to for they daily activities. One of them is, for instance, the service of public taxes ("BelastingDienst") which regularly pulls out data from several registers, stores this data in a big Oracle instance and curates it. This costly and time consuming process could be optimised by providing on-demand access to up-to-date descriptions provided by the register owners.

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains:
  • Obligation/motivation:
  • Usage:
  • Quality:
  • Size:
  • Type/format:
  • Rate of change:
  • Data lifespan:
  • Potential audience:

Technical Challenges: In terms of challenges, linking is for once not much of an issue as registers already cross-reference unique identifiers (see also http://www.wikixl.nl/wiki/gemma/index.php/Ontsluiting_basisgegevens). A URIs scheme with predicable URIs is being considered for implementation. Actual challenges include:

  • Capacity: at this point, it can not be asked that every register owner cares for publishing his own data. Some of them export what they have on the national open data portal. This data has been used to do some testing with third-party publication from PiLODers but this is rather sensitive as a long term strategy (governmental data has to be tracable/trustable as such). The middle ground solution currently deployed is the PiLOD platform, a (semi)-official platform for publishing register data.
  • Privacy: some of the register data is personal or may become so when linked to others (e.g. disambiguate personal data based on adresses). Some registers will require to provide secured access to some of their data to some people only (Linked Data, not Open). Some others can go along with open data as long as they get a precise log of who is using what.
  • Revenue: institutions working under mixed gov/non-gov funding generate part of their revenue by selling some of the data they curate. Switching to an open data model will generate a direct loss in revenue that has to be backed-up by other means. This does not have to mean closing the data, e.g. a model of open dereferencing + paid dumps can be considered, as well as other indirect revenue streams.

Potential Requirements:

Wind Characterization Scientific Study

Contributor: [1]

Overview: This use case describes a data management facility being constructed to support scientific offshore wind energy research for the U.S. Department of Energy’s Office of Energy Efficiency and Renewable Energy (EERE) Wind and Water Power Program. The Reference Facility for Renewable Energy (RFORE) project is responsible collecting wind characterization data from remote sensing and in situ instruments located on an offshore platform. This raw data is collected by the Data Management Facility and processed into a standardized NetCDF format. Both the raw measurements and processed data are archived in the PNNL Institutional Computing (PIC) petascale computing facility. The DMF will record all processing history, quality assurance work, problem reporting, and maintenance activities for both instrumentation and data.

All datasets, instrumentation, and activities are cataloged providing a seamless knowledge representation of the scientific study. The DMF catalog relies on linked open vocabularies and domain vocabularies to make the study data searchable.

Scientists will be able to use the catalog for faceted browsing, ad-hoc searches, query by example. For accessing individual datasets a REST GET interface to the archive will be provided.

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains:
  • Obligation/motivation:
  • Usage:
  • Quality:
  • Size:
  • Type/format:
  • Rate of change:
  • Data lifespan:
  • Potential audience:

Technical Challenges: For accessing numerous datasets scientists will be accessing the archive directly using other protocols such as sftp, rsync, scp, access techniques such as: http://www.psc.edu/index.php/hpn-ssh

Potential Requirements:

BuildingEye: SME use of public data

Contributor: Deirdre lee

Overview: Buildingeye.com makes building and planning information easier to find and understand by mapping what's happening in your city. In Ireland local authorities handle planning applications and usually provide some customised views of the data (pdfs, maps, etc.) on their own website. However there isn't an easy way to get a nationwide view of the data. BuildingEye, an independent SME, built http://mypp.ie/ to achieve this. However as each local authority didn't have an Open Data portal, BuildingEye had to directly ask each local authority for its data. It was granted access to some authorities, but not all. The data it did receive was in different formats and of varying quality/detail. BuildingEye harmonised this data for its own system. However, if another SME wanted to use this data, they would have to go through the same process and again go to each local authority asking for the data.

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains: Planning data
  • Obligation/motivation: demand from SME
  • Usage: Commercial usage
  • Quality: standardised, interoperable across local authorities
  • Size: medium
  • Type/format: structured according to legacy system schema
  • Rate of change: daily
  • Data lifespan:
  • Potential audience: Business, citizens
  • “Governance”: local authorities

Technical Challenges:

  • Access to data is currently a manual process, on a case by case basis
  • Data is provided in different formats, e.g. database dumps, spreadsheets
  • Data is structured differently, depending on the legacy system schema, concepts and terms not interoperable
  • No official Open license associated with the data
  • Data is not available for further reuse by other parties

Potential Requirements:

  • Creation of top-down policy on Open Data to ensure common understanding and approach
  • Top-down guidance on recommended Open license usage
  • Standardised, non-proprietary formats
  • Availability of recommended domain-specific vocabularies.

Digital archiving of Linked Data

Contributor: Christophe Guéret

Overview: Taking the concrete example of the digital archive "DANS", digital archives have so far been concerned with the preservation of what could be defined as "frozen" dataset. A frozen dataset is a finished, self-contained, set of data that does not evolve after it has been constituted. The goal of the preserving institution is to ensure this dataset remains available and readable for as many years as possible. This can for example concern an audio record, a digitized image, e-books or database dumps. Consumers of the data are expected to look-up for a specific content based on its associated persistent identifier, download it from the archive and use it. Now comes the question of the preservation of Linked Open Data. In opposition to "frozen" data sets, linked data can be qualified as "live" data. The resources it contains are part of a larger entity to which third parties contribute, one of the design principles indicate that other data producers and consumers should be able to point to data. As LD publishers stop offering their data (e.g. at the end of a project), taking the LD off-line as a dump and putting it in an archive effectively turns it into a frozen dataset, likewise to SQL dumps and other kind of data bases. The question then raises as to which extent this is an issue...

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains: DANS is concerned with research related datasets - preferably from the computational humanities and social science communities
  • Obligation/motivation: DANS preserves and serves data given to it under some privacy conditions (not all datasets are open)
  • Usage: research
  • Quality: varied
  • Size: varied
  • Type/format: RDF in various serialisation formats (RDF/XML, JSON-LD, TTL, ...)
  • Rate of change: frozen
  • Data lifespan: unlimited
  • Potential audience: future generations of researchers

Technical Challenges: The archive has to think about whether serving dereferencing for resources found in preserved datasets is required or not, also think about providing a SPARQL end point or not. If data consumers and publishers are fine with having RDF data dumps to be downloaded from the archive prior to its usage - just like any other digital item so far - the technical challenges could be limited to handling the size of the dumps and taking care of serialisation evolution over time (e.g. from Ntriples to Trig, or from RDF/XML to HDT) as the preference for these formats evolves. Turning a live dataset into a frozen dump also raises the question of the scope. Considering that LD items are only part of a much larger graph that gives them meaning through context the only valid dump would be a complete snapshot of the entire connected component of the Web of Data graph the target dataset is part of.

Potential Requirements: Decide on the importance of the de-referencability of resources and the potential implications for domain names and naming of resources. Decide on the scope of the step that will turn a connected sub-graph into an isolated data dump.

LA Times' reporting of Ron Galperin's Infographic

Contributor: Phil Archer

Overview:

On 27 March 2014, the LA Times published a story Women earn 83 cents for every $1 men earn in L.A. city government. It was based on an Infographic released by LA's City Controller, Ron Galperin. The Infographic was based on a dataset published on LA's open data portal, Control Panel LA. That portal uses the Socrata platform which offers a number of spreadhseet-like tools for examining the data, the ability to download it as CSV, embed it in a Web page and see its metadata.

Positive aspects:

  • The LA Times story makes its sources clear (it also links to a related Pew Research Center article).
  • It offers readers a commentary on the particular issue raised and is easy for anyone to digest.
  • Data sources are cited directly and can be followed up on by (human) readers.

Negative aspects:

Challenges:

  • Data Citation - how could Ron Galperin have referred to the source data in the Infographic? (the URI is way too long). QR code? Short PURL?
  • How could the publisher of the data link to the Infographic as a visualization of it?
  • In this case, the creator of the underlying data is the same as the creator of the Infographic, but if they were different, how could the data creator discover the Infographic, still less the media report about it?
  • The methodology used is not explained - making it hard to assess trustworthiness. How can provenance be described?
  • The metadata is incomplete and does not used a recognized standard vocabulary making automated discovery and use by anyone other than the data creator difficult.

Other Data Journalism blogs:

FiveThirtyEight

Wall Street Journal’s Number Guy column

Guardian’s data blog

The Land Portal

Contributor: Carlos Iglesias

Overview: The IFAD Land Portal platform it's been completely rebuilt as an Open Data collaborative platform for the Land Governance community. Among the new features the Land Portal will provide access to comprehensive and in-depth 100+ indicators from 25+ different sources on land governance issues for 200+ countries over the world, as well as a repository of land related-content and documentation. Thanks to the new platform people could (1) curate and incorporate new data and metadata by means of different data importers and making use of the underlying common data model; (2) search, explore and compare the data through countries and indicators; and (3) consume and reuse the data by different means (i.e. raw data download at the data catalog; linked data and SPARQL endpoint at RDF triplestore; RESTful API; and built-in graphic visualization framework)

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains: Land Governance; Development
  • Obligation/motivation: To find reliable data driven indicators on land governance and put all them together to facilitate access, study, analysis, comparison and data gaps detection.
  • Usage: Research; Policy Making, Journalism; Development; Investments; Governance; Food security; Poverty; Gender issues.
  • Quality: Every sort of data, from high quality to unverified one.
  • Size: Varies, but low-medium in general.
  • Type/format: Varies: APIs; JSON; spreadsheets; CSVs; HTMLs; XMLs; PDFs...
  • Rate of change: Usually yearly, but also lower rates (monthly, quarterly...)
  • Data lifespan: Unlimited.
  • Potential audience: Practitioners; Policy makers; Activists; Researchers; Journalists.

Technical Challenges:

  • Data coverage.
  • Quality of data and metadata.
  • Lack of machine-readable metadata.
  • Inconsistency between different data sources.
  • Wide variety of formats and technologies.
  • Some non machine-readable formats.
  • Data variability (models, sources, etc.)
  • Data provenance.
  • Diversity and (sometimes) complexity of Licenses.
  • Internationalization issues (e.g. different formats for numbers, dates, etc.) and multilingualism

Potential Requirements:

  • Availability of general use taxonomies (countries, topics, etc.).
  • Data interoperability i.e. domain-specific vocabularies for a common data model with reference formats and protocols.
  • Data persistence.
  • Versioning mechanisms.

Radar Parlamentar

Contributor: Nathalia

Overview: Radar Parlamentar is a web application that illustrates the similarities between political parties based on the vote data analysis that occurs in the Brazilian congress. The similarities are presented in a two-dimensional graphics, in which circles represent parties or parliamentarians, and the distance between these circles is how similar they vote. There is also only a section dedicated to gender issues: how many women are in each party over the years, which are the themes most handled by each gender and party, etc.

Elements:

  • Domains: Political information, voting records
  • Obligation/motivation: The Brazilian government began to provide their data in an open format through the Dados.gov.br portal.
  • Usage: Re-use and exploration of data available in portal Dados.gov.br in another kinds of visualisation.
  • Quality: Every sort of data, from high quality to unverified one (depends on the data provided by parlamentar houses).
  • Size: Varies depends on the data provided by parlamentar houses).
  • Type/format: Tag clouds, 2D graphic, matrix display, treemap.
  • Rate of change: No defined periodicity.
  • Data lifespan: No defined.
  • Potential audience: Brazilian citizens

Technical Challenges:

  • There are significant differences between data from different parlamentar houses, i.e., they don't use a standard ontology
  • There are a lack of data about votations in the National Assembly
  • Data are been released bit by bit
  • The data release frequency release has not been established
  • There are few data about votations available at certain time periods
  • Data quality of the City Council is not good, some data are visibly wrong

Potential Requirements:

  • Feed to notify developers when there are new data
  • Good filters/searches to avoid many unnecessary requests
  • Definition of update data frequency
  • Standard Ontology for all parlamentar houses
  • Documentation: there is a page in the web application explaining the used methodology.

Uruguay: open data catalogue

Contributor: AGESIC

Overview: Uruguay open data site holds 85 datasets containing 114 resources since the first dataset was published in Dec. 2012. Open data initiative prioritizes the “use of data” rather than “quantity of data”, that’s why the catalogue holds 25 applications using datasets resources in some way. It’s important for the project to keep the relation 1/3 between applications and datasets. Most of the resources are CSV and shapefiles; basically we have a 3 stars catalogue and the reason why we can’t go to the next level is the lack of resources (time, human, economic, etc.) at government agencies to implement an open data liberation strategy. So when we are asked about opening data, keep it simple is the answer, and CSV is far the easiest and smart way to start. Uruguay has an Access to public information law but don’t have legislation about open data. The open data initiative is leaded by AGESIC with the support of the open data working group. OD Working group: - Intendencia de Montevideo – www.montevideo.gub.uy - INE – www.ine.gub.uy - AGEV – www.agev.opp.gub.uy - FING – UDELAR – www.fing.edu.uy - D.A.T.A. – www.datauy.org


Elements: (Each element described in more detail at Use-Case Elements )

  • Domains:

• Infraestructure: Most of the datasets are shapefiles. • Transportation: Shapefiles and CSV, containing information about public transportation (stops and frequency), roads, accidents, etc. • Tourism: data about regional events, cultural agenda, hotels, campings, statistics. • Economics: Budget, Consumer price declarations, etc. • Social development • Environment • Health • Education • Culture

  • Obligation/motivation:

There is no obligation for the government agencies to publish open data. All initiatives were carried on by agencies that wants to support the initiative.

  • Usage:

Develop applications and new services for citizens, agencies interoperability (exchange of information in open data formats), transparency

  • Quality:

Most of the data is actualized properly, datasets metadata is complete, resources metadata about 70% complete.

  • Size:

Small; most of the datasets size is less than 1Gb.

  • Type/format:

• SHAPEFILE (35) • CSV (26) • TXT (19) • ZIP (12) • HTML (7) • XLS (6) • PDF (4) • XML (3) • RAR (2)

  • Rate of change:

Depends on the dataset.

  • Data lifespan:

Depends on the dataset, some change in real time, other monthly, every 6 month, annual or never change.

  • Potential audience:

Developers, Journalists, Civil society, Entrepreneurs. Technical Challenges: Consolidate tool to manage datasets, improve visualizations and transform resources to higher level (4 – 5 stars). Automate publication process using harvesting or similar tools. Alerts or control panel to keep data updated.

Potential Requirements:

GS1: GS1 Digital

DWBP Use Case for GS1 Digital / GTIN+ on the Web (Linked Open Data for products)

Contributor: Mark Harrison (University of Cambridge) & Eric Kauz (GS1)

Overview: Retailers and Manufacturers / Brand Owners are beginning to understand that there can be benefits to openly publishing structured data about products and product offerings on the web as Linked Open Data. Some of the initial benefits may be enhanced search listing results (e.g. Google Rich Snippets) that improve the likelihood of consumers choosing such a product or product offer over an alternative product that lacks the enhanced search results. However, the longer term vision is that an ecosystem of new product-related services can be enabled if such data is available. Many of these will be consumer-facing and might be accessed via smartphones and other mobile devices, to help consumers to find the products and product offers that best match their search criteria and personal preferences or needs - and to alert them if a particular product is incompatible with their dietary preferences or other criteria such as ethical / environmental impact considerations - and to suggest an alternative product that may be a more suitable match.

There are at least five main actors in this use case: Manufacturers / Brand Owners Retailers GS1 Search engines, data aggregators and developers of smartphone apps Accreditation agencies

The figure below provides an overview of some of the kinds of factual claims that might be asserted about a product or product offering and the corresponding parties that have the authority to assert such claims.

Overview of some of the kinds of factual claims that might be asserted about a product or product offering and the corresponding parties that have the authority to assert such claims


1) Manufacturers / Brand Owners They publish authoritative master data about their products, data that is intrinsic to the product itself. This includes technical specifications, lists of ingredients, allergens, the results of various accreditations (e.g. environmental, ethical), as well as the product category and various attribute-value pairs (about qualitative and quantitative characteristics of the product). Many of the quantitative values will consist of a quantity and a unit of measurement - and for some of these (e.g. nutritional information), it is essential to unambiguously specify the reference quantity - e.g. per product pack, per per serving size, per 100g or 100ml of product. Some values should be selected from standardized code lists and expressed using URIs rather than literal text strings, in order to better support multi-lingual applications as well as comparisons between products that share some characteristics in common (it is more reliable to check for exact URI matches of codified values than to check for fuzzy string matches of text strings). Each product carries a globally unambiguous identifier, the Global Trade Item Number (GTIN), which is typically represented as an EAN-13 or UPC-12 linear barcode on the product packaging. A GTIN should point to at most one product; products that are distinct should have distinct GTINs. The brand owner assigns a GTIN to each product they produce. An HTTP URI representation of a GTIN issued under the registered domain of a brand owner can serve as the Subject in a graph of data of factual claims about the product, for which the brand owner has authority to make such claims. It can also serve (via HTTP 303 redirection) to retrieve a graph of such data in a preferred representation (e.g. via HTTP Content Negotiation (Accept: header).

2) Retailers A retailer has the authority to assert factual information about an offer it makes for a product. This includes information such as price, availability, payment options, delivery/collection options and store locations and should include a reference to the product (identified via its GTIN). Typically, an online retailer’s website will often replicate or embed some or all of the authoritative product data from the brand owner or manufacturer. However, this data needs to be accurate and synchronized so that it is up-to-date (taking into account any recent changes to the product information). Existing data synchronization mechanisms (such as GDSN (Global Data Synchronization Network) exist for synchronizing master data about products and organizations within a business-to-business / supply chain context - but these mechanisms currently do not make use of Linked Data technology nor publish such data openly on the web). If the retailer instead only references the graph of authoritative master data about the product published by the brand owner, this in turn relies upon (1) open publishing of that information by the brand owner using Linked Open Data techniques, such that the brand owner’s HTTP URI correctly redirects to a graph of authoritative structured data and (2) confidence that search engines, data aggregators and other consumers of the data will actually follow such HTTP URI references to import that externally referenced data, without disadvantaging a retailer who chooses to reference (rather than embed) product data for which they are not authoritative (with the exception of ‘own brand’ products for which they are the authority).

3) GS1 GS1 http://gs1.org is a global not-for-profit standards development organisation that develops user-driven open standards for improving the efficiency of supply chains. GS1 brings together a community of over 1 million companies who work together to develop a common language for exchanging information about products and supply chain operations. Some of the results of this include the data model for the Global Data Synchronization Network (GDSN) for synchronising details on product, party and price, the GS1 Global Data Dictionary (GDD) and its code lists and the Global Product Classification System (GPC). The GPC is a product classification developed by the GS1 community that enables trading partners to communicate more efficiently and accurately throughout their supply chain activities.

Within the GS1 Digital initiative, the GTIN+ on the Web project is supporting brand owners, manufacturers and retailers as they begin to adopt Linked Open Data technology for sharing structured data about products openly on the web. Although GS1 does not have the product data, nor is it authoritative about either the product master data or the product offers made by retailers, it does have the authority to publish its existing data models, definitions and code lists as a GS1 Linked Data Vocabulary and guidelines that can be used by anyone for describing product details with greater precision and expressive power than can currently be achieved using some of the existing broad web vocabularies (such as schema.org). Work is already in progress to convert many of these from existing open data in formats such as XML to RDF datasets and vocabularies.

4) Search engines, data aggregators and developers of smartphone apps These are the consumers of product data. They rely on being able to make comparisons between multiple retail offers for the same product (correlated through the GTIN of the product) and to find similar products (e.g. using the Global Product Classification (GPC) and attribute-value pairs). They rely on the available data being correct and up-to-date, especially when they often present the primary user-interface to consumers who will make decisions (to buy, to consume) based on the information that is presented to them.

5) Accreditation agencies These are independent neutral third party organizations who verify claims (e.g. ethical or environmental claims) about the product and its production. Examples include organizations such as the Marine Stewardship Council, Soil Association, etc. Each of these organizations has the sole authority to certify whether a product or its production process conforms to a particular claim - and to award the corresponding accreditation to the product. The brand owner / manufacturer and retailer may in turn embed or reference such claims, although the relevant accreditation agency is the authoritative source for such claims.

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains:
    • Product master data (e.g. technical specifications, ingredients, nutritional information, dimensions, weight, packaging)
    • Product offerings (e.g. sales price, availability (online, locally), payment options, delivery/collection options
    • Ethical / environmental claims about a product and its production process
  • Obligation/motivation:
    • initially, enhanced search result listings (e.g. Google Rich Snippets)
    • vision is to enable an ecosystem of new digital apps around product data
    • the food sector in the EU is already obliged under new food labelling legislation (EU 1169 / 2011, Article 14) to provide the same amount of information about a food product that is sold online to consumers as the information that would be available to them from the product packaging if they picked up the product in-store. Although the legislation does not suggest that Linked Open Data technology should be used to make the same information available in a machine-readable format, there is currently significant investment and effort to upgrade websites to provide accurate and detailed information about food products; the GS1 Digital team consider that for a relatively small amount of effort, these companies could gain some tangible benefits (e.g. enhanced search results) from such compliance efforts by using Linked Open Data technology within their web pages.
  • Usage:
    • data providing transparency about product characteristics
    • data used to help consumers make informed choices about which products to buy/consume
  • Quality: Very important to have trustworthy authoritative data from respective organizations
  • Size: Typically 20+ factual claims per product - probably 40+ RDF triples
  • Type/format: HTML + RDFa / JSON-LD / Microdata
  • Rate of change: mostly static data initially - but subject to some variation over time
  • Data lifespan: data should remain accessible until products are no longer considered to be in circulation; this represents a challenge for deprecated product lines data that is stated authoritatively by one organization might be embedded / referenced in the data asserted by another organization; this raises concerns about whether embedded data becomes stale if it is inadequately synchronized, that referenced data is not dereferenced (and therefore not discovered / gathered) by consumers or the data. From a liability perspective, there also needs to be clarity about which organization asserted which factual information - and also information about which organization has the authority to assert specific factual claims.
  • Potential audience: machine-readable (search engines, data aggregators, mobile apps etc.)

Technical Challenges:

  • Linked Open Data about products is likely to be highly distributed in nature and various parties have authority over specific claims:
  • Accreditation agencies have authority over ethical/environmental claims
  • Brand owners / manufacturers have authority over product master data
  • Retailers have authority over facts related to product offerings (price, availability etc.)
  • An organization (e.g. retailer) might embed authoritative data asserted by another organization (e.g. brand owner) and there is the risk that such embedded information becomes stale if it is not continuously synchronized.
  • An organization (e.g. retailer) might reference a graph of authoritative data that can be retrieved via an HTTP request to a remote HTTP URI. There is a risk that software or search engines consuming Linked Open Data containing such references may fail to dereference such HTTP URIs and in doing so may fail to gather all of the relevant data.
  • Organizations are currently faced with a choice of whether to embed machine-readable structured data in their web pages using a block approach (e.g. using JSON-LD) or using an inline approach (e.g. using RDFa, RDFa Lite or Microdata). A block approach (JSON-LD) may be simpler and less brittle than inline annotation, especially as it can be easily decoupled from structural changes to the body of the web page that may happen over time in the redesign of a website. At present, tool support for the 3 major markup approaches for embedded Linked Open Data (RDFa, JSON-LD, MIcrodata) is unequal across the three formats and some tools may not export or import / ingest all 3 formats - some tools even fail to extract data from JSON-LD markup created by their corresponding export tool. There are some significant challenges to ensure that the structured data embedded within a web page is correctly linked to form coherent RDF triples, without any dangling nodes that should be connected to the Subject or other nodes.
  • Only through the provision of best-in-class tool support that recognize all three major formats on a completely equal footing can organizations have any confidence that they can use any of the 3 major markup formats and the ability to verify / validate that their own markup does result in the correct RDF triples.

Potential Requirements:

  • The ability to determine who asserted various facts - and whether they are the organization that can assert those facts authoritatively.
  • Where data from other sources is embedded, there is a risk that the embedded data might be stale. It is therefore helpful to indicate which graph of triples is a snapshot in time from data from another source - and to provide a link to the original source, so that the consumer of the data has the opportunity to obtain a fresh version of the live data rather than relying on a potentially stale snapshot graph of data. DWBP could provide guidance about how to indicate which graph of data is a snapshot and where it came from.
  • Consumers of Linked Open Data about products might rely on it for making decisions - not only about purchase but even consumption. If the data about a product is inaccurate or out-of-date, we might need to provide some guidance about how liability terms and disclaimers can be expressed in Linked Open Data. We’re not suggesting that we define such terms from a legal perspective - but perhaps there is an existing framework in a similar way that there is an existing framework for expressing various licences of the data? If not, perhaps such a framework needs to be developed - but outside of the DWBP group? Licensing generally says what you’re allowed to do with the data - but I don’t think it says anything about liability for using the data or making decisions based on that data. This area probably needs some clarification, particularly if there is a risk of injury or death (due to inaccurate information about allergens in a food product).

Tabulae - how to get value out of data

Contributor: [Luis Polo]

Overview: Tabul.ae is a framework to publish and visually explore data that can used to deploy powerful and easy-to-exploit open data platforms, so contributing organizations to unleash the potential of their data. The aim is to enable data owners (public organizations) and consumers (citizens and business re-users) to transform the information they manage into added-value knowledge, empowering them to easily create data-centric web applications. These applications are built upon interactive and powerful graphs, and take the shape of interactive charts, dashboards, infographies and reports. Tabulae provides a high degree of assistance to create these apps and also automate several data visualizations tasks (i.e., recognition of geographical entities to automatically generate a map). In addition, the charts and maps are portable outside the platform and can be smartly integrated with any web content, enhancing the reusability of the information.

Elements: (Each element described in more detail at Use-Case Elements )

  • Domains: Quantitative and geographical information: stats, biodiversity, socio-economic indicators, environment, security, etc
  • Obligation/motivation: to help citizens and companies (especially, consultancy firms) to understand and create value from open data by means of reusable, user-made visualizations.
  • Usage: Data used by citizens, public employees and companies.
  • Quality: The information must be at least semi-structured (for instance, an spreadsheet).
  • Size: Medium and large datasets (hundreds of thousands and millions rows)
  • Type/format: Tabulae can manage relational databases, geojson, csv files and spreadsheets, and provides an API for programmatic access.
  • Rate of change: depending on the original datasets. The platform enables automatic update from original sources.
  • Data lifespan: depending on the original datasets.
  • Potential audience: Organizations that want to publish their catalogue of datasets and aim to maximize their impact and consumption.

Technical Challenges:

  • Quality of data and metadata.
  • Inconsistency between different data sources.
  • Wide variety of formats and technologies.
  • Different data schemas that complicates the integration of data sources.
  • Diversity and (sometimes) complexity of Licenses.
  • Data persistence.
  • Internationalization and format issues (e.g., languages, numbers, dates, etc.)

Potential Requirements:

  • Dataset versioning and updating mechanisms
  • Standardization of schemas
  • Integration with other platforms/services

Bio2RDF

Contributor: Carlos Laufer

Overview: Bio2RDF [1] is an open source project that uses Semantic Web technologies to make possible the distributed querying of integrated life sciences data. Since its inception [2], Bio2RDF has made use of the Resource Description Framework (RDF) and the RDF Schema (RDFS) to unify the representation of data obtained from diverse (molecules, enzymes, pathways, diseases, etc.) and heterogeneously formatted biological data (e.g. flat-files, tab-delimited files, SQL, dataset specific formats, XML etc.). Once converted to RDF, this biological data can be queried using the SPARQL Protocol and RDF Query Language (SPARQL), which can be used to federate queries across multiple SPARQL endpoints.

Elements:

  • Domains: Biological data
  • Obligation/motivation:

Biological researchers are often confronted with the inevitable and unenviable task of having to integrate their experimental results with those of others. This task usually involves a tedious manual search and assimilation of often isolated and diverse collections of life sciences data hosted by multiple independent providers including organizations such as the National Center for Bio-technology Information (NCBI) and the European Bioinformatics Institute (EBI) which provide dozens of user-submitted and curated data, as well as smaller institutions such as the Donaldson group which publishes iRefIndex [3], a database of molecular interactions aggregated from 13 data sources. While these mostly isolated silos of biological information occasionally provide links between their records (e.g. Uni-Prot links its entries to hundreds of other databases), they are typically serialized in either HTML tags or in flat file data dumps that lack the semantic richness required to serialize the intent of the linkage between data records. With thousands of biological databases and hundreds of thousands if not millions of datasets, the ability to find relevant data is hampered by non-standard database interfaces and an enormous number of haphazard data formats [4]. Moreover, metadata about these biological data providers (dataset source data information, dataset versioning, licensing information, date of creation, etc.) is often difficult to obtain. Taken together, the inability to easily navigate through available data presents an overwhelming barrier to their reuse.

  • Usage: Biological research
  • Quality:

Provenance Bio2RDF scripts generate provenance records using the W3C Vocabulary of Interlinked Datasets (VoID), the Provenance vocabulary (PROV) and Dublin Core vocabulary. Each data item is linked to a provenance object that indicates the source of the data, the time at which the RDF was generated, licensing (if available from data source provider), the SPARQL endpoint in which the resource can be found, and the downloadable RDF file where the data item is located. Each dataset provenance object has a unique IRI and label based on the dataset name and creation date. The date-specific dataset IRI is linked to a unique dataset IRI using the W3C PROV predicate "wasDerivedFrom" such that one can query the dataset SPARQL endpoint to retrieve all provenance records for datasets created on different dates. Each resource in the dataset is linked the date-unique dataset IRI that is part of the provenance record using the VoID "inDataset" predicate. Other important features of the provenance record include the use of the Dublin Core "creator" term to link a dataset to the script on Github that was used to generate it, the VoID predicate "sparqlEndpoint" to point to the dataset SPARQL endpoint, and VoID predicate "dataDump" to point to the data download URL.

Dataset metrics

  1. total number of triples
  2. number of unique subjects
  3. number of unique predicates
  4. number of unique objects
  5. number of unique types
  6. unique predicate-object links and their frequencies
  7. unique predicate-literal links and their frequencies
  8. unique subject type-predicate-object type links and their frequencies
  9. unique subject type-predicate-literal links and their frequencies
  10. total number of references to a namespace
  11. total number of inter-namespace references
  12. total number of inter-namespace-predicate references
  • Size:

Nineteen datasets were generated as part of the Bio2RDF 2 release. Several of the datasets are themselves collections of datasets that are now available as one resource. Each dataset has been loaded into a dataset specific SPARQL endpoint using Openlink Virtuoso version 6.1.6. SPARQL endpoints, available at http://[namespace].bio2rdf.org. All updated Bio2RDF linked data and their corresponding Virtuoso DB files are available for download.

Dataset Namespace #of triples
Affymetrix affymetrix 44469611
Biomodels biomodels 589753
Comparative Tox-icogenomics Data-base ctd 141845167
DrugBank drugbank 1121468
NCBI Gene ncbigene 394026267
Gene Ontology Annotations goa 80028873
HUGO Gene No-menclature Committee hgnc 836060
Homologene homologene 1281881
InterPro interpro 999031
iProClass iproclass 211365460
iRefIndex irefindex 31042135
Medical Subject Headings mesh 4172230
National Center for Biomedical Ontology ncbo 15384622
National Drug Code Directory ndc 17814216
Online Mendelian Inheritance in Man omim 1848729
Pharmacogenomics Knowledge Base pharmgkb pharmgkb
SABIO-RK sabiork 2618288
Saccharomyces Genome Database sgd 5551009
NCBI Taxonomy taxon 17814216
Total 19 1010758291
  • Type/format: RDF
  • Rate of change:
  • Data lifespan:
  • Potential audience: Biological researchers

Technical Challenges:

  • Lack of human-readable metadata.
  • Data variability (models, sources, etc.)
  • RDFizations of Datasets
  • Wide variety of formats and technologies

Potential Requirements:

  • Dataset versioning and updating mechanisms
  • Standardization of schemas
  • Integration with other platforms/services
  • Data persistence

References:

  • [1]Callahan A, Cruz-Toledo J, Ansell P, Klassen D, Tumarello G, Dumontier M: Improved dataset coverage and interoperability with Bio2RDF Release 2. SWAT4LS 2012, Proceedings of the 5th International Workshop on Semantic Web Applications and Tools for Life Sciences, Paris, France, November 28-30, 2012. download
  • [2] Belleau F, Nolin MA, Tourigny N, Rigault P, Morissette J: Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J Biomed Inform 2008, 41(5):706-716.
  • [3] Razick S, Magklaras G, Donaldson IM: iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics 2008, 9:405.
  • [4] Goble C, Stevens R: State of the nation in data integration for bioinformatics. J Biomed Inform 2008, 41(5):687-693.

Challenges

Category: Challenge: Related use-case(s): Relevant: In-scope:

Considerations

Use-Case Elements

  • Domains, e.g.
    • Base registers, e.g. addresses, vehicles, buildings;
    • Business information, e.g. patent and trademark information, public tender databases;
    • Cultural heritage information, e.g. library, museum, archive collections;
    • Geographic information, e.g. maps, aerial photos, geology;
    • Infrastructure information, e.g. electricity grid, telecommunications, water supply, garbage collection;
    • Legal information, e.g. supranational (e.g. EU) and national legislation and treaties, court decisions;
    • Meteorological information, e.g. real-time weather information and forecasts, climate data and models;
    • Political information, e.g. parliamentary proceedings, voting records, budget data, election results;
    • Social data, e.g. various types of statistics (economic, employment, health, population, public administration, social);
    • Tourism information, e.g. events, festivals and guided tours;
    • Transport information, e.g. information on traffic flows, work on roads and public transport.
  • Obligation/motivation, e.g.
    • Data that must be provided to the public under a legal obligation, e.g. legislation, parliamentary and local council proceedings (dependent on specific jurisdiction);
    • Data that is a (by-)product of the public task, e.g. base registers, crime records
  • Usage, e.g.
    • Data that supports democracy and transparency;
    • Data that is the basis for services to the public;
    • Data that has commercial re-use potential.
    • Data that the public provides;
    • Utilization Rates once the Data is published;
  • Quality, e.g.
    • Authoritative, clean data, vetted and guaranteed;
    • Unverified or dirty data.
  • Lineage/Derivation, e.g.
    • Where the Data Came from;
    • What formulas were used to process the data;
    • How long the contolling authority had the data;
  • Size (ranging from small CSV files of less than a megabyte to potentially tera- or petabytes of sensor or image data)
  • Type/format, e.g.
    • Text, e.g. legislation, public announcements, public procurement;
    • Image, e.g. aerial photos, satellite images;
    • Video, e.g. traffic and security cameras;
    • Tabular data, e.g. statistics, spending data, sensor data (such as traffic, weather, air quality).
    • Data Classification
  • Rate of change, e.g.
    • Fixed data, e.g. laws and regulations, geography, results from a particular census or election;
    • Low rate of change, e.g. road maps, info on buildings, climate data;
    • Medium rate of change, e.g. timetables, statistics;
    • High rate of change, e.g. real-time traffic flows and airplane location, weather data
  • Data lifespan
  • Potential audience
  • Certification/Governance
    • Individuals or systems that certified the data for publication
    • Processes and steps documented to publish
    • public access and redress for data quality
    • Indication of FOIA Status

Linked Data Glossary

A glossary of terms defined and used to describe Linked Data, its associated vocabularies and Best Practices: http://www.w3.org/TR/ld-glossary/

Common Questions to consider for Open-Data Use-Cases

  1. Did you have a legislative or regulatory mandate to publish Open Data?
  2. What were the political obstacles you faced to publish Open Data?
  3. Did your citizens expect Open Data?
  4. Did your citizens understand the uses of Open Data?
  5. Did you publish data and information available in other forms (print, web, etc) first?
  6. How did you inventory your data prior to publishing?
  7. Did you classify your data as part of the inventory?
  8. How did you transform printed materials into Open Data?
  9. Does your city certify the quality of the data published and what steps are involved in cerficiation?
  10. Do you have data traceability and lineage - ie, do you know where your data came from and who has transformed it?
  11. Can you provide an audit trail of data usage and security prior to publication?
  12. Can you track the utility of the data published?
  13. Are you using URIs to identify data elements?
  14. Do you have a Data Architecture?
  15. What is your Data Governance structure and program?
  16. Do you have a Chief Data Officer and Data Governance Council who make decisions about what to publish and how?
  17. Do you have an Open Data Policy?
  18. Do you do any Open Data Risk Assessments?
  19. Can you compare your Open Data to neighboring cities and regions?
  20. Do you provide any Open Data visualization and analytics on top of your publication portal?
  21. Do you have a common application development framework and cloud hosting environment to maintain Open Data apps?
  22. What legal agreements and frameworks have you developed to protect your citizens and your city from the abuse and misuse of Open Data?

Stories

NYC Council needs modern and inexpensive member services and tools constituent services

Date: Monday, 23 Feb 2014 From: Noel Hidalgo, Executive Director of BetaNYC To: NY City Council’s Committee on Rules, Privileges and Elections. Subject: For a modern 21st Century City, NY Council needs modern and inexpensive member services and tools constituent services.

Dear Chairman and Committee Member,

Good afternoon. It is a great honor to address you and represent New York City’s technology community. Particularly, a rather active group of technologists ­ the civic hacker. I am Noel Hidalgo, the Executive Director and co­founded of BetaNYC [1]. With over 1,500 members, BetaNYC’s mission is to build a city powered by the people, for the people, for the 21st Century. Last fall, we published a “People’s Roadmap to a Digital New York City” where we outline our civic technology values and 30 policy ideas for a progressive digital city [2]. We are a member driven organization and members of the New York City Transparency Working Group [3], a coalition of good government groups that supported the City’s transformative Open Data Law.

In 2008, BetaNYC got its start by building a small app on top of twitter. This tool, Twitter Vote Report, was built over the course of several, then, developer days, now, hacknights, and enabled over 11,300 individuals to use a digital and social tool to provide election projection. [4]

Around the world, apps like this catalyzed our current civic hacking moment. Today, hundred of thousands of developers, designers, mappers, hackers, and yackers (the policy wonks) volunteer their time to analyze data, build public engagement applications, and use their skills for improving the quality of lives of their neighbors. This past weekend, we had Manhattan Borough President Gale Brewer, Councilmember Ben Kallos, Councilmember Mark Levine, a representative from Councilmember Rosie Mendez, and representatives from five Community Boards kick challenge over 100 civic hackers to prototype 21st Century interfaces to NYC’s open data. [15]

Though this conversation on rules reform, you have an opportunity to continue the pioneering work, a small talented team of civic hackers and I did WITHIN the New York State Senate.

In 2004, I moved from Boston to work for then Senator Patterson’s Minority Information Services department. In 2009, I re­joined NY State Senate’s first Chief Information Officer office. Our team’s mission was to move the State Senate from zero to hero, depoliticize technology, and build open­reusable tools for all.

In the course of four months, we modernized the Senate’s public information portal. Leading the way for two years of digital transparency, efficiency, and participation. These initiatives were award winning and done under the banner of “Open Senate” From Andrew Hoppin’s blog, the former NY State Senate CIO. [5]

Open Senate is an online “Gov 2.0′′ program intended to make the Senate one of the most transparent, efficient, and participatory legislative bodies in the nation. Open Senate is comprised of multiple sub­projects led by the Office of the Chief Information Officer [CIO] in the New York State Senate, ranging from migrating to cost effective, open­source software solutions, to developing and sharing original web services providing access to government transparency data, to promoting the use of social networks and online citizen engagement.

We did this because we all know how New Yorkers are getting their information. I don’t need to sit here and spout off academic number of digital connectivity. One just has to hop into a subway station to see just about everyone on some sort of digital device. For a modern NY City Council with 21st century members services, the council needs a Chief Information Officer and dedicated staff. The role of this office would be similar to the NYSenate’s CIO. Be empowered to create ranging from migrating to cost effective, open­source software solutions, to developing and sharing original web services providing access to government transparency data, to promoting the use of social networks and online citizen engagement.

Through this office, the Council would gain an empowered digital and information officer to coordinate the development and enhancement of member and constituent services.

Member services could be improved with the following.

  • Online and modern digital information tools.
    • Imagine a council website that you can call your own and include official

videos, photos, hearing, press releases, petitions, interactive maps of your district, online forms, event notifications, and online townhalls.

  • Usable and updateable committee websites.
  • Constituent tracking & relationship management tools.
    • Imagine being able to take a constituent issue and automatically file a 311 complaint and monitor the status of the complaint to completion. Imagine being able to send targeted constituent messages and reduce your paper mailings.
    • Imagine being able to survey your constituents via a mobile app or sms.
  • Better business and internal technology practices
    • No matter where you are, from desktops to mobile devices, you could always have access to council's internal systems while on the go.
  • A more usable interfaces to legislation
    • Imagine a simpler interface to legistar that integrates constituent comments and public feedback.
  • Real Time dashboards of 311 call and ticket status, municipal agency performance tracking, and budget expenditure tracking.
    • Imagine a monitor in your office and a website you could send to your constituents that demonstrates government performance in your district.
  • A universal participatory budgeting tool that works for all council districts.
    • Imagine a tool that cuts across the digital divide and empowers all to participate in participatory budgeting.

In our “People Roadmap to a Digital New York City,” we specifically call on the Council to adopt the following programs.. This is a brief summary of them:

  • Create "We the People of NYC," a petition tool for any elected representative. [6]
  • Update and centralize NYC’s Freedom of Information Law [7] [8]
  • Publish the City Record Online [9]
  • Expand the 311 system by implementing and growing the Open311 standard [10]
  • Release government content under a Creative Commons license [11]
  • Equip Community Boards with better tools [12]
  • Expand Participatory Budgeting [13]
  • Put the NYC Charter, Rules, and Code online [14]

Hidalgo, Noel Monday, 24 Feb 2014 BetaNYC’s testimony in favor of better member services

URLS Referenced:

Hidalgo, Noel Monday, 24 Feb 2014 BetaNYC’s testimony in favor of better member services


Palo Alto Open Data Story

On February 17th we heard a use case presentation from Jonathan Reichental, CIO of the City of Palo Alto.
A recording of the use case presentation can be found here: Palo Alto - Open by Default
1. We can explore the use of URI's for Open Data elements and physical things in a city that have multiple data elements
2. Cities are not yet tagging their data with metadata to allow comparability
3. There are not yet mechanisms to allow citizens to improve data completeness
4. Cities have internal processes for assuring data quality including sign-offs from IT and public officials but these activities are not recorded in metadata and provided with the datasets
5. Cities are not tracing origin and lineage
6. tuples would be a good way to identify relationships between things and data elements that could allow machine comparability of data sets in an internet of things that open data describes

Palo Alto pledged to be a partner with w3C in our WG, which is a great outcome.

ISO GEO Story

ISO GEO is a company managing catalogs records of geographic information in XML, conformed to ISO-19139. (ISO- 19139 is a French adaptation of the ISO- 19115) An excerpts is here: http://cl.ly/3A1p0g2U0A2z. They export thousands of catalogs like that today, but they need to manage them better. In their platform, they store the information in a more conventional manner, and use this standard for export dataset compliant to Inspire interoperability , or via the CSW protocol. Sometimes, they have to enrich their metadata with other ones, produced by tools like GeoSource and accessed through SDI (Spatial Data Infrastructure), with their own metadata records.

A sample containing 402 metadata records in ISO 19139 are in public consultation at http://geobretagne.fr/geonetwork/srv/fr/main.home. They want to be able to integrate all the different implementations of the ISO 19139 in different tools in a single framework to better understand the thousand of metadata records they use in their day-to-day business. Types of information recorded in each file, see example at http://www.eurecom.fr/~atemezin/datalift/isogeo/5cb5cbeb-fiche1.xml are the following: Contact info (metadata) [Data issued]; spatial representation ; reference system info [code space ], spatial Resolution ; Geographic Extension of the data, File distribution; Data Quality ; process step, etc.

BuildingEye: SME use of public data

Buildingeye.com makes building and planning information easier to find and understand by mapping what's happening in your city. In Ireland local authorities handle planning applications and usually provide some customised views of the data (pdfs, maps, etc.) on their own website. However there isn't an easy way to get a nationwide view of the data. BuildingEye, an independent SME, built http://mypp.ie/ to achieve this. However as each local authority didn't have an Open Data portal, BuildingEye had to directly ask each local authority for its data. It was granted access to some authorities, but not all. The data it did receive was in different formats and of varying quality/detail. BuildingEye harmonised this data for its own system. However, if another SME wanted to use this data, they would have to go through the same process and again go to each local authority asking for the data.

Recife Open Data Story

Recife is a beautiful city situated in the Northeast of Brazil and it is famous for being one of the Brazil’s biggest tech hubs. Recife is also one of the first Brazilian cities to release data generated by public sector organisations for public use as Open Data. An Open Data Portal was created to offer access to a repository of governmental machine-readable data about several domains, including: finances, health, education and tourism. Data is available in csv and geojson format and every dataset has a metadata description, i.e. descriptions of the data, that helps in the understanding and usage of the data. However, the metadata is not described using standard vocabularies or taxonomies. In general, data is created in a static way, where data from relational databases are exported in a csv format and then published in the data catalog. Currently, they are working to have dynamically generated data from the contents of relational databases, then data will be available as soon as they are created. The main phases of the development of this initiative were: to educate people with appropriate knowledge concerning Open Data, relevant data identification in order to identify the sources of data that their pontential consumers could find useful, data extraction and transformation from the original data sources to the open data format, configuration and installation of the open data catalogue tool, data publication and portal release.

Dutch basic registers

Author: Christophe Guéret

Story: The Netherlands have a set of registers they are looking at opening and exposing as Linked (Open) Data under the context of the project "PiLOD". The registers contain information about buildings, people, businesses and other individuals public bodies may want to refer to for they daily activities. One of them is, for instance, the service of public taxes ("BelastingDienst") which regularly pulls out data from several registers, stores this data in a big Oracle instance and curates it. This costly and time consuming process could be optimised by providing on-demand access to up-to-date descriptions provided by the register owners.

Challenges: In terms of challenges, linking is for once not much of an issue as registers already cross-reference unique identifiers (see also http://www.wikixl.nl/wiki/gemma/index.php/Ontsluiting_basisgegevens). A URIs scheme with predicable URIs is being considered for implementation. Actual challenges include:

  • Capacity: at this point, it can not be asked that every register owner cares for publishing his own data. Some of them export what they have on the national open data portal. This data has been used to do some testing with third-party publication from PiLODers but this is rather sensitive as a long term strategy (governmental data has to be tracable/trustable as such). The middle ground solution currently deployed is the PiLOD platform, a (semi)-official platform for publishing register data.
  • Privacy: some of the register data is personal or may become so when linked to others (e.g. disambiguate personal data based on adresses). Some registers will require to provide secured access to some of their data to some people only (Linked Data, not Open). Some others can go along with open data as long as they get a precise log of who is using what.
  • Revenue: institutions working under mixed gov/non-gov funding generate part of their revenue by selling some of the data they curate. Switching to an open data model will generate a direct loss in revenue that has to be backed-up by other means. This does not have to mean closing the data, e.g. a model of open dereferencing + paid dumps can be considered, as well as other indirect revenue streams.


Wind Characterization Scientific Study

Author: Eric Stephan

Story: This use case describes a data management facility being constructed to support scientific offshore wind energy research for the U.S. Department of Energy’s Office of Energy Efficiency and Renewable Energy (EERE) Wind and Water Power Program. The Reference Facility for Renewable Energy (RFORE) project is responsible collecting wind characterization data from remote sensing and in situ instruments located on an offshore platform. This raw data is collected by the Data Management Facility and processed into a standardized NetCDF format. Both the raw measurements and processed data are archived in the PNNL Institutional Computing (PIC) petascale computing facility. The DMF will record all processing history, quality assurance work, problem reporting, and maintenance activities for both instrumentation and data.

All datasets, instrumentation, and activities are cataloged providing a seamless knowledge representation of the scientific study. The DMF catalog relies on linked open vocabularies and domain vocabularies to make the study data searchable.

Scientists will be able to use the catalog for faceted browsing, ad-hoc searches, query by example. For accessing individual datasets a REST GET interface to the archive will be provided.

Challenges: For accessing numerous datasets scientists will be accessing the archive directly using other protocols such as sftp, rsync, scp, access techniques such as: http://www.psc.edu/index.php/hpn-ssh

Use Cases Document Outline

Use Cases Document Outline