CSV on the Web: Metadata Vocabulary for Tabular Data and other updates

The CSV on the Web Working Group has published a First Public Working Draft of a Metadata Vocabulary for Tabular Data. This is accompanied by an update to the Model for Tabular Data and Metadata on the Web document, alongside the group’s recently updated Use Cases and Requirements document.

Validation, conversion, display and search of tabular data on the web requires additional metadata that describes how the data should be interpreted. The “Metadata vocabulary” document defines a vocabulary for metadata that annotates tabular data, at the cell, table or collection level, while the “Model” document describes a basic data model for such tabular data.

A large percentage of the data published on the Web is tabular data, commonly published as comma separated values (CSV) files. The Working Group welcomes comments on these documents and on their motivating use cases. The next phase of this work will involve exploring mappings from CSV into other popular representations. See the Working Group home page for more details or to get involved.

Linking Geospatial Data on the Web

It was a year ago that Alex Coley and I first started discussing the idea of a workshop around geospatial data and how it can link with other data sources on the Web. Alex is the person at the UK’s Department of the Environment, Food and Rural Affairs (DEFRA) who is behind things like the Bathing Water Data explorer (developed by Epimorphics’ Stuart Williams) and the recent release of flood-related data. It didn’t take us long to bring in John Goodwin from Ordnance Survey, Ed Parsons from Google and the Open Geospatial Consortium‘s Bart De Lathouwer and Athina Trakas. That was the team that, on behalf of the Smart Open Data project, I worked with to organize the Linking Geospatial Data workshop that took place in London in early March.

For various reasons it took until now to write and publish the report from the event (mostly my fault, mea cupla) but a lot has been going on in the background, only some of which is evident from the report which just focuses on the workshop itself.

The workshop was really about two worlds: the geospatial information system world, effectively represented by OGC, and the Web world, represented by W3C. Both organizations operate in similar ways, have similar aims, and have more than 20 members in common. But we also both have 20 years of history and ‘ways of doing things.’ That has created a gap that we really want to fill in – not a huge one – but a gap nonetheless.

I hope the report gives a good flavor of the event – we were honored with contributors from places as distant as the Woods Hole Oceanographic Institute on the US West Coast, Natural Resources Canada, the National Institute of Advanced Industrial Science and Technology in Japan and the Australian government plus, of course, many European experts.

End result? I’m delighted to say that W3C and OGC are in advanced and very positive discussions towards an MoU that will allow us to operate a joint working group to tackle a number of issue that came up during the workshop. At the time of writing the charter for that joint WG is very much in its draft state but we’re keen to gather opinions, especially, of course, from:

  • OGC and W3C members who plan to join the working group;
  • developers expecting to implement the WG’s recommendations;
  • the closely related communities around the Geolocation Working Group and Web application developers who will want to access sources of richer data;
  • members of the wider community able to review and comment on the work as it evolves.

If you have comments on the charter, please send them to public-gdw-comments@w3.org [subscribe] [archive].

Better yet, if you’re going to the INSPIRE conference in Aalborg next week, please join us for the session reviewing the workshop and the charter on Tuesday 17th at 14:00.

Those links again:

Data on the Web Best Practices UCR Published

The Data on the Web Best Practices WG is faced with a substantial challenge in assessing the scope of its work which could be vast. What problems should it prioritize and what level of advice is most appropriate for it to develop in order to fulfill the mission of fostering a vibrant and sustainable data ecosystem on the Web? A a significant amount of work has gone in to collecting use cases from which requirements can be derived for all the WG’s planned deliverables. The Use Case & Requirements document, a first draft of which is published today, is expected to evolve significantly in future but already it provides a strong indication of the direction the WG is taking. Further use cases and comments are very welcome.

Congratulations and thanks in particular to the editors, Deirdre Lee and Bernadette Farias Lóscio, both first time W3C document editors, on getting this document out of the door.

Uses of Open Data Within Government for Innovation and Efficiency

You are warmly invited to participate in the first of a series of workshops being organized during this year and next by the Share-PSI 2.0 Thematic Network. Partners from 25 countries are working on issues surrounding the implementation of the European Commission's revised PSI Directive and this will feed into the Data on the Web Best Practices Working Group.

We're beginning with "Uses of Open Data Within Government for Innovation and Efficiency" – i.e. we're looking for cases where opening data has made it easier for government departments (local or national) to do their job better. What worked? What didn't work? What lessons can you share with others? What would most help you benefit from other people's work?

The workshop is taking place as part of the 5th Samos Summit on ICT-Enabled Governance which means participants can look forward to spending time on a beautiful island in the Aegean sea, formerly the home of Pythagoras.

Entry is by position paper which should not be a full academic paper, rather, a short description of what you'd like to talk about.

Deadline for submissions: 13 April
Notification of acceptance: 1 May
Workshop: 30 June – 1 July

Full details at http://www.w3.org/2013/share-psi/workshop/samos/.

Join us!

Phil Archer, W3C Data Activity Lead
On behalf of the Share-PSI partners

Share-PSI 2.0 is co-funded by the European Commission under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme.

RDF 1.1 has been published as Recommendation

The RDF Working Group has published today a set of eight Resource Description Framework (RDF)Recommendations:

  • “RDF 1.1 Concepts and Abstract Syntax” defines an abstract syntax (a data model) which serves to link all RDF-based languages and specifications. The abstract syntax has two key data structures: RDF graphs are sets of subject-predicate-object triples, where the elements may be IRIs, blank nodes, or datatyped literals. They are used to express descriptions of resources. RDF datasets are used to organize collections of RDF graphs, and comprise a default graph and zero or more named graphs.
  • “RDF 1.1 Semantics” describes a precise semantics for the Resource Description Framework 1.1 and RDF Schema, and defines a number of distinct entailment regimes and corresponding patterns of entailment.
  • “RDF Schema 1.1″ provides a data-modelling vocabulary for RDF data. RDF Schema is an extension of the basic RDF vocabulary.
  • “RDF 1.1 Turtle: defines a textual syntax for RDF called Turtle that allows an RDF graph to be completely written in a compact and natural text form, with abbreviations for common usage patterns and datatypes. Turtle provides levels of compatibility with the N-Triples format as well as the triple pattern syntax of the SPARQL W3C Recommendation.
  • “RDF 1.1 TriG RDF Dataset Language” defines a textual syntax for RDF called TriG that allows an RDF dataset to be completely written in a compact and natural text form, with abbreviations for common usage patterns and datatypes. TriG is an extension of the Turtle format.
  • “RDF 1.1 N-Triples” is a line-based, plain text format for encoding an RDF graph.
  • “RDF 1.1 N-Quads” is a line-based, plain text format for encoding an RDF dataset.
  • “RDF 1.1 XML Syntax” defines an XML syntax for RDF called RDF/XML in terms of Namespaces in XML, the XML Information Set and XML Base.

Furthermore, the Working Group has also published four Working Group Notes:

  • “RDF 1.1 Primer” provides a tutorial level introduction to RDF 1.1.
  • The RDF 1.1 Concepts, Semantics, Schema, and XML Syntax documents supersede the RDF family of Recommendations as published in 2004. “What’s New in RDF 1.1″ provides a summary of the changes between the two versions of RDF.
  • “RDF 1.1: On Semantics of RDF Datasets” presents some issues to be addressed when defining a formal semantics for datasets, as they have been discussed in the RDF 1.1 Working Group
  • “RDF 1.1 Test Cases” lists the test suites and implementation reports for RDF 1.1 Semantics as well as the various serialization formats.

More Languages for More Vocabularies

Last month I encouraged the provision of multi-lingual labels for vocabularies hosted at W3C. Tokyo librarian Shuji Kamitsuna has been doing terrific work recently and has translated the specification documents for DCAT (English, Japanese) and ORG (English, Japanese), and is now well into completing his work on the Data Cube Vocabulary. After Shuji had completed his work on the specifications, I wanted to update the schemas to include the Japanese labels too, but doing this threw up some issues.

First up was DCAT. The vocabulary is formally specified in the Recommendation and for each term there is a table showing the definition and a usage note. Immediately before each table, the term itself is given as a section title and it’s these section titles that are the English language labels in the schema. See the entry for dcat:Catalog for example. When Shuji translated the spec, the labels were therefore translated too. Transferring these to the schema was trivial. But that was the easy part.

The definitions in the spec are copied into the schema as the rdfs:comment for each term – except they’re not 100% aligned. Take the definition of the property dcat:dataset. The spec says “A dataset that is part of the catalog” whereas the schema gives just a little more help when it says “Links a catalog to a dataset that is part of the catalog.” The Arabic, Spanish, Greek and French labels, definitions and usage notes in the DCAT schema were all translated from the schema, the Japanese from the spec.

This begs the question: assuming that there is no difference in semantics, just a difference in the clarity with which the semantics are expressed, how much does it matter that the definitions in the schema and the spec are not 100% aligned?

When Shuji sent us the translation of ORG, a different issue arose. Like DCAT, the specification for ORG has a small table for each term that gives its definition and usage note. Before each table there is a heading but here’s the difference: in the ORG specification, those headings are written as the vocabulary term such as subOrganizationOf. If ORG followed exactly the same style as DCAT, this would have been written ‘sub organization of’ which is the English language label for the term – i.e. as proper words, not terms written in camel case. Actually it’s even more confusing as the actual label in the schema for ORG says “subOrganization of” – a sort of half way house. Again, does this matter?

Finally Shuji’s work threw up an issue around the use of upper and lower case letters in vocabularies. The well established convention is that RDF class names begin with upper case letters, properties with lower case letters, both use camel case. Further, where an object property is used for an n-ary relationship between classes, the property is often named in exactly the same way as the class that is the range. For example, in ORG we have org:role that has range org:Role.

You see the problem for Japanese? It’s is one of many languages that does not have the concept of upper and lower case letters.

I raised this issue in the Web Schemas Task Force and was relived that there was consensus that for the purpose of translation, it was safe to advise Shuji that the label for the property org:role could legitimately be ‘has role.’

In this and other work I’ve done over the years it’s clear to me that if you really want to check that what you’ve written is consistent and unambiguous – see how it comes out of a translation process. On this occasion I think we’ve got some pointers for future work to tighten these things up.

Final Publications from GLD

In the short time since the beginning of the year, the Government Linked Data Working Group has successfully published its final documents. The Best Practices for Publishing Linked Data Note was published last week providing advice and insights into how linked data publishing differs from other formats; and this week has seen three vocabularies published as Recommendations. Each of these will enhance data interoperability, especially, but not exclusively, in government data. Each one specifies an RDF vocabulary (a set of properties and classes) for conveying a particular kind of information:

  • The Data Catalog (DCAT) Vocabulary is used to provide information about available data sources. When data sources are described using DCAT, it becomes much easier to create high-quality integrated and customized catalogs including entries from many different providers. Many national data portals are already using DCAT.
  • The Data Cube Vocabulary brings the cube model underlying SDMX (Statistical Data and Metadata eXchange, a popular ISO standard) to Linked Data. This vocabulary enables statistical and other regular data, such as measurements, to be published and then integrated and analyzed with RDF-based tools.
  • The Organization Ontology provides a powerful and flexible vocabulary for expressing the official relationships and roles within an organization. This allows for interoperation of personnel tools and will support emerging socially-aware software.

Many members of the GLD deserve specific thanks, in particular Dave Reynolds for his work on Data Cube and ORG, Fadi Maali for his work on DCAT, Richard Cyganiak for his work on all those, Boris Villazón-Terrazas and Ghislain Atemezing for their work on the LD-BP document, and Hadley Beeman who has ensured that the WG kept up the pace to the end, all under the expert guidance of Sandro Hawke. There were many other members of the WG who remained active right to the end and without whom the work could not have been completed and who also deserve sincere thanks. I’d like to end by expressing my particular thanks to Bernadette Hyland who has chaired the Government Linked Data working group since its initial charter, giving up huge amounts of time to the group. I believe Bernadette will be recording her own thoughts here imminently.

JSON-LD Has Been Published as a W3C Recommendation

The RDF Working Group has published two Recommendations today:

  • JSON-LD 1.0. JSON is a useful data serialization and messaging format. This specification defines JSON-LD, a JSON-based format to serialize Linked Data. The syntax is designed to easily integrate into deployed systems that already use JSON, and provides a smooth upgrade path from JSON to JSON-LD. It is primarily intended to be a way to use Linked Data in Web-based programming environments, to build interoperable Web services, and to store Linked Data in JSON-based storage engines.
  • JSON-LD 1.0 Processing Algorithms and API. This specification defines a set of algorithms for programmatic transformations of JSON-LD documents. Restructuring data according to the defined transformations often dramatically simplifies its usage. Furthermore, this document proposes an Application Programming Interface (API) for developers implementing the specified algorithms.

RDF 1.1 document suite on its way to Recommendation

The RDF Working Group has published the documents of
the RDF 1.1 document suite as Proposed (Edited) Recommendation.
Together, these documents provide significant updates
and extensions of the 2004 RDF specification. For example:

  • Multiple graphs are now part of the RDF data model.
  • Turtle is included in the standard and is as much as possible aligned with SPARQL.
  • TriG is an extension of Turtle and provides a syntax for multiple graphs. Any Turtle document is also a valid TriG document.
  • N-Triples and N-Quads are corresponding line-based exchange formats.
  • JSON-LD provides an exciting new connection between the RDF and JSON worlds.

In “What’s New in RDF 1.1″ you can find a detailed description of
the new and updated features. The Working Group has also published the
first version of a new RDF Primer and a note on semantics of multiples graphs. Comments very welcome!

Vocabularies at W3C

In my opening post on this blog I hinted that another would follow concerning vocabularies. Here it is.

When the Semantic Web first began, the expectation was that people would create their own vocabularies/schemas as required – it was all part of the open world (free love, do what you feel, dude) Zeitgeist. Over time, however, and with the benefit of a large measure of hindsight, it’s become clear that this is not what’s required.

The success of Linked Open Vocabularies as a central information point about vocabularies is symptomatic of a need, or at least a desire, for an authoritative reference point to aid the encoding and publication of data. This need/desire is expressed even more forcefully in the rapid success and adoption of schema.org. The large and growing set of terms in the schema.org namespace includes many established terms defined elsewhere, such as in vCard, FOAF, Good Relations and rNews. I’m delighted that Dan Brickley has indicated that schema.org will reference what one might call ‘source vocabularies’ in the near future, I hope with assertions like owl:equivalentClass, owl:equivalentProperty etc.

Designed and promoted as a means of helping search engines make sense of unstructured data (i.e. text), schema.org terms are being adopted in other contexts, for example in the ADMS. The Data Activity supports the schema.org effort as an important component and we’re delighted that the partners (Google, Microsoft, Yahoo! and Yandex) develop the vocabulary through the Web Schemas Task Force, part of the W3C Semantic Web Interest Group of which Dan Brickley is chair.

But there’s a lot more to vocabularies at W3C than supporting schema.org.

First of all, we want to promote the use of our Community Group infrastructure as a place to develop and maintain vocabularies. Anyone can propose a Community Group, anyone can join. Moreover, it’s really easy for us to allocate a namespace for your vocabulary, i.e. http://www.w3.org/ns/yourVocab. That gives the outside world a promise of persistence of your terms that you can add to, clarify and, if needs be, deprecate – but not delete

As an example, one Community Group that has recently become very active in its discussion of a vocabulary is the Locations and Addresses CG which is looking after http://www.w3.org/ns/locn, originally developed by the European Commission’s ISA Programme.

Another aspect of vocabulary development and maintenance I’m very keen to promote at W3C is the provision of multilingual labels and comments. We’ve got some good examples of this to shout about: the Data Catalog Vocabulary, DCAT, has labels in English, French, Spanish, Greek and Arabic. The Organization Ontology has long had labels in both English and French and just last week, I was able to add Italian, thanks to Antonio Maccioni and Giorgia Lodi at the Italian Digital Agency.

If you use a vocabulary hosted by W3C, whether you’re involved in its development or not, and you’re able to offer a translation of the labels, comments and usage notes, please let us know – we’ll add them.

We’re still developing our ideas on how we can best support the development and maintenance of vocabularies at W3C but the direction of travel is clear – we’re very much here to help.