Gov Open Data -- 31 Oct 2012

Hadley Beeman

Ruben

Sarah

scribe: INRIA

Olivier Berger

Bernadette

<JeniT> Florent

Sylvie

Shigeo

Armin

Jason

scribe: Dept of Internal Affairs, NZ Gov't

Bart

scribe: netage.nl

David

<oberger> hi Zakim

Ralph

Jeni

Martin

scribe: CTIC

<Ruben> this is opendata

<Ruben> list attendees

<Ruben> *not by phone, I was hoping zakim could help with attendance and queing*

Hadley: what's open data?

Florent: making data accessible to the public
... two major targets: adminstration and private companies to communicate better

Hadley: other reasons gov'ts should publish?

Olivier: public money is spent on producing data
... so important that such money is used for openness

Hadley: interesting philosophical debate in the UK there

<oberger> thx Ruben

Hadley: perhaps if we can make money from the data we can minimize what gov't needs to collect via taxes

David: when we started publishing gov't data we found that data had been collected over many years
... was expensive for the gov't to collect
... was available already on the Web but in an unuseable form
... e.g. CSV files, sometimes with descriptions of what the columns were
... when we started making LD out of it they found the data was dirtier than they realized

<Ruben> "dirty" => could openness also lead to clearer data?

David: but people could start making applications on top of the data because it was accessible in ways CSV files are not
... we had conversations with the agencies on the kinds of change this produces
... they don't like to hear their data is dirty

<oberger> Ruben, http://en.wikipedia.org/wiki/Linus%27_Law "given enough eyeballs, all bugs are shallow"

David: but if we can republish the data in a more useable manner that permits applications to be built that weren't practical before this allows for more use

<Ruben> +1 on enough eyeballs

Jeni: I love it when there is dirty data; this is a great opportunity for people who can recognize the dirtiness to contribute to its mainenance
... by contributing to maintenance they start to have an investment in the data
... it becomes co-owned
... improving the quality for everyone
... and you can see who your users are; they're the ones contributing back

David: I was concerned about a backlash when people saw how dirty the data was. So far that has not happened.
... people don't like to be embarassed
... but the data has been dirty for a long time

Hadley: hospital episode statistics ...
... someone went through the data and discovered 60k males admitted for midwife services
... perhaps the data entry person was in such a hurry that the wrong key was pressed
... but the press attention caused them to look more closely at the data

Shigeo: is the problem of dirty data with the format or the accuracy?

David: accuracy

Bern: we found nuclear power plants in the middle of the ocean
... we found that people more enjoy the fact that data they poured heart and soul into collecting was being used
... so rather than embarassement they started to think about lots of ways their data could be combined with other datasets
... so they became great resources to us

Oliver: if people are too enthusiastic they might publish anything, creating new problems

Bern: we didn't encounter that problem
... the folks we dealt with are in the information quality area and worked on publishing dictionaries
... the programmers are generally contractors
... the gov't people feel more in control of a LD project

David: many times data is dirty just because they can't see the problems

LarryMasinter: I went to records management conferences
... one of the principles of useable records for a long-term document is context
... archival records may be less embarassing
... is more access to archival records a fruitful pursuit?

Jeni: Legislation UK managed by UK National Archives
... legislation.gov.uk is one of our big sites
... all legislation back to 1267 that is still enforced is available in many forms

David: the US archives believe they are write-only

Jeni: the legislation data is dirty data
... and it's precisely the opportunity to get people involved that will clean it up

@@: some agencies are afraid of the methods used to collect the data

scribe: it might not be wrong but they don't know how they got it and can't assure its correctness
... would be good to have a process to annotate data about how it was collected and how good it might be

<oberger> traceability ?

Hadley,Bern: we agree!

scribe: it's useful to take time to explain why LD matters because it carries the context of the data

Hadley: there's risk to publish stuff they're not sure of or don't know the consequences of
... but there's also benefit to publishing with lots of caveats
... they're still sharing the data in a way it can be used in a broader ecosystem

Martin: beyond embarassment to civil servants, the problem is politicians who may be above these civil servants
... I know one case where the IT Dashboard has been used for public procurements
... the system has been running for at least a year
... internal only; they're scared to make the data public

<oberger> "the problem is politicians"... no... won't tweet that... to easy ;)

Martin: because they have seen mistakes
... the data may be accurate but there are problems in the management of some contracts

Hadley: transparency is another motivation for LD
... some countries who want to root out corruption or demonstrate lack of corruiption

LD == LinkedData

[apologies to Sylvie]

Bart: the City of Amsterdam has contributed to OpenStreetMaps
... Open Data makes for a more efficient gov't

Hadley: the official edit from our center is "publish what you have, no matter what form"

Larry: define the headings?

Hadley: that would help

Bern: this is a ten-year project
... gov't agencies won't all immediately started publishing LinkedData next year
... getting the data out with a variety of converters is a start
... if we begin to circulate the meme of getting more complex data out there -- too complex for CSV
... the complex datasets are good candidates for LinkedData
... we sign up for APIs, get the CSVs, convert to LinkedData
... we also talk with the agencies to tell them the benefits of publishing in RDF form themselves at an authoritative domain
... showing them their data in a killer app that combines data from multiple datasets gets them very interested
... allows us a foot in the door to describe a value proposition
... we're trying very hard to replace web portals whose backend is a relational database

David: looking at data in commercial enterprises, 5 to 15% of the data is in an RDB where it is well structured
... the rest is elsewhere in spreadsheets, email, etc.
... over the last 30 years this percentage has remained remarkably fixed
... if we can shoot for a goal to have RDF represent this amount of our total data and have this be the most usefule 15% that's probably appropriate
... focus on the bits that are the most useful

Hadley: good point
... our CSV files are useful because they have up-to-date lists of names of schools

Ruben: having representations ...
... one of the cases we should think about is that RDF is a good representation from which to generate other representations

David: it's not the format that is important in RDF -- it's the data model

Olver: the consumer and producer need to agree on the data model

Bart: and I tell people that RDF allows people to disagree formally on the model
... as a fireman I know that there are multiple definitions for "victim"
... each service defines "victim" differently; fire, ambulance, police
... you can't put these three services in a room and get them to agree on a single definition
... so with RDF we have three types of "victim"
... and we can build on this

Hadley: so 20 years in the future when we have massive amounts of gov't data being published, will RDF still be useful?
... will we progress beyond Linked Data?

Bern: the presumption is that we're at a tipping point where data does flood the web
... once we find data the next question is how do we use it?
... right now agencies have no way to exchange data in multiple data models
... to be relevant you must be able to publish your data on the web in a way people can find and use it

Hadley: back in the geocities era before Google people made lists of web sites
... when Google came along there was no need to build catalogs anymore
... I feel LOD is currently in the 'build catalog' era

Ruben: Sindice handles some of this
... perhaps [google] indexing is an answer
... I don't think so

Larry: data rots unless it is used
... backups fail
... if people publish data and no one uses it it will get dirty
... rather than counting the data let's count the uses
... focus on the patterns where the workflow validates somehow the data
... if people are mainly using documents and you want data to travel along with them it's not clear whether we want Linked Data or embedded data
... perhaps RDFa so the data doesn't get separated
... unless you have a pattern for using the data the document and the data may flow different ways
... people update the document but forget to update the data
... so I'd consider embedded data patterns

Bern: should we use LD to describe workflows?

Larry: when thinking about gov't use of data, think about its orgin, its distribution, and its use
... the single point of publication is not the important thing; it's the workflows behind it

Hadley: do we need standards for publishing data with workflows?

Larry: we have to enhance the document workflow; agencies won't hire people to create a parallel data workflow to existing document workflows

David: there is context loss
... sometimes data used within a particular gov't office is meaningless to someone outside
... what's a specific "program number"?
... the cost of publishing LD is to change the data from a very specific program-centric model to a model that has a public description and appropriate for reuse

Olivier; thinking in evolutionist way, perhaps if there lots of data but only a part of it serves to provide better gov't to citizens the parts that are only useful to one agency could be avoided

scribe: avoid "too much data" through open feedback systems

Bart: I support that; there's a lot of duplicate administration of permits

Hadley: @@
... thanks for your insights

[adjourned]

Ralph: I'd like to have this problem of too much Linked Data
... I'm confident that when we have this problem we'll find a way to deal with it
... and Linked Data has the property that CSV doesn't have that people will be able to figure out what it meant rather than having to guess

- DRAFT -

Gov Open Data
31 Oct 2012

Attendees

Contents

Summary of Action Items