See also: IRC log
Hadley Beeman
Ruben
Sarah
scribe: INRIA
Olivier Berger
Bernadette
<JeniT> Florent
Sylvie
Shigeo
Armin
Jason
scribe: Dept of Internal Affairs, NZ Gov't
Bart
scribe: netage.nl
David
<oberger> hi Zakim
Ralph
Jeni
Martin
scribe: CTIC
<Ruben> this is opendata
<Ruben> list attendees
<Ruben> *not by phone, I was hoping zakim could help with attendance and queing*
Hadley: what's open data?
Florent: making data accessible
to the public
... two major targets: adminstration and private companies to
communicate better
Hadley: other reasons gov'ts should publish?
Olivier: public money is spent on
producing data
... so important that such money is used for openness
Hadley: interesting philosophical debate in the UK there
<oberger> thx Ruben
Hadley: perhaps if we can make money from the data we can minimize what gov't needs to collect via taxes
David: when we started publishing
gov't data we found that data had been collected over many
years
... was expensive for the gov't to collect
... was available already on the Web but in an unuseable
form
... e.g. CSV files, sometimes with descriptions of what the
columns were
... when we started making LD out of it they found the data was
dirtier than they realized
<Ruben> "dirty" => could openness also lead to clearer data?
David: but people could start
making applications on top of the data because it was
accessible in ways CSV files are not
... we had conversations with the agencies on the kinds of
change this produces
... they don't like to hear their data is dirty
<oberger> Ruben, http://en.wikipedia.org/wiki/Linus%27_Law "given enough eyeballs, all bugs are shallow"
David: but if we can republish the data in a more useable manner that permits applications to be built that weren't practical before this allows for more use
<Ruben> +1 on enough eyeballs
Jeni: I love it when there is
dirty data; this is a great opportunity for people who can
recognize the dirtiness to contribute to its mainenance
... by contributing to maintenance they start to have an
investment in the data
... it becomes co-owned
... improving the quality for everyone
... and you can see who your users are; they're the ones
contributing back
David: I was concerned about a
backlash when people saw how dirty the data was. So far that
has not happened.
... people don't like to be embarassed
... but the data has been dirty for a long time
Hadley: hospital episode
statistics ...
... someone went through the data and discovered 60k males
admitted for midwife services
... perhaps the data entry person was in such a hurry that the
wrong key was pressed
... but the press attention caused them to look more closely at
the data
Shigeo: is the problem of dirty data with the format or the accuracy?
David: accuracy
Bern: we found nuclear power
plants in the middle of the ocean
... we found that people more enjoy the fact that data they
poured heart and soul into collecting was being used
... so rather than embarassement they started to think about
lots of ways their data could be combined with other
datasets
... so they became great resources to us
Oliver: if people are too enthusiastic they might publish anything, creating new problems
Bern: we didn't encounter that
problem
... the folks we dealt with are in the information quality area
and worked on publishing dictionaries
... the programmers are generally contractors
... the gov't people feel more in control of a LD project
David: many times data is dirty just because they can't see the problems
LarryMasinter: I went to records
management conferences
... one of the principles of useable records for a long-term
document is context
... archival records may be less embarassing
... is more access to archival records a fruitful pursuit?
Jeni: Legislation UK managed by
UK National Archives
... legislation.gov.uk is one of our big sites
... all legislation back to 1267 that is still enforced is
available in many forms
David: the US archives believe they are write-only
Jeni: the legislation data is
dirty data
... and it's precisely the opportunity to get people involved
that will clean it up
@@: some agencies are afraid of the methods used to collect the data
scribe: it might not be wrong but
they don't know how they got it and can't assure its
correctness
... would be good to have a process to annotate data about how
it was collected and how good it might be
<oberger> traceability ?
Hadley,Bern: we agree!
scribe: it's useful to take time to explain why LD matters because it carries the context of the data
Hadley: there's risk to publish
stuff they're not sure of or don't know the consequences
of
... but there's also benefit to publishing with lots of
caveats
... they're still sharing the data in a way it can be used in a
broader ecosystem
Martin: beyond embarassment to
civil servants, the problem is politicians who may be above
these civil servants
... I know one case where the IT Dashboard has been used for
public procurements
... the system has been running for at least a year
... internal only; they're scared to make the data public
<oberger> "the problem is politicians"... no... won't tweet that... to easy ;)
Martin: because they have seen
mistakes
... the data may be accurate but there are problems in the
management of some contracts
Hadley: transparency is another
motivation for LD
... some countries who want to root out corruption or
demonstrate lack of corruiption
LD == LinkedData
[apologies to Sylvie]
Bart: the City of Amsterdam has
contributed to OpenStreetMaps
... Open Data makes for a more efficient gov't
Hadley: the official edit from our center is "publish what you have, no matter what form"
Larry: define the headings?
Hadley: that would help
Bern: this is a ten-year
project
... gov't agencies won't all immediately started publishing
LinkedData next year
... getting the data out with a variety of converters is a
start
... if we begin to circulate the meme of getting more complex
data out there -- too complex for CSV
... the complex datasets are good candidates for
LinkedData
... we sign up for APIs, get the CSVs, convert to
LinkedData
... we also talk with the agencies to tell them the benefits of
publishing in RDF form themselves at an authoritative
domain
... showing them their data in a killer app that combines data
from multiple datasets gets them very interested
... allows us a foot in the door to describe a value
proposition
... we're trying very hard to replace web portals whose backend
is a relational database
David: looking at data in
commercial enterprises, 5 to 15% of the data is in an RDB where
it is well structured
... the rest is elsewhere in spreadsheets, email, etc.
... over the last 30 years this percentage has remained
remarkably fixed
... if we can shoot for a goal to have RDF represent this
amount of our total data and have this be the most usefule 15%
that's probably appropriate
... focus on the bits that are the most useful
Hadley: good point
... our CSV files are useful because they have up-to-date lists
of names of schools
Ruben: having representations
...
... one of the cases we should think about is that RDF is a
good representation from which to generate other
representations
David: it's not the format that is important in RDF -- it's the data model
Olver: the consumer and producer need to agree on the data model
Bart: and I tell people that RDF
allows people to disagree formally on the model
... as a fireman I know that there are multiple definitions for
"victim"
... each service defines "victim" differently; fire, ambulance,
police
... you can't put these three services in a room and get them
to agree on a single definition
... so with RDF we have three types of "victim"
... and we can build on this
Hadley: so 20 years in the future
when we have massive amounts of gov't data being published,
will RDF still be useful?
... will we progress beyond Linked Data?
Bern: the presumption is that
we're at a tipping point where data does flood the web
... once we find data the next question is how do we use
it?
... right now agencies have no way to exchange data in multiple
data models
... to be relevant you must be able to publish your data on the
web in a way people can find and use it
Hadley: back in the geocities era
before Google people made lists of web sites
... when Google came along there was no need to build catalogs
anymore
... I feel LOD is currently in the 'build catalog' era
Ruben: Sindice handles some of
this
... perhaps [google] indexing is an answer
... I don't think so
Larry: data rots unless it is
used
... backups fail
... if people publish data and no one uses it it will get
dirty
... rather than counting the data let's count the uses
... focus on the patterns where the workflow validates somehow
the data
... if people are mainly using documents and you want data to
travel along with them it's not clear whether we want Linked
Data or embedded data
... perhaps RDFa so the data doesn't get separated
... unless you have a pattern for using the data the document
and the data may flow different ways
... people update the document but forget to update the
data
... so I'd consider embedded data patterns
Bern: should we use LD to describe workflows?
Larry: when thinking about gov't
use of data, think about its orgin, its distribution, and its
use
... the single point of publication is not the important thing;
it's the workflows behind it
Hadley: do we need standards for publishing data with workflows?
Larry: we have to enhance the document workflow; agencies won't hire people to create a parallel data workflow to existing document workflows
David: there is context
loss
... sometimes data used within a particular gov't office is
meaningless to someone outside
... what's a specific "program number"?
... the cost of publishing LD is to change the data from a very
specific program-centric model to a model that has a public
description and appropriate for reuse
Olivier; thinking in evolutionist way, perhaps if there lots of data but only a part of it serves to provide better gov't to citizens the parts that are only useful to one agency could be avoided
scribe: avoid "too much data" through open feedback systems
Bart: I support that; there's a lot of duplicate administration of permits
Hadley: @@
... thanks for your insights
[adjourned]
Ralph: I'd like to have this
problem of too much Linked Data
... I'm confident that when we have this problem we'll find a
way to deal with it
... and Linked Data has the property that CSV doesn't have that
people will be able to figure out what it meant rather than
having to guess
This is scribe.perl Revision: 1.137 of Date: 2012/09/20 20:19:01 Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/ Guessing input format: RRSAgent_Text_Format (score 1.00) Succeeded: s/Beaeman/Beeman/ Succeeded: s/Armand/Armin/ Succeeded: s/Rubin/Ruben/ Succeeded: s/Ruben/Olivier/ Succeeded: s/caveates/caveats/ Succeeded: s/legislationgov.uk/legislation.gov.uk/ Succeeded: s/Silvie/Sylvie/ Succeeded: s/Rubin/Ruben/ No ScribeNick specified. Guessing ScribeNick: Ralph Inferring Scribes: Ralph WARNING: No "Topic:" lines found. WARNING: No "Present: ... " found! Possibly Present: AZ Alan Bart BartvanLeeuwen Bern David Florent Hadley I18N_WG Jeni JeniT JonathanJ1 Larry LarryMasinter MLW Martin Oliver Olivier Olver Ralph Ruben SW_RDFWG Shigeo Sylvie WAIT WAI_Team active ahaller2 bhyland chsiao__ conferences jkiss joined masinter oberger opendata shige svillata yoshiaki You can indicate people for the Present list like this: <dbooth> Present: dbooth jonathan mary <dbooth> Present+ amy Got date from IRC log name: 31 Oct 2012 Guessing minutes URL: http://www.w3.org/2012/10/31-opendata-minutes.html People with action items: WARNING: No "Topic: ..." lines found! Resulting HTML may have an empty (invalid) <ol>...</ol>. Explanation: "Topic: ..." lines are used to indicate the start of new discussion topics or agenda items, such as: <dbooth> Topic: Review of Amy's report[End of scribe.perl diagnostic output]