Open data priorities and engagement — identifying data sets for publication: Report
Executive Summary
The third Share-PSI workshop was held in Romania, hosted by the West University of Timişoara. The event showed an evolution from previous workshops in the series:
- Samos was largely a traditional paper-presentation event.
- Lisbon was much more interactive with fewer presentations and more discussions.
- Timişoara, outwardly, was similar to Lisbon, but there was a much stronger focus on eliciting best practices that the project partners could codify and link to the PSI Directive.
Almost all sessions included discussion of best practices that followed from the shared experiences. These were:
Engage a broad community, including technical and non-technical people, in planning and executing open data policies.
Publishers should encourage and facilitate consumers' reporting of usage of data to encourage its continued provision.
Publishers should encourage and facilitate consumers' corrections to data, using tools such as GitHub, gamification techniques etc.
Publishers should clearly define their role in providing PSI whilst encouraging others to build on it.
Publishers should provide a knowledge base as part of a data portal.
Publishers should focus on providing services such as mixing and visualisation as much as data.
Machine translation technologies should be harnessed to offer data in multiple languages.
As a minimum, data should be available for bulk download.
Publishers should be explicit about the rights that consumers have in accessing and re-using data.
Publishers should unambiguously express and communicate the quality level of their data.
Datasets must refer to locations consistently following international standards.
Public authorities should use common criteria for assessing the impact of their PSI provision.
Inventories of available information, whether open or closed, should be generated through scraping of public authorities' websites.
Public authorities should follow the techniques developed in research publishing to link reports and studies with the underlying data.
Introduction
The third Share-PSI workshop took the theme of “Open Data Priorities and Engagement — Identifying data sets for publication” and was hosted by West University of Timişoara. Following on from the success of events in Samos and Lisbon, the Timişoara workshop comprised a series of facilitated discussions with only a small number of presentation-based sessions. The aim of the project overall is to identify what works and what doesn't work as the public sector across Europe implements open data policies in the context of the revised PSI Directive. Recurring topics were the impact of relative studies and indices, the importance of gathering user feedback, the publishers' desire to know who is using their data and for what, and that the demand from citizens is not for data but for services that may be built on that data.
83 people registered for the workshop, a number that included participants from Serbia, Poland, Croatia, Bulgaria, the Czech Republic and, of course, Romania – countries that that had been absent or under-represented at previous Share-PSI events. The sessions in the main hall were streamed and recorded and the event generated a good amount of buzz on Twitter.
2.1 Plenary Talks
After a welcome from Dr Marilen Pirtea, Rector of West University of Timişoara, the Share-PSI partners were honoured to be joined by the Secretary of State from the Chancellery of the Romanian Prime Minister, Radu Puchiu. During his remarks, he described the hackathons organised to publicise the datasets available on the national portal (data.gov.ro) and to engage the wider community - something that is seen as crucial. Mr Puchiu used the workshop to announce that a new platform for public procurement will soon be established and its data will be exported and made available in standard formats. This and his comments on the Global Data Index proved highly relevant to later sessions.
The Czech Ministry of Finance's Benedikt Kotmel presented the situation is his country. His ministry is the first to establish an open data portal and five other ministries are following suit. This is in addition to portal.gov.cz, which is run by the Ministry of the Interior. Interoperability of the various catalogues therefore depends on inter-departmental communication, a situation made more complicated by government ministers being drawn from more than one political party. An internal directive was essential to begin the conversation. Mr Kotmel described the demand analysis carried out prior to releasing any data. This drew on several sources:
- FOI requests;
- Universities that were approached directly
- Non-profit organisations;
- Private sector companies.
The latter proved hard to engage but the other sources showed which data sets were most wanted. Further work was then done to see how feasible it was – technically and legally – to release those datasets, before seeking sign off by the minister. As with Romania, the importance of gathering feedback was emphasised. The result is that the Ministry of Finance's data catalogue is very well used, even though the number of datasets is a modest 25.
There was agreement between the Romanian and Czech speakers that a mechanism is needed to ensure continuous publication of data, including updates, and that this requires a change in culture. But there is some understandable frustration. First of all it is hard to know who is using published data and what for – not knowing this makes it hard to see an end result and to develop that culture change. Promoting open data to the tech community doesn't reach the citizens who have no interest in data but who might be interested in services. It is also frustrating to receive requests for data from people who are unaware that what they have asked for is already freely available.
The second day of the workshop began with a presentation of the situation in Poland from Jacek Wolszczak of the Ministry of Administration and Digitization. An important distinction for that country is between access and re-use. Individuals need to identify themselves when requesting data and access is free of charge, but the re-use may or may not be free of charge. This puts Poland outside the usual definition of open data but it is within the PSI Directive. One eye-catching idea was that of a knowledge base associated with the data portal, that is, documents designed for humans to read rather than data for machines to process. The workshop felt that the provision of such a knowledge base could be regarded as a best practice. Indeed, participants emphasised that Public Sector Information includes such documents, not all of which would be thought of as 'open data' – there is an overlap, but there is also a difference. As with other speakers, Mr Wolszczak was keen to highlight the importance of receiving and acting on community feedback. The feedback received will be published in full.
Branislav Dobrosavljevic presented the work of the Serbian Business Registers Agency. It operates as a one stop shop for many different and long established registers in Serbia and offers a range of services for internal and external users. The focus is very much on services rather than data. Such provision of eServices in Serbia is new as, until recently, the law demanded stamps on paper as part of the processes. Core services are now available from apr.gov.rs which includes free access to information about registered companies, although this is mediated via a search box with no option to download the data in bulk. The site is also not amenable to scraping and so the Serbian register is not available, for example, via OpenCorporates.
Mr Dobrosavljevic extended an invitation to collaborate internationally to ensure better interoperability, and ended by emphasising some key points:
- keep it simple;
- focus on front end services;
- have complete control on services, selling enriched data, not raw data;
- create a consistent market of data so that third party companies are able to run profitable services of their own;
- activities need to be covered by legislation, which is more important than technology.
The European Commission's Szymon Lewandowski gave an update on developments at DG CONNECT. The main issues arising from the revised PSI Directive concern some requests for clarification of some of the details, including details of the original Directive, and the effect of the revision on running contracts. The launch of the Open Data Incubator (ODINE) and the new contract for the publicdata.eu portal are seen as important components of Europe's big data infrastructure. The latter is foreseen as a single gateway to reusable information with the aim of enabling the combination and visualisation of information held by various open data portals at various levels throughout the EU. The new portal will be a focus for services around open data and include a dedicated service infrastructure for language resources in order facilitate multi-lingual access. This point was picked up by the LIDER project who took part in a Share-PSI workshop for the second time. Mr Lewandowski reported that machine translation technologies will be included in the first version of the new portal at its launch in November 2015 at the European Data Forum although, of course, they will need testing and further development.
A hope for the new portal is that it will include wizards to guide publishers through licensing issues and allow you to combine different datasets with different licences. This implies that licences need to be at least partially machine readable.
The final plenary presentation was by Nicolas Loozen of PwC who carried out some work under the ISA Programme looking at the prioritisation of datasets for publication. That work suggested a series of factors to take into account.
The data owner's perspective | |
---|---|
Transparency | Does the publication of the dataset increase transparency and openness of the government towards its citizens? |
Legal obligation | Is there a law that makes open publication mandatory or is there no specific obligation? |
Relation to the public task | Is the data the direct result of the primary public task of government or is it a product of a non-essential activity? |
Cost reduction | The availability and re-use of a dataset eliminates the need for duplication of data and effort, which reduces costs and increases interoperability. |
A re-user's perspective | |
Target audience | In terms of size and dynamics |
Systems & services | The number of new and existing uses of the data. |
These factors were applied to the European Commission's Tenders Electronic Daily service (TED), which meets all the criteria, and followed up by interviews and a questionnaire with TED users.
That work reinforced the comments made by other plenary speakers and throughout the workshop, that user engagement is an essential aspect of PSI provision. Mr Loozen further raised the issue of collaborative tools – a recurring theme in other sessions.
Data Quality
One of the best-attended sessions of the workshop was lead by Makx Dekkers. His work with the ISA Programme in the Open Data Support project identified 9 dimensions of quality that might be applied to data.
- Accuracy: is the data correctly representing the real-world entity or event?
- Consistency: Is the data not containing contradictions?
- Availability: Can the data be accessed now and over time?
- Completeness: Does the data include all data items representing the entity or event?
- Conformance: Is the data following accepted standards?
- Credibility: Is the data based on trustworthy sources?
- Processability: Is the data machine-readable?
- Relevance: Does the data include an appropriate amount of data?
- Timeliness: Is the data representing the actual situation and is it published soon enough?
This sparked a good deal of debate (captured more or less fully in the raw notes for this session). One suggestion that found favour was the addition of context, that is, the reason the data was collected in the first place. Other topics for discussion were
- the usefulness of the 5 Stars of Linked Open Data scheme as a measure of processability (general agreement that it is useful);
- whether the methods by which the data was collected is relevant as a measure of quality (again, yes);
- the usefulness of schemes like the ODI's Certificates (useful);
- whether an indication that access is open or restricted, or of differences between access and re-use, are part of a quality assessment (dubious).
The topics discussed in this session could perhaps be the basis of a full two day workshop but the interim conclusion was that publishers should unambiguously express and communicate the quality level of their data. This allows potential users to make informed decisions on whether and how to re-use the data. A standard set of terms should be developed… which is being done in the W3C Data on the Web Best Practices Working Group of which Mr Dekkers is an active member.
A specific aspect of quality and re-usability – how to identify locations - was discussed in the session Free our Maps, lead by Vasile Crăciunescu and Codrina Maria Ilie of the Technical University of Civil Engineering Bucharest. Licensing is an issue for Open Street Map usage, and few national mapping agencies make their data available for free. Many of the issues raised, such as the choice of spatial vocabulary, how to best represent spatial objects in RDF etc. are being addressed in the Spatial Data on the Web Working Group in which W3C is collaborating with the Open Geospatial Consortium to produce joint standards, including a best practice guide.
The session concluded that there is a need for a reference data set that covers at least the whole of Europe so that locations can be referred to consistently. The suggestion was that this should be created by the INSPIRE community as an authoritative dataset. The alternative is to use OSM and/or Geonames but this raises issues of quality, authority and licensing.
Collaboration
Valentina Dimulescu of the Romanian Academic Society lead a session discussing the Romanian Electronic Public Procurement System (SEAP). As announced by Radu Puchiu, this is being replaced by a new system but the discussion raised several issues. As part of a research project conducted via the FP 7 funded project ANTICORRP, the Romanian Academic Society (SAR) gathered contract-level procurement data by downloading the needed information directly from the SEAP server (contracts over 1 million Euro in the construction sector) and compiling them in a consistent database. The current system has a series of accessibility barriers, such as the need to enter a CAPTCHA code at every query step (searches, switching from one page to another or viewing notices), search functions based on one NACE code, at a time (so that you cannot compare between such codes to investigate broader procurement sectors) and the only being able to view each procurement award notice separately (so you cannot compile statistics, rankings or make comparisons). A bulk download is available from the government open data portal but the downloadable data is corrupted and, therefore, largely unusable. The two main problems identified are the lack of uniformity of the CSV files available for download and the fact that the errata or corrections made to the award notices (containing important information such as contract number, company name, contract value, etc) were not applied to the machine readable dataset.
In addition to the electronic procurement database, SAR created a smaller database containing contract-level information at the regional and local level which it retrieved via FOIA requests sent to local contracting authorities active in the road infrastructure sector so as to gather information not present in SEAP. Also, information was gathered from the Official Gazette of Romania on whether winning firms were political party donors or whether they received favourable treatment according to national and local media. Furthermore, statements of interests and assets of individuals in charge of procurement boards were examined, as well as Trade Register data (profit, turnover) on the companies belonging to Top 100 businessmen and on the Top 60 construction sector companies.
Some of the results point to the fact that a bit more than a quarter of single bidding contracts went to "favoured" firms and that Romanian privileged firms tend to be the main winners of contracts financed by the state budget. The investigation of the statement of interests and assets showed that some procurement board presidents were also members or presidents or the shareholders’ general assembly of the so-called Roads and Bridges Companies (local construction companies which are fully or partially owned by county councils). One should not automatically consider the presence of board members in these companies’ general assemblies as an indication of corrupt procurement procedures. Awarding contracts to such companies can also have a social underpinning since local jobs and economies depend on their existence. The data gathered points to the fact that direct links between compaines and award givers present in procurement boards are not pervasive and apparent, but that companies more frequently rely on informal networks with the procurement members or the county council president or prefect who receives kickbacks from the winning companies.
By contrast in Albania, this information is available and those correlations are clear to see. It's also possible to correlate procurement contracts with companies that make donations to Albanian political parties.
It is important to note that the Romanian civil society has underlined that corruption also takes place after the contract has been awarded. There are monitoring instruments and overlapping supervisory state bodies that are very active in the pre-contracting phase. A much more complete monitoring mechanism and datasets covering the whole process would be needed to assess in real time the contract’s correct implementation.
One of the most important challenges was the data correctness aspect. During the research phase, the SAR team had to make many corrections to the database but it wasn't possible to feed those corrections back to the government open data portal. The reason was that many corrections were made by hand because of their complexity and time constraints. Therefore, the whole database would have to have been fed back to the portal. The two main difficulties identified were that the extracted text fields had a diverse structure depending on the type of notice (the same type of notice may have additional text fields) and that the errata/corrections were not standardized (different punctuation and formatting). Additionally, there is a structural difference between the download service SEAP offers and the public interface since some information is only present on the latter.
The CKAN software used by the majority of data portals doesn't have a mechanism for providing cleaned up versions of datasets. It was suggested that contributing a software module to CKAN that would support this might be a good future project.
An alternative approach is adopted by the City of Chicago where some datasets are published on GitHub. This was presented by Peter Krantz in his session on crowd sourcing and was seen as an excellent method of gathering corrections and engaging the community.
Crowd sourcing is a prime example of community engagement – the community that wants the data helps to create and manage the data. The problems are usually legal; for example, the person behind a project to crowd source Swedish post codes quickly received a cease and desist notice. However, Chicago provides an example where crowd sourcing complements official sources to the benefit of all.
The discussion around crowd sourcing lead to some concrete proposals for best practices.
- Identify the need first and then seek groups able to support solving that need via crowd sourcing.
- Think of crowd sourcing as another tool to create/improve data sets and think about the phases of your data collection project and where crowd sourcing could best fit in.
- Involve stakeholders who could benefit from a free source of certain data sets and have them provide funding in order to sustain crowd sourcing efforts.
- Minimise the size of each task.
- Use a gamification approach.
- Consider using crowdsourcing without the users' knowledge, for example by using CAPTCHA systems.
The last of these is well known in the cultural heritage community where CAPTCHAs can be used to gather human reading of text from scanned documents that OCR software cannot read.
Many of these ideas were reflected and emphasised in the session Raising awareness and engaging citizens in re-using PSI lead by Daniel Pop of West University of Timişoara, and Yannis Charalabidis of the University of the Aegean. Engaging end users is essential to ensure that the data made available is the data people want and that it is worth the effort of publishing. The point was made again that end users are not interested in data – but they might be interested in data-driven services, and public authorities need to know if someone is going to do something with the data to justify the effort made.
Engaging citizens requires effort – it is a job in itself to reach out to different members of the community and to respond to requests. One method of doing this that was highlighted is the Karlsruhe City Wiki which is run entirely by the community. Drawing on work done under the Engage Project Professor Charalabidis offered 5 ways for a community manager to engage users (these have been implemented in the Engage Platform):
- Provide a home - offer the ability to citizens / users to create a profile and login via social media.
- Create an open data marketplace – citizens can put in a request – that is public for everyone to see (this draws on gamification principles).
- Allow users to be publishers - allow for upload of datasets by users.
- Allow working on datasets, i.e. make users curators.
- Provide incentives such as:
- publishing the popularity of the users;
- free tickets to community events, free parking etc.
- 'Datathons' (longer competitions);
- data journalism competitions.
It was notied during the session in Timişoara that data journalism was suggested should be seen as another 'tool' to raise awareness among citizens; story telling is far more interesting than numbers or charts.
Indices
Crowd sourcing is the technique used by both Open Knowledge to create the Global Open Data Index and by ePSI Platform to create the PSI Scoreboard. As Emma Beer from Open Knowledge and Martin Alvarez from ePSI Platform described in their session How benchmarking tools can stimulate government departments to open up their data, data submitted by volunteers is then curated and reviewed in a documented process. Some common problems faced were in helping contributors understand the questions they are tasked to answer and subsequently in generating publicity. One aspect of the former problem is multilingualism. Some of the data received is translated using Google Translate – it's a very manual process. On the second point 'UK still top of the table' is not a news story, although France's position as the 'most improved' did generate a lot of coverage.
Most of the discussion focussed on the impact that the Index and Scoreboard have. In his opening remarks, Radu Puchiu said that Romania had been pleased to be ranked joint 16th alongside the Netherlands and Iceland and had a target to be in the top 10 in 2015. That's a clear case where the Index is having a positive effect. Silviu Vert of Open Knowledge Romania said that being able to show that the openness of budget and spending data is internationally benchmarked helps make the case against the usual excuses for not publishing. Anne Kauhanen-Simanainen from the Finnish Ministry of Finance said that they were considering what indicators they should use to measure the impact of their open data policies. It was suggested that they look carefully at the indicators used in the Global Index, the PSI Scoreboard and the Web Foundation's Barometer so that comparisons could be made easily.
If success in the Global Index is helpful, and if the EC uses the PSI Scoreboard to measure progress across Europe, what of countries at the bottom of the list? Martin Alvarez reported that his attempts to contact governments at the bottom of the Scoreboard had been unsuccessful; with the exception of some individuals the governments seem simply not to care. The low score given to Belgium threatened the continuation of the excellent work done in Flanders since regional efforts are, perhaps unfairly, not reflected in the indices.
One way to tackle this, and to address the 'no change is not news' story, would be to increase the number of available comparisons. In particular, countries often judge themselves against their geographical or cultural neighbours more than more distant territories. It's also worth highlighting specific areas in which countries do well. Greece, for example, is one of only a handful of countries to make its spending data available at transaction level and are a clear leader in this regard.
New Discoveries
There were several sessions in Timişoara that, in one way or another, tackled the issue of data discovery.
In his session, Site scraping techniques to identify and showcase information in closed formats - How do organisations find out what they already publish?, Peter Winstanley of the Scottish Government considered the large amounts of data published within documents and websites designed for human readership. In the same way that search engines are able to make sense of unstructured web pages (to a greater or lesser degree), scraping can be used to create at least an inventory of what an organisation has. The Scottish Government Data Labs provides an example of this. It dynamically scrapes the organisation's website to generate lists of various types of document, including keywords etc.
Similar exercises have been carried out elsewhere but the participants agreed that this was only a first step. Such lists don't include licence data, for example, and the manual effort may still be substantial. Martin Alvarez used FOCA to create an inventory of data although that inventory isn't available publicly. It was a way to show the public authorities what they already had. It also showed the value in publishing structured metadata for published documents. In all cases, generating lists through site scraping must be seen as a first step or staging area and not as a substitute for publishing datasets explicitly.
The session on scraping concluded with some concrete proposals for how to proceed:
- Identify the information assets that are already published on the website by the institution, e.g. by scraping, harvesting, crawling.
- Identify how the information assets are published (closed formats, open formats), e.g. extracting information from the header, extracting information from RDF representations.
- Establish usage in a user interface over this retrieved information to create a "staging area."
- Use the staging area to pre-fill the production-ready catalogue. Use staging area to identify and to monitor the progress of work on information assets that need improvement to have them added to a production-ready catalogue.
Two sessions looked at storage and discovery of scientific research data. The session led by Tamás Gyulai of the Regional Innovation Agency in Szeged, The Role of Open Data in Research Institutions with International Significance and Robert Ulrich's bar camp on re3data.org both considered similar issues. Researchers are being encouraged, in some cases forced, to publish the data that underpins their work. In some disciplines, such as astronomy and biochemistry, this is already part of the culture but in others, such as the social sciences, publishing data goes against that culture. There is no separate finance available for publishing data and careers are built on published papers, not published datasets – at least, that is the situation today. Changing the landscape so that the incentives and rewards for publishing data are equal to those for publishing papers will be an important change in the culture among researchers in all disciplines.
One engine of change is that funders are increasingly asking for descriptions of how researchers plan to publish their data to be included in the proposal/grant application. Hungary's SZTAKI is experimenting with journal publications that include the paper, the data and the algorithm used so that experiments can be re-run and results reproduced.
Increasing the number of people with the skills necessary to publish research data in a re-usable manner must be an important target within education and policy. Alongside this, infrastructures for publishing and archiving research data need to be established.
re3data.org is an effort to provide information about the rapidly growing number of data repositories. It publishes information about more than 1,000 such repositories, making it easier for researchers to identify a repository suitable for their own work. Initially established through a collaboration of several German institutions, re3data.org will be managed as part of DataCite by the end of 2015. That organisation (DataCite) is part of the ecosystem around journals, researchers and citations that is already established among large sections of the academic community.
Legal Matters
The Open Science Link project is also working on new models for publishing scientific information including papers and their associated data. As part of this project, Freyja van den Boom of KU Leuven lead a session discussing the European Database Directive. There is no copyright on facts and so collections of facts – databases – are not protected. However, the 1996 Database Directive recognises the investment necessary to create databases. These are the sui generis rights. The problem is that the Directive is applied inconsistently across the EU. In one case in Germany, 40 people employed to maintain the database was sufficient evidence that the investment was substantial, in another case, 500 workers was not.
The situation is very unclear, especially in relation to access versus bulk download, individual versus institutional re-use, state-funded but privately created data and so on. The Database Directive makes no distinction between publicly owned and privately owned data so there are some cases where the PSI Directive overrides the Database Directive (this is true if the public authority owns the sui generis rights). Georg Hittmair of Compass described how his company had created an electronic business register from paper records for many years. When the Austrian government created their own electronic register in 1999, Compass copied that data (which it already had, it was simply a different way of accessing it). The Database Directive meant that Compass was no longer able to resell the data. It went all the way to the European Court – and went in the Austrian government's favour.
In Finland, if a public body owns a database and makes it available for free, without having to register, even anonymously, then they have effectively given permission to re-use. This is seen by many as an effective 'right to scrape' in Finland.
The conclusion of the session was that harmonisation is necessary across the EU, both in how the Database Directive is implemented and the relationship with the PSI Directive. In the meantime, agreement on licensing would be helpful, and in that regard, Creative Commons licences are a good option. Machine readable licences and rights statements would also help.
The topic of machine readable licences came up yet again in the session on multilingual data. An aspect of data is the language in which it is expressed and multilingualism must be a part of any European data infrastructure. Felix Sasaki presented his work in the LIDER project and advocated the use of Linguistic Linked Data as a bridge to reach a global audience. Agreed vocabularies, standardised APIs and links to other resources are all important building blocks but an outcome of LIDER will be Linguistic Linked Licensed Data (3LD). This is designed to produce language resources using standard data models along with something called repeatedly throughout the Timişoara workshop – machine readable licences.
Taken together, these techniques can lead to making data available in the language in which the potential users want it, even if it is not the original language.
Conclusion
The event in Timişoara was successful in engaging stakeholders in the PSI and open data landscape from across Europe, including countries that are often under represented in these discussions. The city of Timişoara is a leader in this field within Romania, and the participation of the country's top civil servant responsible for open data policy, validated the choice of location.
The main conclusions of the workshop were:
- Citizen engagement is essential in creating an ecosystem around PSI publication and use.
- The technical community is important, but not the only important community with which to engage.
- For end users, services are what counts, not data.
- The release of further data is best incentivised by seeing use of what's already available.
- The legal landscape around databases, crowd sourcing, access and re-use is far from clear.
- There is a need for machine readable licences/rights statements.
- Users of data should be empowered to curate and correct datasets to the benefit of all.
- There is a need to describe the quality of data in a consistent manner if potential consumers are to make informed choices.