Timisoara/Paper Reviews

From Share-PSI EC Project
Jump to: navigation, search
Title Review 1 Review 2 Link
Good practices for identifying high value datasets and engaging with re-users: the case of public tendering data

This paper proposes good practices for prioritising datasets for publication and engaging with re-users and is highly relevant for the project aims to ‘define best practices for assessing the priority order for releasing datasets and receiving and acting on feedback from the user community’. The paper also suggests working towards a common understanding of what ‘high quality’ means in this context from the point of view of both data owners and data reusers. Understanding both perspectives is essential for prioritising the release of datasets. This paper will significantly contribute to our knowledge of good practices in this area.

The presentation proposal on identifying high-value datasets and engaging with re-users fits the workshop very well. The authors have experience in both identifying important datasets and engaging data re-users.


  1. It is a challenge how to measure the value of open data for re-users where the data is open for everyone to use and hence we might not know who these re-users are and what they might want. Authors address it in the section about engaging with re-users but it still remains a difficult challenge.
  2. The Reusability section refers to the 5-star criteria by Tim Berners-Lee. Authors advocate at least 3-star datasets. But what about the remaining levels (4-star, 5-star) that are not mentioned here? While the datasets we are looking at might not be published in RDF as Linked Data it is important for reusability to be able to link and relate the dataset to other datasets. It would be good to have dataset linkability as one of the reusability criteria.
  3. The abstract is too long and contains information that would be more appropriate for the introductory sections.
  4. The paper (in PDF) contains a number of hyperlinks with text such as "social media", "5-star schema" and "thresholds". Relevant hyperlink URLs are not visible when people print out the PDF. Please make URLs visible by adding them as footnotes or including these resources as references.
  5. The paper mentions "14 recommendations and the funtional requirements for the future TED service". If this information is publicly available it would be good to have a link or reference to it.

File:Share-PSI Submission Paper-PwC v0.03.pdf

Crowd sourcing alternatives to government data – how should governments respond?

The paper is aligned with the objectives of the WS - the goal is to initiate discussion about the relations government - companies that offer services on top of open data -end users. Essentially, the Session discussion will answer the question How can/should public authorities respond to community efforts to crowd source data that replicates official data that is not open? The paper will raise questions about the completeness and truthfulness of the crowdsourced data and the reaction of the government to emerging commercial services.

The paper addresses the re-use of PSI in the PSI Directive documents, so the discussion notes can be used for writing SHARE-PSI Best practices.

This paper is clearly very much on target for the workshop. It's phrased in a way that we can hope will elicit good conversation and ideas. The ePSI Platform article referred to cites three other cases where legal means have been used to prevent address information being crowd sourced. What is not immediately obvious therefore – and this is no fault of the paper – is what could the best practices possibly be? Don't send cease and desist letters? Can something like a Lat/Long pair be used instead of an address and would that be useful? If there really is no way to solve this, might best practice be for governments to make a statement about crowd sourcing of protected datasets?


Site scraping techniques to identify and showcase information in closed formats - How do organisations find out what they already publish?

Relevance to the Timisoara Workshop: The paper is relevant to the workshop


The session proposal addresses a problem related to identification of datasets for opening up which many public sector bodies are facing: where to start with the initiative and what data and information we already publish on the web site. It is proposed to discuss the scarping of websites as technique that might help to quickly identify what information assets are currently published. I can confirm from my own experience that an analysis of the contents of the website is one of the first steps that we recommend public sector bodies to do when initiating an Open Data initiative. We have never used a scraper but it sounds like a useful technique that can really help when some public sector body (or any kind of entity) starts identification and prioritization of datasets for opening up. Therefore the proposal fits into the aims of the workshop.

Comments There are three questions proposed for the session each of which can be turned into a practice. In order to help the development of the practices adding indicative answers to the three workshop questions should be considered.

Overall impression: great idea for a workshop. The approach is a valid contribution to the process of identifying datasets which are suitable for publication as open data. Additions to the issues to be discussed:

  • the suggested site scraping poses issues which should be further elaborated: In the paper it is asked whether the scraping is generalizable. I would also ask where the bottle neck with this approach lies. Not so much in the scraping but very much in the analysis of the results.
  • what other quantitative or qualitative approaches could complement the site scraping (think: how could a data identification strategy look like?)
  • scraping is a crutch as the author describes himself. How could the publication processes and the CMSs be improved to make data set identification easier?
  • what kind of know-how do public bodies need to engage in the described exercise?
  • How can the scraped data sets be further ranked in terms of a open data release plan?
  • Could scraping be undertaken by the community?

File:SCOT timisoara.pdf

How benchmarking tools can stimulate government departments’ to open up their data

The Open Knowledge Index is an important metric for a lot of governments and activists so the topic will be of significant interest. However, from the point of view of the workshop it's important that the Index is presented as an outcome of following best practice, an incentive to do the right thing. It would be interesting to see if there are any relatively easy to follow best practices that have a significant effect on an index score. Are there any silly mistakes that trip governments up, i.e. simple things to avoid doing? If it's all 'big stuff' then it may appear daunting.

The session proposal is closely related to the objectives of the workshop. The presentation of the results of the Index 2014 may be a good input to help to define what data sets have been more successful and the reason, which ones have been less successful and why, etc.

The title itself refers to benchmarking tools. What kind of tools do you mean? Are there well-defined indicators that show the success of open data initiatives? If they exist, should be discussed in this session.

Several questions raised in the workshop can have clear answers in this session.

Google Doc

How good is good enough

The proposal is well on target, addresses our focusing three Xs, and reads well in general. Go for it.

The topic is very useful for developing standards, but doesn't seem to focus on either datasets priorities or engagement - which can be seen as the beginning and end points of publishing open data. As the workshop is described with these two topics, is there scope to add elements of these into the paper? For example, the paper discusses quality as being quite an internal process, perhaps there could be added suggestions of a discussion around how reusers could input their own feedback into datasets and crowd source quality improvements?

The paper doesn't reference either the five stars of open data or the open data certificates which can be seen as useful resources and tools to guide open data publishers into providing higher quality information about their dataset. I think that the information on a catalogue of quality dimensions provides helpful guidance as to potential quality metrics, but the majority of these do feature in ODCs. If they don't, perhaps the group could think about what questions are missing from the certification process to address any gaps.

I'd suggest that the session teasers out if the quality measures would vary depending on the datasets? The publishers? If the dataset is from private or public sector? Do these quality metrics fit all? Do we need to consider things differently for, say, geographic data?

Is there something here missing about the impact/benefits of quality? I can predict a discussion forming in this group about 'why' a publisher should put in lots of effort to produce high quality, freely available data. What's the pay off for them? Can we think of any practical examples where realising data ads open data has led to the improving of the quality of a dataset? Can we think of any examples where crowd sourcing / community involvement has led to the improvement of a dataset?

I think its a great session, and will be beneficial to the work being done to consider and develop standards in this area.

File:AMI proposal Share-PSI Timisoara How good is good enough.pdf

Raising awareness and engaging citizens in re-using PSI

Citizens in Europe are not impacted by PSI as such, they are impacted through the services resulting from the use/re-use of PSI. The paper is correct in stating that efforts are duplicated and we don’t see a lot of use/re-use of PSI, but pre-existing investments have been made mostly for internal use, not necessarily for citizen-focused service deployment.

Many reports see worldwide investment in publishing data as insufficient at best. Again, the focus of this kind of investment is primarily a poor effort in digitalizing archaic internal processes, thereby mostly ignoring their customers, citizens and businesses. The question posed in the document is however valid, certainly with regards to use/re-use of PSI. What is the impact on citizens, or more generally speaking, in reusing available electronic data and service?”

An overview of the known actions, initiatives, platforms that can be used to raise citizens awareness on existing PSI should be the starting point of this session. However, defining hackatons as a worthwhile initiative towards raising awareness was discussed at length in Lisbon and was seen as insufficient and obsolete.

Asking session participants what they use or see as alternative ways and methods for raising awareness is obviously a good idea, making sure however that resulting ideas are well-documented and distributed as possible “good practice”. If not, this is going to be a useless discussion.

Care should be taken not to emphasize or go into too much technical detail (software packages, Cloud based, Web-enabled platforms etc..). Defining ideas and methods should be the focus of the session.


The focus of the session should/must be on use/re-use of PSI, with an effort to document existing ideas/methods/practices gathered in this session together with results from a possible brainstorm w.r.t. these issues.

The questions to be addressed in the discussion make sense, however the examples given - Seed and Engage projects are not very praiseworthy use cases of Open Data.

Firstly - these projects create solutions, which provide unclear value for the citizens. The displays are an outdated form of data presentation, with exceptions in some niche cases if talking about senior citizens or kids. Generally today, the data belongs to the web and should be used for the web applications and mobile.

Secondly, if we take it from the perspective of engaging private businesses to reuse PSI data, the solutions of the Seed and Engage projects are useless, because there is no data accesible in appropriate and accessible formats.

The problem is that the public sector projects are trying to create solutions themselves, instead of opening data in proper formats, like RDF. Therefore, we believe, more emphasis should be put on the availability of the data itself for the business sector. Subsequently, the business sector will generate tons of applications for the usage of the citizens. But where is the data?


Interlinking of PSI data

The proposed session (on using the potential for interlinking as an important criteria when evaluating PSI datasets) would be valuable to have at the workshop. By interlinking datasets with other data out there (or by taking care to ensure a potential for this when publishing data) we can re-use a number of datasets together and answer questions that each dataset alone might not be able to answer. An important questions then is how do you evaluate a potential for dataset interlinking. I wish authors covered this in more detail in the proposal.

The Rationale section finishes somewhat abruptly, focusing on comparability of datasets. Comparability is important, but is it crucial to interlinking? Authors give an example from the statistics domain but this issue has already been around for ages, unrelated to interlinking. I am not saying it comparability is not important but there are other types of interlinking that might be even more important.

An important question for the proposed session is from the policy and best practices point of view: how can we promote and implement the interlinking criteria as a part of dataset evaluation criteria so that decision-makers would adopt and use it? People putting the data out there would need to think about interlinking before publishing the data and perhaps some improvements would be needed to increase the potential for interlinking. That's additional work. If we want people to do it, we need to make it as simple as possible for them to evaluate and improve the interlinking potential of datasets they are working with.

The goal of the paper is to raise discussion related to the WS topic - making high quality reference data sets available. In my opinion the Rationale part of the submission is well explained, but the focus of the discussion, especially the first sentence is questionable. Data published by the government data should be already interlinked with high quality data and the question of quality is of interest for those that re-use the data. - Therefore either delete or reformulate the first sentence.

The proposers would like to discuss technical issues of interlinking and the results of the discussion can be used for elaborating SHARE-PSI Best practices.

Google Docs

Free our maps

The proposal is based on experiences made in Romania but falls a bit short on addressing the topic of geodata on a broader scale. At the same time, it proposes to address an enormous range of discussion items. I recommend sharpening the focus a bit more. As it currently stands, the session would address an overly stretched range of topics. Maybe split into two sessions?

It is aligned with the topics of the workshop, focusing on a particular type of data, the geodata.

The proposal includes:

  1. good questions that can be answered by the audience and can stimulate its participation
  2. one example on experience that can be considered in the category of best practices

However I suggest to:

  1. enhance the list of questions to be discussed during the session
  2. identify and mention also other European initiatives for open geodata
  3. eliminate the Romanian specificities from the proposal (e.g. "we do not have substantial knowledge on other European countries’ legislation related to data and geodata")

As conclusion, I think the proposal can be accepted (if it is not overlapping with other similar ones).

File:Abstract free our maps.pdf

The Electronic Public Procurement System, open data and story telling in Romania

The paper, "The Electronic Public Procurement System, open data and story telling in Romania" is a small case study on usability of the Romainian Electronic Public Procurement System (SEAP) from the perspective of anyone in an investigatory role who is trying to undertake their own systematic review of the SEAP data. It is also a cri de coeur for good design in open data portals. The paper presents in some detail the difficulties encountered when trying to access SEAP data either through the website, or through the administration (including the private sector contractor company that is responsible for maintaining critical infrastructure of the SEAP system). As such it provides a set of materials from which a number of good practices can be identified as they would represent the remedies required in order to make the SEAP system together with its ancillary support framework fit for purpose.

The paper unfortunately only reports the negative aspects of the SEAP system and leaves it up to the reader to infer the good practices that would remediate the SEAP system. The paper could be strengthened by providing a short list of these good practices.

I think that 'as is' the paper would certainly provide the workshop with material for a very good discussion on the requirements of an 'open data' portal. However, the key theme of the paper is perhaps only on the margins of the key themes of the workshop. It might fit in the areas of discussion relating to overall quality and also to the engagement with users and the development of improvement loops that involve users.

This paper does not analyse categories of valuable datasets. It is only focused on public procurement information, so it is assumed that this dataset is a priority for this organization. The document does not analyse good practices, instead of this collects all the bad practices followed by the data publishers (Romanian public bodies) when releasing the information — which is good because we can identify best practices with the opposite.

The closest workshop topic for this paper is "making available data that may be incomplete or imperfect in such a way that it is valuable to reusers but that does not put a heavy burden of liability on providers". In this particular case the authors, who find useful the incomplete and imperfect information released by the government, face many problems to reuse the information.

Author found that the procurement information published by the government is incomplete and inaccurate.

Although the approach of the paper is contrary to the aim of the workshop (identify good practices), authors could find some good practices quickly with the counterexamples cited:

  • Identification of accessibility barriers: (CAPCHA required when search and filtering, Limits with querying notices)
  • Imperfect information (lack of flexibility adding CPV codes, name of countries, incorrect data, etc.)
  • Isolated documents/datasets (no possibility of combining data from various contracts).
  • Opacity in public procurement contract winners.
  • Lack of homogeneous scheme of data (notices have different structure, depending on the category. Hard to do data interpretation).
  • Opacity in the internal proceedings (A private company controls the whole process of gathering data, stats or values).

One important aspect of the document is that the public authority is inefficient, and it could be solved with these good practices. The entity that manages public contracts cannot retrieve essential information from modifications within errata notices, due to a lack of homogeneous format and automatic retrieval.

Author also identified bad practices in formats (i.e., data.gov.ro provides the same information in CSV files — OK, open format, but with errors within the files.

This paper does NOT answer any of the questions arisen in the CFP. Anyway, it could be useful to illustrate what practices to avoid. This paper to be acceptable should include a new section of conclusions with a set of good practices to solve the challenges and difficulties that Romanians face when trying to reuse procurement data (examples listed above). To be fitted within the theme of the workshop the document should explain why public contracts are important for reusers.

These best practices should answer these questions:

  1. What X is the thing that should be done to publish or reuse PSI? (e.g., Publication in CSV following an homogeneous, understandable scheme)
  2. Why does X facilitate the publication or reuse of PSI? (e.g., It facilitates the reuse because intermediate cleansing is not needed and it allows automatic processing).
  3. How can one achieve X and how can you measure or test it? (e.g. Creating reliable ETLs and test suites…).

File:Concept Note PSI.pdf

Role of Open Data in Research Institutions with International Significance

This is interesting topic although a little bit far from Share-PSI focus. The proposal addresses open scientific data and their reuse which is very important for scientific community. For general audience this is less interesting because generally speaking scientific (experimental) data require significant and complex description (e.g. circumstances of the experiment in which data were obtained) to enable reproduction of the results. I am not sure that this topic fits into Share-PSI but generally speaking this is important area.

The paper describes an activity where researchers jointly produce and analyse big data for innovation and better public service.

It is quite obscure what ELI project does, it'd be good to have a paragraph about it at least, so that people understand the whole story

This is a good example of jointly funded regional activity, where the output is partly open data produced in regional cooperation. In this sense, I think the suggested session might nicely complement the current workshop topics.

I think you could suggest which details of the initiative are more interesting for the workshop. (For example IPR or data management). Overall, the paper is not suitable as a session invitation as it stands now, so it needs some extensions, clarifications.


Freyja's Proposal

The session proposed for the Share-PSI workshop will clearly enable participants to tackle sensitive aspects of the implementation of the PSI Directive and benefit of your experience in field emphasized by your contributions to LAPSI project. Although the context is well described in the session proposal, I feel that one need to more focus of the discussion and closer alignment with the workshop questions. My suggestion is to try drive the session around 1-3 questions from the workshop description. For example,

  • What could be the role of standard disclaimers or positive statements (sometimes known as 'proclaimers') to communicate the extent to which data can be guaranteed to be complete?
  • Should producers of the same category (e.g. municipalities or local governments) publish a common set of datasets (e.g. budgetary data, public transport data etc.)? Under the same license? How do the national differences affect the ability to (crossborder) re-use PSI?

Do not forget that we are looking for answers to 3 fundamental questions:

  1. What* X is the thing that should be done to publish or reuse PSI?*
  2. Why* does X facilitate the publication or reuse of PSI?*
  3. How* can one achieve X and how can you measure or test it?

and therefore we should be able to define best practices at European level.

Considering the fact that Share-PSI consortium includes over 15 academic and research organizations, I believe that an equally interesting session would be one around the publication, sharing, linking, review and evaluation of research results. That exactly matches the OpenScienceLink project objectives. A session like this, addressing the issue of opening data in the European Research ecosystem and enabling participants to exchange experience and ideas, will help identify practices across Europe and, eventually, transform them into best practices.

This paper introduces aspects that non-legal minds rarely think of and therefore an interesting angle to bring to the workshop. The difficulty will be to distill best practices from the discussion but this is what we need to do. How do the legal considerations affect the selection of data for publication? What is the best way to make licensing and/or rights information available? This will be an interesting session.