Timisoara/Scribe/Site scraping techniques to identify and showcase information in closed formats
From Share-PSI EC Project
- Facilitator: Peter Winstanley, Scottish Government .
- Scribe: Benedikt Kämpgen
Best practice: Identifying what you already publish Includes: 1) Identify the information assets that are already published on the website by the institution, e.g., by scraping, harvesting, crawling. (see example Benedikt Kotmel) 2) Identify how the information assets are published (closed formats, open formats), e.g., extracting information from the header, extracting information from RDF representations. 3) Establish usage in a user interface over this retrieved information to create a "staging area" (e.g., Exhibit, have a website, also part of the evidence). 4) Use the staging area to pre-fill the production-ready catalogue. Use staging area to identify and to monitor the progress of work on information assets that need improvement to have them added to a production-ready catalogue.
- Peter starts with describing some experiments with automatically scraping websites for PDF documents etc.
- Publishing pipeline
- Identifying published information
- Metadata and textual description only accessible for the content management system
- Institutions make data available
- That's great!! But...
- How can we get the web view?
- Scraping tools, e.g., Scrapy...
- Faceted search...
- Scottish Government Approach
- Loops of scraping.
- Technical difficulties.
- Irregular structure of web pages
- See tag clouds
- Links to the document (csv, excel spreadsheet, PDF documents)
- Do you think it is useful?
- Anyone has experience and knows better approaches?
- Output better deployed?
- Takes ownership
- Business side and technical side are complemented.
- It has a negative connotation, at least in Belgium.
- How do you answer to it?
- It is not different from what a search engine does. It is a perception thing.
- Authentic representation
- Public websites are not up-to-date.
- Is that not a risk?
- Is a way to do inventory taking.
- People ask me, what are we publishing, already?
- It also is a way to measure the progress.
- Good idea.
- Did same exercise with a couple of governments in Spain (Government of Andalusia, and City of Barcelona). Unfortunately not openly available because it was for them to understand what they already had published.
- For that exercises, I used FOCA (Fingerprinting Organizations with Collected Archives) , a forensic tool to collect metadata from PDFs, DOCs, and other documents already published on the Web, under specific domains.
- 10 times a guideline was downloaded about the methodology of publishing everything in PDF.
- Personal experience on quality of the air in my city (air pollution) using python. [App using that information at http://gijonair.es]. Expected that government is hiding information in their datasets. I scrape data from a HTML and realised that data is different in each source. Another benefit of your approach, Peter.
- Agree with your approach, Peter.
- Used a similar approach for identifying datasets.
- 400 cities in Germany.
- The benefit from this exercise should not be overestimated.
- For instance, you do not know about the license.
- The manual effort may still be large.
- For one publisher we found that they would mainly publish tables in PDF.
- Scraping as a first step.
- The next step would be to pull out more information, automatically.
- For re-user it is a very nice approach
- Ask for data in another format
- Exactly, this approach shall be a facilitator to encourage to publish open formats.
- Great idea for two reasons
- It will not happen that data will be overlooked (see example given by Benedikt Kotmel).
- And shows people what is the benefit of adding structured information.
- Question: What is the best practice that we talk about now? Inventory taking. Bootstrapping catalogue.
- Does anyone look at the data before?
- No, but why should someone look at it. It already is published on the website.
- There is no sense of publishing data if noone is using it.
- Good idea.
- Reluctant to look behind links. Malware.
- How to make sure that no malware?
- Question: Exhibit sufficiently useful?/performant?
- For 1000 of metadata about documents, it works pretty well.
- Trying to identify a best practice to find: We need to identify all datasets, also those with closed formats.
- We have made experiences in publishing statistical datasets.
- Wondering about the time dimension.
- How do you know which document is the most current one?
- It is a static view, updating means to scrape, again.
- Could think about extracting time information.
- Also a matter of display of information.
- We are probably speaking about a "staging area"
- So that production-ready would be the next step.
- Important remark: Publishing institutions can use the extended dataset search functionality of Google.
- Here is how to use Google for inventorying your data
- We had a dataset survey added to scraping of datasets.
- What is there.
- Peter, who do you think should do the scraping.
- The organisation itself or external people