Warning:
This wiki has been archived and is now read-only.
Best Practices/Identifying what you already publish
This is a best practice that has come out of the session Timisoara/Scribe/Site scraping techniques to identify and showcase information in closed formats
Contents
Title
Scrape your website to find out what you already publish.
Alternative title: Inventorying Your Data
Short description
Public sector websites already publish large volumes of information that is often to large to be catalogued manually. Therefore the automated scraping techniques should be applied to create inventories of already published information. Such inventories can be used as an input into prioritization of datasets for sharing as Open Data.
Why
Public sector websites already publish large volumes of information, too much to be catalogued manually. Automated scraping techniques can be used to gather details of information assets that are already published on the website. By automating the process website owners can easily use the scraping method to periodically audit their website to assess what information assets they publish and in what form (open, closed, etc).
Intended Outcome
Public sector bodies (data publishers) should be able to automatically audit their website and to create invetories of already published information.
Relationship to PSI Directive
Article 9 - Practical arrangements
Possible Approach
The following scraping software/libraries can be used:
Metadata gathered using the scraping software can be then used as facets for sorting and grouping the links. The following appliations can provide the faceted browsing features:
The following techniques can be applied to increase security of the craping and to reduce the risk of a malware infection:
- Process only the headers.
- Scaper shdoul be run on an isolated machine.
The following challenges might affect the scraping:
Challenge | Possible solution |
---|---|
Loops of scraping | TBD |
Technical difficulties | TBD |
Irregular structure of web pages | TBD |
It should be noted that the automatic site scraping is only the first step in development of a catalogue of datasets. Manual effort might still be needed. However the created metadata (inventory) can be used together with the analysis of the web server logs to prioritize datasets for sharing as Open Data. I.e. the web server logs might show what of the inventoried data sources are accessed often which might indicate good candidates for publishing in open formats.
How to Test
Automatically generated invetory of information provided on a given web site is available (at least to the personnel of the particular publisher).
Evidence
Scottish government utilize the site scraping techniques - see the Timisoara Workshop session "Site scraping techniques to identify and showcase information in closed formats - How do organisations find out what they already publish?"
Fraunhofer FOKUS has also made experiences with crawling data:
- For the crawling activities in Germany they used this tool:
- https://github.com/yasserg/crawler4j
- You can find the results of the crawling activities in the study they undertook (unfortunately only available in German) in this document starting on page 401
- https://www.w3.org/2013/share-psi/wiki/Localised_Guides#Country.2FCity:_Germany (the German long version)
Tags
Site scraping, web, published information, dataset listing, inventory
Status
Draft
Intended Audience
Data managers, data owners
Related Best Practices
Analyse web server logs to find out what pople are intersted in