Best Practices/Identifying what you already publish

From Share-PSI EC Project
Jump to: navigation, search

This is a best practice that has come out of the session Timisoara/Scribe/Site scraping techniques to identify and showcase information in closed formats

Title

Scrape your website to find out what you already publish.

Alternative title: Inventorying Your Data

Short description

Public sector websites already publish large volumes of information that is often to large to be catalogued manually. Therefore the automated scraping techniques should be applied to create inventories of already published information. Such inventories can be used as an input into prioritization of datasets for sharing as Open Data.

Why

Public sector websites already publish large volumes of information, too much to be catalogued manually. Automated scraping techniques can be used to gather details of information assets that are already published on the website. By automating the process website owners can easily use the scraping method to periodically audit their website to assess what information assets they publish and in what form (open, closed, etc).

Intended Outcome

Public sector bodies (data publishers) should be able to automatically audit their website and to create invetories of already published information.

Relationship to PSI Directive

Article 9 - Practical arrangements

Possible Approach

The following scraping software/libraries can be used:


Metadata gathered using the scraping software can be then used as facets for sorting and grouping the links. The following appliations can provide the faceted browsing features:


The following techniques can be applied to increase security of the craping and to reduce the risk of a malware infection:

  • Process only the headers.
  • Scaper shdoul be run on an isolated machine.


The following challenges might affect the scraping:

Challenge Possible solution
Loops of scraping TBD
Technical difficulties TBD
Irregular structure of web pages TBD


It should be noted that the automatic site scraping is only the first step in development of a catalogue of datasets. Manual effort might still be needed. However the created metadata (inventory) can be used together with the analysis of the web server logs to prioritize datasets for sharing as Open Data. I.e. the web server logs might show what of the inventoried data sources are accessed often which might indicate good candidates for publishing in open formats.

How to Test

Automatically generated invetory of information provided on a given web site is available (at least to the personnel of the particular publisher).

Evidence

Scottish government utilize the site scraping techniques - see the Timisoara Workshop session "Site scraping techniques to identify and showcase information in closed formats - How do organisations find out what they already publish?"

Fraunhofer FOKUS has also made experiences with crawling data:

Tags

Site scraping, web, published information, dataset listing, inventory

Status

Draft

Intended Audience

Data managers, data owners

Related Best Practices

Analyse web server logs to find out what pople are intersted in