Warning:
This wiki has been archived and is now read-only.

Best Practices/Identifying what you already publish

From Share-PSI EC Project

Jump to: navigation, search

This is a best practice that has come out of the session Timisoara/Scribe/Site scraping techniques to identify and showcase information in closed formats

1 Title
2 Short description
3 Why
4 Intended Outcome
5 Relationship to PSI Directive
6 Possible Approach
7 How to Test
8 Evidence
9 Tags
10 Status
11 Intended Audience
12 Related Best Practices

Title

Scrape your website to find out what you already publish.

Alternative title: Inventorying Your Data

Short description

Public sector websites already publish large volumes of information that is often to large to be catalogued manually. Therefore the automated scraping techniques should be applied to create inventories of already published information. Such inventories can be used as an input into prioritization of datasets for sharing as Open Data.

Why

Public sector websites already publish large volumes of information, too much to be catalogued manually. Automated scraping techniques can be used to gather details of information assets that are already published on the website. By automating the process website owners can easily use the scraping method to periodically audit their website to assess what information assets they publish and in what form (open, closed, etc).

Intended Outcome

Public sector bodies (data publishers) should be able to automatically audit their website and to create invetories of already published information.

Relationship to PSI Directive

Article 9 - Practical arrangements

Possible Approach

The following scraping software/libraries can be used:

Python Scrapy library

Metadata gathered using the scraping software can be then used as facets for sorting and grouping the links. The following appliations can provide the faceted browsing features:

Exhibit

The following techniques can be applied to increase security of the craping and to reduce the risk of a malware infection:

Process only the headers.
Scaper shdoul be run on an isolated machine.

The following challenges might affect the scraping:

Challenge	Possible solution
Loops of scraping	TBD
Technical difficulties	TBD
Irregular structure of web pages	TBD

It should be noted that the automatic site scraping is only the first step in development of a catalogue of datasets. Manual effort might still be needed. However the created metadata (inventory) can be used together with the analysis of the web server logs to prioritize datasets for sharing as Open Data. I.e. the web server logs might show what of the inventoried data sources are accessed often which might indicate good candidates for publishing in open formats.

How to Test

Automatically generated invetory of information provided on a given web site is available (at least to the personnel of the particular publisher).

Evidence

Scottish government utilize the site scraping techniques - see the Timisoara Workshop session "Site scraping techniques to identify and showcase information in closed formats - How do organisations find out what they already publish?"

Fraunhofer FOKUS has also made experiences with crawling data:

For the crawling activities in Germany they used this tool:
https://github.com/yasserg/crawler4j
You can find the results of the crawling activities in the study they undertook (unfortunately only available in German) in this document starting on page 401
https://www.w3.org/2013/share-psi/wiki/Localised_Guides#Country.2FCity:_Germany (the German long version)

Status

Draft

Intended Audience

Data managers, data owners

Related Best Practices

Analyse web server logs to find out what pople are intersted in

Retrieved from "https://www.w3.org/2013/share-psi/wiki/index.php?title=Best_Practices/Identifying_what_you_already_publish&oldid=1424"

Best Practices/Identifying what you already publish

Contents

Title

Short description

Why

Intended Outcome

Relationship to PSI Directive

Possible Approach

How to Test

Evidence

Tags

Status

Intended Audience

Related Best Practices

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Navigation

Tools