Best Practices/Discover published information by site scraping

From Share-PSI EC Project
Jump to: navigation, search

Name of the Share-PSI workshop

Title of the Best Practice: Discover published information by site scraping

Outline of the best practice

Automate the discovery of what is already published on your organisation’s website/s through using site scraping tools

Management summary

Challenge

Organisations often currently publish very large amounts of information on their websites but don’t know what is published and in what formats. How can they discover and catalogue what they already publish without using large staff resources and taking a long time to do the job?

Solution

  • Decide to implement a site-scraping element as part of the organizational approach to mapping the information assets that it already publishes on its website/s
  • Use sophisticated software libraries such as Scrapy (http://scrapy.org/ ) or Nutch (http://nutch.apache.org/ ) to develop bespoke site scraping tools and use these to discover and catalogue information published on websites. Develop specific tools to discover and catalogue information published in closed formats such as PDF and Excel.
  • Use faceted browsing tools such as Simile Widgets “Exhibit” (http://www.simile-widgets.org/exhibit/ ) to publish the results of this site scraping and incorporate this into processes to display the diversity of information published to potential re-users, or to help prioritise which sets of information to convert from closed to open formats. Examples are shown at http://labs.data.scotland.gov.uk
  • If in house skills for developing site scraping tools from code libraries then there are other resources available such as https://scraperwiki.com/

Best Practice Identification

Why is this a Best Practice? What's the impact of the Best Practice?

Site scraping provides an automated and scalable route to discovering what information is already published and in what formats. It can speed up the discovery process for organisations that already publish significant quantities of Public Sector Information and can help automate the initial classification of these information assets delivering machine-readable results.

Link to the PSI Directive

(Please use one or more of the categories listed on the last page of this document, as many as relevant)

  • Techniques w.r.t. opening up of data / Technical requirements and tools
  • Organisational structures and skills
  • Encouraging (commercial) re-use
  • Selection of information/data to be published according to various criteria


Why is there a need for this Best Practice?

Organisations with restricted time and resources need to implement the 2013 PSI Directive in an economical and efficient way. Automation of the process to discover what is already published will facilitate this. Frequently the existing websites are managed in a content management system (CMS) in which the information and metadata are only easily seen together when the final webpage output has been composed. Scraping websites provides an economic route to efficiently and accurately gathering information about these web pages and their content. The alternative manual route is not only time consuming, it is also prone to error.

What do you need for this Best Practice?

Some IT infrastructure (server, internet access, and open source software such as Scrapy (http://scrapy.org/) or Nutch (http://nutch.apache.org/). Modest programming skills in Python or Java and an understanding of web page structure and HTML

Applicability by other member states?

UK (Scotland) : http://labs.data.scotland.gov.uk

Contact info - record of the person to be contacted for additional information or advice.

Dr P Winstanley peter.winstanley@scotland.gsi.gov.uk