How to help build the UK’s open planning database: writing scrapers

Posted: March 29, 2012 | Author: Andrew Speakman | Filed under: hyperlocal, local government, open data, openlylocal, planning | Tags: Councils, gov2.0, hyperlocal, local data, local government, opendata, Planning Applications, PlanningAlerts |4 Comments

This post is by Andrew Speakman, who’s coordinating OpenlyLocal’s planning application work.

As Chris wrote in his last post announcing OpenlyLocal’s progress in building an open database of planning applications, while we can do the importing from the main planning systems, if we’re really going to cover the whole country, we’re going to need the community’s help. I’m going to be coordinating this effort and so I thought it would be useful to explain how we’re going to do this (you can contact me at planning@openlylocal.com).

First, we’re going to use the excellent ScraperWiki as the main platform for writing external scrapers. It supports Python, Ruby and PHP, and has worked well for similar schemes. It also means the scraper is openly available and we can see it in action. We will then use the Scraperwiki API to upload the data regularly into OpenlyLocal.

Second, we’re going to break the job into manageable chunks by focus on target groups of councils, and just to sweeten things – as if building a national open database of planning applications wasn’t enough 😉 – we’re going to offer small bounties (£75) for successful scrapers for these councils.

We have some particular requirements designed to make the system maintainable, and do things the right way, but not many are fixed in stone, so feel free to respond with suggestions if you want to do it in a different way.

For example, the scraper should keep itself current (running on a daily basis), but also behave nicely (not putting an excessive load on Scraperwiki or the target website by trying to get too much data in one go). In addition we propose that the scrapers should operate by updating current applications on a daily basis and also make inroads into the backlog by gathering a batch of previous applications.

We have set up three example scrapers that operate in the way we expect: Brent (Ruby), Nuneaton and Bedworth (Python) and East Sussex (Python). These scrapers perform 4 operations, as follows:

Create new database records for any new applications that have appeared on the site since the last run and store the identifiers (uid and url).
Create new database records of a batch of missing older applications and store the identifiers (uid and url). Currently the scrapers are set up to work backwards from the earliest stored application towards a target date in the past
Update the most current applications by collecting and saving the full application details. At the moment the scrapers update the details of all applications from the past 60 days.
Update the full application details of a batch of older applications where the uid and url has been collected (as above) but the application details are missing. At the moment the scrapers work backwards from the earliest “empty” application towards a target date in the past

The data fields to be gathered for each planning application are defined in this shared Google spreadsheet. Not all the fields will be available on every site, but we want all those that are there.

Note the following:

The minimal valid set of fields for an application is: ‘uid’, ‘description’, ‘address’, ‘start_date’ and ‘date_scraped’
The ‘uid’ is the database primary key field
All dates (except date_scraped) should be stored in ISO8601 format
The ‘start_date’ field is set to the earliest of the ‘date_received’ or ‘date_validated’ fields, depending on which is available
The ‘date_scraped’ field is a date/time (RFC3339) set to the current time when the full application details are updated. It should be indexed.

So how do you get started? Here’s a list of 10 non-standard authorities that you can choose from. Aberdeen, Aberdeenshire, Ashfield, Bath, Calderdale, Carmarthenshire, Consett, Crawley, Elmbridge, Flintshire. Have a look at the sites and then let me know if you want to reserve one and how long you think it will take to write your scraper.

Happy scraping.

4 Comments on “How to help build the UK’s open planning database: writing scrapers”

Tom Hughes says:

March 29, 2012 at 9:10 am

I’ll have a look at converting my existing scraper for the (distinctly non-standard and annoying) site of Broxbourne Council to this standard. The current scraper is here:

https://scraperwiki.com/scrapers/broxbourne_planning_applications/

Reply
- Andrew says:
  
  March 29, 2012 at 7:46 pm
  
  Tom
  
  I would be very interested to see how you get on as there are about 8 other sites like this (built by Civica?) which I have already had a go at and found equally annoyingly and impenetrable. I did manage to crack two of them (Harrow and Wrexham) so let me know when you are ready and we can swap notes.
  
  Andrew
  
  Reply
  - Tom Hughes says:
    
    March 30, 2012 at 4:12 pm
    
    Think I’ve got it all running now, and repopulated with all the data from the start of 2007 using the new schema.
    
    The biggest pain of course is the lack of any direct URLs for applications…
Andrew says:

March 29, 2012 at 7:50 pm

Hi All

The missing link to the spreadsheet in the main article is:

https://docs.google.com/spreadsheet/ccc?key=0AhOqra7su40fdGdVbDRWYkxGbnhsTkFMTjBBSi1oTHc

Look for the “Scraper field names” tab at the bottom

Reply

OpenlyLocal news

News about the home of UK local open data

How to help build the UK’s open planning database: writing scrapers

4 Comments on “How to help build the UK’s open planning database: writing scrapers”

Leave a comment Cancel reply

Recent Posts

Archives

Categories

Recent tweets

Follow Blog via Email

Blogroll

OpenlyLocal news

News about the home of UK local open data

How to help build the UK’s open planning database: writing scrapers

Share this:

Related

4 Comments on “How to help build the UK’s open planning database: writing scrapers”

Leave a comment Cancel reply

Recent Posts

Archives

Categories

Recent tweets

Follow Blog via Email

Blogroll