Introducing OpenCharities: Opening up the Charities Register

A couple of weeks ago I needed a list of all the charities in the UK and their registration numbers so that I could try to match them up to the local council spending data OpenlyLocal is aggregating and trying to make sense of. A fairly simple request, you’d think, especially in this new world of transparency and open data, and for a dataset that’s uncontentious.

Well, you’d be wrong. There’s nothing at data.gov.uk, nothing at CKAN and nothing on the Charity Commission website, and in fact you can’t even see the whole register on the website, just the first 500 results of any search/category. Here’s what the Charities Commission says on their website (NB: extract below is truncated):

The Commission can provide an electronic copy in discharge of its duty to provide a legible copy of publicly available information if the person requesting the copy is happy to receive it in that form. There is no obligation on the Commission to provide a copy in this form…

The Commission will not provide an electronic copy of any material subject to Crown copyright or to Crown database right unless it is satisfied… that the Requestor intends to re-use the information in an appropriate manner.

Hmmm. Time for Twitter to come to the rescue to check that some other independently minded person hasn’t already solved the problem. Nothing, but I did get pointed to this request for the data to be unlocked, with the very recent response by the Charity Commission, essentially saying, “Nope, we ain’t going to release it”:

For resource reasons we are not able to display the entire Register of Charities. Searches are therefore limited to 500 results… We cannot allow full access to all the data, held on the register, as there are limitations on the use of data extracted from the Register… However, we are happy to consider granting access to our records on receipt of a written request to the Departmental Record Officer

OK, so it seems as though they have no intention of making this data available anytime soon (I actually don’t buy that there are Intellectual Property or Data Privacy issues with making basic information about charities available, and if there really are this needs to be changed, pronto), so time for some screen-scraping. Turns out it’s a pretty difficult website to scrape, because it requires both cookies and javascript to work properly.

Try turning off both in your browser, and see how far you get, and then you’ll also get an idea of how difficult it is to use if you have accessibility issues – and check out their poor excuse for accessibility statement, i.e. tough luck.

Still, there’s usually a way, even if it does mean some pretty tortuous routes, and like the similarly inaccessible Birmingham City Council website, this is just the sort of challenge that stubborn so-and-so’s like me won’t give up on.

And the way to get the info seems to be through the geographical search (other routes relied upon Javascript), and although it was still problematic, it was doable. So, now we have an open data register of charities, incorporated into OpenlyLocal, and tied in to the spending data being published by councils.

Charity supplier to Local authority

And because this sort of thing is so easy, once you’ve got it in a database (Charity Commission take note), there are a couple of bonuses.

First, it was relatively easy to knock up a quick and very simple Sinatra application, OpenCharities:

Open Charities :: Opening up the UK Charities Register

If there’s any interest, I’ll add more features to it, but for now, it’s just a the simplest of things, a web application with a unique URL for every charity based on its charity number, and with the  basic information for each charity is available as data (XML, JSON and RDF). It’s also searchable, and sortable by most recent income and spending, and for linked data people there are dereferenceable Resource URIs.

This is very much an alpha application: the design is very basic and it’s possible that there are a few charities missing – for two reasons. One: the Charity Commission kept timing out (think I managed to pick up all of those, and they should get picked up when I periodically run the scraper); and two: there appears to be a bug in the Charity Commission website, so that when there’s between 10 and 13 entries, only 10 are shown, but there is no way of seeing the additional ones. As a benchmark, there are currently 150,422 charities in the OpenCharities database.

It’s also worth mentioning that due to inconsistencies with the page structure, the income/spending data for some of the biggest charities is not yet in the system. I’ve worked out a fix, and the entries will be gradually updated, but only as they are re-scraped.

The second bonus is that the entire database is available to download and reuse (under an open, share-alike attribution licence). It’s a compressed CSV file, weighing in at just under 20MB for the compressed version, and should probably only attempted by those familiar with manipulating large datasets (don’t try opening it up in your spreadsheet, for example). I’m also in the process of importing it into Google Fusion Tables (it’s still churning away in the background) and will post a link when it’s done.

Now, back to that spending data.


Local spending data in OpenlyLocal, and some thoughts on standards

A couple of weeks ago Will Perrin and I, along with some feedback from the Local Public Data Panel on which we sit, came up with some guidelines for publishing local spending data. They were a first draft, based on a request by Camden council for some guidance, in light of the announcement that councils will have to start publishing details of spending over £500.

Now I’ve got strong opinions about standards: they should be developed from real world problems, by the people using them and should make life easier, not more difficult. It slightly concerned me that in this case I wasn’t actually using any of the spending data – mainly because I hadn’t got around to adding it in to OpenlyLocal yet.

This week, I remedied this, and pulled in the data from those authorities that had published their local spending data – Windsor & Maidenhead, the GLA and the London Borough of Richmond upon Thames. Now there’s a couple of sites (including Adrian Short’s Armchair Auditor, which focuses on spending categories) already pulling the Windsor & Maidenhead data but as far as I’m aware they don’t include the other two authorities, and this adds a different dimension to things, as you want to be able to compare the suppliers across authorities.

First, a few pages from OpenlyLocal showing how I’ve approached it (bear in mind they’re a very rough first draft, and I’m concentrating on the data rather than the presentation). You can see the biggest suppliers to a council right there on the council’s main page (e.g. Windsor & Maidenhead, GLA, Richmond):

Click through to more info gets you a pagination view of all suppliers (in Windsor & Maidenhead’s case there are over 2800 so far):

Clicking any of these will give you the details for that supplier, including all the payments to them:

And clicking on the amount will give you a page just with the transaction details, so it can be emailed to others

But we’re getting ahead of ourselves. The first job is to import the data from the CSV files into a database and this was where the first problems occurred. Not in the CSV format – which is not a problem, but in the consistency of data.

Take Windsor & Maidenhead (you should just be able to open these files an any spreadsheet program). Looking at each data set in turn and you find that there’s very little consistency – the earliest sets don’t have any dates and aggregate across a whole quarter (but do helpfully have the internal Supplier ID as well as the supplier name). Later sets have the transaction date (although in one the US date format is used, which could catch out those not looking at them manually), but omit supplier ID and cost centre.

On the GLA figures, there’ a similar story, with the type of data and the names used to describe changing seemingly randomly between data sets. Some of the 2009 ones do have transaction dates, but the 2010 one generally don’t, and the supplier field has different names, from Supplier to Supplier Name to Vendor.

This is not to criticise those bodies – it’s difficult to produce consistent data if you’re making the rules up as you go along (and given there weren’t any established precedents that’s what they were doing), and doing much of it by hand. Also, they are doing it first and helping us understand where the problems lie (and where they don’t). In short they are failing forward –getting on with it so they can make mistakes from which they (and crucially others) can learn.

But who are these suppliers?

The bigger problem, as I’ve said before, is being able to identify the suppliers, and this becomes particularly acute when you want to compare across bodies (who may name the same company or body slightly differently). Ideally (as we put in the first draft of the proposals), we would have the company number (when we’re talking about a company, at any rate), but we recognised that many accounts systems simply won’t have this information, and so we do need some information that helps use identify them.

Why do we want to know this information? For the same reason we want any ID (you might as well ask why Companies House issues Company Numbers and requires all companies to put that number on their correspondence) – to positively identify something without messing around with how someone has decided to write the name.

With the help of the excellent Companies Open House I’ve had a go at matching the names to company numbers, but it’s only been partially successful. When it is, you can do things like this (showing spend with other councils on a suppliers’ page):

It’s also going to allow me to pull in other information about the company, from Companies House and elsewhere. For other bodies (i.e. those without a company number), we’re going to have to find another way of identifying them, and that’s next on the list to tackle.

Thoughts on those spending data guidelines

In general I still think they’re fairly good, and most of the shortcomings have been identified in the comments, or emailed to us (we didn’t explicitly state that the data should be available under an open licence such as the one at data.gov.uk, and be definitely should have done). However, adding this data to OpenlyLocal (as well as providing a useful database for the community) has crystalised some thoughts:

  • Identification of the bodies is essential, and it think we were right to make this a key point, but it’s likely we will need to have the government provide a lookup table between VAT numbers and Company Numbers.
  • Speaking of Government datasets, there’s no way of finding out the ancestry of a company – what its parent company is, what its subsidiaries are, and that’s essential if we’re to properly make use of this information, and similar information released by the government. Companies House bizarrely doesn’t hold this information, but the Office For National Statistics does, and it’s called the Inter Departmental Business Register. Although this contains a lot of information provided in confidence for statistical reasons, the relationships between companies isn’t confidential (it just isn’t stored in one place), so it would be perfectly feasible to release this information.
  • We should probably be explicit whether the figures should include VAT (I think the Windsor & Maidenhead ones don’t include it, but the GLA imply that theirs might).
  • Categorisation is going to be a tricky one to solve, as can be seen from the raw data for Windsor & Maidenhead – for example the Children’s Services Directorate is written as both Childrens Services & Children’s Services, and it’s not clear how this, or the subcateogries, ties into standard classifications for government spending, making comparison across authorities tricky.
  • I wonder what would be the downside to publishing the description details, even, potentially, the invoice itself. It’s probably FOI-able, after all.

As ever, comments welcome, and of course all the data is available through the API under an open licence.

C