Adding charity details using - Part I

Part I - add charity numbers

See part II is a site I've created to help access data about charities. One of the ways you can use it is to add details about a list of charities if you only know their names.

This is often called "reconciliation" - turning a list of names into something that unambiguously identifies the charity. Charities are often known by different names, or have variations like "Ltd" rather than "Limited", so by adding charity numbers we can make sure we're talking about the same charity whether we say "Comic Relief" or "Charity Projects".

In this post I'm going to show you how to use OpenRefine and to turn charity names into numbers. In the next part, I'll show how to use charity numbers to fetch extra information about a charity - like their web address or latest income.

1. Get Open Refine

OpenRefine is a great data cleaning/wrangling tool. It was originally started by Google (and so is sometimes still called Google Refine), but is now run by a team of volunteer developers.

You can find the download for your platform on the download page. Note that if you're a Windows user you won't "install" it in the same way you might for other programs - instead you extract the zip file into a directory and then run refine.bat from that directory.

Again, OpenRefine might be a bit different from other programs you've seen in that it doesn't open in it's own window - instead it has a "server" that runs in the background, and you interact with it through a web browser. The address you visit (normally http://localhost:3333) is only accessible through your computer though - other parts of the internet can't see it.

2. Load your data

When you first open OpenRefine you'll see a page that looks a bit like this:

Web page screenshot showing the OpenRefine screen for importing data

You can follow the instructions for importing your data. OpenRefine has some very flexible options for importing data. The important thing about your data is that it should be a "flat file" - a simple table with rows and columns. You might need to use another program (like Excel) to get your data in this format first.

I've chosen to use some data published by the Cheshire Community Foundation [xlsx - 57Kb] on the grants they've published. This data is published to the 360 Giving grant data standard and I found it in the 360 Giving data registry. It already includes charity numbers so we wouldn't need to do this reconciliation process, but it's useful to have them there to check how good the reconiliation process is.

I can point OpenRefine at this URL and it will download and load it itself. Here's what the data looks like once it's been imported. I've hidden the description field to get more rows on the screen.

Data loaded into OpenRefine

3. Reconcile charity names

We're going to be using the Recipient Org:Name field to reconcile - this contains the name of the organisation that received the grant. Click on the arrow next to the column title to bring up the menu, and click Reconcile > Start reconciling....

Location of the reconcile menu

Now you need to add the Find that Charity reconciliation service. This is a web service that will recieve data from OpenRefine (in this case the name of a charity) and tries to match it to an entity. It returns a list of possible matches, and allows us to extract the ID of each match (in this case the charity number).

Click Add Standard Service... and then paste in the textbox, and click Add Service. This will then (possibly after a short delay) add a service called "charitysearch".

Dialog box for adding a reconciliation service

You'll see a list with just "charity" on the left hand side (you might need to hide the sidebar). You can leave the default options (at the moment findthatcharity doesn't take any additional properties to help matching, but this might be something we add later).

Click Start Reconciling to make the magic happen. This might take a while if you have lots of records. OpenRefine will batch up charities and send a few at a time to findthatcharity, but the server can only cope with so many requests at a time. If you need to do lots of requests (>5,000 in one go), speak to me directly.

Completed options for running reconcilation

You'll see a counter with the percentage complete at the top, and when finished each field will show the matches that have been made.

Reconciliation results, showing a series of checkboxes for each entry with possible matches

4. Check the matches

You now need to check the matches. Depending on your time and the number of records, there's a few options for this.

Manually check every record

You can manually check each record - just click on the single tick next to the correct match. You can also click the box with two ticks if you'd like every instance of this record to be matched to this entity.

Match all records to the best result

If you trust the algorithm, or have done some manual checking and think they're all correct, you can just take the best match for each record. To do this, click the arrow menu next to the Recipient Org:Name column heading, and select Reconcile > Actions > Match each cell to its best candidate.

Filter and match

You can also combine these two approaches using filters. On the left hand side of the screen, OpenRefine has added two "facets" based on the results of the reconciliation.

The judgement facet at the top shows the overall result - either (blank) where no matches have been found, none where there are some matches or matched where a single match has been selected (some records are marked as matched already if findthatcharity thinks they are an exact match).

The best candidate's score facet shows the score of the best match candidate, and allows you to select different numbers by dragging the range sliders. You can also remove Error values: these are the ones with no match.

You can drag the score fact to select only those with high matching scores - this will filter to show only those rows with the selected scores. The Match each cell to its best candidate action also respects this filter, so you can use that to match all the records found by the filter, with the remainder remaining unmatched.

Results of combining the score filter with matching each cell to best candidate

5. Add charity number

Once you're happy with the matches you can add a new column with the matched charity number in. To do this, use the arrow menu next to the column heading, and click Edit column > Add column based on this column...

The box that pops up allows you to specify lots of different columns. Add a name for your column at the top ("Charity Number" would make sense), and then add in the "Expression" box. This will look at the cell reconcilation object, and find the id of the matched candidate.

Entering an expression for getting the matched cell id

Press OK to add this column.

That's it for part one - hopefully you've now got a charity number for all the charities in your data. In the next part I'll take a look at using the charity number to add more details about the charity, like it's web address or latest income.

Part II continues here...