Recently I obtained a portion of address layer 2 for Southwark and surrounding boroughs in order to georeference my patient data by household. Currently I have a postcode match with over 99% match between known postcodes and patient reported postcodes. Being able to locate patients by their address, rather than their postcode will allow me to begin to think about patients in terms of ‘households’ rather than as I have been currently doing as (somewhat atomised) individuals. A recent experiment I conducted on choice characteristics of patients which was an evolution of my CASA working paper showed that seeing people as individuals was faulty logic in the context of primary care uptake. Reason tells me that families and people living in the same household are more likely to go to the same GP than different ones, due to factors that might include social network effects, thus in order to test this I need to address geocode the patient list data to a finer standard than postcodes- households.
Addresslayer2 is a rich Ordnance Survey dataset that pinpoints the location of houses, commercial buildings and features of the built environment (such as post boxes). I have an extract of the national dataset covering Lambeth, Lewisham and Southwark which constitutes over 400,000 points of interest which are given an explicit location in space.
The difficulty inherent is to match up the reported patient address with the record in addresslayer2 in order to derive a location for each patient. Patients that overlay each other in space can then be aggregated to a ‘household’ for that location. Often the addresses of patients living in the same house are subtly different so I cannot simply group the list by the addresses given, such a method also says little about the unque location of each household. Thus I have chosen to address geocode allt he aptients first and then derive household information from the spatial component of the data.
Initially I tried using pre-existing geocoding software in both ArcGIS and Manifold GIS, but neither was able to provide a satisfactory result. So i set about doing it myself using the Python Programming language.
One of the main things that has helped me so far is Fuzzy String Matching using Levenshtein Distance. It is rare that a recorded address will exactly match (i.e. match by equality, x == y) an address in addresslayer2. Often differences between two address strings amount to very little, such as capitalisation, abbreviations, punctuation etc. So a fuzzy match can be obtained by comparing the similarity of 2 strings.
The Levenshtein distance computes a value that represents the number of edits (by way of insertion, deletion or substitution of characters) required to turn one string into another. I am using the following algorithm written in python to calculate this value:
def levenshtein(s1, s2): if len(s1) < len(s2): return levenshtein(s2, s1) if not s1: return len(s2) previous_row = xrange(len(s2) + 1) for i, c1 in enumerate(s1): current_row = [i + 1] for j, c2 in enumerate(s2): insertions = previous_row[j - 1] + 1 deletions = current_row[j] + 1 substitutions = previous_row[j] + (c1 != c2) current_row.append(min(insertions, deletions, substitutions)) previous_row = current_row
return previous_row[-1]
This algorithm is available, in this form and in many other languages, from wikibooks.
For 2 strings, calling the levenshtein function returns an integer which represents the number of edits required to make the 2 strings the same, however this value is meaningless without reference to the length of the string in the first place – 5 edits on a string 5 characters long suggests a completely different string, whereas 5 edits on a string 50 characters long suggests that 90% of the string was the same and that changes were actually minimal. Thus to get a relative ratio from the levenshtein distance I use the following code:
def ratio(s1,s2):
edit = levenshtein(s1,s2)
output = float(edit)/max(len(s1),len(s2))
return output
This is a simple comparison that I found here and is implemented in a python Levenshtein library here. This library is written in c and hence should be faster than the python implementation, unfortunately I couldn’t get it to compile properly in windows 7. Works really well in Linux though!
Having established the levenshtein and ratio functions, it is simply a case of matching candidate strings with known address strings. I’m using a threshold for similarity so the string has to been similar to a specified degree before it can be considered a match.
Using this algorithm in conjunction with some other basic string operations, strip(), isdigit(), is alnum() and find()/replace() etc. gives me a match rate of around 90%. This is reasonable, but because the dataset of patients is very large it still leaves around 30,000 people unmatched. My next move will be to start subsetting addresses and matching elements of them, and checking which pieces are not matched. This is particularly important with students and people living in social housing where a lot of the address information given is specific to particular subsets of social housing estates, but given to me as a single string. Dissaggregating this data will allow me to match each bit with individual fields in addresslayer2 relevant to social housing.
~ End Article and Begin Conversation ~
There are no comments yet...
~ Now It's Your Turn ~