Computing the geometric median in Python

.........................................................

I noticed in a beta of ArcGIS 10 (then called 9.4) that there was a ‘new’ option for computing a Geometric Median which didn’t exist in my copy of ArcGIS 9.3. This is an interesting concept, as in 1d statistics, the geometric (2d) mean is easy to calculate, being the average of all the X coords and all the Y coords. From stats we know that the Mean and Median value of a distribution will coincide if the data is perfectly normally distributed; however in the real world data usually will only approximate a normal distribution, leading to a mean value that is different from the midpoint, or median.  Therefore for a skewed distribution on the plane, we encounter a situation in which the mean is not necessarily the best representation of the ‘centre’ of the data, thus we may wish to calculate the median; doing so will also give us a good idea of the direction of the skew of the point pattern we are investigating. In calculating the median of a 2d point pattern we can express the problem as a need to:

minimise the sum of squared distances from all points in a distribution to a centre.

Thus it is reasonably clear that we are dealing with an ‘optimisation problem’, something that I have experimented with before in work I conducted using the ‘transportation problem’, a classic linear programming problem.

In terms of application, I though that finding the median of a distribution of people around a service would be a useful, albeit basic, indication of whether all people were making a similar trip to a service, or whether there were other effects at work (this would be evidenced by a median centre that was not close to the actual service location). I though I would be able to code the optimisation routine in Python using pre-existing insight. Notably, the wikipedia page on this details the Weiszfeld Algorithm as the acknowledged computational solution to the geometric median problem, it takes the form:

However, actually writing the algorithm proved somewhat tough. Essentially the answer is to start with a candidate data point (I started with the mean centre) and calculate the objective function – in this case the sum of the euclidian distances of all points from the candidate centre. Then pass the candidate point through the Weiszfeld Algortihm and reassess the objective function, at such a point as the objective function converges a median has been found. There is no guarantee that the median found is the optimal median though, and depending of the data there may be more than 1 optimal solution. Below is a solution for some of my data (the data has been randomly offset by 75m to preserve anonymity) on patient registrations to a doctor.

Here we can see that the mean and median centres are slightly different, suggesting that the patient population is skewed slightly northwards, most likely as a result of discontinuous urban infrastructure.

The scatterplot was achieved using the matplotlib Python plotting library. This was just a test, but I imagine more complex outputs can be achieved reasonably easily.

Notably, this technique is using euclidian distance, which in a dense urban environment may be misleading, I note that there is a relatively simple execution of the Dijkstra algorithm for shortest paths in Python, and I am curious whether this could be integrated to find a geometric median on the network, although I suspect that it may be unworkable due to computational time constraints, although for smaller problems it might be ok.

Naturally there are algorithms that can calculate a solution to the above for p-medians (i.e. several service centres in a population- commonly known as location-allocation), it is something that Paul Densham at UCL has worked on, and his code is making a return to service in ArcGIS version 10. I’m looking forward to seeing it, as it is a very difficult problem to solve (and in fact already has been ‘solved’), and not one I intend to investigate!

My code for the geometric median is here.

~ End Article and Begin Conversation ~

There are no comments yet...

~ Now It's Your Turn ~

Feel free to use <strong>, <em>, and <a href="">

[]

The Blogroll

Search this Site


[]