The Jenks Optimal, or Jenks’ Natural Breaks, Algorithm is a common method for classifying data presented in a choropleth map. It aims to present a series of break values that best represent the actual breaks observed in the data as opposed to some arbitrary classificatory scheme (i.e. equal interval), in this way the actual clustering of data values is preserved (subject to the arbitrary specification of k classes). The method was originally published in George Jenks’ (1977) Optimal Data Classification for Choropleth Maps and reportedly represented the culmination of 15 years research on the topic, the method primarily derived from Walter Fisher’s work ‘On grouping for maximum homogeneity‘. The specifics of the algorithm aim to create k classes so that the variance within groups is minimised, as such it is a problem of numerical optimisation.
A paper by Michael Coulson (1987) entitled In The Matter Of Class Intervals For Choropleth Maps: With Particular Reference To The Work Of George F Jenks details a method that Jenks apparently authored, but never published, to derive how optimum the number of classes chosen was, the method Goodness of Variance Fit (GVF) works by taking the difference between the squared deviations from the array mean (SDAM) and the squared deviations from the class means (SDCM), and dividing by the SDAM. Thus:
GVF = (SDAM – SDCM)/SDAM
However, it is likely this was never published as the GVF improves as the number of classes increases, until at such a points as there are the same number of classes as data points, the GVF reaches unity. Nonetheless, I have included a rudimentary example for calculating this statistic. In reality, this method is used to generalise data into a few classes for visualisation, so you are unlikely to be using more than 7 (+/- 2) classes; number of classes can be loosely assigned by looking at the distribution histogram, but often this is difficult.
The script is here.
Acknowledgement: The initial script I used for the Python conversion can be found (in JAVA and Fortran) here: https://stat.ethz.ch/pipermail/r-sig-geo/2006-March/000811.html
June 17th, 2010
by Drew
Good work! I work for ESRI software development on the core geoprocessing team and was investigating goodness of variance fit classification using Python and this is right on the money! Your Python port is much faster than the version I scripted, so congratulations to the original programmer and you for your excellent port!
July 8th, 2010
by Dan
If it is of interest, I think this could be much optimised using Numpy arrays and some more intelligent programming – it would likely use less memory too. Might not get round to it though.
September 4th, 2010
by PROF. ALI AL-GAHMDI
GOOD WORK
I AM IN THE PROCESS OF WRITING UP A PAPER ABOUT DETERMINING THE OPTIMAL NUMBER OF CLASSES THROUGH A SIMPLE EMTHOD USING THE NATURAL BREAKS SCHEME.
THE QUESTION IS: HOW MANY CLASSES? IS AN ISSUE THAT HAS NOT BEEN ADDRESSED, EXCEPT BY MATHEW NORTH (2009) BUT NEEDS TO BE EMPLEMENTED. MY WORK IS YET MORE SIMPLER AND EFFECTIVE.
September 23rd, 2010
by Carson Farmer » Blog Archive » Playing around with classification algorithms: Python and QGIS
[...] function in the classInt package (ah open source!), as well as the handy Python script from here. Once I had the code in hand, it didn’t take long to have a nice Python script ready to be [...]
November 15th, 2011
by Doug Curl
This is really great. I was able to adapt your Python code for Javascript to make ArcGIS Server Javascript API chloropleth maps on-the-fly…I think successfully. The code is published here and, hope you don’t mind, but I also cited this page and your code: http://www.arcgis.com/home/item.html?id=0b633ff2f40d412995b8be377211c47b