June 7th, 2010

*by* Daniel Lewis

## Jenks’ Natural Breaks Algorithm in Python

.........................................................

¶
The Jenks Optimal, or Jenks’ Natural Breaks, Algorithm is a common method for classifying data presented in a choropleth map. It aims to present a series of break values that best represent the actual breaks observed in the data as opposed to some arbitrary classificatory scheme (i.e. equal interval), in this way the actual clustering of data values is preserved (subject to the arbitrary specification of *k *classes). The method was originally published in George Jenks’ (1977) *Optimal Data Classification for Choropleth Maps* and reportedly represented the culmination of 15 years research on the topic, the method primarily derived from Walter Fisher’s work ‘*On grouping for maximum homogeneity*‘. The specifics of the algorithm aim to create *k *classes so that the variance within groups is minimised, as such it is a problem of numerical optimisation.

A paper by Michael Coulson (1987) entitled *In The Matter Of Class Intervals For Choropleth Maps: With Particular Reference To The Work Of George F Jenks *details a method that Jenks apparently authored, but never published, to derive how optimum the number of classes chosen was, the method Goodness of Variance Fit (GVF) works by taking the difference between the squared deviations from the array mean (SDAM) and the squared deviations from the class means (SDCM), and dividing by the SDAM. Thus:

GVF = (SDAM – SDCM)/SDAM

However, it is likely this was never published as the GVF improves as the number of classes increases, until at such a points as there are the same number of classes as data points, the GVF reaches unity. Nonetheless, I have included a rudimentary example for calculating this statistic. In reality, this method is used to generalise data into a few classes for visualisation, so you are unlikely to be using more than 7 (+/- 2) classes; number of classes can be loosely assigned by looking at the distribution histogram, but often this is difficult.

The script is here.

Acknowledgement: The initial script I used for the Python conversion can be found (in JAVA and Fortran) here: https://stat.ethz.ch/pipermail/r-sig-geo/2006-March/000811.html

June 17th, 2010

byDrewGood work! I work for ESRI software development on the core geoprocessing team and was investigating goodness of variance fit classification using Python and this is right on the money! Your Python port is much faster than the version I scripted, so congratulations to the original programmer and you for your excellent port!

July 8th, 2010

byDanIf it is of interest, I think this could be much optimised using Numpy arrays and some more intelligent programming – it would likely use less memory too. Might not get round to it though.

September 4th, 2010

byPROF. ALI AL-GAHMDIGOOD WORK

I AM IN THE PROCESS OF WRITING UP A PAPER ABOUT DETERMINING THE OPTIMAL NUMBER OF CLASSES THROUGH A SIMPLE EMTHOD USING THE NATURAL BREAKS SCHEME.

THE QUESTION IS: HOW MANY CLASSES? IS AN ISSUE THAT HAS NOT BEEN ADDRESSED, EXCEPT BY MATHEW NORTH (2009) BUT NEEDS TO BE EMPLEMENTED. MY WORK IS YET MORE SIMPLER AND EFFECTIVE.

September 23rd, 2010

byCarson Farmer » Blog Archive » Playing around with classification algorithms: Python and QGIS[...] function in the classInt package (ah open source!), as well as the handy Python script from here. Once I had the code in hand, it didn’t take long to have a nice Python script ready to be [...]

November 15th, 2011

byDoug CurlThis is really great. I was able to adapt your Python code for Javascript to make ArcGIS Server Javascript API chloropleth maps on-the-fly…I think successfully. The code is published here and, hope you don’t mind, but I also cited this page and your code: http://www.arcgis.com/home/item.html?id=0b633ff2f40d412995b8be377211c47b

November 29th, 2012

byMatt PerryDan,

I’ve found the code to work very well but performance becomes an issue with large data lists. Have you experimented any further with optimizing this code using numpy?

January 15th, 2013

byMaarten HilferinkThe algorithm presented here takes O(n^2*k) time with n = the number of values and k = the number of requested classes. I recently implemented an algorithm that takes O(n*log(n)*k) time. See: http://wiki.objectvision.nl/index.php/Jenks_Natural_Breaks_Classification

May 7th, 2013

byrame tarekalso can use least square method even he considered all value are equal weight .. just sugqest

August 1st, 2013

byMaarten HilferinkThe presented O(k*n^2) algorithm can be made faster by realising that mat1 is monotonously increasing in each row (a no crossing back-path property) and therefore the search space of left and right can be halved when a middle mat1 value is determined, which can be applied recursively to arrive at a O(k*n*log(n)) algorithm, see for code:

http://wiki.objectvision.nl/index.php/CalcNaturalBreaksCode

and for a proof of its validity:

http://wiki.objectvision.nl/index.php/Fisher%27s_Natural_Breaks_Classification

.

March 19th, 2014

byNoah HuntingtonHi Daniel! Great stuff here! I believe this will work perfectly with a standalone script I am working on. I am admittedly somewhat novice with python but can generally follow the usage and flow of a script. With that said I have a question regarding obtaining “datalist” variable. I am working with a slope raster where I am interested in classifying the “Value” from the cells. Any suggestions on obtaining the datalist? Cursor or otherwise?