<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Volunteered Geographic Information &#187; Modeling</title>
	<atom:link href="http://danieljlewis.org/category/modeling/feed/" rel="self" type="application/rss+xml" />
	<link>http://danieljlewis.org</link>
	<description>A Geography/GIS blog by Daniel J Lewis</description>
	<lastBuildDate>Tue, 20 Dec 2011 17:15:30 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4-alpha-20124</generator>
		<item>
		<title>Weighted Mean Direction Surfaces in Python</title>
		<link>http://danieljlewis.org/2011/08/31/weighted-mean-direction-surfaces-in-python/</link>
		<comments>http://danieljlewis.org/2011/08/31/weighted-mean-direction-surfaces-in-python/#comments</comments>
		<pubDate>Wed, 31 Aug 2011 13:18:18 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[GIS]]></category>
		<category><![CDATA[Modeling]]></category>
		<category><![CDATA[Representation]]></category>
		<category><![CDATA[Southwark]]></category>
		<category><![CDATA[Brunsdon]]></category>
		<category><![CDATA[Charlton]]></category>
		<category><![CDATA[circular statistics]]></category>
		<category><![CDATA[mean direction]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[weighting]]></category>

		<guid isPermaLink="false">http://danieljlewis.org.blogs.splintdev.geog.ucl.ac.uk/?p=537</guid>
		<description><![CDATA[I work a lot with flows and spatial interactions, one thing that I&#8217;ve wanted to do for a while is compute a mean flow direction surface. Unfortunately, arithmetic means don&#8217;t work for angular data, this is because it cannot account for the circular nature of the distribution of angular measurements. For instance the angles 5 [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2011%2F08%2F31%2Fweighted-mean-direction-surfaces-in-python%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2011%2F08%2F31%2Fweighted-mean-direction-surfaces-in-python%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>I work a lot with flows and spatial interactions, one thing that I&#8217;ve wanted to do for a while is compute a mean flow direction surface. Unfortunately, arithmetic means don&#8217;t work for angular data, this is because it cannot account for the circular nature of the distribution of angular measurements. For instance the angles 5 degrees and 355 degrees are seperated only by 10 degrees, but their arithmetic mean is 180 degrees -w ay off, it should be 0 degrees!</p>
<p>Luckily, <a title="Local trend Statistics for Direction Data" href="http://leicester.academia.edu/ChrisBrunsdon/Papers/534394/Local_trend_statistics_for_directional_data--A_moving_window_approach">Brunsdon and Charlton</a> have published on this very subject, so I took it upon myself to implement a weighted circular mean function in Python. The key obstacle was learning about complex numbers, about which, up until this point, I had no idea about at all!</p>
<p>The first thing to do is calculate the angle between a set of candidate points (such as people) and a set of services (such as Medical Centres). This is simple enough to do using, and would look something like:</p>
<pre>import math</pre>
<pre>math.atan2((y2-y1),(x2-x1))</pre>
<p>In which the pair (x1,y1) is the location of the candidate point, and (x2,y2) the location of the allocated service for that candidate point. The line linking these two points defines a flow from a candidate point, to a servcie and vice versa.</p>
<p>Having calculated all of the angles, I used ArcGIS to create an output grid, at the extent of the candidate points, using the &#8220;fishnet&#8221; function which creates a vector grid of prespecified dimensions.</p>
<p>The beauty of Brunsdon and Charlton&#8217;s method is that it uses a local method of approximation, this means that for each cell in the output grid, a mean direction can be calculated based upon the values of nearby points, applying a weighting allows for more distance points to have less of an effect on the mean direction.</p>
<p>Firstly, I read all the candidate points into a KDTree structure, this allows me to search for local points, at the same time I also create an array of the angles for those candidate points.</p>
<pre>from scipy.spatial import cKDTree
import numpy as np

tree = cKDTree(treepoints)
res, idx = tree.query(testpoint,300000,0,2,100)
res = res[0][np.where(res[0] &lt; np.Inf)[0]]
idx = idx[0][:len(res)]</pre>
<p>The tree takes a numpy array of coordinate pairs, and the query method returns an array of distances to points (res) and their index value in the original array of coordinates (idx). The testpoint is a cell in the vector grid; 300000 is the k-number of nearest neighbours to find, here I have simply set it arbitrarily high in the context of my dataset; 0 is for approximate nearest neighbours, here I&#8217;ve specified exact; 2 indicates the use of euclidian distance; and 100 is the threshold, neighbours won&#8217;t be returned if they are further than 100 metres away. The penultimate line simply returns an array that is shortened to just those values which are less than 100m away (i.e. less than infinity) &#8211; points over 100m away are returned as value Inf.</p>
<p>The next step is to actually compute the mean direction, this requires a special approach using complex numbers however. Brunsdon and Charlton show that a direction can be stated as a complex number <em>z</em> in which <em>z = exp(iθ)</em> this is effectively: <em>z = cos(θ) + i sin(θ)  </em>in which <em>i</em> is an imaginary number. We can restate our directions in Python using:</p>
<pre>import cmath

thetas = angles[idx]
cThetas = []
for i in xrange(0,len(thetas)):
    cThetas.append(complex(np.cos(thetas[i]),np.sin(thetas[i])))
cThetas = np.array(cThetas)</pre>
<p>Here, the complex function allows the complex number representing an angle to be stored in a list, which I convert (lazily) to a numpy array. The first term, thetas, is using the idx array from the cKDTree to cleverly index the relevant angle records from the angles array which stores all the angle values in the order of entries for the cKDTree.</p>
<p>Next a temporary variable is created which calculates the mean direction:</p>
<pre>temp = np.sum(cThetas)/np.absolute(np.sum(cThetas))
MeanDir = np.angle(temp, deg = True)</pre>
<p>The mean direction is given by the argument (Arg) of the resultant complex number, Python implements this with the np.angle function, where deg = True returns the angle in degrees, and False in radians.</p>
<p>So far this is the unweighted mean, aggregating directional observations within a 100m disk (see also: uniform disk smoothing). To introduce weighting we must first define a weighting scheme, I&#8217;ve used the one suggested by Brunsdon and Charlton, which is Gaussian, and might look at bit like this:</p>
<pre>def gaussW(dists,band):
    out = np.zeros(dists.shape)
    for i in xrange(0,len(out)):
        temp = np.power(dists[i],2)/(2.0*np.power(float(band),2))
        out[i] = np.exp(-1.0 * temp)
    return out

weight = gaussW(res,100)</pre>
<p>Quite simply, I pass the distance array res to the gaussW function and it gives me back an array of weights for that ordering of distances. Using this I can redo the mean direction thus:</p>
<pre>temp = np.sum(weight*cThetas)/np.absolute(np.sum(weight*cThetas))
MeanDir = np.angle(temp, deg = True)</pre>
<p>There you have it! Attached is the script I used. Obviously, Brunsdon and Charlton implement a variance and a couple of visualisation devices, but these should be simple enough to implement now!</p>
<p>I created an output for flows of patients to GPs in Southwark, visualised using one of ESRI&#8217;s circular/direction colour ramps from <a title="Mapping Resources" href="http://mappingcenter.esri.com/index.cfm?fa=arcgisResources.gateway">colour ramp pack 2</a>. Not sure how best to visualise the legend at this point though. NB. 90 is north, -90 is South, 0/-0 is East and 180/-180 is West. The map is visualised to show the 4 cardinal directions, but the output is in fact continuous.</p>
<p style="text-align: left"><a href="http://danieljlewis.org/files/2011/08/MeanDirectionFlows.png"><img class="aligncenter size-large wp-image-538" src="http://danieljlewis.org/files/2011/08/MeanDirectionFlows-724x1024.png" alt="" width="434" height="614" /></a>My example script is <a href="http://danieljlewis.org/files/2011/08/meanDirection.txt">here. </a> Note that I am using dbfpy to read and write to shapefile DBF tables directly.</p>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2011/08/31/weighted-mean-direction-surfaces-in-python/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Network Population Density for Southwark</title>
		<link>http://danieljlewis.org/2011/05/03/network-population-density-for-southwark/</link>
		<comments>http://danieljlewis.org/2011/05/03/network-population-density-for-southwark/#comments</comments>
		<pubDate>Tue, 03 May 2011 02:29:45 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[Modeling]]></category>
		<category><![CDATA[Representation]]></category>
		<category><![CDATA[density]]></category>
		<category><![CDATA[network]]></category>
		<category><![CDATA[population]]></category>
		<category><![CDATA[sanet]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://danieljlewis.org.blogs.splintdev.geog.ucl.ac.uk/?p=517</guid>
		<description><![CDATA[Using the excellent SANET extension for ArcGIS 9.3 I was able to take some of my data for Southwark that I had geocoded to address level, and estimate the population density using the OS Mastermap ITN product. The procedure is essentially a Kernel Density Estimation that takes place on a given network rather than across 2D space, this effectively controls [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2011%2F05%2F03%2Fnetwork-population-density-for-southwark%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2011%2F05%2F03%2Fnetwork-population-density-for-southwark%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>Using the excellent <a title="SANET Website" href="http://sanet.csis.u-tokyo.ac.jp/">SANET</a> extension for ArcGIS 9.3 I was able to take some of my data for Southwark that I had geocoded to address level, and estimate the population density using the OS Mastermap ITN product. The procedure is essentially a Kernel Density Estimation that takes place on a given network rather than across 2D space, this effectively controls for the effect of spatial structure, such as urban form, of which the data relates to residential locations. The estimation is made for c.300,000 people in Southwark on a network with around 30,000 road segments so it is to be expected that the calculation takes several hours to run. The KDE process is parameterised in much the same way as the straightforward density estimation procedures in the ARCGIS Spatial Analyst toolboxes, bandwidth and cell size are specified. In this case though cell size relates to the length of segments into which the network has to be cut in order to represent the output. Additionally, SANET allows you to control how you handle road intersections, either by using a continuous or discontinuous approach to the bifurcation, i arbitrarily chose the continuous approach, essentially meaning that the density estimation can turn corners. A straightforward representation can be made in 2D as below.</p>
<p style="text-align: center"><a href="http://danieljlewis.org/files/2011/05/SouthwarkNetworkDensitySanet.png"><img class="aligncenter size-large wp-image-518" src="http://danieljlewis.org/files/2011/05/SouthwarkNetworkDensitySanet-791x1024.png" alt="" width="428" height="553" /></a></p>
<p style="text-align: left">The interesting aspect to this image that is obscured in 2D smoothed representations is the relative usage of different streets, clearly visible are the residential streets as distinct from the more commercial area on Southwark&#8217;s Bankside, and along major roads, and the effect of open space and water features in reducing network density (i.e. if only one side of a road has residences on it). I&#8217;ve attempted to explore this further by using ArcScene&#8217;s 3D visualisation capabilities, but the complexity of the data make this an incredibly arduous process. The result i was able to obtain outside of ArcScene simply crashing are below.</p>
<p style="text-align: center"><a href="http://danieljlewis.org/files/2011/05/testhigherres.png"><img class="aligncenter size-large wp-image-521" src="http://danieljlewis.org/files/2011/05/testhigherres-1024x610.png" alt="" width="553" height="329" /></a></p>
<p style="text-align: left">In this example, Southwark is presented in a kind of 2.5D perspective in which the streets have been extruded so that their height represents the population density at that point. I&#8217;ve included some contextual elements, the Thames, and parks, wooded areas, and other water features. Whether or not this image is in anyway an improvement over a simple 2D representation is open to debate, but the selections below do present an interesting cross section of the data.</p>
<p style="text-align: left"><a href="http://danieljlewis.org/files/2011/05/SelectionSanetSwk.png"><img class="aligncenter size-medium wp-image-522" src="http://danieljlewis.org/files/2011/05/SelectionSanetSwk-300x178.png" alt="" width="300" height="178" /></a><a href="http://danieljlewis.org/files/2011/05/SelectionSanetSwk2.png"><img class="aligncenter size-medium wp-image-523" src="http://danieljlewis.org/files/2011/05/SelectionSanetSwk2-300x178.png" alt="" width="300" height="178" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2011/05/03/network-population-density-for-southwark/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>&#8216;Compactness&#8217; in Zoning: the circle as the ideal.</title>
		<link>http://danieljlewis.org/2011/02/26/compactness-in-zoning-the-circle-as-the-ideal/</link>
		<comments>http://danieljlewis.org/2011/02/26/compactness-in-zoning-the-circle-as-the-ideal/#comments</comments>
		<pubDate>Sat, 26 Feb 2011 06:17:26 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[GIS]]></category>
		<category><![CDATA[Modeling]]></category>
		<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[anisotropy]]></category>
		<category><![CDATA[christaller]]></category>
		<category><![CDATA[circles]]></category>
		<category><![CDATA[isotropy]]></category>
		<category><![CDATA[Representation]]></category>
		<category><![CDATA[zones]]></category>

		<guid isPermaLink="false">http://danieljlewis.org.blogs.splintdev.geog.ucl.ac.uk/?p=509</guid>
		<description><![CDATA[I saw a thought provoking presentation recently, given by Wenwen Li of the University of California Santa Barbara, the talk was a wide ranging insight into Cyber Infrastructure, its uses for geospatial information, and some of the computational techniques that underpinned the project. One element of the project involved zone design for the greater Los [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2011%2F02%2F26%2Fcompactness-in-zoning-the-circle-as-the-ideal%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2011%2F02%2F26%2Fcompactness-in-zoning-the-circle-as-the-ideal%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>I saw a thought provoking presentation recently, given by <a title="Wenwen Li" href="http://www.geog.ucsb.edu/~wenwen/" target="_blank">Wenwen Li</a> of the University of California Santa Barbara, the talk was a wide ranging insight into Cyber Infrastructure, its uses for geospatial information, and some of the computational techniques that underpinned the project. One element of the project involved zone design for the greater Los Angeles region, and involved the implementation of an algorithm that was intended to aggregate small areal units into larger zones whilst meeting a number of conditions, principle among these conditions was &#8216;compactness&#8217;. The output looked very much like a single hierarchy of Christaller hexagons, and this got me thinking about the nature of space and compactness.</p>
<p style="text-align: center"><a href="http://danieljlewis.org/files/2011/02/centralplace.gif"><img class="aligncenter size-full wp-image-510" src="http://danieljlewis.org/files/2011/02/centralplace.gif" alt="From: http://watd.wuthering-heights.co.uk/mainpages/sustainability.html" width="440" height="307" /></a></p>
<p style="text-align: left">Christaller&#8217;s hexagons are the defining illustration of something called &#8216;central place theory&#8217;, a geographical abstraction that idealises settlement pattern based upon an underlying space which is assumed to be isotropic. The assumption of spatial isotropy is the big leap in this model, it assumes that the &#8216;friction of distance&#8217; from any given point increases at an equal rate whichever way you go from that point. Clearly such a suggestion is not applicable to Los Angeles, where huge freeways and interchanges can make adjacent parcels of land remote neighbours, and increase the connection between advantageously placed non-adjacent sites? Surely a city in which sprawl and ribbon development, as well as segregated communities should be modeled differently? Why then do many of our zoning algorithms favour compact &#8216;circular&#8217; shapes, very much in the christaller mould, and why do we reject uncompact areal features as ugly slivers? In short, how did the circle come to be the ideal shape of zone in regional studies? Certainly, it is easier, both implementationally and conceptually, to model circles than to consider optimising a zone system over an <em>n </em>zone by <em>n </em>zone similarity matrix pertaining to variables which may be important to aggregating any set of areal units. However, as we explore more and more the complex systems defined by cities and regions, surely there is a need to start integrating a more realistic anisotropic view of space, one in which the friction of distance from any given point in any given direction is defined by the underlying demography, built environment and/or infrastructure.</p>
<p style="text-align: left">One such attempt at this, <a title="Amoeba publication" href="http://onlinelibrary.wiley.com/doi/10.1111/j.1538-4632.2006.00689.x/full" target="_blank">AMOEBA</a> (A Multidirectional Optimum Ecotope-Based Algorithm), developed by Aldstadt and Getis, is worth noting. In this algorithm, zones are defined via the Getis-Ord Gi* statistic, which is a local statistic for identifying clustering, thus zones are defined by local conditions, which are free to vary anistropically across space, rather than by a predefined preference for circles. Spectacularly this algorithm is implemented in the superb <a title="clusterpy" href="http://www.rise-group.org/risem/clusterpy/clusterPy-module.html" target="_blank">clusterpy</a> python module for spatially constrained clustering.</p>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2011/02/26/compactness-in-zoning-the-circle-as-the-ideal/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Mapping Spatial Entropy in Southwark</title>
		<link>http://danieljlewis.org/2011/02/03/mapping-spatial-entropy-in-southwark/</link>
		<comments>http://danieljlewis.org/2011/02/03/mapping-spatial-entropy-in-southwark/#comments</comments>
		<pubDate>Thu, 03 Feb 2011 01:32:16 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[GIS]]></category>
		<category><![CDATA[Modeling]]></category>
		<category><![CDATA[Southwark]]></category>
		<category><![CDATA[ethnicity]]></category>
		<category><![CDATA[mapping]]></category>
		<category><![CDATA[rasters]]></category>
		<category><![CDATA[segreagtion]]></category>
		<category><![CDATA[spatially weighted entropy]]></category>

		<guid isPermaLink="false">http://danieljlewis.org.blogs.splintdev.geog.ucl.ac.uk/?p=496</guid>
		<description><![CDATA[I&#8217;ve been doing a bit of work recently on segregation with Pablo Mateos, and having gone through the motions with aspatial indices of segregation (the classics): dissimilarity, exposure and so on, I decided to investigate the more explicitly spatial ones. Taking a lead from Reardon and O&#8217;Sullivan&#8217;s (2004) paper &#8220;Measures of Spatial Segregation&#8221; in sociological methodology, I [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2011%2F02%2F03%2Fmapping-spatial-entropy-in-southwark%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2011%2F02%2F03%2Fmapping-spatial-entropy-in-southwark%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>I&#8217;ve been doing a bit of work recently on segregation with <a title="Pablo Mateos" href="http://www.geog.ucl.ac.uk/about-the-department/people/academics/pablo-mateos" target="_blank">Pablo Mateos</a>, and having gone through the motions with aspatial indices of segregation (the classics): dissimilarity, exposure and so on, I decided to investigate the more explicitly spatial ones. Taking a lead from Reardon and O&#8217;Sullivan&#8217;s (2004) paper &#8220;Measures of Spatial Segregation&#8221; in <em>sociological methodology</em>, I got in touch with <a title="David O'Sullivan" href="http://web.env.auckland.ac.nz/people_profiles/osullivan_d/" target="_blank">David O&#8217;Sullivan </a>and he, and his student Seong-Yun Hong, helped me with the implementation of some spatial measures of segregation. This post specifically concerns spatially weighted entropy &#8211; a measure of population diversity. Reardon and O&#8217;Sullivan define spatially weighted entropy as:</p>
<p style="text-align: center"><a href="http://danieljlewis.org/files/2011/02/SpatiallyWeightedEntropy.gif"><img class="aligncenter size-full wp-image-497" src="http://danieljlewis.org/files/2011/02/SpatiallyWeightedEntropy.gif" alt="" width="350" height="94" /></a></p>
<p style="text-align: left">This equation describes the &#8216;entropy&#8217;, derived from Shannon&#8217;s information theory, for each grid cell in the image (below) in which each cell value results from the entropy computed for a 1km &#8216;neighbourhood&#8217; <em>p </em>around each cell (essentially a circular buffer)<em>. </em>The ethnic group in question is given by &#8216;m&#8217; (with the pi representing the proportion of a given group in a given neighbourhood) and relates to ethnic groups defined from the Southwark patient register using Onomap, the groups defined are: African, East Asian and Pacific, European, Muslim, South Asian, British, Eastern European, Hispanic, and Unclassified or Other. The Onomap software is able to apply this classification by looking at the forename and surname combination of patients registered to use Southwark GPs, or patients living in Southwark but using GPs outside of Southwark. The cells in the image relate directly to the residential locations of patients, who were geocoded to their household using the Ordnance Survey&#8217;s Address Layer 2 product, therefore empty cells are areas within which no recorded patients were found, such as parks, and transport infrastructure. As the data underlying this is from patient registrations with GPs, we have to accept that the data is likely to be partial, with potentially systematic biases in those people who have registered &#8211; young men and people from countries where GPs as a method of primary care do not exist- may have been omitted.</p>
<p style="text-align: center"><a href="http://danieljlewis.org/files/2011/02/SwkEntropyMap.png"><img class="aligncenter size-large wp-image-500" src="http://danieljlewis.org/files/2011/02/SwkEntropyMap-791x1024.png" alt="" width="475" height="614" /></a></p>
<p>In the image, higher values of entropy indicate greater diversity of population by ethnic group, the resultant images is unsurprising in terms of Southwark, with the Dulwich Village area showing as the least diverse place, home as it is to more affluent, generally &#8216;British&#8217; groups. Likewise historical factors regarding access to housing have shaped the lower entropy scores in the middle of the borough &#8211; home to African populations and the North East, home to the British working classes who were rehoused from the now more African areas in the middle of the borough. Finally, the greater Waterloo- Elephant and Castle region in the north-west shows up as the ethnic melting pot in the borough.</p>
<p>In the image above, the 1km neighbourhood defined in the spatially weighted entropy score has a smoothing effect, I experimented with smaller values for the neighburhood size, and found that the resultant output did not change dramatically from that obtained above. At the end of the day, the selection of neighbourhood size is largely arbitrary and will depend on sociocultural factors of the area and it&#8217;s people. Similarly, as there is no data for the regions outside of Southwark we are more uncertain of the values at the edges than in the middle of the borough as we are only sampling from within Southwark itself. Nonetheless, this representation of Southwark goes somewhat beyond what is possible using the commonly used output zones defined by the census.</p>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2011/02/03/mapping-spatial-entropy-in-southwark/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Finnish Municipalities: A case for zone design?</title>
		<link>http://danieljlewis.org/2010/11/30/finnish-municipalities-a-case-for-zone-design/</link>
		<comments>http://danieljlewis.org/2010/11/30/finnish-municipalities-a-case-for-zone-design/#comments</comments>
		<pubDate>Tue, 30 Nov 2010 20:05:39 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[Geography]]></category>
		<category><![CDATA[GIS]]></category>
		<category><![CDATA[Modeling]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[finland]]></category>
		<category><![CDATA[Maps]]></category>
		<category><![CDATA[municipal]]></category>
		<category><![CDATA[pysal]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[zone design]]></category>

		<guid isPermaLink="false">http://danieljlewis.org.blogs.splintdev.geog.ucl.ac.uk/?p=438</guid>
		<description><![CDATA[In Finland, municipalities are incredibly powerful; like local authorities in the UK, municipalities are responsible for local administration, but they also levy an income tax and are responsible for providing most public services. Municipalities were founded on the assumption of equality, which forms the basis for the reform considerations currently ongoing in the Finnish government. [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F11%2F30%2Ffinnish-municipalities-a-case-for-zone-design%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F11%2F30%2Ffinnish-municipalities-a-case-for-zone-design%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>In Finland, municipalities are incredibly powerful; like local authorities in the UK, municipalities are responsible for local administration, but they also levy an income tax and are responsible for providing most public services. Municipalities were founded on the assumption of equality, which forms the basis for the reform considerations currently ongoing in the Finnish government. The fact is, municipalities simply aren&#8217;t equal, they vary wildly in population size and area; numbering 342 they are of rough numerical equivalence to UK local authorities, despite Finland itself having around 1/11th of the UK&#8217;s population. The population skew is greatly emphasised by the presence of cities such as Helsinki which cross municipal boundaries, and more remote municipalities whose geographical extent was set out by horse-and-cart distance. It is therefore understandable that the Finnish government would be interested in the possibility of mergers to reduce the number of municipalities, and create a system in which muncipalities serve a similar number, or at least a threshold, population. It is believed that this would make administration more efficient, as services can be centralised to a greater degree, and small municipalities which already share services can formalise this.</p>
<p>Some mergers have already taken place, and there are governmental incentives for merging. However, with the specified model that municalities should be reformed such that they have a base population of 20-30,000 people, from the Association of Finnish Local and Regional Authorities (<em>Kuntaliitto</em>), we can apply automated zone design scenarios to test the &#8216;what if&#8217; aspect of creating a new Finnish municipal system based on preserving different characteristics.</p>
<p>The zone design tool that I use to test a few basic scenarios is the regionalization library in pySAL, a spatial analysis module for Python. This implements the max-P algorithm for spatially contrained clustering subject to a similarity matrix, and a threshold value. I specified the research design so that I was testing for the optimal new aggregation of the pre-existing muncipalities and tested 3 different scenarios: 1) No similarity measure (all municipalities assumed equal, but for population) 2) Similarity based on municipal tax regime, and 3) Similarity based upon municipal tax regime and % non-finnish speakers, which accounts culturally for the Sami people of Lapland and ethnically Swedish Finns.</p>
<p>The regionalisation requires that you create a contiguity matrix for the zones, I arbitrarily chose the queen case, and added bespoke contiguity for Finnish islands based upon proximity, this is easy to do in GeoDa and it outputs a .gal file which you can read into python. Then all you really need to do is the following:</p>
<pre>import pysal
import numpy as np #required as pysal uses numpy arrays

#Read in your population and similarity data in some way,
#I tend to create a python list from a csv.

#convert the population and similarity data into
#numpy arrays, from lists called pop and sim
pop = np.asarray(pop)
sim = np.asarray(sim)

# Read in your precomputed Weights matrix
w = pysal.open("...\\QueenWeights.gal").read()

# Create an (optional) array of 1s to represent equality
# (replace sim in Maxp function call)
nosim = np.ones((342,1))

# Run solutions for maxp algorithms with specific parameters
r= pysal.Maxp(w,sim,floor = 20000,floor_variable = pop,initial=100)

# Write r.regions to outfile to get regional assignments,
#this can be joined to shp in ArcGIS.
</pre>
<p>The maxP algorithm works by first randomly creating a set of possible zoning configurations, then it chooses the current optimum and seeks to refine it using a computationally expensive zone-swapping method. Optimality in this case is defined by minimising dissimilarity whilst obeying the threshold population constraint. The <a title="pySAL Documentation" href="http://www.pysal.org/users/tutorials/region.html" target="_blank">API reference</a> for regionalisation in pysal is very good.</p>
<p>Here are some of the results I produced for this basic approach:</p>
<div id="attachment_453" class="wp-caption aligncenter" style="width: 271px"><a href="http://danieljlewis.org/files/2010/11/NoH20000.png"><img class="size-large wp-image-453 " src="http://danieljlewis.org/files/2010/11/NoH20000-725x1024.png" alt="" width="261" height="368" /></a><p class="wp-caption-text">Zone Design - Pop &gt; 20,000 with assumption of Municipal Equality</p></div>
<div id="attachment_454" class="wp-caption aligncenter" style="width: 271px"><a href="http://danieljlewis.org/files/2010/11/TaxLang20000.png"><img class="size-large wp-image-454 " src="http://danieljlewis.org/files/2010/11/TaxLang20000-725x1024.png" alt="" width="261" height="368" /></a><p class="wp-caption-text">Zone Design - Pop &gt; 20,000 with similarity of tax regime and language</p></div>
<p>Unlike some of the more advanced zone-design algorithms, pySAL doesn&#8217;t yet provide a way of preserving or optimising area shape characteristics, so you can get sliver-like polygons forming. Nonetheless it presents an interesting insight and a set of functions from which a more advanced/bespoke algorithm could be built.As it turns out the islands that I arbitrarily allocated to the  contiguity matrix are actually quite a contentious topic and given their  strategic significance to Sweden are in fact neutral territories which  would be untouched by any redesign of municipal structure.</p>
<p>As ever, local knowledge is important, and for the economists at <a title="VATT" href="http://www.vatt.fi/en/" target="_blank">VATT</a> this is a real task to undertake. They won&#8217;t be doing anything quite so crude, they do however have a curious spatial problem to deal with.</p>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2010/11/30/finnish-municipalities-a-case-for-zone-design/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Household Types, Combinatorial Problems and Pure Maths</title>
		<link>http://danieljlewis.org/2010/07/15/household-types-combinatorial-problems-and-pure-maths/</link>
		<comments>http://danieljlewis.org/2010/07/15/household-types-combinatorial-problems-and-pure-maths/#comments</comments>
		<pubDate>Thu, 15 Jul 2010 18:17:07 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[Modeling]]></category>
		<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[combinatorics]]></category>
		<category><![CDATA[functions]]></category>
		<category><![CDATA[households]]></category>

		<guid isPermaLink="false">http://danieljlewis.org/?p=369</guid>
		<description><![CDATA[In some of the work I&#8217;m currently doing looking at households as derived from the Southwark patient register I wanted to go beyond a quantification of how many people lived in a households &#8211; a rudimentary household size, to looking at the composition of a household and hence what type of household it represented. In [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F07%2F15%2Fhousehold-types-combinatorial-problems-and-pure-maths%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F07%2F15%2Fhousehold-types-combinatorial-problems-and-pure-maths%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>In some of the work I&#8217;m currently doing looking at households as derived from the Southwark patient register I wanted to go beyond a quantification of how many people lived in a households &#8211; a rudimentary household size, to looking at the composition of a household and hence what type of household it represented. In order to do this I looked at how types of household were generally reported in the UK Census, in European statistics, and in terms of social research on the life course, as well as in health literature itself. In terms of defining households, I found that although complex household typologies do exist, there exists a general set of likely household forms: as expected these revolve around the single, co-habiting, family, single parenthood, extended family etc models. As I have data on individuals I first decided to classify individuals into 5 broad categories that seem important in the literature and then look at the composition of these categories within households. The categories were:</p>
<p>1) Dependent Children (&lt;18 yrs old)</p>
<p>2) Adult Male (18-65 yrs old)</p>
<p>3) Adult Female (18-60 yrs old)</p>
<p>4) Male Pensioner (65+ yrs old)</p>
<p>5) Female Pensioner (60+ yrs old)</p>
<p>Evidence suggests that these represent the coarsest categories that could usefully represent significant periods within the life course, as well as being relevant to changes in health status. In a sense, different type of household structure can be described by different combinations of these person classes for different household sizes.</p>
<p>I decided to test this by calculating all the possible combinations of these 5 classes for a 2 person household and then looking at their uptake in the actual household data I had derived from the Southwark patient register. It turned out that for a two person household there were 15 different ways in which you could combine the 5 person classes in order to create a unique household:</p>
<p><em>Children Only (Parents Unregistered); Single Parent Male and Child; Co-Habiting Men; Single Parent Female and Child; Single Parent Male Pensioner and Child; Co-Habiting Man and Woman; Co-Habiting Man and Male Pensioner; Co-Habiting Women; Single Parent Female Pensioner and Child; Cohabiting Woman and Male Pensioner; Cohabiting Man and Female Pensioner; Cohabiting Male Pensioners; Cohabiting Woman and Female Pensioner; Cohabiting Male and Female Pensioner; Cohabiting Female Pensioners.</em></p>
<p>Using this typology of 15 possible household types, I extracted the two person households from the larger dataset and wrote a Python script to classify these households. The result for 27,124 households was a follows:</p>
<p style="text-align: left"><a href="http://danieljlewis.org/files/2010/07/2personHHtype.png"><img class="aligncenter size-full wp-image-370" title="2personHHtype" src="http://danieljlewis.org/files/2010/07/2personHHtype.png" alt="" width="594" height="330" /></a>What this graph seems to demonstrate is that roughly half of all 2 person households consist of a man and a woman (either adult or pensioner) cohabiting, and roughly a further 22% of same sex cohabitation. In this dataset for two person household, single parents only make up around 15% of households of which almost 13% is a single female parent (adult or pensioner) and a child. All other groups only make up around 13% of households, but crucially the only category in which no households were found to exist was the adult man cohabiting with a male pensioner category. Indeed many of the smaller categories can be interpreted as having inherently important social roles, the adult woman looking after a male or female pensioner for instance.</p>
<p style="text-align: left">Essentially, the terrain of household type was a lot more nuanced and tricky than I&#8217;d at first though, made even more so by my realisation that as household size increases, the number of possible combinations of the person types within a  household increases dramatically. I wrote a python script to calculate the number of possible different sets of people for the household sizes 1 to 10:</p>
<p style="text-align: left"><a href="http://danieljlewis.org/files/2010/07/possibles.png"><img class="aligncenter size-full wp-image-373" title="possibles" src="http://danieljlewis.org/files/2010/07/possibles.png" alt="" width="564" height="334" /></a>This presents a difficult situation, even for reasonably small households. This is a problem known as &#8220;combinatorial mathematics&#8221; or &#8220;<a title="Wiki - combinatorics" href="http://en.wikipedia.org/wiki/Combinatorics" target="_blank">combinatorics</a>&#8220;. I decided to see what I could learn about this distribution, so I looked for patterns in the sequence, as you are taught in pre-GCSE maths and soon found that the sequence had a constant fourth difference:</p>
<p style="text-align: left"><a href="http://danieljlewis.org/files/2010/07/difference-table.png"><img class="aligncenter size-full wp-image-375" title="difference table" src="http://danieljlewis.org/files/2010/07/difference-table.png" alt="" width="622" height="226" /></a>This constant fourth difference indicated that the sequence can be explained by a quartic function, of which is was easy to then calculate the form:</p>
<p style="text-align: left"><a href="http://danieljlewis.org/files/2010/07/CodeCogsEqn4.gif"><img class="aligncenter size-full wp-image-376" title="CodeCogsEqn(4)" src="http://danieljlewis.org/files/2010/07/CodeCogsEqn4.gif" alt="" width="556" height="22" /></a></p>
<p style="text-align: left">Sadly not one of those classically beautiful equations.</p>
<p style="text-align: left">This all leads to the issue of how I now classify households, clearly the number of possible sets makes anything above around 4 people per household fairly intractable. I&#8217;ll experiment with 3 households and see whether I can account for most household types with a few set patterns and then look at households that fall outside of this remit.</p>
<p style="text-align: left">Interesting none the less, I hadn&#8217;t expected to be doing much of this kind of maths!</p>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2010/07/15/household-types-combinatorial-problems-and-pure-maths/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Computing the geometric median in Python</title>
		<link>http://danieljlewis.org/2010/07/09/computing-the-geometric-median-in-python/</link>
		<comments>http://danieljlewis.org/2010/07/09/computing-the-geometric-median-in-python/#comments</comments>
		<pubDate>Fri, 09 Jul 2010 10:17:56 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[Modeling]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[allocation]]></category>
		<category><![CDATA[dijkstra]]></category>
		<category><![CDATA[geometric]]></category>
		<category><![CDATA[location]]></category>
		<category><![CDATA[matplotlib]]></category>
		<category><![CDATA[median]]></category>
		<category><![CDATA[optimisation]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[service]]></category>

		<guid isPermaLink="false">http://danieljlewis.org/?p=362</guid>
		<description><![CDATA[I noticed in a beta of ArcGIS 10 (then called 9.4) that there was a &#8216;new&#8217; option for computing a Geometric Median which didn&#8217;t exist in my copy of ArcGIS 9.3. This is an interesting concept, as in 1d statistics, the geometric (2d) mean is easy to calculate, being the average of all the X [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F07%2F09%2Fcomputing-the-geometric-median-in-python%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F07%2F09%2Fcomputing-the-geometric-median-in-python%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>I noticed in a beta of ArcGIS 10 (then called 9.4) that there was a &#8216;new&#8217; option for computing a Geometric Median which didn&#8217;t exist in my copy of ArcGIS 9.3. This is an interesting concept, as in 1d statistics, the geometric (2d) mean is easy to calculate, being the average of all the X coords and all the Y coords. From stats we know that the Mean and Median value of a distribution will coincide if the data is perfectly normally distributed; however in the real world data usually will only approximate a normal distribution, leading to a mean value that is different from the midpoint, or median.  Therefore for a skewed distribution on the plane, we encounter a situation in which the mean is not necessarily the best representation of the &#8216;centre&#8217; of the data, thus we may wish to calculate the median; doing so will also give us a good idea of the direction of the skew of the point pattern we are investigating. In calculating the median of a 2d point pattern we can express the problem as a need to:</p>
<p><em> minimise the sum of squared distances from all points in a distribution to a centre.</em></p>
<p>Thus it is reasonably clear that we are dealing with an &#8216;optimisation problem&#8217;, something that I have experimented with before in work I conducted using the &#8216;transportation problem&#8217;, a classic linear programming problem.</p>
<p>In terms of application, I though that finding the median of a distribution of people around a service would be a useful, albeit basic, indication of whether all people were making a similar trip to a service, or whether there were other effects at work (this would be evidenced by a median centre that was not close to the actual service location). I though I would be able to code the optimisation routine in Python using pre-existing insight. Notably, the <a title="Geometric Median" href="http://en.wikipedia.org/wiki/Geometric_median" target="_blank">wikipedia page</a> on this details the Weiszfeld Algorithm as the acknowledged computational solution to the geometric median problem, it takes the form:</p>
<p><a href="http://danieljlewis.org/files/2010/07/weiszfeld.png"><img class="aligncenter size-full wp-image-363" title="weiszfeld" src="http://danieljlewis.org/files/2010/07/weiszfeld.png" alt="" width="368" height="61" /></a>However, actually writing the algorithm proved somewhat tough. Essentially the answer is to start with a candidate data point (I started with the mean centre) and calculate the objective function &#8211; in this case the sum of the euclidian distances of all points from the candidate centre. Then pass the candidate point through the Weiszfeld Algortihm and reassess the objective function, at such a point as the objective function converges a median has been found. There is no guarantee that the median found is the optimal median though, and depending of the data there may be more than 1 optimal solution. Below is a solution for some of my data (the data has been randomly offset by 75m to preserve anonymity) on patient registrations to a doctor.</p>
<p style="text-align: center"><a href="http://danieljlewis.org/files/2010/07/geomedian.png"><img class="aligncenter size-large wp-image-365" title="geomedian" src="http://danieljlewis.org/files/2010/07/geomedian-1024x742.png" alt="" width="574" height="415" /></a>Here we can see that the mean and median centres are slightly different, suggesting that the patient population is skewed slightly northwards, most likely as a result of discontinuous urban infrastructure.</p>
<p style="text-align: left">The scatterplot was achieved using the <a title="MatPlotLib @ Sourceforge" href="http://matplotlib.sourceforge.net/index.html" target="_blank">matplotlib</a> Python plotting library. This was just a test, but I imagine more complex outputs can be achieved reasonably easily.</p>
<p style="text-align: left">Notably, this technique is using euclidian distance, which in a dense urban environment may be misleading, I note that there is a relatively simple execution of the <a title="Python Dijkstra" href="http://code.activestate.com/recipes/119466-dijkstras-algorithm-for-shortest-paths/" target="_blank">Dijkstra algorithm for shortest paths in Python</a>, and I am curious whether this could be integrated to find a geometric median on the network, although I suspect that it may be unworkable due to computational time constraints, although for smaller problems it might be ok.</p>
<p style="text-align: left">Naturally there are algorithms that can calculate a solution to the above for <em>p</em>-medians (i.e. several service centres in a population- commonly known as location-allocation), it is something that <a title="Paul Densham" href="http://www.geog.ucl.ac.uk/~pdensham/s_t_paper.html" target="_blank">Paul Densham</a> at UCL has worked on, and his code is making a return to service in ArcGIS version 10. I&#8217;m looking forward to seeing it, as it is a very difficult problem to solve (and in fact already has been &#8216;solved&#8217;), and not one I intend to investigate!</p>
<p style="text-align: left">My code for the geometric median is <a href="http://danieljlewis.org/files/2010/07/geomedian.pdf">here.</a></p>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2010/07/09/computing-the-geometric-median-in-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Jenks&#8217; Natural Breaks Algorithm in Python</title>
		<link>http://danieljlewis.org/2010/06/07/jenks-natural-breaks-algorithm-in-python/</link>
		<comments>http://danieljlewis.org/2010/06/07/jenks-natural-breaks-algorithm-in-python/#comments</comments>
		<pubDate>Mon, 07 Jun 2010 15:53:27 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[Modeling]]></category>
		<category><![CDATA[choropleth]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[Jenks]]></category>
		<category><![CDATA[optimisation]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://danieljlewis.org/?p=347</guid>
		<description><![CDATA[The Jenks Optimal, or Jenks&#8217; Natural Breaks, Algorithm is a common method for classifying data presented in a choropleth map. It aims to present a series of break values that best represent the actual breaks observed in the data as opposed to some arbitrary classificatory scheme (i.e. equal interval), in this way the actual clustering [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F06%2F07%2Fjenks-natural-breaks-algorithm-in-python%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F06%2F07%2Fjenks-natural-breaks-algorithm-in-python%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>The Jenks Optimal, or Jenks&#8217; Natural Breaks, Algorithm is a common method for classifying data presented in a choropleth map. It aims to present a series of break values that best represent the actual breaks observed in the data as opposed to some arbitrary classificatory scheme (i.e. equal interval), in this way the actual clustering of data values is preserved (subject to the arbitrary specification of <em>k </em>classes). The method was originally published in George Jenks&#8217; (1977) <em>Optimal Data Classification for Choropleth Maps</em> and reportedly represented the culmination of 15 years research on the topic, the method primarily derived from Walter Fisher&#8217;s work &#8216;<em>On grouping for maximum homogeneity</em>&#8216;. The specifics of the algorithm aim to create <em>k </em>classes so that the variance within groups is minimised, as such it is a problem of numerical optimisation.</p>
<p>A paper by Michael Coulson (1987) entitled <em>In The Matter Of Class Intervals For Choropleth Maps: With Particular Reference To The Work Of George F Jenks </em>details a method that Jenks apparently authored, but never published, to derive how optimum the number of classes chosen was, the method Goodness of Variance Fit (GVF) works by taking the difference between the squared deviations from the array mean (SDAM) and the squared deviations from the class means (SDCM), and dividing by the SDAM. Thus:</p>
<p style="text-align: center">GVF = (SDAM &#8211; SDCM)/SDAM</p>
<p style="text-align: left">However, it is likely this was never published as the GVF improves as the number of classes increases, until at such a points as there are the same number of classes as data points, the GVF reaches unity. Nonetheless, I have included a rudimentary example for calculating this statistic. In reality, this method is used to generalise data into a few classes for visualisation, so you are unlikely to be using more than 7 (+/- 2) classes; number of classes can be loosely assigned by looking at the distribution histogram, but often this is difficult.</p>
<p style="text-align: left">The script is <a href="http://danieljlewis.org/files/2010/06/Jenks.pdf">here.</a></p>
<p style="text-align: left">Acknowledgement: The initial script I used for the Python conversion can be found (in JAVA and Fortran) here: https://stat.ethz.ch/pipermail/r-sig-geo/2006-March/000811.html</p>
<p style="text-align: left">
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2010/06/07/jenks-natural-breaks-algorithm-in-python/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>MDS for Inner London</title>
		<link>http://danieljlewis.org/2010/05/11/mds-for-inner-london/</link>
		<comments>http://danieljlewis.org/2010/05/11/mds-for-inner-london/#comments</comments>
		<pubDate>Tue, 11 May 2010 15:07:59 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[Cartography]]></category>
		<category><![CDATA[Modeling]]></category>
		<category><![CDATA[Representation]]></category>
		<category><![CDATA[distance]]></category>
		<category><![CDATA[Inner]]></category>
		<category><![CDATA[london]]></category>
		<category><![CDATA[Maps]]></category>
		<category><![CDATA[MDS]]></category>

		<guid isPermaLink="false">http://danieljlewis.org/?p=299</guid>
		<description><![CDATA[As an extension of the previous post, I created MDS mappings for Inner London using the OAC variables. I am using the ONS definition of Inner London, as opposed the commonly accepted definition, thus Inner London includes Haringey, Newham and the City, but excludes Greenwich. The nature of MDS is to scale a pairwise distance [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F05%2F11%2Fmds-for-inner-london%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F05%2F11%2Fmds-for-inner-london%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>As an extension of the previous post, I created MDS mappings for Inner London using the OAC variables. I am using the ONS definition of Inner London, as opposed the commonly accepted definition, thus Inner London includes Haringey, Newham and the City, but excludes Greenwich.</p>
<p>The nature of MDS is to scale a pairwise distance matrix, thus the number of entries in the matrix increases as the square of the number of observations, rapidly becoming very large. There are 9759 OAs in Inner London, requiring a distance matrix of 95,238,081 cells. I would like to compute the MDS mappings for Greater London as well, however there are 24,140 OAs giving a distance matrix of 582,739,600 cells, that is over 1/2 billion (American bn) entries and cannot be dimensioned on 32-bit machine. I&#8217;ve looked into 64-bit Python, but have yet to find a solution that doesn&#8217;t completely fill my computers memory and create an enormous pagefile.</p>
<p>In the previous example I used canberra distance, but found that in the larger data space of inner London I was having issues with very small fractions skewing the outcome, so I calculated the distance matrix using Bray-Curtis distance using SciPy&#8217;s spatial.distance library. The formula for bray-curtis is as follows (you&#8217;ll note it is a data normalisation method too):</p>
<p style="text-align: left"><a href="http://danieljlewis.org/files/2010/05/braycurtis.png"><img class="aligncenter size-full wp-image-300" title="braycurtis" src="http://danieljlewis.org/files/2010/05/braycurtis.png" alt="" width="200" height="56" /></a></p>
<p style="text-align: left">where k is the particular variable relating to the pair xi and xj and dij is the resultant distance matrix.</p>
<p style="text-align: left">The mappings were, as before, produced in greyscale and RGB.</p>
<p style="text-align: center">
<div id="attachment_301" class="wp-caption aligncenter" style="width: 563px"><a href="http://danieljlewis.org/files/2010/05/InnerLondonMDSBW.jpg"><img class="size-large wp-image-301  " title="InnerLondonMDSBW" src="http://danieljlewis.org/files/2010/05/InnerLondonMDSBW-1024x718.jpg" alt="" width="553" height="388" /></a><p class="wp-caption-text">Greyscale MDS mapping of Inner London using 41 OAC Variables</p></div>
<p style="text-align: left">
<div id="attachment_304" class="wp-caption aligncenter" style="width: 563px"><a href="http://danieljlewis.org/files/2010/05/InnerLondonMDSRGB.jpg"><img class="size-large wp-image-304  " title="InnerLondonMDSRGB" src="http://danieljlewis.org/files/2010/05/InnerLondonMDSRGB-1024x723.jpg" alt="" width="553" height="391" /></a><p class="wp-caption-text">RGB colour mapping for MDS of 41 OAC Variables</p></div>
<p style="text-align: left">I feel that the representations produced are quite effetcive in demonstrating the wide mix of social environments in Inner London, the greyscale is particualrly good at picking up the acknowledged patterns of deprivation, particularly east/west  and north/south. The colour representation does this too, but with an additional layer of complexity that seems to give a more nuanced reading of socio stratification in Inner London, with specific colour groups apparently marking out notional neighbourhoods &#8211; complementing the reading of London as a &#8216;city of villages&#8217;.</p>
<p style="text-align: left">Acknowledgement</p>
<p style="text-align: left">Census data is Crown Copyright 2010 from CasWeb, boundaries are Crown Copyright 2010 from UKBorders, an Edina/JISC supplied service.</p>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2010/05/11/mds-for-inner-london/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The 10% that change everything.</title>
		<link>http://danieljlewis.org/2010/03/17/the-10-that-change-everything/</link>
		<comments>http://danieljlewis.org/2010/03/17/the-10-that-change-everything/#comments</comments>
		<pubDate>Wed, 17 Mar 2010 18:24:48 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[Modeling]]></category>
		<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[deviants]]></category>
		<category><![CDATA[local]]></category>
		<category><![CDATA[population]]></category>
		<category><![CDATA[regional]]></category>
		<category><![CDATA[small numbers]]></category>

		<guid isPermaLink="false">http://danieljlewis.org/?p=277</guid>
		<description><![CDATA[I was caused to consider the problem of generalisable human behaviour by a presentation on evacuation modelling. In my eyes the ability to model or predict anything, and make the implementation transferable across different contexts is contingent on the assumption that you know how people will act. The fallacy here is that you know how [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F03%2F17%2Fthe-10-that-change-everything%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F03%2F17%2Fthe-10-that-change-everything%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>I was caused to consider the problem of generalisable human behaviour by a presentation on evacuation modelling. In my eyes the ability to model or predict anything, and make the implementation transferable across different contexts is contingent on the assumption that you know how people will act. The fallacy here is that you know how people will act, but you don&#8217;t necessarily know how individuals will act. It is easy (relatively) to aggregate across social characteristics and say: young people move at x speed, but old people move at y speed, and use it as a way of building socially stratified and perhaps logically more realistic models (this is often known as disaggregation). This approach seems to work well for regional systems, and large groups of people, however as the system of interest becomes more and more localised, individuals whose observable characteristics diverge enough from a generalisable norm can actually have an important role in the outcome of the model. I tend to think of this group as the 10% who change everything, although the actually percentage is likely to vary contextually.</p>
<p>In my work I have been looking at uptake and registration with GPs (doctors&#8217; surgeries) with a view to isolating the demographic qualifiers of choice that create these different spatial patternings of uptake. Essentially attempting to find interesting social disaggregations within the data. In this work, which is at the level of the surgery, but for a PCT system (i.e. administrative health area), it is clear that some people, a small minority, act unexpectedly and differently to others with the exact same socio-economic characteristics, and that the problem is exacerbated when you look at small and smaller problems. For example, the largest surgeries in Southwark have between 10,000 and 20,000 registered patients, for these it is far easier to model patient registration as a function of distance than it is for a surgery of only 2,000 people. This is because larger surgeries can aggregate out the small number of &#8216;deviants&#8217; better than a small surgery can. This is a defacto small numbers problem &#8211; the effect of a small number of outlying cases has a larger effect on smaller units of aggregation (surgeries) because they make up a larger proportion of the total population.</p>
<p>What does this mean for small-scale agent based simulations then? Well, as far as I can see it is very difficult to predict who out of a population is likely to diverge from their socially-stratified peers and be the outlying individual, and since the scale of simulation is so localised this uncertainty is liable to dramatically change the predicted outcomes. Thus in any case estimating the likely outcome within a margin for error is plausible &#8211; x people of y population were subject to some disadvantageous outcome (i.e. death/injury), but assessing where deaths/injuries occured, or the characteristics of who died/was injured may be a bit ambitious, or open to a quantitatively unacceptable uncertainty.</p>
<p>Naturally, understanding local level social systems should be a priority, and deriving generalisable rule-bases to give the best possible outcome for the incidence of a given phenomena is important. However, I think we always have to accept that in these circumstances a small number of people can have a large effect on the outcome in a way which is largely incalculable.</p>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2010/03/17/the-10-that-change-everything/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

