<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Volunteered Geographic Information &#187; Daniel Lewis</title>
	<atom:link href="http://danieljlewis.org/author/dan/feed/" rel="self" type="application/rss+xml" />
	<link>http://danieljlewis.org</link>
	<description>A Geography/GIS blog by Daniel J Lewis</description>
	<lastBuildDate>Fri, 27 Aug 2010 15:13:01 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>k-nearest neighbour weights using PySAL</title>
		<link>http://danieljlewis.org/2010/08/27/k-nearest-neighbour-weights-using-pysal/</link>
		<comments>http://danieljlewis.org/2010/08/27/k-nearest-neighbour-weights-using-pysal/#comments</comments>
		<pubDate>Fri, 27 Aug 2010 15:13:01 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[PhD Work]]></category>
		<category><![CDATA[pysal]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[weights]]></category>

		<guid isPermaLink="false">http://danieljlewis.org.blogs.splintdev.geog.ucl.ac.uk/?p=397</guid>
		<description><![CDATA[I found a nice little time saving device when testing different numbers of nearest neighbour weights in the excellent PySAL library in python, so I thought I&#8217;d share it. Basically I wanted to test a number of different values of k when choosing a weighting scheme for spatial smoothing using nearest neighbours, but I didn&#8217;t [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F08%2F27%2Fk-nearest-neighbour-weights-using-pysal%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F08%2F27%2Fk-nearest-neighbour-weights-using-pysal%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605" height="61" width="50" /><br />
			</a>
		</div>
<p>I found a nice little time saving device when testing different numbers of nearest neighbour weights in the excellent <a title="PySAL home" href="http://geodacenter.asu.edu/pysal" target="_blank">PySAL</a> library in python, so I thought I&#8217;d share it.</p>
<p>Basically I wanted to test a number of different values of k when choosing a weighting scheme for spatial smoothing using nearest neighbours, but I didn&#8217;t want to have to continually recalculate the weight&#8217;s matrix for different values of k. Here&#8217;s what I did:</p>
<ol>
<li>Calculate a k nearest neighbour vector for a high value of k, this should be the same size, or larger than what you anticipate as being your maximum value of k.</li>
<li>Store this table in a database, the database is useful as for large values of k for a large set of data you create k x n rows. Databases are optimised to store and query this amount of data in a way that text files aren&#8217;t!</li>
<li>Create the weights matrix in PySAL from first principles: grab all the data from the database and order it by the &#8216;from&#8217; neighbour id, and then the weights of the &#8216;to&#8217; neighbours, Create the weights matrix as specified in the<a title="PySAL Weights Docs" href="http://pysal.org/library/weights/weights.html" target="_blank"> PySAL docs on weighting</a>, but only use as many observations as you want to test by slicing the input matrixes.</li>
</ol>
<p>Here is my code showing how this works:</p>
<pre>import _mysql    #Library that connects to Mysql
from pysal import W    #Weights part of pysal

# Important - Database connection!
db = _mysql.connect(host="localhost",user="root",passwd="",db="data")

# Now Create a Spatial Weights Object

# These first 4 lines pull in the weights data from my MySQL database
# and store it as a list of tuples called 'resultWeight'
getWeights = "select * from `spatialweight` order by `polyID`,`weight`"
db.query(getWeights)
r = db.store_result()
resultWeight = r.fetch_row(maxrows=0)

nList = []    # Empty list for neighbours
wList = []    # Empty list for weights
neighbours = {}    # Empty dictionary to hold neighbours
weights = {}    # Empty dictionary to hold weights

# Now iterate through the results to form dictionaries with lists of
# neighbours and weights for each relevant point in the dataset

recNum = 1
while recNum &lt; (len(resultWeight)+1):
    # append data from resultWeights until the limit
    # for k is reached for each point
    nList.append(int(resultWeight[recNum-1][1]))
    wList.append(float(resultWeight[recNum-1][2]))

    if recNum % 50 == 0:    # 50 as I used a maximum value of k = 50
        # Slice nList and wList depending on the value of k to test
        neighbours[int(resultWeight[recNum-1][0])] = nList[0:20]
        weights[int(resultWeight[recNum-1][0])] = wList[0:20]
        # Reset nList and wList for the next point in the weights data.
        nList = []
        wList = []
    recNum += 1

# Now simply create the weights matrix for the value of k specified.
w = W(neighbours,weights)

#the order the matrix for use later.
if not w.id_order_set:
    w.id_order = range(1,n)
# Where n = number of observations in the dataset (assuming
# point IDs are sequential integers starting at 1.)
</pre>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2010/08/27/k-nearest-neighbour-weights-using-pysal/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>P ≠ NP : Relevance to Computational Geography</title>
		<link>http://danieljlewis.org/2010/08/13/p-%e2%89%a0-np-relevance-to-computational-geography/</link>
		<comments>http://danieljlewis.org/2010/08/13/p-%e2%89%a0-np-relevance-to-computational-geography/#comments</comments>
		<pubDate>Fri, 13 Aug 2010 18:15:31 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[computers]]></category>
		<category><![CDATA[maths]]></category>
		<category><![CDATA[NP]]></category>
		<category><![CDATA[optimisation]]></category>
		<category><![CDATA[P]]></category>
		<category><![CDATA[problems]]></category>
		<category><![CDATA[solutions]]></category>

		<guid isPermaLink="false">http://danieljlewis.org/?p=394</guid>
		<description><![CDATA[In computational complexity theory the biggest unanswered question is P ≟ NP, that is whether or not a set of problems that fall into a certain complexity class P also fall into the class NP. Basically, this question asks: can all decision-based problems of the class NP in computing be solved in feasible length of [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F08%2F13%2Fp-%25e2%2589%25a0-np-relevance-to-computational-geography%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F08%2F13%2Fp-%25e2%2589%25a0-np-relevance-to-computational-geography%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605" height="61" width="50" /><br />
			</a>
		</div>
<p>In computational complexity theory the biggest unanswered question is P ≟ NP, that is whether or not a set of problems that fall into a certain complexity class P also fall into the class NP. Basically, this question asks: can all decision-based problems of the class NP in computing be solved in feasible length of time? This is a timely topic because a mathematician named <a title="Vinay Deolalikar - HP Labs" href="http://www.hpl.hp.com/personal/Vinay_Deolalikar/" target="_blank">Vinay Deolalikar</a> claims to have found a proof that <a title="P not NP" href="http://www.hpl.hp.com/personal/Vinay_Deolalikar/Papers/pnp_synopsis.pdf" target="_blank">P ≠ NP</a>, meaning that there are problems for which solutions cannot be found computationally.</p>
<p>First, let&#8217;s consider what the terms P and NP mean:</p>
<p>The complexity class NP contains all problems for which a known solution can be verified, this means that if you have a decision problem and you know the answer to be yes, you can write a program to check that that is the case. What it does not contains is all problems for which lacking an answer, a solution can be generated in a feasible period of time.</p>
<p>A good geographical example of an NP problem, albeit an optimisation problem, is the &#8216;travelling salesman problem&#8217; (TSP). The TSP says, given a list of cities and the pairwise distances between them, what is the shortest route that a travelling salesman could take so that he visits each city only once? This is a complex problem because the more cities there are the longer it takes to solve, after only a few cities the set of possible solutions becomes so large as to make the problem unsolvable in a realistic time frame. Thus for large numbers of cities the only solutions available are created by heuristics, in which a solution is assumed to be near-optimal but unprovable as a &#8216;best&#8217; solution. This is an example of an NP problem which it is believed does not belong to the set of problems known as P, because the TSP does not currently have a solution which can compute the &#8216;best&#8217; solution.</p>
<p>The complexity class P is a subset of NP and contains all decision-based problems that can be solved by a computer in a feasible amount of time. A decision problem is essentially a yes-no question that can be solved by an algorithm on a single computer running in a sequential manner (i.e. step-by-step). Decision problems, and the complexity class P are linked to other types of problems, such as search problems (i.e. find a particular element/structure in some set of data), function problems (like decision problems, but returning more complex answer than yes/no), and optimisation problems (for given starting conditions, find the best solution to a question from a set of feasible solutions).</p>
<p>An optimisation problem that falls into the P complexity class is the transportation problem of linear programming. This is an optimisation problem that says: for a given set of supply and demand sites, with a set of associated costs which differ for each pair of demand and supply sites, what is the best way of allocating resources from the supply to the demand site so that the cost is minimised. As a geographer this is a useful problem, I&#8217;ve used it to find an optimal set of GP service areas, where local areas &#8216;supply&#8217; a certain number of people who want to enroll with a doctor, to a set of GPs who meet the &#8216;demand&#8217; based on the fact that people see GPs as a location based service, and want to travel as short a distance as possible to get to a GP, thus reducing the &#8216;cost&#8217; to them. The virtues of Linear Programming mean that this problem can be solved reasonably efficiently, however it may not be exactly P as the efficiency of the solution may depend on the constraints you use in the problem. One constraint that I used was that each GP can only serve a certain number of people, and when it is full no one else can go there; adding constraints increases the complexity of a problem and may lead to the definition of an unsolvable problem.</p>
<p>So essentially what we have are NP problems, some of which are known to be P. If the question does P = NP is found to be true, it would mean that there exists somewhere a solution for all NP problems that could calculate an exact, best answer in a feasible amount of time. This would mean we could effectively solve problems like the TSP for large numbers of cities. However, if as suggested by Vinay Deolalikar P does not = NP then we are resigned to the fact that we will only be able to guess at the answer to problems such as the TSP.</p>
<p>Computational Geography uses numerous optimisation methods and looks for solutions to problems that are likely to be NP but not P, in all of these cases we are left in the situation of being able to approximate, but not know for sure, and this limits the certainty with which researchers can claim any particular result to be socially useful, or policy relevant. NP problems that are relevant to computational geography but may not now be solvable include problems in network analysis, including routing, flows and spanning trees; data storage; scheduling and optimisation; automata; geometry and mathematical programming.</p>
<p>In a sense if P ≠ NP it only goes to further enhance the mystery of the spatial and the geographical.</p>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2010/08/13/p-%e2%89%a0-np-relevance-to-computational-geography/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Community Informatics: Better Websites for the Health of Local Areas</title>
		<link>http://danieljlewis.org/2010/08/12/community-informatics-better-websites-for-the-health-of-local-areas/</link>
		<comments>http://danieljlewis.org/2010/08/12/community-informatics-better-websites-for-the-health-of-local-areas/#comments</comments>
		<pubDate>Thu, 12 Aug 2010 16:09:35 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[Health Geography]]></category>
		<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[community informatics]]></category>
		<category><![CDATA[health]]></category>
		<category><![CDATA[local]]></category>
		<category><![CDATA[NHS]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[White Paper]]></category>

		<guid isPermaLink="false">http://danieljlewis.org/?p=390</guid>
		<description><![CDATA[A comment I received by a chap called Bob Stott, on a previous post, got me thinking. I want to pick up this part of the comment in particular: &#8220;It also, as far as IT initiatives are concerned, reflects the need for more thought about ‘Community Informatics’ to feed realistic data regarding NHS Policy and [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F08%2F12%2Fcommunity-informatics-better-websites-for-the-health-of-local-areas%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F08%2F12%2Fcommunity-informatics-better-websites-for-the-health-of-local-areas%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605" height="61" width="50" /><br />
			</a>
		</div>
<p>A comment I received by a chap called Bob Stott, on a <a title="Previous Post of New NHS White Paper" href="http://danieljlewis.org/2010/06/01/locally-led-nhs-service-changes-dubious/" target="_blank">previous post</a>, got me thinking. I want to pick up this part of the comment in particular:</p>
<address>&#8220;It also, as far as IT initiatives are concerned, reflects the need for  more thought about ‘Community Informatics’ to feed realistic data  regarding NHS Policy and Strategy.&#8221;</address>
<address>
</address>
<p>This set of a couple of neurons firing, firstly, I was reminded that the Guardian recently had a piece <a title="Guardian- NHS Websites" href="http://www.guardian.co.uk/society/2010/aug/04/nhs-websites-failing-patients" target="_blank">lamenting the state of NHS websites</a>, and secondly I remembered some critique I wrote a while ago suggesting that <a title="My Blog - NHS Choices limited" href="http://danieljlewis.org/2009/10/16/pathways-to-choice-in-the-nhs-the-limitations-of-nhs-choices-for-primary-care/" target="_blank">NHS Choices wasn&#8217;t up to scratch</a>.</p>
<p>What I thought was: community informatics! What a great term! Here is a concept that might actually work under the new NHS structure! However, rather than Bob&#8217;s truly ambitious idea about communicating policy and strategy, what if we keep it simple at first and thought about communicating effectively with local communities about their care choices?</p>
<p>Now, the suggestion that the Guardian makes is that the NHS is wasting money on hundreds of websites, many of which are out-of-date, misleading or just wrong. In fact many of these website actually relate to primary care doctors surgeries, who, it could be argued, have better things to do than maintain a website. In fact there are numerous GPs who do not even have a web presence outside of the NHS Choices search page. Likewise, NHS Choices is an improving website &#8211; it has added several search filters and patient feedback methods since I last cast a critical eye over it, but it still acts as a centralised inforamtion portal. This is fine on the one hand, because the NHS is a national system of care, and such a system needs a centralised presence to some extent, however it may be limited when dealing with local issues. This is largely the thinking behind proposed changes to the NHS, the Conservative-Liberal government believe that previously too much power was centralised within the NHS system through explicit heirarchies. This, they claim, meant that central government had too much control over health spending, despite the fact that around 80% of funding was left to the lowest level authority- the primary care trust- to spend. The conservative-liberal system remotes the explicit national-regional-local linkage in favour of local consortia, groups of GPs, and instills a shadowy national body &#8211; the NHS commisioning board- about which we do not know too much at the moment, to oversee the consortia. Whilst there are numerous critiques one might make, upon reflection this seems like a potentially advantageous position from the vantage of &#8216;community informatics&#8217;.</p>
<p>Clearly a well maintained website for individual surgeries, or GP consortia, will be highly advantageous to the local users of the service, and as well as providing general information it could provide highly personalised insight that is tailored to the specific issues faced by either the communities, or the individual themselves. These websites were traditionally the responsibility of GPs who may not have kept them updated, as opposed to the PCTs, who had more important things to do, and perhaps were somewhat inefficient with respect to information dissemination and web media. However, a consortium, which is responsible for a group of local GPs, and which has a more marketised responsibility to provide tailored care may gain an advantage from the potential for several GPs to bring together resources and collaborate on providing community-based information and online services. This is simply because the shifting situation will mean that it is increasingly in the interest of the GPs and the consortia to advertise access to care and provide effective local solutions. Of course, whether this is a realistic possibility remains to be seen, I certainly hope that it could be a positive upshot of the NHS plans, but again there seems potential for the system to become increasingly inequitable for patients across the social scale.</p>
<address> </address>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2010/08/12/community-informatics-better-websites-for-the-health-of-local-areas/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Carl Steinitz, Symap and Place</title>
		<link>http://danieljlewis.org/2010/08/11/carl-steinitz-symap-and-place/</link>
		<comments>http://danieljlewis.org/2010/08/11/carl-steinitz-symap-and-place/#comments</comments>
		<pubDate>Wed, 11 Aug 2010 16:54:14 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[Cartography]]></category>
		<category><![CDATA[Representation]]></category>
		<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[Boston]]></category>
		<category><![CDATA[LSE]]></category>
		<category><![CDATA[MIT]]></category>
		<category><![CDATA[place]]></category>
		<category><![CDATA[Steinitz]]></category>
		<category><![CDATA[Symap]]></category>

		<guid isPermaLink="false">http://danieljlewis.org/?p=387</guid>
		<description><![CDATA[Recently, all and sundry had the chance to rummage through LSE Geography&#8217;s map library and liberate any maps of their choosing. Naturally some got over excited (cf. James Cheshire) and took numerous maps of all sorts. I was slightly more selective, and whilst being mostly on the look out for maps that represented social areas [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F08%2F11%2Fcarl-steinitz-symap-and-place%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F08%2F11%2Fcarl-steinitz-symap-and-place%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605" height="61" width="50" /><br />
			</a>
		</div>
<p>Recently, all and sundry had the chance to rummage through LSE Geography&#8217;s map library and liberate any maps of their choosing. Naturally some got over excited (cf.<a title="James Cheshire's Blog" href="http://spatialanalysis.co.uk/"> James Cheshire</a>) and took numerous maps of all sorts. I was slightly more selective, and whilst being mostly on the look out for maps that represented social areas (cf. <a title="LSE Booth Map Portal" href="http://booth.lse.ac.uk/" target="_blank">Booth Maps</a>) I did find one particularly interesting map.</p>
<p style="text-align: center"><a href="http://danieljlewis.org/files/2010/08/Steinitz.jpg"><img class="aligncenter size-full wp-image-388" title="Steinitz" src="http://danieljlewis.org/files/2010/08/Steinitz.jpg" alt="" width="436" height="645" /></a></p>
<p>The map is by Carl Steinitz, from a time when he was at MIT Department of City and Regional Planning, it appears to be made using Symap. The map is entitled &#8220;The Principle Local Activity of a Place&#8221;. I think this title is both fascinating, and in terms of the development of spatial analysis quite telling. First however, some background.</p>
<p>Carl Steinitz is a Professor at Harvard Graduate School of Design, and has been a regular visitor at CASA for as long as I&#8217;ve been at UCL. He trained as an architect and planner, but became known as an early evangelist of Geographic Information Systems (GIS), his ongoing work concerns the design of environments, often urban, and the use of GIS to describe possible development trajectories. I respect him most for his impassioned stance against needless 3d visualisations, particularly if those visualisation have a musical backing.</p>
<p>Symap, aka synergraphic mapping system, is one of the first software packages that could create outputs that actively resemble current desktop GIS outputs. It was developed in the mid-1960s and carried with it a distinctive style of using ascii characters in order to draw map elements. Andrew Crooks has a couple of interesting examples and some background on his <a title="Symap info from GIS Agents" href="http://gisagents.blogspot.com/2009/10/symap-movie.html" target="_blank">blog</a>.</p>
<p>Now the map in question here doesn&#8217;t seem to have a date, which is a shame, and it does not give a specific location to the mapped area, although given that the map was made in MIT it becomes apparent that the area in question is Boston, Massachusetts with the Charles River Basin in the north of the map, and Boston Harbour to the east. The legend denotes different kinds of &#8216;principle local activity&#8217;, using different ascii characters to create a colour graduation. Unfortunately some of the particular legend categories are lost due to the low quality of reproduction on this particular map, nevertheless we see that Boston exhibits a distinct spatial patterning with respect to principle activity. This kind of map is not unusual, land use mapping is still an actively researched area that continues to generate copious debate &#8211; what interests me actually seems rather minor, it is the description of the map as presenting &#8220;The principle local activity of a <em>place</em>&#8221; (emphasis added). Initially I wondered whether this phrasing was simply standard boilerplate, but a google search couldn&#8217;t find the exact phrase, or variations on it, anywhere else on the web (which is not to say that it isn&#8217;t standard, simply that it doesn&#8217;t exist on google, I imagine it would have appeared had it been related to statistical reporting at some time or another). What it may mean then is that it marks the way in which the author Carl Steinitz saw the representation at the time: as a representation of the local activity of a place.</p>
<p>This is interesting, first it is easy to assume that by place he meant &#8216;Boston&#8217;, Boston is after all a place. However, scale has a very interesting role to play in how we think about place: we can conceive of many places within Boston centred around communities of all kinds, these places will be at least partially defined by the &#8216;local activities&#8217; that occur there. As such the gridded representation of this map hints at the possibility of lots of places within Boston each with particular autobiographies, and each engaging people in different ways and offering different opportunities. Subsequent advances in GIS formalised the discourse of &#8216;space&#8217; and spatial analysis, after all GIS does fundamentally hinge on the euclidian system of representation, and as such the vast, expansive idea of space sits much better than a nuanced, specific, local concept such as place. It would be easy to disregard Steinitz&#8217;s map and say that of course it simply assesses land use in Boston by a grid of systematically defined areas, but that designation of &#8216;place&#8217; &#8211; purposeful or not- adds another layer of interpretation. Fundamentally it gives a different sense to what it is being represented here.</p>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2010/08/11/carl-steinitz-symap-and-place/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hospital Outpatients in Southwark 08/09</title>
		<link>http://danieljlewis.org/2010/07/16/hospital-outpatients-in-southwark-0809/</link>
		<comments>http://danieljlewis.org/2010/07/16/hospital-outpatients-in-southwark-0809/#comments</comments>
		<pubDate>Fri, 16 Jul 2010 17:44:09 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[Health GIS]]></category>
		<category><![CDATA[Health Geography]]></category>
		<category><![CDATA[Southwark]]></category>
		<category><![CDATA[admissions]]></category>
		<category><![CDATA[HES]]></category>
		<category><![CDATA[ONS]]></category>
		<category><![CDATA[population]]></category>

		<guid isPermaLink="false">http://danieljlewis.org/?p=380</guid>
		<description><![CDATA[Amongst other things, I&#8217;m beginning to tap into a data source I have acquired for my research known as Hospital Episode Statistics (HES). These are datasets which record the particulars of hospital service by patients. Generally they have a bit of a learning curve, and require the gathering of several additional datasets in order to [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F07%2F16%2Fhospital-outpatients-in-southwark-0809%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F07%2F16%2Fhospital-outpatients-in-southwark-0809%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605" height="61" width="50" /><br />
			</a>
		</div>
<p>Amongst other things, I&#8217;m beginning to tap into a data source I have acquired for my research known as Hospital Episode Statistics (HES). These are datasets which record the particulars of hospital service by patients. Generally they have a bit of a learning curve, and require the gathering of several additional datasets in order to make them useful. Having gathered all this data and put in all within a MySQL database I decided to conduct a basic analysis, using my study site of Southwark as a guinea pig. Essentially I wanted to known whether more people from Southwark were using hospitals of outpatient appointments than we would expect from national (England) figures. There are many reasons why any given area might be using health care services at a greater or lesser rate than other areas, but for the moment I simply wanted to see whether there was any interesting patterns.</p>
<p>In the HES data it is simple to calculate the total number of people using outpatient care, what is more complex is deriving an expected score from the national data. I went about it in the following way:</p>
<p>Firstly, I took the ONS experimental population projections from mid-2008 and calculated the number of people in each Southwark LSOA, and at the national (England) level, for each of the available age bands by men and women. The population projection age bands are quite coarse, giving totals for 5 population groups: 0-15, 16-29, 30-44, 45-64 (for men) or 45-59 (for women) and 65+ (for men) and 60+ for women. This isn&#8217;t ideal, but the age bands do roughly correlate with the different groups of mortality causes in the Grim Reaper&#8217;s road map (Shaw, Thomas, Smith and Dorling, 2008). Then I calculated the admission totals for all of the age-sex bands nationally (England), with this I could create a ratio of admissions against popualtion nationally. By applying this ratio to the Southwark LSOA population projects I could create an expected value for number of admissions per areas. Finally it is simply a case of dividing the observed admissions by the expected and multipling by 100 to get a score.</p>
<p>I mapped the results as follows, a score of 100 suggests that the area is not different from the national picture, whereas a value higher than 100 suggests that the area has more people using hospitals than we would expect and a value lower than 100 suggests the converse.</p>
<p style="text-align: left"><a href="http://danieljlewis.org/files/2010/07/Outpatient0809a.jpg"><img class="aligncenter size-large wp-image-384" title="Outpatient0809a" src="http://danieljlewis.org/files/2010/07/Outpatient0809a-724x1024.jpg" alt="" width="579" height="819" /></a>In the case of Southwark, the pattern seems to follow those that are often observed in my work on Southwark, in that the Bankside areas, and the southern part of the borough, in addition with the north-eastern former docklands area have levels of admissions that are equivilant too, or lower than what we would expect nationally, whereas the central areas have admission numbers higher than the national level.</p>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2010/07/16/hospital-outpatients-in-southwark-0809/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Household Types, Combinatorial Problems and Pure Maths</title>
		<link>http://danieljlewis.org/2010/07/15/household-types-combinatorial-problems-and-pure-maths/</link>
		<comments>http://danieljlewis.org/2010/07/15/household-types-combinatorial-problems-and-pure-maths/#comments</comments>
		<pubDate>Thu, 15 Jul 2010 18:17:07 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[Modeling]]></category>
		<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[combinatorics]]></category>
		<category><![CDATA[functions]]></category>
		<category><![CDATA[households]]></category>

		<guid isPermaLink="false">http://danieljlewis.org/?p=369</guid>
		<description><![CDATA[In some of the work I&#8217;m currently doing looking at households as derived from the Southwark patient register I wanted to go beyond a quantification of how many people lived in a households &#8211; a rudimentary household size, to looking at the composition of a household and hence what type of household it represented. In [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F07%2F15%2Fhousehold-types-combinatorial-problems-and-pure-maths%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F07%2F15%2Fhousehold-types-combinatorial-problems-and-pure-maths%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605" height="61" width="50" /><br />
			</a>
		</div>
<p>In some of the work I&#8217;m currently doing looking at households as derived from the Southwark patient register I wanted to go beyond a quantification of how many people lived in a households &#8211; a rudimentary household size, to looking at the composition of a household and hence what type of household it represented. In order to do this I looked at how types of household were generally reported in the UK Census, in European statistics, and in terms of social research on the life course, as well as in health literature itself. In terms of defining households, I found that although complex household typologies do exist, there exists a general set of likely household forms: as expected these revolve around the single, co-habiting, family, single parenthood, extended family etc models. As I have data on individuals I first decided to classify individuals into 5 broad categories that seem important in the literature and then look at the composition of these categories within households. The categories were:</p>
<p>1) Dependent Children (&lt;18 yrs old)</p>
<p>2) Adult Male (18-65 yrs old)</p>
<p>3) Adult Female (18-60 yrs old)</p>
<p>4) Male Pensioner (65+ yrs old)</p>
<p>5) Female Pensioner (60+ yrs old)</p>
<p>Evidence suggests that these represent the coarsest categories that could usefully represent significant periods within the life course, as well as being relevant to changes in health status. In a sense, different type of household structure can be described by different combinations of these person classes for different household sizes.</p>
<p>I decided to test this by calculating all the possible combinations of these 5 classes for a 2 person household and then looking at their uptake in the actual household data I had derived from the Southwark patient register. It turned out that for a two person household there were 15 different ways in which you could combine the 5 person classes in order to create a unique household:</p>
<p><em>Children Only (Parents Unregistered); Single Parent Male and Child; Co-Habiting Men; Single Parent Female and Child; Single Parent Male Pensioner and Child; Co-Habiting Man and Woman; Co-Habiting Man and Male Pensioner; Co-Habiting Women; Single Parent Female Pensioner and Child; Cohabiting Woman and Male Pensioner; Cohabiting Man and Female Pensioner; Cohabiting Male Pensioners; Cohabiting Woman and Female Pensioner; Cohabiting Male and Female Pensioner; Cohabiting Female Pensioners.</em></p>
<p>Using this typology of 15 possible household types, I extracted the two person households from the larger dataset and wrote a Python script to classify these households. The result for 27,124 households was a follows:</p>
<p style="text-align: left"><a href="http://danieljlewis.org/files/2010/07/2personHHtype.png"><img class="aligncenter size-full wp-image-370" title="2personHHtype" src="http://danieljlewis.org/files/2010/07/2personHHtype.png" alt="" width="594" height="330" /></a>What this graph seems to demonstrate is that roughly half of all 2 person households consist of a man and a woman (either adult or pensioner) cohabiting, and roughly a further 22% of same sex cohabitation. In this dataset for two person household, single parents only make up around 15% of households of which almost 13% is a single female parent (adult or pensioner) and a child. All other groups only make up around 13% of households, but crucially the only category in which no households were found to exist was the adult man cohabiting with a male pensioner category. Indeed many of the smaller categories can be interpreted as having inherently important social roles, the adult woman looking after a male or female pensioner for instance.</p>
<p style="text-align: left">Essentially, the terrain of household type was a lot more nuanced and tricky than I&#8217;d at first though, made even more so by my realisation that as household size increases, the number of possible combinations of the person types within a  household increases dramatically. I wrote a python script to calculate the number of possible different sets of people for the household sizes 1 to 10:</p>
<p style="text-align: left"><a href="http://danieljlewis.org/files/2010/07/possibles.png"><img class="aligncenter size-full wp-image-373" title="possibles" src="http://danieljlewis.org/files/2010/07/possibles.png" alt="" width="564" height="334" /></a>This presents a difficult situation, even for reasonably small households. This is a problem known as &#8220;combinatorial mathematics&#8221; or &#8220;<a title="Wiki - combinatorics" href="http://en.wikipedia.org/wiki/Combinatorics" target="_blank">combinatorics</a>&#8220;. I decided to see what I could learn about this distribution, so I looked for patterns in the sequence, as you are taught in pre-GCSE maths and soon found that the sequence had a constant fourth difference:</p>
<p style="text-align: left"><a href="http://danieljlewis.org/files/2010/07/difference-table.png"><img class="aligncenter size-full wp-image-375" title="difference table" src="http://danieljlewis.org/files/2010/07/difference-table.png" alt="" width="622" height="226" /></a>This constant fourth difference indicated that the sequence can be explained by a quartic function, of which is was easy to then calculate the form:</p>
<p style="text-align: left"><a href="http://danieljlewis.org/files/2010/07/CodeCogsEqn4.gif"><img class="aligncenter size-full wp-image-376" title="CodeCogsEqn(4)" src="http://danieljlewis.org/files/2010/07/CodeCogsEqn4.gif" alt="" width="556" height="22" /></a></p>
<p style="text-align: left">Sadly not one of those classically beautiful equations.</p>
<p style="text-align: left">This all leads to the issue of how I now classify households, clearly the number of possible sets makes anything above around 4 people per household fairly intractable. I&#8217;ll experiment with 3 households and see whether I can account for most household types with a few set patterns and then look at households that fall outside of this remit.</p>
<p style="text-align: left">Interesting none the less, I hadn&#8217;t expected to be doing much of this kind of maths!</p>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2010/07/15/household-types-combinatorial-problems-and-pure-maths/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Computing the geometric median in Python</title>
		<link>http://danieljlewis.org/2010/07/09/computing-the-geometric-median-in-python/</link>
		<comments>http://danieljlewis.org/2010/07/09/computing-the-geometric-median-in-python/#comments</comments>
		<pubDate>Fri, 09 Jul 2010 10:17:56 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[Modeling]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[allocation]]></category>
		<category><![CDATA[dijkstra]]></category>
		<category><![CDATA[geometric]]></category>
		<category><![CDATA[location]]></category>
		<category><![CDATA[matplotlib]]></category>
		<category><![CDATA[median]]></category>
		<category><![CDATA[optimisation]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[service]]></category>

		<guid isPermaLink="false">http://danieljlewis.org/?p=362</guid>
		<description><![CDATA[I noticed in a beta of ArcGIS 10 (then called 9.4) that there was a &#8216;new&#8217; option for computing a Geometric Median which didn&#8217;t exist in my copy of ArcGIS 9.3. This is an interesting concept, as in 1d statistics, the geometric (2d) mean is easy to calculate, being the average of all the X [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F07%2F09%2Fcomputing-the-geometric-median-in-python%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F07%2F09%2Fcomputing-the-geometric-median-in-python%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605" height="61" width="50" /><br />
			</a>
		</div>
<p>I noticed in a beta of ArcGIS 10 (then called 9.4) that there was a &#8216;new&#8217; option for computing a Geometric Median which didn&#8217;t exist in my copy of ArcGIS 9.3. This is an interesting concept, as in 1d statistics, the geometric (2d) mean is easy to calculate, being the average of all the X coords and all the Y coords. From stats we know that the Mean and Median value of a distribution will coincide if the data is perfectly normally distributed; however in the real world data usually will only approximate a normal distribution, leading to a mean value that is different from the midpoint, or median.  Therefore for a skewed distribution on the plane, we encounter a situation in which the mean is not necessarily the best representation of the &#8216;centre&#8217; of the data, thus we may wish to calculate the median; doing so will also give us a good idea of the direction of the skew of the point pattern we are investigating. In calculating the median of a 2d point pattern we can express the problem as a need to:</p>
<p><em> minimise the sum of squared distances from all points in a distribution to a centre.</em></p>
<p>Thus it is reasonably clear that we are dealing with an &#8216;optimisation problem&#8217;, something that I have experimented with before in work I conducted using the &#8216;transportation problem&#8217;, a classic linear programming problem.</p>
<p>In terms of application, I though that finding the median of a distribution of people around a service would be a useful, albeit basic, indication of whether all people were making a similar trip to a service, or whether there were other effects at work (this would be evidenced by a median centre that was not close to the actual service location). I though I would be able to code the optimisation routine in Python using pre-existing insight. Notably, the <a title="Geometric Median" href="http://en.wikipedia.org/wiki/Geometric_median" target="_blank">wikipedia page</a> on this details the Weiszfeld Algorithm as the acknowledged computational solution to the geometric median problem, it takes the form:</p>
<p><a href="http://danieljlewis.org/files/2010/07/weiszfeld.png"><img class="aligncenter size-full wp-image-363" title="weiszfeld" src="http://danieljlewis.org/files/2010/07/weiszfeld.png" alt="" width="368" height="61" /></a>However, actually writing the algorithm proved somewhat tough. Essentially the answer is to start with a candidate data point (I started with the mean centre) and calculate the objective function &#8211; in this case the sum of the euclidian distances of all points from the candidate centre. Then pass the candidate point through the Weiszfeld Algortihm and reassess the objective function, at such a point as the objective function converges a median has been found. There is no guarantee that the median found is the optimal median though, and depending of the data there may be more than 1 optimal solution. Below is a solution for some of my data (the data has been randomly offset by 75m to preserve anonymity) on patient registrations to a doctor.</p>
<p style="text-align: center"><a href="http://danieljlewis.org/files/2010/07/geomedian.png"><img class="aligncenter size-large wp-image-365" title="geomedian" src="http://danieljlewis.org/files/2010/07/geomedian-1024x742.png" alt="" width="574" height="415" /></a>Here we can see that the mean and median centres are slightly different, suggesting that the patient population is skewed slightly northwards, most likely as a result of discontinuous urban infrastructure.</p>
<p style="text-align: left">The scatterplot was achieved using the <a title="MatPlotLib @ Sourceforge" href="http://matplotlib.sourceforge.net/index.html" target="_blank">matplotlib</a> Python plotting library. This was just a test, but I imagine more complex outputs can be achieved reasonably easily.</p>
<p style="text-align: left">Notably, this technique is using euclidian distance, which in a dense urban environment may be misleading, I note that there is a relatively simple execution of the <a title="Python Dijkstra" href="http://code.activestate.com/recipes/119466-dijkstras-algorithm-for-shortest-paths/" target="_blank">Dijkstra algorithm for shortest paths in Python</a>, and I am curious whether this could be integrated to find a geometric median on the network, although I suspect that it may be unworkable due to computational time constraints, although for smaller problems it might be ok.</p>
<p style="text-align: left">Naturally there are algorithms that can calculate a solution to the above for <em>p</em>-medians (i.e. several service centres in a population- commonly known as location-allocation), it is something that <a title="Paul Densham" href="http://www.geog.ucl.ac.uk/~pdensham/s_t_paper.html" target="_blank">Paul Densham</a> at UCL has worked on, and his code is making a return to service in ArcGIS version 10. I&#8217;m looking forward to seeing it, as it is a very difficult problem to solve (and in fact already has been &#8216;solved&#8217;), and not one I intend to investigate!</p>
<p style="text-align: left">My code for the geometric median is <a href="http://danieljlewis.org/files/2010/07/geomedian.pdf">here.</a></p>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2010/07/09/computing-the-geometric-median-in-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Distribution of Household Occupancy in Southwark</title>
		<link>http://danieljlewis.org/2010/06/09/distribution-of-household-occupancy-in-southwark/</link>
		<comments>http://danieljlewis.org/2010/06/09/distribution-of-household-occupancy-in-southwark/#comments</comments>
		<pubDate>Wed, 09 Jun 2010 14:19:05 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[Geography]]></category>
		<category><![CDATA[Southwark]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[exponential decay]]></category>
		<category><![CDATA[households]]></category>
		<category><![CDATA[log]]></category>
		<category><![CDATA[social]]></category>

		<guid isPermaLink="false">http://danieljlewis.org/?p=355</guid>
		<description><![CDATA[I&#8217;ve been doing some more analysis on the Southwark GP patient register at the household level. After a fair amount of cleaning and interpretation I&#8217;ve arrived at the following distribution of households. There are a number of interesting things to say about this data, not least in the section that I&#8217;ve marked &#8216;larger social groupings&#8217; [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F06%2F09%2Fdistribution-of-household-occupancy-in-southwark%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F06%2F09%2Fdistribution-of-household-occupancy-in-southwark%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605" height="61" width="50" /><br />
			</a>
		</div>
<p>I&#8217;ve been doing some more analysis on the Southwark GP patient register at the household level. After a fair amount of cleaning and interpretation I&#8217;ve arrived at the following distribution of households.</p>
<p style="text-align: center"><a href="http://danieljlewis.org/files/2010/06/HHDistAnnotate.png"><img class="aligncenter size-full wp-image-356" title="HHDistAnnotate" src="http://danieljlewis.org/files/2010/06/HHDistAnnotate.png" alt="" width="578" height="380" /></a></p>
<p style="text-align: left">There are a number of interesting things to say about this data, not least in the section that I&#8217;ve marked &#8216;larger social groupings&#8217; as it seems to suggest a possible migrant social network effect, as the larger household groupings tend to be of minority ethnic groups, including Nigerians and other Africans, Hispanics and South-East Asians who are perhaps using cross-country social ties as help in getting established when first arriving in the UK. However, visually the shape of the distribution of household occupancy is very distinctive, and actually is very close to an exponential. Here I&#8217;ve taken the log of frequency of occurence and plotted the best-fit line through the plot:</p>
<p style="text-align: left"><a href="http://danieljlewis.org/files/2010/06/LogHHDist.png"><img class="aligncenter size-large wp-image-358" title="LogHHDist" src="http://danieljlewis.org/files/2010/06/LogHHDist-1024x682.png" alt="" width="574" height="382" /></a>This linear trend means that the model <strong>log(y) = -0.1635x + 4.602 </strong>is a good predictor of the number of Households we can expect to exist in Southwark for a given value of x, or occupancy.</p>
<p style="text-align: left">It is not entirely clear however why this situation is the case. Firstly, it may just be an artifact of the data, either of the matching process that has occured between the patient register and OS AddressLayer2, the way that GPs encode patient addresses in the first place, or the fact that the patient register is only a sample of the total population of Southwark, i.e. those people who register with a doctor. Secondly, it may simply be a reflection of the structure of the built environment in Southwark &#8211; i.e. what kind of housing is actually available. However, the distribution is also subject to the choices of individuals or groups.</p>
<p style="text-align: left">Currently, I am in the process of dissagregating the above characteristics and looking at trends by different population groups.</p>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2010/06/09/distribution-of-household-occupancy-in-southwark/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Review of Elementary Statistics for Geographers- Bert et al.</title>
		<link>http://danieljlewis.org/2010/06/07/review-of-elementary-statistics-for-geographers-bert-et-al/</link>
		<comments>http://danieljlewis.org/2010/06/07/review-of-elementary-statistics-for-geographers-bert-et-al/#comments</comments>
		<pubDate>Mon, 07 Jun 2010 16:07:13 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[Geography]]></category>
		<category><![CDATA[review]]></category>
		<category><![CDATA[RSS]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://danieljlewis.org/?p=351</guid>
		<description><![CDATA[A review I authored of Bert, Barber and Rigby&#8217;s &#8220;Elementary Statistics for Geographers&#8221; third edition, has made it into the Journal of the Royal Statistical Society Series A: Statistics in Society. The book is a truly excellent collection of statistical methods themed explicitly for use by geographers and spatial scientists, moreover the explanation and presentation [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F06%2F07%2Freview-of-elementary-statistics-for-geographers-bert-et-al%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F06%2F07%2Freview-of-elementary-statistics-for-geographers-bert-et-al%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605" height="61" width="50" /><br />
			</a>
		</div>
<p style="text-align: left"><a href="http://danieljlewis.org/files/2010/06/geogstats.jpg"><img class="size-full wp-image-352 alignleft" title="geogstats" src="http://danieljlewis.org/files/2010/06/geogstats.jpg" alt="" width="185" height="240" /></a>A review I authored of Bert, Barber and Rigby&#8217;s &#8220;<em>Elementary Statistics for Geographers</em>&#8221; third edition, has made it into the Journal of the Royal Statistical Society Series A: Statistics in Society. The book is a truly excellent collection of statistical methods themed explicitly for use by geographers and spatial scientists, moreover the explanation and presentation is superb. This has become a core book for myself and my colleague <a title="James' blog" href="http://spatialanalysis.co.uk/" target="_blank">James Cheshire</a> as we continue along the route of our PhD studies. I have said much the same thing in my review, accessible <a title="RSS A: Statistics in Society" href="http://www3.interscience.wiley.com/journal/123305751/abstract" target="_blank">here.</a></p>
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2010/06/07/review-of-elementary-statistics-for-geographers-bert-et-al/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Jenks&#8217; Natural Breaks Algorithm in Python</title>
		<link>http://danieljlewis.org/2010/06/07/jenks-natural-breaks-algorithm-in-python/</link>
		<comments>http://danieljlewis.org/2010/06/07/jenks-natural-breaks-algorithm-in-python/#comments</comments>
		<pubDate>Mon, 07 Jun 2010 15:53:27 +0000</pubDate>
		<dc:creator>Daniel Lewis</dc:creator>
				<category><![CDATA[Modeling]]></category>
		<category><![CDATA[choropleth]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[Jenks]]></category>
		<category><![CDATA[optimisation]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://danieljlewis.org/?p=347</guid>
		<description><![CDATA[The Jenks Optimal, or Jenks&#8217; Natural Breaks, Algorithm is a common method for classifying data presented in a choropleth map. It aims to present a series of break values that best represent the actual breaks observed in the data as opposed to some arbitrary classificatory scheme (i.e. equal interval), in this way the actual clustering [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F06%2F07%2Fjenks-natural-breaks-algorithm-in-python%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fdanieljlewis.org%2F2010%2F06%2F07%2Fjenks-natural-breaks-algorithm-in-python%2F&amp;source=gisdjl&amp;style=normal&amp;service=bit.ly&amp;service_api=gisdjl%3AR_cbf864f1d7672c90a5d0e63770588605" height="61" width="50" /><br />
			</a>
		</div>
<p>The Jenks Optimal, or Jenks&#8217; Natural Breaks, Algorithm is a common method for classifying data presented in a choropleth map. It aims to present a series of break values that best represent the actual breaks observed in the data as opposed to some arbitrary classificatory scheme (i.e. equal interval), in this way the actual clustering of data values is preserved (subject to the arbitrary specification of <em>k </em>classes). The method was originally published in George Jenks&#8217; (1977) <em>Optimal Data Classification for Choropleth Maps</em> and reportedly represented the culmination of 15 years research on the topic, the method primarily derived from Walter Fisher&#8217;s work &#8216;<em>On grouping for maximum homogeneity</em>&#8216;. The specifics of the algorithm aim to create <em>k </em>classes so that the variance within groups is minimised, as such it is a problem of numerical optimisation.</p>
<p>A paper by Michael Coulson (1987) entitled <em>In The Matter Of Class Intervals For Choropleth Maps: With Particular Reference To The Work Of George F Jenks </em>details a method that Jenks apparently authored, but never published, to derive how optimum the number of classes chosen was, the method Goodness of Variance Fit (GVF) works by taking the difference between the squared deviations from the array mean (SDAM) and the squared deviations from the class means (SDCM), and dividing by the SDAM. Thus:</p>
<p style="text-align: center">GVF = (SDAM &#8211; SDCM)/SDAM</p>
<p style="text-align: left">However, it is likely this was never published as the GVF improves as the number of classes increases, until at such a points as there are the same number of classes as data points, the GVF reaches unity. Nonetheless, I have included a rudimentary example for calculating this statistic. In reality, this method is used to generalise data into a few classes for visualisation, so you are unlikely to be using more than 7 (+/- 2) classes; number of classes can be loosely assigned by looking at the distribution histogram, but often this is difficult.</p>
<p style="text-align: left">The script is <a href="http://danieljlewis.org/files/2010/06/Jenks.pdf">here.</a></p>
<p style="text-align: left">Acknowledgement: The initial script I used for the Python conversion can be found (in JAVA and Fortran) here: https://stat.ethz.ch/pipermail/r-sig-geo/2006-March/000811.html</p>
<p style="text-align: left">
]]></content:encoded>
			<wfw:commentRss>http://danieljlewis.org/2010/06/07/jenks-natural-breaks-algorithm-in-python/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
