Hot-Spots and Geocoding

Greetings from the Westside Community Alliance public safety team! These past few weeks we spent a lot of time identifying potential use cases amongst our stakeholders. After all, we should only do something if the following are true: we can do it, and someone has a use for it. A common thread among the stakeholders is the need for a tool to identify locations where they should focus their crime prevention strategies. To find such a tool we began by researching the strategies law enforcement use. After all, they have direct access to crime data, a vested interest in preventing crime, and a lot of cash. Their go-to tool is the heatmap, essentially a 2-dimensional histogram of crime data taken over some area and color-coded by crimes per population. They identify hotspots by flagging outliers in the 2-dimensional heatmap as areas of high crime. Police also use a variety of clustering algorithms to identify hot-spots; one method is to fit a number of ellipses to the heatmap, and another is to use the distance of crimes to their nearest neighbor (clusters will contain an overabundance of points which are near one another).

A map of hot-spots for Portland, Oregon.

If the data is 100% accurate, maps of hot-spots are a great method for locating areas prone to crime. However, if the geocoding has errors (which may be from partial street-name recognition, streets with different names and/or numbers, mistakes in data entry, etc.) heavy biases may be introduced. For instance, if a street had a different name in the past, and the geocoding for the entire street failed because of the name-change, then the bins in the heatmap along the street will be artificially low, causing the hot-spot location method to work incorrectly. Our stakeholders could then concentrate their efforts in the wrong area!

If we are going to use a map of hot-spots, then we need to know how accurate our geocoding is. This is difficult as there are over 2 million records which span a large amount of time, from 1997 to the present. To begin, a list of unique addresses with the number of crimes committed at each address from our data is created:

3393 PEACHTREE RD NE                  1172
1801 HOWELL MILL RD NW             947
590 CASCADE AVE SW                      642
2841 GREENBRIAR PKWY SW           535

To test the consistency of the geocoding (provided by the Atlanta Police Dept.) we calculate how many unique latitudes there are for 3393 PEACHTREE RD NE, or Lenox mall:

33.84676    1111
33.84892    42
33.84600    2
33.84914     2
33.84746     2
33.76651      1

Out of 1172 records, there are 61 latitudes which are not 33.84676, for a failure rate of 5%, although most of them are close to one another with a standard deviation of ~1 arcminute, roughly a mile off.

To tackle this problem from the other side, we compute the number of unique latitude/longitude pairs. To determine this we use a distance to a reference point (Google’s answer to Atlanta’s latitude/longitde: 33.7490° N, 84.3880° W) and make a list of the unique distances:

6.067656    1803
7.360060    970
0.999120    910
3.612908    906
0.395376    884
7.361331     883

Note that there are significantly more records at a distance of 6.067656 arcminutes than there are at 3393 PEACHTREE RD NE, what is going on? When we index the location column by the rows with distances equal to 6.067656 arcminutes we find:

3393 PEACHTREE RD NE                                         1111
3393 PEACHTREE RD NE @LENOX MALL             278
3393 PEACHTREE RD                                               120
3393 PEACHTREE ROAD                                          51
3393 PEACHTREE RD.                                               29

There are numerous synonyms used for the Lenox mall, while many of them are differences in formatting, some specify which store the crime was committed in:

3393 PEACHTREE RD NE ; MACY’S                                              1

and thus some information is lost as these separate places are geocoded with the same latitude and longitude. Additionally, more severe errors are present in the geocoding. For example, from the 20th most popular distance of 1.13059361399 arcminutes:

550 PEACHTREE ST         1

these are completely separate addresses but with the same latitude/longitude pair! We will have to dive deeper to solve or at the very least quantify the effect of these problems.