Predicting and Plotting Crime in Seattle

I have recently been watching “The Wire” and along with my Amazon Prime membership looking better and better, it’s actually given me some things to think about. Besides making me an expert police detective, it has steadily been making an impact on how I view a city. It sounds kind of cheesy, but I never really understood how nice Seattle is when looking at the kind of crime that a city like Baltimore can experience on a day to day basis.

There is also a way to quantitatively determine this than having me describe to you how the bleak images of Baltimore’s project are far worse than those of Seattle. Socrata is a company based in Pioneer Square in downtown Seattle that has worked with governments to release open data to the public . Their most popular data-set on the Seattle city page however is the “Seattle Police Department 911 Incident Response” with currently over 45,000 views.

So there’s a couple interesting things to look at in this data-set. Notably with over a million rows for over four years of data, we can find something that’s worth telling? I initially tried reading in the data-set using the API but got 1000 rows. I don’t really wanna sample it so I downloaded the CSV and got a massive 250 mb file. Luckily enough, Pandas can destroy this.

Snapshot of the Seattle Police Dataset
Snapshot of the Seattle Police Dataset

After dropping all the rows where Seattle Police did not record the time or date of arrival on scene,  I got around 160,000 plus rows. I tried initially plotting the Seattle police “At Scene Time” times on a histogram.

Time of Seattle Police's Arrival on Scene
Time of Seattle Police’s Arrival on Scene

The X-Axis represents hours with each break representing every 12.5 minutes. You can see some clear dips around lunch and dinner time as well as one at around 3:30-4:00 am in the morning?. Pretty interesting. Is this from cops having to eat on the job or do people commit less crime during meal times? Maybe the one in the dead of night is like a midnight snack or a change in shifts. But because of the spike in calls right after the dips at around noon and seven pm, it seems like this might be police squad cars catching up to calls that initially might have happened during their break.

Time Series of Police Event Clearance Date
Time Series of Police Event Clearance Date

Plotting the dates of all incident responses, you can see there’s a spike in cases in the summer time as well as some downward spikes during the winters. Because the police process these cases at later times than when they actually occurred, we might see some low spikes due to holidays or busy days. The highest spike on the graph was the day before the Super Bowl in 2014 which was won by the Seahawks. #Represent #2016Revenge

Case Descriptions and False Alarms

As I said before, Seattle as a whole is a relatively safe city. We might see the occasional drunkard and grimy bus assaulter, but other than that it’s not bad. In the data-set, we can grab the most-frequent values for the column labeled “Event Clearance Group” which I shortened to “EvClearGroup” for laziness in Pandas purposes.

List of crimes in 2014 in Seattle by frequency
List of crimes in 2014 in Seattle by frequency

Essentially these are the descriptions that the police classify the cases into after they have been cleared. False alarms are the sixth most frequent case in Seattle in 2014 which can be a good or bad thing dependent on how you look at it.  But it also comprises about 5% of all incident response calls that come in. Let’s map all of the “False Alarms” cases according to their coordinates in the data-set.

 

 

 

 

Mapping of all False Alarm Incident Responses
Mapping of all False Alarm Incident Responses. Done in ggmap, ggplot2, and overlaid with Stamen

Seattle = qmap("Seattle", zoom = 11, 
    source="stamen", maptype="toner",darken = c(.3,"#BBBBBB"))

Seattle + geom_point(data=hard, aes(x=Longitude, y=Latitude), 
                              color="dark green", alpha=.15, size=2)

UPDATE: Check here for a updated crime density map based on population density.

Downtown would naturally have a higher concentration of incident response cases anyway so it doesn’t do too good of a job showing which neighborhoods have the most “False Alarms” calls. By doing some more data wrangling, we can effectively see the percentage of each district’s “False Alarms” cases. Each police district has a precinct with a certain “Beat” patrolling each zone. We can normalize the values by finding false alarm percentages in each police beat zone using Pandas.

police2014 = pd.read('police2014.csv')
temp1 = pd.DataFrame(police2014[police2014['EvClearGroup'] == "FALSE ALARMS"]["ZoneBeat"].value_counts())
temp2 = pd.DataFrame(police2014["ZoneBeat"].value_counts())
false = temp2.merge(temp1, left_on=temp2.index, right_on=temp1.index, how='inner', suffixes=['all', 'falseAlarms'])
false['percent'] = false['0falseAlarms']/false['0all']
false.sort(['percent'], ascending=False)

 

List of police beats along with percentage of cases that are "False Alarms"
Police beats with top percentage of cases that are “False Alarms”
Map of the Seattle Police Beat and locations
Map of the Seattle Police Beat and locations

The index on the table above displays the order of police beat districts by number of cases. I re-ordered the list by the percentage of “False Alarm” cases in each beat or general neighborhood. If you look at the map, the precinct’s do a pretty good job of dividing the areas into overall neighborhoods and zip-codes.

L3 leads all of the groups with the highest percentage of “False Alarms” with over 16% of their 9-11 calls being not too important. They are 34th out of 50th in number of general 911 calls into the city. The area in L3 is largely a portion of Sand Point, a rather nice neighborhood in Seattle, and Lake City, which is more of a shabbier crappy place to live. Not dangerous, but more of a boring area with a lot of dreary homes.

For personal interest, I plotted the cases with “Car Prowls” in the University District/Fremont Area. My own car was broken into on a street called Pasadena, as it’s a popular free parking street for criminals with no real houses on the actual curb for any oversight. But nothing really jumps out on the map except for the area around 45th street. Also, a really ratchet like club called “Fusion” is right where those three red dots are on the right of I-5. It’s getting it’s liquor license pulled because of all the shootings.

Mapping of Car Prowls in the University District of Seattle by UW
Mapping of Car Prowls in the University District of Seattle by UW

 Predicting “Higher” or “Lower” Crime Urgency

Besides just looking at “False Alarms” in the data-set, it might be more interesting to re-classify some values based on how urgent an incident response would be. How can we accomplish this? By taking a look at the data-set again, there are multiple columns for not only event clearance groupings but also initial type grouping. The data-set has an “Initial Type Group” which classifies the initial phone call as being in one of the categories similar to the “Event Clearance Group”. I believe this grouping is where the 911 operators would place the call description. Afterwards after the incident has been recorded and then cleared, the police officers assigned to the case place it under the correct clearance subject. Most of the time, the “Initial Type Group” and the “Event Clearance Group” are at the same urgency level. It becomes slightly more interesting if they aren’t.

I decided to try to manually classify all of the groups into six different urgency levels:

0 – False Alarms

1 – Fraud Calls, Liquor Violations, Prowler, Suspicious Circumstances, etc…

2 – Property Damage, Theft, Threats, Car Prowl, etc…

3 – Commercial and Residential Burglary, Road Rage, Sex Offense (No Rape), etc…

4 – Arrest, Assaults, Robbery, Drive By, Weapon Calls, etc..

5 – Homicide, Person Down, Causalities

I  mapped out all crime that had an urgency level of four and higher. You can compare it to the “False Alarm” map above and see that this one is way more concentrated in the downtown region. Many dots very visibly follow roads such as the one down by the Rainier Beach neighborhood. Martin Luther King Jr. Way is literally a string of dark red dots as well as what seems to be Aurora Ave north of Green lake.

Map of Serious Offenses with Urgency Levels of Four and higher
Map of Serious Offenses with Urgency Levels of Four and higher

Running each “Initial Type” against the “Event Clearance Type”, we get three classified values of “Higher”, “Lower”, and “Same” on each case description. For example, if the “Initial Type” description is Assault but turns out the “Event Clearance” is Liquor Violations, then we would classify it as Lower as it is on a lower urgency level than Assault.  I plotted out each subset of “Lower” and “Higher” on a map.

The left map displays crimes that have a higher urgency level than initially described and the left map displays a lower urgency level that initially described.
The left map displays crimes that have a higher urgency level than initially described and the right map displays a lower urgency level that initially described.

In [114]: SPscene[‘UrgentLevel’].value_counts()

Out[114]:

Same 135036

Lower 21094

Higher 10215

dtype: int64

Because there are more than twice as many “Lower” urgency calls than “Higher” urgency calls, I had to re-sample the data to equivalent amounts before mapping them to get a fair distribution of results. You can see the calls are more spread out a bit on the lower urgency level map on the right.

Now is it possible to actually predict whether the eventual urgent level of an incident response will be higher or lower than the initial phone call? Let’s try it with sci-kit learn!

SPml = SPscene.dropna(subset=['UrgentLevel'])
SPml = SPml.reset_index(drop=True)
outcome = SPml['UrgentLevel'].values
enc = LabelEncoder()
label_encoder = enc.fit(outcome)
outcome = label_encoder.transform(outcome)+1
features = SPml[['ZoneBeat', 'InitialUrgency', 'InitialTypeDesc', 'InitialTypeGroup', 'hour', 'month']]
features = pd.concat([pd.get_dummies(features[col]) for col in features], axis=1, keys=features.columns)

Because Pandas doesn’t actually convert categorical data over to numpy for sci-kit learn very well, we have to encode the variables into binary labels of 0 or 1. Essentially the one hot encoding turns a single column of “Zone Beat” into 50 columns of headers as “W1, W2, W3, K1, K2, K3, etc…”.

The features that I am trying to use to determine urgency level will be the beat/neighborhood, initial urgency level (ex. 0-6), initial type description (ie. more detailed than group), initial type group, and the hour and the month of the “At Scene Time”. There are plenty of other factors to use with more data parsing such as daytime, weekends, prices of surrounding homes using Zillow API, etc…

What’s our benchmark for success? Since around 80% of the cases will evaluate to a “Same” in urgency level, then in order to be successful, we have to improve upon that 80%. Otherwise, anyone who just predicts “Same” for the cases will be right 80% of the time.

In [237]: clf = MultinomialNB()

In [238]: pred = clf.fit(features, outcome).predict(features)

In [239]: len(outcome)

Out[239]: 166345

In [240]: summary(outcome, pred)

Number of Misclassified: 32539

Mean Outcome Value: 2.67500075145

In [241]: pd.crosstab(outcome, pred)

Out[241]:

Real Values

Higher

Lower

Same

Predictions

Higher

3491

1435

5289

Lower

357

18851

1886

Same

2508

21064

111464

3 rows × 3 columns

 On a confusion matrix, the rows are the predicted values and the columns are the actual values. For example, the NB classifier predicted “Higher” correctly for 3491 cases, predicted “Higher” instead of the case being “Lower” 1,435 times, and predicted that it would be “Higher” instead of “Same” 5,289 times; more than it predicted higher. However, the “Lower” classifier was a lot more successful predicting a much higher percentage correctly. Overall, it gave me 32,539 misclassified out of 166,345 cases. That evaluates to around 80.44 percent correct. “Same” accounts for 81.2 percent of all cases. Let’s try another classifier.

rf = RandomForestClassifier()
fit = rf.fit(features, outcome)
pred = fit.predict(features)
split = np.random.rand(len(SPml)) < .9
xtrain = features[split]
ytrain = outcome[split]
fitTrain = rf.fit(xtrain, ytrain)
xtest = features[~split]
ytest = outcome[~split]
predTest = fitTrain.predict(xtest)
pd.crosstab(ytest, predTest, rownames=['actual'], colnames=['preds'])

 

Real Values

Higher

Lower

Same

Predictions

Higher

291

102

686

Lower

43

1388

745

Same

236

541

12684

3 rows × 3 columns

By splitting the data-set randomly between 90% for training the data and 10% for testing, we can add up the mis-classified values. It turns out that it misclassified 2,353 cases out of 16716. Accuracy is then 85.92%. Awesome we beat the benchmark by almost 6 percent! Random Forest works better. Surprisingly enough it doesn’t even look like it’s that much better though. “Lower” and “Higher” both classified their values with lower percentages than the Naive Bayes method. However when looking at “Same”, it classified the values much more accurately, and being that it consists of the largest percentage of all cases, that makes for a better classifier.

Trying to Find Corruption

I tried to do some analysis on Baltimore’s Police data-set as well. Mostly to try to see if I could find any indication of the police juking the statistics by doing things like turning Aggravated Assaults into Common Assaults and Armed Robberies into Common Robberies. In the end, I couldn’t find anything substantial. When graphing both of the Assault cases, I was looking for spikes in differences between the two crimes as that meant officers were trying to switch up the stats. I was also looking for spikes at the end of the month or consistent date intervals right before possible meetings. Yet the time series plot is so noisy and there’s no correlation between these two crimes as much as there would be between assaults, larcenies or rapes. And juking the stats are probably not as prevalent since Comp-stat was disbanded by Baltimore in 2009.

Aggravated Assaults and Common Assaults in Baltimore over a period of 6 months
Aggravated Assaults and Common Assaults in Baltimore over a period of 6 months. The noise is incredible.

I also tried looking into the Seattle Police data-set to specifically isolate the “Lower Urgency” labeled crimes. If there was some stats meddling, then there would be spikes in the “Lower Urgency” crimes at specified intervals for meetings as well. Police would intentionally lower the urgency level in the event clearance description  from the initial type description.  Though fortunately, I can’t find a pattern in these graphs either.

Crime in Seattle in 2014
“Lower Urgency” Crime in Seattle in 2014. Spikes are during major holidays and no clear pattern.
"Lower Urgency" Crime in Seattle from January 2014 to March 2014. Spikes are during weekends and huge events. Aka, super-bowl and MLK weekend.
“Lower Urgency” Crime in Seattle from January 2014 to March 2014. Spikes are during weekends and huge events. Aka, super-bowl and MLK weekend.

Summary

Overall I am pretty interested in how the Seattle police department is approaching data analysis with the depth of information that can be covered. If Socrata hasn’t done so alrady, it would be cool if there was a team that could provide data analysis solutions along with just opening the data to the public. On top of the Seattle police data-set, there seem to be hundreds of more data-sets, graphs, and maps on the Socrata government pages, yet not much in the form of usual summarized information. The Socrata maps also take long to load, but check out my friend Daniel’s live stream crime map plotter.

My classification for police case levels are also probably not the most official. I did it at a personal discretion and if anyone wants to recreate the experiment with different levels and already formatted data, check out my github for data cleaning code or email me and I can share the data-set.

But on top of this, hopefully more police departments will start filling in more missing data. By dropping out rows without At Scene Time, the data-set comprised of only a tenth of it’s actual result. The Initial Type Group and the Event Clearance Group are both included in the data-set in around 250 thousand plus rows as well. When considering that there are over a million rows for four years of data in the Seattle Police data-set, it goes to show that improvements in data completion could lead to an endless amount of better opportunities for crime prevention and prediction.

Of course I also can’t suggest that data science could possibly be a substitute for the frantic panicking of a person or a prankster kid who’s obviously lying into the phone. The amount of attributes that could actually go into an incident response data-set are actually quite large, but that would probably also mean revealing more privacy than appropriately allowed.

Email me at jayfeng1@uw.edu as well for any comments, interests, or feedback!

11 Comments

Add yours →

  1. Awesome post! I’m trying to do something similar with Chicago’s Socrata data. When I looked at your GitHub profile, I couldn’t find your Seattle scraping code (just Craigslist). Could you point me to the right repo?

  2. My only comment – is this really predicting ‘reported’ crime? Minor nuance, but hey..

  3. Cool stuff, very nicely presented! Bill Rawls would be happy to see the stats are not being juked! 🙂

    @Will, I haven’t looked at the Chicago data (is Socrata different than the City’s open data portal?) but was curious as I live in Chicago as well. Check out scrapy, it’s a great python package for scraping, I use it to grab ESPN sports data.

  4. this is so sick, Jay! i saw an article on this, i think you’re famous!

  5. Reblogged this on Seattle Property Investment and commented:
    Cool research!

  6. I appreciate you for doing this. As someone who’s lived in downtown Seattle for a decade (and physically assaulted by vagrants twice), the relatively low crime in a neighborhood like Magnolia is interesting.

  7. Interesting, but it ends up being very close to a population density map. See, eg:

    http://www.seattle.gov/dpd/cityplanning/populationdemographics/geographicfilesmaps/2010census/default.htm

    and compare it to the crime maps. They’re almost identical.

    The differences would be interesting to discuss though. It looks like the primary difference I see is that more incidents occur along major arterial roads, such as Rainer Ave S. That seems worthy of further discussion and analysis. You could do a more formal analysis of the differences by charting crime per capita, among other things.

    On a lighter note, no discussion of this issue of heatmaps that look like population density would be complete without a reference to this Xkcd:

    http://xkcd.com/1138/

  8. Excellent analysis-it will work, provided that the key assumption is correct: But is the past really prologue?-

  9. Thanks for the link Jay!

  10. @Brandon: Yeah, I’m just referring to the open data portal. Specifically, I’m using this: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2. I’m working on it as a term project for a machine learning course. You can see what we’re up to here: https://github.com/chi-learn/chi-learn . There’s not much to see yet (save some janky ipython notebooks where I’m playing with pandas), but if you check back at the end of April, we should have some more interesting work on display.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: