-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcrime_markdown.rmd
261 lines (240 loc) · 11.9 KB
/
crime_markdown.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
---
title: "Boston Crime Incident Reports (2012-2015)"
author: "Ke Li"
date: "02/15/2015"
output: html_document
---
## Data Pre-processing
The crime indicent reports data is downloaded from [City of Boston](https://data.cityofboston.gov/api/views/7cdf-6fgx/rows.csv?accessType=DOWNLOAD). It contains crime recorded from 2012 to now in the City of Boston. The size of the data is about 47 MB. The detailed description of the data variables can be seen at file [Crime_Indicent_Field_Explanation.xlsx](https://data.cityofboston.gov/api/assets/F443C87A-9722-4032-B0A3-E823BA469DF6?download=true). To make the data for suitable for this analysis, preprocessing is performed, including:
- Taking of unuseful variables;
- Expanding the geolocation into lat and lng variables
```{r, cache=T}
#url = 'https://data.cityofboston.gov/api/views/7cdf-6fgx/rows.csv?accessType=DOWNLOAD'
#download.file(url, './Crime_Incident_Reports.csv', method='curl')
#crime = read.csv('Crime_Incident_Reports.csv')
###Taking of unuseful variables;
#crime$COMPNOS =NULL
#crime$NatureCode =NULL
#crime$X =NULL
#crime$Y =NULL
#crime$REPORTINGAREA = NULL
####Expand the geolocation into lat and lng variables
#crime$Location = as.character(crime$Location)
#tmp = strsplit(gsub(".*\\((.*)\\).*", "\\1", crime$Location), ',')
#tmp = matrix(unlist(tmp), ncol = 2, byrow=T)
#crime$lat = as.numeric(tmp[, 1])
#crime$lng = as.numeric(tmp[, 2])
#crime$DOMESTIC = NULL
#crime$Location = NULL
```
- Replacing the date time with R-formatted date string;
- Converting the Year and Month variables from numeric to factor;
- Deriving the time of the day the crime took place from variable 'FROMDATE';
- Deriving whether the crime took place during the day or at night (7:00pm-7:00am);
- Cutting a day into four chuncks of interval with 6 hours each;
- Relevel the order of DAY_WEEK, following 'Sunday', 'Monday',... 'Saturday' sequence.
```{r,}
##library(lubridate)
#crime$DateFormatted = mdy_hms(crime$FROMDATE, tz='EST')
#crime$FROMDATE= NULL
#### convert to factor
#crime$Year = as.factor(crime$Year)
#crime$Month = as.factor(crime$Month)
#### Derive the hour of the day the crime takes place
#crime$hr = strftime(crime$DateFormatted, format='%H')
#### Derive whether the crime took place during the day or at night
#crime$AtNight=(strftime(crime$DateFormatted, format='%H:%M:%S') > "19:00:00")|
#(strftime(crime$DateFormatted, format='%H:%M:%S') < "07:00:00")
#### Cut a day into Four chuncks with 6 hour each.
#cutpoints = c(0, 6, 12, 18, 24)
#crime$timeDec = cut(as.numeric(crime$hr), cutpoints)
#### relevel days of a week
#print levels(crime$DAY_WEEK)
#crime$DAY_WEEK = factor(crime$DAY_WEEK, levels(crime$DAY_WEEK)[c(4, 2, 6, 7, 5, 1, 3)])
```
- Adjusting the letter case confusion (ex, 'Three' and 'THREE' are same thing.);
```{r}
## UCRPART was found to have upper/lower case confusion
#crime$UCRPART = as.factor(toupper(crime$UCRPART))
```
- Converting the reported districted code for Boston Police Department into readable Boston district names;
```{r}
## check the safety at different district
## replace district code with names
#distrName = c(
#A1 = 'Downtown',
#A15= 'Charlestown',
#A7= 'East Boston',
#B2= 'Roxbury',
#B3= 'Mattapan',
#C6= 'South Boston',
#C11= 'Dorchester',
#D4= 'South End',
#D14= 'Brighton',
#E5= 'West Roxbury',
#E13= 'Jamaica Plain',
#E18= 'Hyde Park',
#HTU= 'Human Traffic Unit'
#)
#crime$ReptDistrName = as.factor(distrName[as.character(crime$REPTDISTRICT)])
#crime$REPTDISTRICT = NULL
```
## Variables and Data Structure
First of all, the data has 269849 observations and 18 variables.
```{r}
str(crime)
```
The variables given in the data are shown in the following:
```{r}
names(crime)
```
where,
- "INCIDENT_TYPE_DESCRIPTION": Boston Police Department classification;
- "MAIN_CRIMECODE": BRIC classification of the crime code for analysis purposes;
- "UCRPART": Universal Crime Reporting Part number (1,2, 3);
- "WEAPONTYPE":Detailed info on the weapon type (free form field);
- "Shooting": Whether police shooting is involved (??);
- "SHIFT": What shift (Day, First, Last) the incident took place on;
- "timeDec": the time interval of every 6 hours of a day, the cutting points are (0, 6, 12, 18, 24) as the hours of a day;
- "Year", "Month", "hr", and "DateFormatted": the time variables when the crime took place; and,
- "lat" and "lng" are latitude and longitude, where the crime record took place.
More information is recorded in the file 'Crime_Incident_Field_Explanation.xlsx'.
## Analysis
Couple of exploratory analytical topics are explored to investigate the crime rate distribution in districts, time of day, over the years.
### Geolocations
Latitude and Longitude are used to plot each record in the data. This roughly depicts the map of Boston, if you are familiar with Boston area.
```{r, fig.width=14, fig.height=10}
library(ggplot2)
# geographically, where are the different districts?
qplot(lng, lat, data=crime, color=ReptDistrName, geom='point')+
theme_bw(base_size=20)+
geom_point(size = 5)
#dev.copy(png, file='distrLoc.png', height=1100, width= 1100)
#dev.off()
```
To make it clearer. The crime locations depict each district as:
```{r, fig.width=7, fig.width=15}
library(lattice)
xyplot(lat~lng|ReptDistrName, data=crime)
#dev.copy(png, file='distrLoc_1.png', height=1100, width= 1100)
#dev.off()
```
With each district, Roxbury, South End, and Dorchester have the highest crime rates over all. Charlestown, West Roxbury, and East Boston are the safest area. East Boston contains Logan airport.
```{r}
histogram(~ReptDistrName, data=crime, scales=list(x=list(rot=45)))
#dev.copy(png, file='hist_districts.png')
#dev.off()
```
### Streets of Crime
What is the street that has the highest crime rate? To answer this question, a word cloud is used to showcase the distribution of crime rate on each street. Colors in the word cloud is randomly distributed, doesn't bear any analytical meaning. Washington street has the highest crime rate. Boylston St, Blue Hill Ave, and Mass Ave etc are listed to have high crime rate too. Boylston street is the street where the Boston Marathon Bombings took place on April 15 2013. When looking at map of Boston, Washington St is the longest street in Boston, meandering from Downtown Boston, extending southwestward to Massachussetts-Rhode Island border. It may be quite unfair to address that Washington Street is the most dangerous street in Boston, considering its length.
```{r}
# collect streets names, find a display of streets with most crime
street = data.frame(table(crime$STREETNAME))
colnames(street)=c('Name', 'Freq')
street$Name = as.character(street$Name)
# remove street names that are blank
street = street[-which(nchar(street$Name)==0), ]
# make a word cloud of street with highest crime rate since 2012
library(wordcloud)
wordcloud(street$Name, street$Freq, min.freq = 200, random.order=F,
random.color=T, colors=c('black', 'red', 'steelblue'))
#dev.copy(png, file='wordcloud_crime.png')
#dev.off()
```
Therefore, two more analyses are made to followup with the word cloud observation: Boston Marathon Bombings and Street of Crime in each district.
### Boston Marathon Bombings
Location, BOYLSTON ST; Time, April 15 2013, around 14:00-15:00 EST. Most of the records during Boston Marathon Bombings appear to be "Property Lost" and "Property Found", as in any other big events. Thus, records with "Property Lost" or "Property Found" filtered are listed. The incident type descriptions are "aggravated assault", "medical assistance", and "death investigation", which reflect the bombing scene to some extent.
```{r}
options(width = 2000)
# find out Boston Bombing records
bostonBomb = crime[which((crime$DateFormatted<"2013-04-15 16:00:00 EST"
&crime$DateFormatted>"2013-04-15 10:00:00 EST")
&crime$INCIDENT_TYPE_DESCRIPTION != 'PropLost'
&crime$INCIDENT_TYPE_DESCRIPTION != 'PropFound'
&crime$STREETNAME=='BOYLSTON ST'),]
print(bostonBomb[ , c('INCIDENT_TYPE_DESCRIPTION', 'STREETNAME', 'ReptDistrName', 'DateFormatted')])
```
###Street of Crime of Each District
In each district, the streets with top 3 crime rate is listed out.
```{r, fig.height = 20}
# each district, what is the most dangerous street, top3? How dangeous are they?
DataStreetDistr = data.frame(with(crime, table(STREETNAME, ReptDistrName)))
topstreet = data.frame()
for (distrName in unique(DataStreetDistr$ReptDistrName)) {
subDataStreetDistr = subset(DataStreetDistr, ReptDistrName==distrName)
tmp = subDataStreetDistr[with(subDataStreetDistr, order(Freq, decreasing=T))[1:3], ]
topstreet = rbind(topstreet, tmp)
}
topstreet$STREETNAME = as.factor(topstreet$STREETNAME)
topstreet$Rank = rep(1:3, 13)
library(ggplot2)
ggplot(topstreet, aes(x= Rank, y = Freq, label=STREETNAME))+
ylim(0, 5000)+
theme_bw(base_size=20)+
geom_bar(stat='identity')+
geom_text(size=4, hjust=0, vjust=0)+
facet_wrap(~ReptDistrName, nrow=13)+
coord_flip()
#dev.copy(png, file='top3streets.png', height=1100, width= 600)
#dev.off()
```
### Crime rate over a week
The histogram shows that Sunday has the lowest crime rate since 2012, and Friday has the highest. This observation fits well with the common sense of Friay being the day of partying and Sunday being the hang-over day or church day. The meancrime rate for each day is roughly 38549.9 since 2012, with standard deviation of 2394.2. This shows that on average there are roughly 50.7 crimes happening each day in Boston with 6.3 crimes more or less between Fridays and Sundays.
```{r}
# check days of a week crime rate
histogram(~DAY_WEEK, data=crime)
# Average crime rate of a day is
mean(table(crime$DAY_WEEK))
sd(table(crime$DAY_WEEK))
```
### Weapon Distribution
Throughout a day, the weapon usage for crime is plotted. Over all four panels, the crime rate drop to minimum at 7:00am of a day and reach the highest around mid night.
Interestingly, Firearms and knife appear to stay high in level or even increase in the late night of a day. Unarmed and Other drop in crime rate after sunset. It raises a good hypotheses behind the criminal behavior, choosing weapons to act crime. It is hypothesized that early before dawn, people tend to be asleep, while after sunset, people tend to attend bars and get drunk, and behave outrageously.
```{r}
# weapontype distribution across a day24 hours.
library(lattice)
crime$hr = as.factor(crime$hr)
histogram(~hr|WEAPONTYPE, data=crime)
#dev.copy(png, file='hist_hr_weapontype.png')
#dev.off()
```
### Police Shooting
A histogram of Police Shooting is plotted in each town in Boston. Roxbury, Dorchester, and Mattapan rank the highest. Over the years since 2012, Roxbury and Dorchester, Mattapan stay high in police shooting, while in 2015, Jamaica Pain catches up.
```{r}
with(subset(crime, Shooting=='Yes'), table(ReptDistrName))
#
histogram(~ReptDistrName, data=subset(crime, Shooting=='Yes'),scales=list(x=list(rot=45)))
#dev.copy(png, file='hist_district_shooting.png')
#dev.off()
#
#plot the shooting cases histograms in districts from 2012 to 2015
histogram(~ReptDistrName|Year, data=subset(crime, Shooting=='Yes'),scales=list(x=list(rot=45)))
#dev.copy(png, file='hist_years_shooting.png')
#dev.off()
```
### Crime Types
```{r}
sort(with(crime, table(INCIDENT_TYPE_DESCRIPTION)), decreasing=T)
```
```{r, fig.height = 20, warning=F}
# each district, what is the top 3 crime types
DataCrimeDistr = data.frame(with(crime, table(INCIDENT_TYPE_DESCRIPTION, ReptDistrName)))
topcrime = data.frame()
for (distrName in unique(DataCrimeDistr$ReptDistrName)) {
subDataCrimeDistr = subset(DataCrimeDistr, ReptDistrName==distrName)
tmp = subDataCrimeDistr[with(subDataCrimeDistr, order(Freq, decreasing=T))[1:3], ]
topcrime = rbind(topcrime, tmp)
}
topcrime$INCIDENT_TYPE_DESCRIPTION = as.factor(topcrime$INCIDENT_TYPE_DESCRIPTION)
topcrime$Rank = rep(1:3, 13)
library(ggplot2)
ggplot(topcrime, aes(x= Rank, y = Freq, label=INCIDENT_TYPE_DESCRIPTION))+
ylim(0, 5000)+
theme_bw(base_size=20)+
geom_bar(stat='identity')+
geom_text(size=4, hjust=0, vjust=0)+
facet_wrap(~ReptDistrName, nrow=13)+
coord_flip()
#dev.copy(png, file='top3crimes.png', height=1100, width= 600)
#dev.off()
```