cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
The Understanding of Criminal Rate and Crime Prediction (2022-US-30MP-1165)

Grant Lackey, Student, Oklahoma State University
Karanveer Singh, Student, Oklahoma State University

 

The crime rate in the United States is increasing every year, and this is something that needs to be addressed. A key objective of this paper is to identify factors that statistically impact the crime rate in each state and leverage that information in order to reduce crime rates. For our analysis, we will make use of datasets: U.S. Census Data and Uniform Crime Reporting Program Data collected by the Federal Bureau of Investigation in 2014. We were able to get crime statistics for all 50 US states along with a detailed breakdown of crimes and input variables such as income and literacy to test their impact. Additionally, we intend to identify correlations between the types of crimes so that we can understand the core issues and identify crimes that may influence others. As a result, the allocation of resources and optimization of results for crime reduction would be improved. The findings of this paper will help us to understand the various socio-economic and locational factors that influence crime and, possibly, break certain stereotypes. This could be amazing for government bodies in constructing rules to combat crime in their areas. We were able to find that weapon and drug-based crimes had a high correlation with the other crimes. After testing the various factors in determining the crime rate in any given state, the top 3 were Weapons owned, Literacy Rate, and the percentage of people who follow a religion. 

 

 

Good  afternoon.

Today  we're  going  to  be  talking  about the  understanding

of  crime  rate  and  crime  prediction.

Before  we  go  into  the  data, let  me  introduce  a  team.

We  have  Karanveer, our data  scientist   and  data  modeling  expert,

and  myself Grant Lackey  as  a data  researcher  and

data  visualization  specialist.

So  before  we  get  into  the  data, let's  do  an  overview

of  the  entire  presentation.

We're  going  to  begin  with  the  background

which  will  be  the  initial  data  sets and  why  we  chose  our  data.

The  data  overview, which  again  is  going  into  the  reason

why  we  chose  our  data  and  what we're  going  to  be  trying  to  answer.

The  business  problems, which  is  the  problems  that

we  had  with  our  data.

Why  we're  trying  to  answer  certain

questions  and  the  overall idea  of  the  entire  project.

Next  is  our  methods  and  plans, which  is  our  procedure  of  answering

our  business  problem and  then  our  results,

which  are  the  results  of our  methods'  plans.

Our  applications, which  are  real- life  implications

from  our  results,  and   post-analysis,

which  is  what  we  could  include or  add  on  to  our  results.

What  we  could  add  on to  improve  upon  this  and  years  to  come.

Beginning  with  background, why  should  we  care  about  crime  rate?

Well  crime  is  just  important  to  everyone, and it's  everywhere  in  the  United  States,

and  so  what  is  crime  rate and  how  can  we  define  it?

How  we  define  crime  rate  is  our  initial criminal  activity,

divided  by  the  population  density per  county  or  per  state.

We're  mainly  going  to  be looking  at  per  state.

So  how  are  we  going  to  identify  factors

which  can  reduce  crime  rates throughout  our  entire  project.

Here  we're  going  to  be  speaking about  how  certain  crime

is  going  to  be  more  influential in  certain  states  than  others,

and  do  certain  crimes  influence other  crimes?

So  for  example,

if  there  was  a  murder  crime, would  guns  or  theft  be  more

influential  in  that  murder  crime  or  would other  crimes  be  influential  in  that?

So  looking  at  our  data  overview.

We  started  off  with  our  initial  data  set which  is  our  crime  statistics  data,

and  we  added  other  data  variables  later on  throughout  this  initial  data  set.

Beginning  with  our  initial  data  set, we  started  with  2014  data,

and this  initial  data  set  was  given  to  us  from

Federal  Bureau  of investigation:  the  FBI.

We  looked  at  42  criminal  activities which  are  wide  range

from  murder,  theft  to  drug  possession, drug  activities.

We  looked  at  about  3,200  counties, and  within  all  these  counties

or  within  these  states would  be  all  those  counties.

We  had  to  look  at  48  states.

We  had  to  exclude  Florida  and  Illinois,

because  Florida  and  Illinois  did  not provide  data  to  the  FBI

for  the  criminal  activities.

If  you  look  at  future  2018  data or  past  2012  data,  it's  the  same  issue.

They  just  don't  provide data  to  the  FBI  it  seems .

With  all  of  this  for  2014  data, there's  180,000  data  points

talking  about  the  FIPS  codes.

This  is  how  we  identified  certain criminal  activity  in  certain  counties.

For  example,  we  have  our  state  codes, which  would  be 01  for  Alabama.

These  states  are  represented alphabetically.

Alabama  would  be  the  first  one, and  then  each  county  within  that  state

would  have  numbers  to  them.

For  example,  Baldwin  would  be  003.

If  you  looked  at  Baldwin,  Alabama,

it'd  be  01003,  so  on  and  so  forth for  every  county  detailed  in  the  state.

Looking  at  our  extra  variables, we  looked  at  the census  data.

Census  data  is  always  great  for  checking out  the  age,  population,

income  per  county  or  per  state,

and  we  had  to  look  at  other  data  sets  like

gender,  immigration,  religion,  marriage, unemployment  and  literacy  rates.

These  other  data  sets  looked  more or  so  at  the  statewide  rates,

and  this  isn't  really  related to  criminal  activity,

but  we  wanted  to  involve  it

within  our  initial  data  set  to  see  if there's  any  correlation  with  them.

Going  into  our  business  problem,

we  want  to  answer  what  states in  the  United  States  specifically

have  the  highest  and  lowest  crime rates  and  why  is  that  so?

To  answer  our  business  problem,

we  have  to  answer  these  business questions  going  into  that.

How  can  we  identify  variables that  influence  crime?

Which  are  the  most  important  factors?

Are  there  crimes  that influence  other  crimes?

I'm  going  to  hand  it  off  to  Karanveer to  talk  about  plans and  methods.

Thank  you,  Grant.

Our  approach  to  solve  this business  problem,

was  to  come  up with  a  regression  model.

We  have  used  JMP  to  make  it.

First,  as  Grant  mentioned,

we  have  connected  the  various  databases, that  is  the  crime  data  set,

along  with  those  extra  variables such  as  religion,  income,  etc.

We have  made  sure  whether the  data  looks  clean.

A fter  that,  we  have  run our  regression  model,

which  is  able  to  predict the  crime  rate  for  us.

With  this,  we  are  able  to  know the  various  variables

and  their  importance  in determining  this  crime  rate,

and  we  are  able  to  list them  by  their  importance.

A t  the  end  we'll  also  be  showing you  visualizations  based  on  it.

As  Grant  mentioned, we  had  42  criminal  activity  variables.

Some  of  these  variables  were  very  small,

such  as  drug  possession, drug  consumption,  drug  sales.

In  that  case,  we  have  simply  grouped them  to  make  sure  that

we  can  come  on  a  conclusion  on  that

since  the  data  was  otherwise too  small  for  the  subgroups.

We'll  be  looking them  state wise

as  we  didn't  have  the  extra variables  on  a  county  basis.

But  I  feel  that  this  is  great for  starting  this  project.

Our  target  variable would  be  the  crime  rate.

We  have  defined  the  crime  rate as  the  number  of  arrest

in  that  certain  population.

Now,  coming  down  to  the  variables that  we  are  using.

Most  of  these  variables  have been  normalized  and  we  have  used

a  percentage  for  them, such  as  immigration  for  gender.

We  will  be  using  two  types  that  will  be a  male  and  a  female,

and  then  religion,  unemployment, marriage,  literacy.

Most  of  these  are  normalized  so  that

we  don't  have  an  analysis which  could  be  misleading.

Coming  down  to  the  final  equation of  our  regression  model.

This  is  the  equation  of  a  model.

We  have  rounded  off  the  samples,

and  as  we  can  see  there  are  a  lot  of  variables

that  have  a  positive  influence, as  in,  that  they  increase  the  crime  rate,

and  there  are  certain  variables which  have  a  negative  sign  with  them.

They  basically  decrease  the  crime  rate.

Using  this  we  can  see  how  we  can  define a  crime  rate  in  any  state  or  county.

Coming  down  to  the  results.

The  finding  number  one.

We  really  wanted  to  see  which  states have  the  highest  crime  rate.

These  are  the  following  five  states.

Tennessee,  Wyoming,  Mississippi, Wisconsin,  New  Mexico.

Then  we  have  the  following  states with  the  lowest  crime  rate,

that  are  New  York,  Alabama,  Vermont, Massachusetts  and  Michigan.

Here is  a  following  visualization explaining  how  the  crime  rate

varies  across  United  States.

As  we  see,  there  is  no  certain pattern  and  it's  all  over  the  place.

Finding  number  two.

Using  JMP  and  doing  a   log [inaudible 00:07:57]

on  the  variables,  we  could  basically  see which  variables  have  more  importance.

The  number  one  was  weapon  owned, followed  by  literacy  rate,

then  religion  percentage,  immigration, population  density,

and  the  unemployment  rate.

I  think  this  is  a  great  finding, while  any  government  body

or  any  organization  wants to  allocate  resources

whenever  they  are  trying  to  reduce the  crime  rate  or  trying  to  analyze  it.

The  finding  number  three  is something  really  interesting.

Our  goal  was  to  see  whether there  are  certain  crimes

which  could  help  us  solve not  just  that  crime,

but  maybe  other  crimes  as  well.

Which  these  crimes  are trying  to  influence.

Drug  and  weapon  was  one  of  them.

We  could  see  drug  and  weapons have  a  very  high  correlation

with  say,  theft,  robbery,  murder.

Using  a  chi- square  test, we  saw  that  the  correlation  is  very  high.

So  in  case  any  organization would  want  to  focus  on  and  start  with,

I  think  drug  and  weapon is  a  great  category  where

they  can  focus  at  for  reducing  crime  rate in  any  state  or  county.

This  is  the  following  map  showing

the  religion  rate,  weapons owned, and  literacy  rate,

and  the  variation  across  United  States.

If  we  put  it  with  the  crime  rate, we  can  see  a  certain  pattern

which  is  actually  explained by  our  regression  model.

Now  coming  down  to  the  implications,

how  we  can  use  our  analysis to  a  real- world  solution.

Like  the  data  set  we  have  used, and  we  have  connected  to  variables,

we  would  definitely  want  to  work  with governments,  towns  and  communities

because  crime  is  a  universal  problem

and  this  is  something everybody  wants  to  reduce.

The  restore  allocation  can  be  done according  to  this,

and  further,  this  would result  in  a  decrease  in  crime  rate

and  increase in  happiness in  the  community.

Post- analysis.

There  are  a  lot  of  things  that  we  would want  to  include  in  our  project,

and  this  is  a  great  future  scope  as  well.

First  thing,  we  would  want  to  include more  variables

such  as  weather,  ethnicity and  the  list  goes  on.

We  could  definitely  even  listen to  the  government  bodies  and  take  inputs

for  these  variables  from  them.

County  detailed  or  at  least  city  detail.

I  feel  it's  great  to  start  with state- wise  data,

but  we  would  definitely  want  to  focus on  a  more  detailed  level  of  analysis,

so  that  we  can  use  these conclusions  to  the  real  world

more  clearly,  more  precisely and  we  would  have  a  better  impact  as  well.

The  data  time  frame.

Right  now  we  have  used the  data  from  the  year  2014.

I  feel  this  is  an  eight  year  old  data  set.

We  would  definitely  want  to  use a  more  latest  data  set,

and  something  that  is spanning  over  a  couple  of  years,

so  that  it  gives  us  clarity.

Since  COVID  has  impacted  us in  a  lot  of  ways,

and  it  has  changed  how  basically lives  are  working  around  us,

and  so  has  crime  rate  and  the  way crime  happened  has  been  changed.

We  would  definitely  want  to focus  post  COVID  ,

and  over  the  last  two- three  years, for  a  post- analysis.

That's  all  and  thank  you.

Comments
calking

Hey @GrantLackey!

 

Interesting talk here! I'd be interested in hearing more about the modeling process, particularly in how you performed the regression. Since the response is a rate, I assumed you used a log-linear regression model? It wasn't clear from the slides. 

 

Also, while the topic is certainly an interesting one, it's also a very complex one. I've played around with crime rate data before (see here) and even I didn't fully realize all of the nuances involved. So I hope you don't mind a few suggestions for your next go round:

  • Arrest rate and crime rate are not one and the same I'm afraid. First, an arrest does not imply guilt; that is determined in a court trial. A person arrested for committing a crime may end up being released after trial. Furthermore, a person may be arrested and charged with more than one crime. A crime rate should measure the rate at which criminal activities occur per unit population (the FBI uses a unit of 100,000 people). The FBI typically measures two rates: violent crime rate and property crime rate. Both are available in the FBI crime reports, freely available online. 
  • Another important yet often overlooked fact regarding crime rates is that they are in essence lower bound estimates. Not every location sends the FBI their crime reports (as you noted for some of the states in your data) and even for those that do report, they can only give counts of reported crimes. It's safe to say that not every crime gets reported, so these are actually lower bound estimates. You may not be able to directly account for this in the model, but it's good to at least note it. 
  • I noticed you had both Male and Female terms in your model. Based on the data you provided, these are the only two possible categories. In that case, you should only have one or the other in your model. Having both in there means that two of your input variables are completely correlated (you can easily compute one from the other), which will cause problems in your model. Plus, there's the practical interpretation. Note that both terms have a positive coefficient, which means just having a gender increases your crime rate. A bit concerning, no?
  • Speaking of coefficients, I see that they are quite large in this model; on the order of several magnitudes. My guess would be that this is because you're using the raw variables, most of which are percentages. In modeling, it's good practice to transform/code variables so that everything is on the same scale (i.e. log-transform positive variables, logit-transform for percentages, etc.). This will lead to better-behaved coefficients at the minor cost of some interpretability (i.e. your coefficients now express the change in terms of the coded/transformed variable rather than the variable itself). 
  • Some of your findings seem inconsistent with known relations in crime analysis. For example, your model indicates high unemployment leads to low crime rates, when in fact the opposite has been shown to be the case in multiple studies. Is there a reason you think this might be the case?
  • Since your model is intended to be predictive, did you compare your predicted crime rates to the actual crime rates (or arrest rates) in 2014? How about 2015? A good practice in modeling (and predictive modeling especially) is to compare your predictions not only to the data itself, but also to data not used in the modeling. It's a good way to make sure you're not overfitting. 
  • One last thing. Predicting crime is an especially tricky topic. People's lives are at stake. And while you're not necessarily trying to predict individual crimes, even trying to determine the factors that lead to higher crime rates can have unforeseen impacts. I'd highly recommend you check out some recent videos from the YouTube channel VSauce2 (see here and here). While he focuses on models that try to predict locations of criminal activity, I still think it does an excellent job showing just how complex crime prediction in general can be. It's important to always understand the context in which your data analysis occurs. What on one level seems like a straightforward analysis can have far-reaching impacts. 

Ok, so that was more than a few I admit. I do find this topic very interesting though, which is why I feel compelled to share so much with you. If you're online next week for Discovery, I'd love to chat more! Thanks again for sharing!