Car Accident Severity

Applied Data Science Capstone Project

Rohan Lewis

2021.02.11

I. Introduction

The dataset provided is a collaborative effort by Seattle Police Depart and Seattle Department of Transportation. It has approximately 195,000 collisions of various types from Jan 2004 - May 2020.

For a full description of the variables, see the ArcGIS MetaData.

II. Data

1. Packages

2. Raw Data

3. Date and Time

Separate columns for the date and time were created from the timestamp.

4. Predictors

In order to predict accident severity, I eliminateded variables that are not result of the accident. I kept variables related to the setting.

These are location (neighborhood and address type), date and time, light conditions, road conditions, and weather conditions.

5. Remove NAs and Unknowns

i. Count

ii. Drop 'NA' Values

iii. Drop 'Unknown' Values

6. Data Cleaning Summary

III. GEOLOCATE

Find the neighborhoods from the coordinates. Note that all collisions with coordinates were included.

1. Functions

2. Accidents per Neighborhood

3. Choropleth

IV. Exploratory Data Analysis

1. Functions

2. Severity Type

Note that over twice as many accidents result in property damage than in injury.

3. Date

i. Dates with the greatest number of accidents

ii. Dates with the fewest number of accidents

There seems to be slightly more accidents from 2005-2008 and slightly less from 2011-2013, but overall the variance seems to be within reason.

iii. Day of Week

I explored whether the Day of Week seemed to affect the Severity Code Ratio.

While there are definitely more accidents on Friday (perhaps stressed from the work week, perhaps more traveling for longer periods for the weekend) and fewer on Sunday (perhaps more relax at home), the ratio of Property Damage to Injury by day reflects the overall ratio of 2:1.

4. Time

i. Times on the hour

i. Times on the half hour

iii. Times on the quarter hour

iv. Times every ten minutes

iv. Times every five minutes

v. Times with the fewest number of accidents

vi. 00:01:00!!!!!

It is clear that the majority of accidents happen in the business, lunch, and daylight hours of 8:00 - 18:00.

There seems to be a disproportionate number of accidents reported on, in decreasing order:

Lastly, there was a disproportionate spike at 00:01:00, which means accidents being purposely documented to the next day was not uncommon. I found this extremely fascinating, perhaps for bureaucratic reasons, quotas, night shift, etc. Upon further investigation, I learned these times did not coincide with specific dates such as 1st of a month.

5. Address Type

The majority of property damage occurs on blocks, but injury occurs about equally between blocks and intersections.

Intersections are usually larger and have more road space and less businesses and buildings compared to blocks.

This variable is the only one where the ratio of of Property Damage to Injury did not reflect the overall ratio of 2:1.

6. Light Conditions

The vast majority of accidents occur during Daylight and Dark - Street Lights On. The values of Property Damage and Injury reflect the overall ratio of 2:1.

7. Road Conditions

The vast majority of accidents occur during Dry and Wet. The values of Property Damage and Injury reflect the overall ratio of 2:1.

8. Weather Conditions

The overwhelming majority of accidents occur during Clear, Raining, and Overcast. The values of Property Damage and Injury reflect the overall ratio of 2:1.

V. Final Feature Selection and Setup

1. Convert Date and Time

Date and Time are converted to integers for regression.

2. Specify Other label

The following predictors have an other label. They will be appropriately clarified.

This differentiation will be important for the upcoming dummy variables.

3. Create Dummy Variables for:

4. Time and Light Condition Correlation

It occurred to me that Time and Light Condition are very likely to be highly correlated based on the frequency and preliminary summaries from my EDA. I decided to explore these further.

*AM/PM* from TIME and *Daylight/Dark - Street Lights On* from Light Condition are appropriately correlated as expected. However, they are not so strong to be considered collinear. No variables will be removed. ## 5. Road and Weather Condition Correlation It occurred to me that *Dry/Wet* from Road Conditions should be compared to *Clear/Raining* from Weather Conditions.

Dry/Wet from Road Condition and Clear/Raining from Weather Condition are appropriately correlated as expected. However, they are not so strong to be considered collinear. No variables will be removed.

6. Default Values

The following are the category labels that contain the overwhelming majority of observations from their respective original variables. They will be set as default and the corresponding dummy column will be dropped.

7. Final Predictors

VI. Classification Models

I am using four machine learning algorithms:

Each model will be run in a validation curve simulation to optimize its respective parameter/hyperparameter. In addition, each model will be run for both the Latitude/Longitude and Neighborhood training sets.

1. Function

2. Decision Tree (Random Forest)

i. Setup

50 decision trees using a maximum depth of 5 to 10 will be evaluated to determine the best fit.

ii. Latitude and Longitude

A depth of maxium depth of 6 will be used.

iii. Neighborhoods

A depth of maxium depth of 6 will be used.

3. K Nearest Neighbors

i. Setup

K-Nearest Neighbors. Select values of Number of Neighbors from $1$ to $500$ will be evaluated to determine the best fit.

Due to the nature of the data and the complexity of the algorithm, KNN required much more time and resource processes than the entire rest of the analysis.

Hence, the select values were 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500.

ii. Latitude and Longitude

$N = 500$ will be used.

iii. Neighborhoods

$N = 500$ will be used.

4. Logistic Regression

i. Setup

A logistic regression will be used. Hyperparameter $C$ values from $10^{-7}$ to $10^{1}$ will be evaluated to determine the best fit.

ii. Latitude and Longitude

$C = 10^{-2}$ will be used.

iii. Neighborhoods

$C = 10^{-3}$ will be used.

5. Support Vector Machine

i. Setup

A linear support vector classifier with $L2$ penalty will be used. Hyperparameter $C$ values from $10^{-7}$ to $10^{1}$ will be evaluated to determine the best fit.

ii. Latitude and Longitude

$C = 10^{-3}$ will be used.

iii. Neighborhoods

$C = 10^{-3}$ will be used.

IV. Results

1. Function

2. Train Set Results

3. Test Set Results

V. Discussion and Conclusion

Four models, using two different location measures, were evaluated on the training set and test set. Note that all 16 results are within 67.1-67.3%. I could not immediately see a significance in the difference between Latitude/Longitude or the Neighborhoods. Neighborhoods did take longer to run in general, as about 90 dummy variables were added to categorize all neighborhoods with non-zero accident counts.

Using Neighborhood as the location predictor, K-Nearest Neighbors at 500 Neighbors and a Linear Support Vector Classifier applying a L2 penalty yielded 67.27% and 67.26% accuracy. The SVM took less than 30 seconds, whereas the KNN took several hours.

Since Property Damage and Injury occur approximately $\dfrac{2}{3}$ and $\dfrac{1}{3}$ of the time, randomly classifying Property Damage $\dfrac{2}{3}$ the time and Injury $\dfrac{1}{3}$ the time would yield an accuracy of $\left(\dfrac{2}{3}\right)^2+ \left(\dfrac{1}{3}\right)^2 = \dfrac{5}{9} \approx 55.6\%$. The learned model seemed to be a significant improvement from that.

My best algorithms are ~67% accurate because they are classifying accidents as Property Damage ~99% of the time, and ~67% of accidents happen to be Property Damage!

The predictors I used were immediate predictors related to the scene and setting of an accident. Some other predictors, perhaps, number of cars, pedestrians, speeding, under the influence, etc., could possibly be useful and improve accuracy, but require more time to gather details. In addition, if one were at the scene of the accident gathering those variables, one could easily determine if there were Property Damage or Injury. Prediction then becomes unnecessary because one can simply document the result at the scene.

It seems to me that accidents in general should be predicted to be "Property Damage" twice as often as "Injury", and thus resource planning be allocated appropriately. The dataset did not specify accidents which had both, which has to have a significant occurrence.

Note that using Block/Intersection alone yields 67.25% accuracy!

Note

I wanted to explore accidents that are documented at 00:01, as there was significant overrepresentation. There does not appear to be any pattern in the Date (specifically, I thought perhaps 1st of months). It is extremely curious to me that many accidents are clearly documented to the next day.