Machine Learning Methods to Analyze and Predict Crash Injury Severity Based on Contributing Factors for Southeast Michigan

Abstract

Traffic safety is a critical aspect of transportation. Transportation agencies and researchers have been continuously making an effort to ensure minimal error margins in the road system. There is a growing interest in the Southeast Michigan region because of the active role of original equipment manufacturers in spurring research in the new mobility field and autonomous vehicle technology. As new jobs are created and commuting trips increase, traffic safety becomes even more important. Thus, this study presents one of the first efforts to study crash patterns for Southeast Michigan using machine learning models to analyze and predict crash severity in the region based on various contributing factors. We combine a three-step method of correlation analysis, machine learning predictive models, and spatial analysis to develop a rigid schema for crash prediction. Results show that decision tree classifiers provide accurate predictions and rapid computations. Also, spatial plots of predicted injury severities reveal disproportionate errors in model predictions at intersections prompting the need to stratify the crash data for further analyses. This significantly improves severe injury predictions at non-intersection locations and highlights pedestrians as a main important feature. The converse holds for intersections that identify motorcycles as the main feature for severity prediction. Further results are elucidated in this paper.

Keywords

data and data science data analysis safety crash analysis crash data crash prediction models

Roadway safety is a critical component of transportation and one of the main focal points of research in the field. Over the last decade in the U.S.A., records indicate that the high occurrences of traffic crashes result in property damage, injuries, and fatalities ( 1 ). For this reason, it is essential to enhance research methods for predicting crash severity to determine the causes of crashes. In addition to this, it is also of great importance to identify locations with high crash occurrences to enact mitigating measures.

Over the years, a wide variety of statistical and data-mining methods have been used to assess the severity of crash injuries. From one perspective, the analysis of crash injury severity can be divided into two main categories: choice models and machine learning-based methods. Ordered probit, binary, and multinomial logit models are among the most frequently used choice models to predict crash injury severity. The main advantage of these approaches lies in their interpretive capabilities resulting from clear technical formulations between the dependent (i.e., class of the injury severity) and the explanatory variables. Kockelman and Kweon ( 2 ), for example, leveraged ordered probit models for evaluating injury levels under different crash types. One of the conclusions from their work was that pickups and SUVs are not as safe as passenger cars for the driver in single-vehicle crashes. In another study, Abdel-Aty ( 3 ) used ordered probit models and reported that driver’s age, gender, seat belt use, point of impact, speed, and vehicle type were significant in analyzing driver injury severity.

Other perspectives of crash severity analysis include statistical choice models and probability estimation for vehicle damage and injury severity based on various contributing factors ( 4 ). Although widely used for crash analysis, statistical choice models suffer from erroneous predictions, mainly because of the violation of models’ intrinsic assumptions about the endogenous and exogenous variables. Analyzing crash records indicates that less severe crashes are more likely to be underreported. Therefore, non-random samples of dependent variables are usually the case, violating the basic assumption of econometric models ( 5 ). In addition, the spatial and temporal correlation between crashes ( 6 ) and the endogeneity of model predictors ( 5 ) can significantly affect the performance of choice models in predicting the levels of crash severity. Despite substantial progress in addressing these problems, they still pose significant challenges in the field, leading to incorrect parameter estimations and erroneous statistical inferences ( 7 ).

As discussed earlier, there is an increasing interest toward using artificial intelligence techniques and machine learning models to predict crash severity. Several studies indicated that these models outperform traditional statistical methods in accurately predicting severity level in traffic crashes. On the other hand, many recent research works have focused on comparing the performance of different machine learning techniques in crash severity analysis. In their study, Zhang et al. ( 7 ) proved the higher performance of machine learning techniques in characterizing crash severity compared to multinomial logit and ordered probit models. In a similar study, Iranitalab and Khattak ( 8 ) illustrated that the three machine learning techniques of K-nearest neighbor (KNN), support vector machine (SVM), and random forest (RF) classifiers outperform the widely used multinomial logit models in regard to correct predictions. Alkheder et al. ( 9 ) also developed an artificial neural network structure to predict the severity level of traffic crashes that occurred in Abu Dhabi over six years. Results indicated noticeable improvements in the model’s prediction power compared to ordered probit models.

A review of these studies suggests that tree-based modeling structures often provide higher prediction accuracies compared to other machine learning techniques when predicting crash severity level. For example, Zhang et al. ( 7 ) compared the performance of four machine learning models, including decision tree (DT), RF, KNN, and SVM, and indicated that the RF model had the highest overall predicting accuracy. Their results are also consistent with the findings of Iranitalab and Khattak. Mafi et al. ( 10 ) also proved the higher performance of RF models in predicting driver injury severity at signalized intersections. In the present study, we also detected a similar pattern in the performance of adopted machine learning models. Our results indicated that the two tree-based modeling structures, DT and RF classifiers, provide a more balanced and accurate prediction of crash severity level in Southeast Michigan.

A promising example of efforts to enhance predictive modeling concepts for traffic crashes using machine learning techniques is a recent study by Mafi et al. ( 10 ), who leveraged instance-based and RF machine learning models for injury severity prediction for intersection safety. In this study, Mafi et al. ( 10 ) proposed a cost-sensitive approach that accounted for monetary cost penalization for inaccurate injury severity predictions during the development of the algorithms: cost-sensitive models indicated better prediction and minimized bias. In another study, Wahab and Jiang ( 11 ) investigated the severity level of motorcycle crashes using three different machine learning techniques. A relative importance analysis was also conducted to evaluate the effect of each model predictor on injury severity prediction. Safety strategies were then suggested to reduce the severity level in motorcycle crashes based on the most critical determinants identified by the model. Furthermore, machine learning models have been proven in several studies as superior to conventional methods for predicting crash severities in mountainous freeways ( 12 ), work zones ( 13 ), and freeway diverge areas ( 14 ).

With the focus on the Southeast Michigan region, the present study adopts machine learning techniques to predict crash injury severity. The region is regarded as an important area for the future of mobility. The transition from a rust-belt region to a location of increased high-tech activity began about a decade ago and is still witnessing restructuring ( 15 ). As such, we expect to see an increase in economic and industrial activity over the next couple of years. The presence of original equipment manufacturers and their involvements in autonomous vehicle research also means that there is anticipated growth in job opportunities and traffic congestion in urban centers. It is worth mentioning that the Ford Motor Company purchased and is currently remodeling the otherwise washed-out Michigan Central Train Depot located in the Corktown area in Detroit ( 16 ). It is expected that these developments will usher in thousands of commuting trips to the area. General Motors has also announced its decision to invest millions of dollars into expanding its plants in Southeast Michigan. While these developments are essential for socio-economic growth, they also imply increased commuting trips and traffic volumes. Already in 2019, reports indicate that more fatalities were recorded in Michigan than at the same time last year. Traffic safety is an important aspect of transportation to consider in the advent of Southeast Michigan’s growth. Even though statistics exist for traffic safety trends in the region, there have not been extensive studies in recent years for predicting crashes by severity. Such studies will serve as a basis to enact proactive/preventive measures. For this reason, in this study, we undertake a rigorous crash analysis for Southeast Michigan. The overarching objective of this paper is to predict the severity of crash-related injuries for the Southeast Michigan region using machine learning methods. Thus, we seek to answer the following research questions with this study: (i) what are the important contributory factors to crash severity in Southeast Michigan; (ii) how can we leverage spatial techniques to refine crash severity analyses to enhance model predictions; and (iii) which machine learning modeling provide superior performance in predicting crash severity? The structure of the paper is as follows. After this introduction, the second section discusses the methodology for this study, where we introduce the dataset, the machine learning models leveraged in our analyses, as well as initial correlation analyses to select variables of interest for our study. The third section highlights the results with discussions of our findings, and the final section of this paper details our conclusion and recommendation for future studies.

Methodology

To answer the abovementioned research questions, we develop and utilize a three-stage analytical schema that considers descriptive statistics and correlation analysis, injury severity predictive modeling using machine learning models, and spatial representation of model predictions using geographical information system (GIS) tools to guide analytical decisions for improved predictions. This approach ensures that we obtained a streamline set of important and relevant variables to crash severity, as well as refine the data stratification process for more focused analyses to arrive at results that are robust and can inform practical policy interventions.

Overview of Dataset and Data Preparation Methods

Data used for this study is from the Southeast Regional Council of Government (SEMCOG) database ( 17 ). The SEMCOG region is a seven-county region in Michigan with a total population of approximately 4.7 million, according to the region’s July 1, 2018, estimate. The seven counties in the area are Livingston, Macomb, Monroe, Oakland, St. Clair, Washtenaw, and Wayne. The crash database provides historical crash data over six years from 2013 to 2018. There are approximately 800,000 crashes recorded in this database. Furthermore, with anticipated growth in activity in the area, caused by projected developments, there is a chance that trends would escalate if the transportation authorities do not endorse proactive preventive measures.

As part of the initial data analysis, this study chooses several features as potential factors that may impact the crash severity. It includes explanatory variables such as alcohol, drug involvement, weather condition, crash type, and road condition. As a preprocessing method, these categorical variables are flattened into binary dummy values to better evaluate their individual impact on crash severity.

For severity level, the dataset classifies crashes into “Fatal Injury, Suspected Serious Injury, Suspected Minor Injury, Possible Injury, and No Injury” according to the KABCO injury classification scale produced by Federal Highway Administration ( 18 ). Previous literature showed that similar approaches that attempt to classify crash severity into multiple classes (greater than 2) tend to suffer low prediction accuracy, especially for high severity crashes. The low accuracy may result from the relatively small sample size and additional computational complicity for multiclass classification ( 8 ). In this study, the preprocessing step groups the KABCO injury severities into a binary variable “Severe.” It takes the value of “1” as severe for fatal injury and suspected serious injury, and “0” as non-severe for suspected minor injury, possible injury, and no injury. Other pertinent variables in the dataset are described in Table 1.

Table 1.

Descriptive Statistics of Variables

Category	Variable	Mean	Frequency	Variable	Mean	Frequency
Explanatory variables	Alcohol	0.0296	20337	Lane departure	0.1789	122860
	Bicycle	0.0068	4663	Motorcycle	0.0088	6036
	Deer	0.0405	27821	Occupants	2.2771	1563995
	Distracted	0.0215	14767	Pedestrian	0.0091	6223
	Disobey control signal	0.0372	25538	Red light rush	0.0273	18772
	Drug	0.0082	5648	School bus	0.0035	2384
	Elderly driver (65 or above)	0.1553	106659	Young driver (16–24 years old)	0.3405	233896
	Intersection	0.3401	233604	Weekend	0.2180	149738
	Lane departure	0.1789	122860	NA	NA	NA
Weather condition	Clear	0.6121	420408	Sleet or hail	0.0035	2407
	Cloudy	0.2075	142502	Blowing snow	0.0018	1228
	Fog	0.0042	2873	Blowing sand soil dirt	3.93E-05	27
	Rain	0.0915	62822	Smoke	2.33E-05	16
	Snow	0.0696	47781	Unknown weather	0.0081	5552
	Severe crosswinds	0.0018	1205	NA	NA	NA
Crash type	Single motor vehicle	0.1757	120698	Rear end right turn	0.0110	7541
	Head on	0.0129	8868	Sideswipe same	0.1460	100265
	Head-on left turn	0.0249	17075	Sideswipe opposite	0.0214	14683
	Angle	0.1740	119478	Backing	0.0058	3966
	Rear end	0.3506	240818	Other crash type	0.0656	45065
	Rear end left turn	0.0089	6116	Null crash type	0.0033	2262
Light condition	Daylight	0.6967	478537	Dark-lighted	0.1509	103640
	Dawn	0.0227	15570	Dark-unlighted	0.1019	69976
	Dusk	0.0218	14984	NA	NA	NA
Road condition	Dry road	0.7107	488119	Standing or moving water on road	0.0057	3890
	Wet road	0.1612	110752	Sandy road	1.75E-05	12
	Icy road	0.0492	33814	Oily road	4.08E-05	28
	Snowy road	0.0612	42024	Road lanes	3.3311	N/A
	Mud dirt gravel road	0.0025	1727	Speed limit	43.028	N/A
	Slush road	0.0091	6218	Debris road	3.83E-04	263
Severity level	Fatal injury K	0.0026	1780	Possible injury C	0.1401	96198
	Suspected serious injury A	0.0125	8598	No injury O	0.7932	544801
	Suspected minor injury B	0.0516	35470	Severe (suspected serious injury or fatal)	0.0151	10378

Note: NA = not available.

While some machine learning algorithms can work with categorical data, most require input or output variables to be numeric. Input data such as “Lighting” that does not have any ranking for category values can lead to poor performance or less accurate results. To prepare data and get improved prediction each categorical value has been converted into a new unique variable and assigned a binary value of 1 or 0 (0 indicates nonexistent while 1 indicates existent). These new columns are used as features to train the model.

Machine Learning Models for Crash Severity Prediction

Different machine learning classification methods are developed to predict the class of a sample based on its features. Some learning methods require a dataset with both features and their known classes to train the classifier. These methods are known as supervised learning. This study used the supervised classification methods of DT, RF, SVM, artificial neuron network (ANN), Logistic Regression (LR), KNN.

DT

DT’s name presents rules of classification that are simple and interpretable by humans ( 19 ). The internal nodes/branches represent test conditions, and leaf nodes that represent the final classes. The tree starts from the root nodes, which contain all sample sets, moves down internal nodes, and get classified at the leaves. Each internal node divides the observations into classes based on the features that best divide that node’s set of samples into subsets. It computes the information gain from the split by evaluating the change of entropy. The algorithm picks out the attribute with the largest information gain to split the data at each node. DT repeats this procedure until the subset’s size is small enough, or all the samples in the current subset belong to the same class. In this work, the normalized information gain at each node, which can evaluate a feature’s importance, is also used to support the feature selection for the data.

Entropy : E (T, X) = \sum_{c \in X} P (c) E (c),

where T is current state, and X is selected attribute.

RF

RF is an ensemble learning method (use multiple classifiers and aggregate their results) ( 20 ). It consists of a collection of DT classifiers with independent identically distributed random vectors. Each node of RF is split using the best among a subset of tree classifiers randomly chosen at that node. This method has performed very well compared to many other commonly used classifiers and is robust against overfitting ( 21 ).

SVM

SVM is a machine learning approach developed by Vapnik ( 22 ). It is a system for efficiently training the linear learning machines in the kernel-induced feature spaces while utilizing the generalization theory and the optimization theory. It includes a set of related supervised learning methods used for prediction and regression. Structural risk minimization is one of the key theoretical foundations for the learning algorithms of SVMs. In this study, the author selects the linear support vector classifier (LSVC). The preprocessing step flattens most of the inputs into binary dummy variables, and LSVC provides high computational efficiency.

minimize [\frac{1}{n} \sum_{i = 1}^{n} \max (0, 1 - y_{i} (x^{T} x_{i} - b))] + λ ‖ w ‖^{2}

ANN

ANNs are machine learning methods that are inspired by biological neural networks. It consists of a collection of connected units (neurons). Each neuron can receive, process, and pass along information and their network become a parallel processing system. It can adapt different learning rules. This study chooses multilayer perceptron, which is one of the most commonly used learning rules. It can approximate any finite nonlinear functions with very high accuracy ( 23 ).

LR

LR uses a logistic function to model a binary dependent variable. It is a widely used discrete choice model. Many previous approaches in the literature have used its multiclass form, multinomial LR for crash severity analysis ( 8 ). This study classifies the severity as a binary variable (severe/non-severe), so LR will suffice. In the logistic model, the logarithm of the probability of that value being “True” is a linear combination of the observation’s feature variables. The classifier can then mark the data with the calculated probability.

l = \log_{b} \frac{p}{1 - p} = β_{0} + β_{1} x_{1} + β_{2} x_{2}

where b is the base of logarithm, and β_i is the parameters of the model.

KNN

KNN classifies a target observation by looking at the closest k observations. It will classify the observation based on the classification of the majority of the k closest samples. To implement KNN, it needs to decide the value of k and distance function for nearest neighbor. This study determines k value by cross-validation accuracy. For distance function, we choose Euclidean distance, which can be interpreted as the physical distance between two high dimensional points ( 24 ).

Feature Selection

In this study, we adopt two initial feature selection processes to reduce the data size to improve classification results: correlation analysis and DT feature importance. The selected feature is then used in the machine learning models as input for severity prediction. In the first step, we compute a correlation matrix for all the potential features and the severity levels. The correlation is used to estimate the plausibility of a feature to result in a severe crash. Table 2 shows the highest-ranking features based on their correlation values.

Table 2.

Correlation Analysis and Decision Tree Feature Importance

Features	Correlation with severity	Features	Correlation with severity
Pedestrian	0.16735	Dark-lighted	0.02630
Motorcycle	0.14319	Weekend	0.02606
Alcohol	0.10612	Dark-unlighted	0.02021
Drug	0.10243	Units	0.01695
Head-on	0.06544	Occupants	0.01676
Single motor vehicle	0.05054	Clear	0.01505
Bicycle	0.04769	Angle	0.01467
Lane depart	0.04765	Mud dirt gravel road	0.01232
Disobey control signal	0.03122	Other crash type	0.01218
Red light rush	0.02787	Intersection	0.01169
Head-on left turn	0.02693	NA	NA

Note: NA = Not available.

We selected the highlighted features in Table 2 based on their correlation value (>0.02) for further analysis. One observation from the table is that the correlation values are overall quite low (<0.2). We attribute this to the nature of the crash events because each feature will not directly impact crash severity; however, the likelihood of a severe crash may increase. Since the selected features are all binary values, we use a DT classifier to extract the normalized network gain to assess how impactful it is to involve each feature in the classifier. The illustration in Figure 1 shows that the DT first selects the feature split that leads to the largest entropy change at each node: the normalized entropy change is considered as feature importance. These values are computed and extracted with the DT classifier and can indicate a feature’s impact on crash severity. We then use 10 training sets (each 75% of all crash data) chosen randomly to train the DT classifiers and compute the average value of normalized network gains for each parameter as feature importance. Table 3 shows the ranking of features based on their importance: the highlighted features are used as input to train/predict with different machine learning classifiers.

Table 3.

Feature Ranking Based on Importance

Features	Correlation with severity	Features	Correlation with severity
Pedestrian	0.21473	Head-on	0.03712
Motorcycle	0.19164	Head-on left turn	0.02823
Alcohol	0.16210	Dark-unlighted	0.01775
Lane depart	0.10425	Dark-lighted	0.01694
Disobey traffic signal	0.06682	Weekend	0.01167
Bicycle	0.06656	Single motor vehicle	0.00758
Drug	0.04457	Red light rush	0.00176

Figure 1.

Decision tree model structure.

Spatial Analyses

For additional treatment of data in the analytical process, we conduct a spatial analysis to identify where misclassifications and inconsistencies occur in our analyses relative to the roadway construct. This is important for refining our analyses and making policy recommendations. Observations in the initial results are used to stratify our dataset for further analyses as seen and discussed in next section.

Results and Discussion

Balancing the Biased Data

Because of the nature of traffic crashes, most data collected are non-severe: out of 6,86,847 crash data collected through 2013 to 2017, only 10,378 is classified as severe, which is 1.51% of the dataset. It indicates that any classifier that would label all crashes to be non-severe will reach over 98% overall accuracy. The result would be heavily biased toward the non-severe class, and such performance is not desired. This study tries to peruse a balanced performance for accuracy in predicting both severe and non-severe crashes. Thus, it uses a weight balancing technique that oversamples the small size class by duplicating the severe crash samples to increase the cost of wrongly predicting these crashes until both classes have a similar number of samples (balanced weight) for the classifier. This method can improve the accuracy of predicting a rare-occurring event (severe crash in this case) for machine learning problems ( 18 ). In the literature, some other approaches, such as cost-sensitive learning, tie the cost of the wrong prediction to the cost of a crash at that severity level, which can achieve a similar effect ( 10 ). The drawback of such approaches is that they may reduce the overall prediction accuracy for all samples, which can be expected as the distribution of each class in the training data is changed.

Classifier Training and Validation

Cross-validation was run to test the performance of each machine learning algorithm with their accuracy at each severity class. The classifiers’ parameters are calibrated for better performance. The average accuracy of 10 cross-validation scenarios is used to compare between different classifiers.

The results as observed in Table 4 show that all six classifiers provide varied prediction accuracies for severe and non-severe cases. Based on the results, DT, RF, and ANN are the preferred classifiers as the accuracy for each class is more balanced, with approximately 78% accuracy rates for validation sets with non-severe and severe crashes with f1 score of 0.78. Even though SVM, KNN, and LR produced higher accuracies of about 86% for non-severe crashes, the false predictions for severe crashes are more frequent, especially for KNN. Considering the trade-off between model accuracy and computational time, we observed that the DT classifier runs the fastest compared to RF and ANN, while all three have similar accuracy. The impressive performance of the DT, RF, and ANN can be attributed to the selected features for classifications all being binary variables, which fits well with the tree like structure and the neural network structure. The DT classifier is trained with SEMCOG crash data from 2013 to 2017 and used to predict the severity of the 2018 crash data for the SEMCOG region. The accuracy for non-severe crashes is 81.07% and for severe crashes is 66.32% as is shown it Table 5. The test accuracy is reasonably close to the validation results, which indicates that the pattern of crash severity is relatively consistent over the years. Compared to the existing literature, these results gives a similar performance for overall accuracy (70%∼90% in reviewed literature), yet a superior prediction accuracy for severe crashes (10%∼35%), ( 7 , 8 ). Note that the reviewed literature classifies the severity into 3∼5 different levels. This study simplifies it and combine them into two levels (severe or non-severe), which may also contribute to the improved result.

Table 4.

Prediction Accuracy Based on Machine Learning Models

	Classification methods	Decision tree	Random forest	Neuron network	Linear SVC	K-nearest neighbor	Logistic regression
Prediction accuracy for each class in the validation set	Non-severe	0.7758	0.7743	0.7768	0.8822	0.7776	0.8827
Prediction accuracy for each class in the validation set	Severe (recall)	0.7793	0.7793	0.7801	0.6009	0.7686	0.5985
Other classification matricies	Accuracy	0.7775	0.7768	0.7785	0.7422	0.7731	0.7412
	Precision	0.7749	0.7739	0.7801	0.6009	0.7686	0.5985
	F1 score	0.7771	0.7766	0.7780	0.6988	0.7713	0.6972
	AUC	0.8224	0.8241	0.8271	NA	0.8166	0.8174

Note: SVC = support vector classifier; AUC = area under the curve; NA = not available.

Table 5.

Validation Results Based on Decision Tree

	Decision tree
Prediction accuracy for each class in the test set (2018 crash data)	Non-severe	81.07%
	Severe	66.32%
	Overall	80.52%

Spatial Analyses

In addition, we analyze the relationship between crash location and its prediction accuracy. The figures below indicate a section of the roadway network east of Detroit. Here we highlighted the observed crash data points by injury severity in Figure 2a and the predicted data in Figure 2b. These figures identify no-injury crashes by a green point layer and severe-injury crashes by red. While most crashes are accurately predicted, some no-injury crashes are wrongly predicted as severe. This is expected because of the disproportional frequency, denoted with a red point layer in Figure 2c. Even though we identified some falsely predicted crashes clustered at certain intersections, the overall prediction accuracy was impressive as previously discussed. The general observations we made in regard to higher clustering of crashes at or close to intersections are not atypical; however, visualizing the data allowed us to identify areas where most wrong predictions occurred. The locations of inaccurate predictions were predominant at intersections, and this informed further analyses to investigate how we can improve predictions by considering intersections and non-intersections distinctly.

Figure 2.

Map showing crash locations for: (a) recorded crashes, (b) predicted crashed using the decision tree, and (c) accuracy of predicted crashes.

This finding reveals that the current approach does not suffice to spur policy decisions. Consequently, we identified patterns in the DT analyses, pedestrian, motorcycle, and alcohol involvements as the highest determinants of crash severity predictions in decreasing order of importance. Taking an in-depth look at pedestrian crash patterns (Figure 3a), we observed clustering of severe-injury related crashes along major arterials and highways. High-speed vehicles on these roadways would cause severe injuries to pedestrians in most cases. Additionally, most major roadways do not have the requisite infrastructure for pedestrian mobility, increasing pedestrian risk to crash-related injuries. When considering denser city centers such as downtown Detroit and Ann Arbor, indicated in Figure 3, b and c , respectively, we noticed that pedestrian-related crashes are more clustered at intersections—however, with fewer records of severe injuries. We attribute this to slower-moving vehicles in those areas. Additionally, pedestrian injury severity tends to increase closer to major arterials and other suburban regions. These trends are consistent in both Ann Arbor and Detroit geographic areas.

Figure 3.

Maps showing the location of pedestrian crashes by severity in: (a) Wayne County, Michigan, (b) Downtown Detroit, and (c) City of Ann Arbor.

To improve crash prediction accuracies, we stratify the data by considering crashes that occur at intersections distinctly from those that occur on other roadways. We then applied the DT classifier to the data strata. Results for non-intersection crashes show significantly improved prediction accuracies for severe injuries, with further improvements seen when using different cohorts as reported in Table 6. For example, we observe that prediction accuracies are improved when applying a DT classifier to a cohort of crashes involving people between 20 and 60 years at a non-intersection area. Conversely, we observe relatively lower accuracies when predicting severe injuries at intersections. Improving prediction accuracies at intersections will require more rigorous studies by leveraging additional data because of the variance in the causations of intersection-related crashes. In further analyses, we noticed that the order of high-importance features predicted alternate with each data strata. For instance, the variable of highest importance for severity prediction at intersections only is motorcycle, while that for non-intersection is pedestrians. This finding echoes the previous discussions on pedestrian crashes. Similar observations are made for tests for crashes involving young drivers only and elderly drivers only (not stratified by roadway location): the prediction accuracies for severe crashes are less accurate (see Table 6). Complete results for feature importance for data strata in Table 6 are reported in Table 7.

Table 6.

Decision Tree Accuracy for Stratified Data

Crash type	Intersection	Non-intersection	Elderly	Young	All crashes	Crashes not involving elderly, young, or intersection
Prediction accuracy for non-severe crashes	79.91%	79.60%	85.14%	85.14%	79.26	81.03%
Prediction accuracy for severe crashes	62.89%	71.23%	51.78%	51.78%	69.35	71.59%
Highest importance feature	Motorcycle	Pedestrian	Lane depart	Alcohol	Pedestrian	Lane depart

Table 7.

Summary of Feature Importance

Intersection		Non-intersection		Elderly
Variable	Feature importance	Variable	Feature importance	Variable	Feature importance
Motorcycle	0.2088	Pedestrian	0.2104	Lane depart	0.2799
Pedestrian	0.1793	Lane depart	0.1677	Pedestrian	0.1868
Single motor vehicle	0.0961	Motorcycle	0.1575	Motorcycle	0.1866
Disobey control signal	0.0909	Alcohol	0.1398	Alcohol	0.0773
Alcohol	0.0907	Single motor vehicle	0.0728	Drug	0.0511
Drug	0.0655	Drug	0.0702	HeadOn	0.0420
Bicycle	0.0645	Headon	0.0589	Single_Motor_Vehicle	0.0391
Head on left turn	0.0596	Bicycle	0.0508	Weekend	0.0285
Headon	0.0446	Dark-lighted	0.0204	Dark-lighted	0.0277
Red light rush	0.0358	Weekend	0.0150	Dark-unlighted	0.0247
Dark-lighted	0.0328	Dark-unlighted	0.0148	Bicycle	0.0221
Dark-unlighted	0.0159	Head on left turn	0.0113	Head_On_Left_Turn	0.0153
Weekend	0.0155	Disobey control signal	0.0057	Redlightru	0.0144
Lane depart	0.0000	Red light rush	0.0047	Disobey ctrl signal	0.0046
Young		General (all crashes)		No young/no old non-intersection
Variable	Feature importance	Variable	Feature importance	Variable	Feature importance
Alcohol	0.2256	Pedestrian	0.2147	Lanedepart	0.2809
Motorcycle	0.1728	Motorcycle	0.1916	Pedestrian	0.2345
Pedestrian	0.1566	Alcohol	0.1621	Motorcycle	0.1625
HeadOn	0.0959	Lane depart	0.1043	Alcohol	0.1009
Drug	0.0700	Disobey traffic signal	0.0668	Bicycle	0.0772
Disobey ctrl signal	0.0575	Bicycle	0.0666	Drug	0.0391
Single_Motor_Vehicle	0.0431	Drug	0.0446	Single_Motor_Vehicle	0.0294
Lanedepart	0.0394	Head On	0.0371	Dark-unlighted	0.0152
Head_On_Left_Turn	0.0340	Head On Left Turn	0.0282	Weekend	0.0149
Bicycle	0.0297	Dark-unlighted	0.0178	HeadOn	0.0146
Dark-lighted	0.0265	Dark-lighted	0.0169	Dark-lighted	0.0117
Weekend	0.0244	Weekend	0.0117	Head_On_Left_Turn	0.0099
Dark-unlighted	0.0209	Single motor vehicle	0.0076	Disobey ctrl signal	0.0056
Redlightru	0.0036	Red light rush	0.0018	Redlightru	0.0036

As alluded to, analysis of important features in crash severity prediction indicated that pedestrian crashes were predominant at non-intersections arears, while motorcycles affect crash severity at intersections. Indeed, a significant finding of the present study is to highlight the importance of complementing spatial data analysis with machine learning models to achieve a better understanding of the nature of traffic crashes as well as the determinant variables in different crash types/locations. Reviewing the results of variable importance analysis for different crash types/locations in previous studies further supports this idea.

Mafi et al. ( 10 ), for example, followed a similar approach to identify key determinants of driver injury severity at signalized intersections across different age/gender group combinations. While some factors, like area type and median width, were important in all driver groups, authors discovered substantial differences in injury severity determinants for different driver cohorts. Zhang et al. ( 7 ) also adopted a linear sensitivity analysis approach to investigate important variables affecting the overall crash severity in freeway diverge areas. While the number of lanes, ramp length, and land type were found to be the most important variables according to the proposed statistical models, large differences were observed in the way these variables affect the prediction accuracy of the adopted machine learning models.

The key takeaways from the above discussion is that predicting crash severity needs a deep analysis of the environment and the associated geographic region to extract primary determinants of crash severity at each location. In addition, special attention should be paid to analyzing determinant factors when working with machine models. Different machine learning models use distinct mechanisms to explore inherent patterns in datasets and this can, in turn, lead to assigning different importance values for same model variables.

Conclusions

The role of Southeast Michigan as a hub for mobility research now and in the coming years makes this an area of prime interest for socio-economic development and growth. With such change comes an increase in commuting trips and traffic flows on our roadways. This is an area of concern since the increase in congestion and vehicle miles traveled tends to correlate positively with traffic crashes. For this reason, the present study addresses traffic safety issues by undertaking a rigorous crash analysis for Southeast Michigan. This work presents one of the first studies to apply such methods to this important area experiencing rapid industrial and technological resurgence.

More specifically, we predict the severity of crash-related injuries for the Southeast Michigan region using machine learning methods by complementing the predictive modeling with GIS spatial observations. Results indicate that while all models have similar performance in predicting crash severity, DT and RF estimates provide the best predictions with DT preferred because of the computational ease and accuracy. Combining the machine learning methods with GIS analyses revealed crash patterns to be useful in guiding streamlined model executions for intersections and non-intersections distinctly. In this paper, we observed that pedestrian crashes are clustered at non-intersection locations and are the primary determinant of crash severity at those locations.

The results were promising as non-intersection-only analyses showed improved accuracy while intersection-only analyses provided worse predictions. The important features in the models also changed as pedestrian crashes were predominant at non-intersections while motorcycles affect crash severity at intersections. The results in this study echo the importance of machine learning models in crash severity prediction and highlight that when complemented with GIS spatial analysis, methods can be refined to better comprehend observed patterns. One limitation of the study is that the crash data origin from police reports lacks detailed traffic operation and road geometry data such as curvature and traffic control signal, which may also impact crash type and severity. We recommend further studies to incorporate more data and improve model predictions, and we recommend that measures be enacted to protect pedestrians, motorcyclists, and the elderly significantly affected by crashes in the SEMCOG region.

Footnotes

Acknowledgements

The Ford Robotics and Mobility Research team would like to acknowledge SEMCOG officials for making their data available to us for research purposes.

Author Contributions

The authors confirm contribution to the paper as follows: study conception and design: X. Cai, R. Twumasi-Boakye, Y. Rahmati, S. Jain, J. Fishelson; data collection: X. Cai, S. Jain; analysis and interpretation of results: X. Cain, R. Twumasi-Boakye, Y. Rahmati; draft manuscript preparation: X. Cain, R. Twumasi-Boakye, Y. Rahmati, J. Fishelson. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Xiaolin Cai

Richard Twumasi-Boakye

Yalda Rahmati

James Fishelson

References

Johansson

Vision Zero–Implementing a Policy for Traffic Safety. Safety Science, Vol. 47, No. 6, 2009, pp. 826–831.

Kockelman

K. M.

Kweon

Y.-J.

Driver Injury Severity: An Application of Ordered Probit Models. Accident Analysis & Prevention, Vol. 34, No. 3, 2002, pp. 313–321.

Abdel-Aty

Analysis of Driver Injury Severity Levels at Multiple Locations Using Ordered Probit Models. Journal of Safety Research, Vol. 34, No. 5, 2003, pp. 597–603.

Qin

Wang

Cutler

C. E.

Analysis of Crash Severity Based on Vehicle Damage and Occupant Injuries. Transportation Research Record: Journal of the Transportation Research Board, 2013. 2386: 95–102.

Savolainen

P. T.

Mannering

F. L.

Lord

Quddus

M. A.

The Statistical Analysis of Highway Crash-Injury Severities: A Review and Assessment of Methodological Alternatives. Accident Analysis & Prevention, Vol. 43, No. 5, 2011, pp. 1666–1676.

Paleti

Eluru

Bhat

C. R.

Examining the Influence of Aggressive Driving Behavior on Driver Injury Severity in Traffic Crashes. Accident Analysis & Prevention, Vol. 42, No. 6, 2010, pp. 1839–1854.

Zhang

Comparing Prediction Performance for Crash Injury Severity Among Various Machine Learning and Statistical Methods. IEEE Access, Vol. 6, 2018, pp. 60079–60087.

Iranitalab

Khattak

Comparison of Four Statistical and Machine Learning Methods for Crash Severity Prediction. Accident Analysis & Prevention, Vol. 108, 2017, pp. 27–36.

Alkheder

Taamneh

Severity Prediction of Traffic Accident Using an Artificial Neural Network. Journal of Forecasting, Vol. 36, No. 1, 2017, pp. 100–108.

10.

Mafi

Abdelrazig

Doczy

Machine Learning Methods to Analyze Injury Severity of Drivers From Different Age and Gender Groups. Transportation Research Record: Journal of the Transportation Research Board, 2018. 2672: 171–183.

11.

Wahab

Jiang

A Comparative Study on Machine Learning Based Algorithms for Prediction of Motorcycle Crash Severity. PLoS One, Vol. 14, No. 4, 2019, p. e0214966.

12.

Abdel-Aty

Analyzing Crash Injury Severity for a Mountainous Freeway Incorporating Real-Time Traffic and Weather Data. Safety Science, Vol. 63, 2014, pp. 50–56.

13.

Mokhtarimousavi

Anderson

J. C.

Azizinamini

Hadi

Improved Support Vector Machine Models for Work Zone Crash Injury Severity Prediction and Analysis. Transportation Research Record: Journal of the Transportation Research Board, 2019. 2673: 680–692.

14.

Liu

Wang

Using Support Vector Machine Models for Crash Injury Severity Analysis. Accident Analysis & Prevention, Vol. 45, 2012, pp. 478–486.

15.

Fasenfest

Jacobs

An Anatomy of Change and Transition: The Automobile Industry of Southeast Michigan. Small Business Economics, Vol. 21, No. 2, 2003, pp. 153–172.

16.

Gaudette

Bardelli

Michigan Central Station–Detroit’s Most Iconic Symbol of Growth, Decay and Rebirth Existing Conditions Assessment of Slabs Using GPR. Proc., SAGEEP 2019-32nd Annual Symposium on the Application of Geophysics to Engineering and Environmental Problems, Portland, OR, 2019.

17.

SEMCOG. About SEMCOG. Southeast Michigan Council of Governments, Detroit, MI, 2019.

18.

Batista

G. E.

Prati

R. C.

Monard

M. C.

A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations Newsletter, Vol. 6, No. 1, 2004, pp. 20–29.

19.

Quinlan

J. R.

C4. 5: Programs for Machine Learning. Elsevier, Amsterdam, Netherlands, 2014.

20.

Liaw

Wiener

Classification and Regression by RandomForest. R News, Vol. 2, No. 3, 2002, pp. 18–22.

21.

Breiman

Random Forests. Machine Learning, Vol. 45, No. 1, 2001, pp. 5–32.

22.

Vapnik

V. N.

An Overview of Statistical Learning Theory. IEEE Transactions on Neural Networks, Vol. 10, No. 5, 1999, pp. 988–999

23.

Mitchell

T. M.

Artificial Neural Networks. Machine Learning, Vol. 45, 1997, pp. 81–127.

24.

Hewson

P. J.

Multivariate Statistics With R. 2009. http://local.disia.unifi.it/rampichini/analisi_multivariata/Hewson_notes.Pdf.