Effect of Socioeconomic and Demographic Factors on Crash Occurrence

Abstract

The number of crashes and fatalities rate in the Commonwealth of Kentucky, U.S., have been higher than the national average for the past decades. It has been hypothesized that the distinct socioeconomic conditions of the state could be contributing to and explaining these trends. This study investigated the influence of socioeconomic characteristics on highway safety in Kentucky and attempted to identify the high-risk driver groups, based on crash data and the socioeconomic and demographic features of their residence zip codes. The quasi-induced exposure technique and binary logistic regression were employed to develop a predictive modeling approach for determining the probability of being the at-fault driver in a single- and two-unit crashes, based on socioeconomic characteristics of the driver residence zip code. The study identified that socioeconomic features such as income, poverty level, employment, age, gender, rurality, and number of traffic-related convictions of a driver’s zip code influence their likelihood to be at fault in a two-unit crash, while for single-unit crashes, in addition to these variables, educational attainment had also an impact. Younger and older drivers living in zip codes with low socioeconomic conditions have a higher probability to be the at-fault driver in both single- and two-unit crashes. The conclusions of the study can be used to determine the regions (zip codes) and driver groups with higher likelihood to be the at-fault driver in a crash and develop effective safety programs for the target groups.

Traffic fatalities are a leading cause of death in the U.S. The Commonwealth of Kentucky has had a higher fatality crash rate than the national average since 1986 ( 1 , 2 ). Moreover, the crash rate per population for Kentucky is also higher than that of the national average. It is therefore important to identify and address the factors contributing to the high collision rates and implement effective safety measures to mitigate them. Overall roadway safety can be improved by addressing the underlying issues that lead to such safety concerns.

Most of the practices on enhancing roadway safety focus on improving roadway systems and geometric designs at high-crash locations. Even though these efforts could improve roadway safety, the increasing trends in crash numbers indicate that there are additional crash-related factors yet to be addressed. It is important that transportation agencies understand the underlying factors contributing to crashes to implement effective countermeasures to improve overall roadway safety. Previously, geometric design, roadway characteristics, driver behavior, demographic factors, and socioeconomic features were identified as factors influencing safety ( 3 – 10 ). The National Motor Vehicle Crash Causation Survey indicates that 94% of the crashes are caused by driver behavior ( 11 ). However, underlying factors that could affect driver behavior resulting in a crash involvement have not been widely discussed. The socioeconomic and demographic characteristics of the driver and their residency environment may influence their driving conduct and eventual crash involvement. Prior research has identified the influence of macro-level socioeconomic characteristics on crash occurrence (e.g., poverty, income, employment, and education) ( 3 , 6 , 8 , 12 – 14 ). However, the focus for most of these studies was the socioeconomic factors of the region where the crash occurred, and this would not allow for understanding the influence of the residence socioeconomic characteristics of the driver involved in a crash, which may yield more information on the underlying factors contributing to crashes. It would help practitioners identify those at-risk driver groups and implement effective countermeasures focusing on improving traffic safety.

Stamatiadis and Puccini demonstrated that the southeastern U.S. states experience consistently higher traffic fatalities compared with other states ( 13 ). They concluded that the distinct socioeconomic characteristics of these states could explain the higher rates. Socioeconomic features such as median household income, unemployment, educational attainment, and percentage of rural population were identified as significant descriptors of the region. Their study also suggests that socioeconomic data associated with a driver’s residence zip code could serve as a potential surrogate measure for explaining the high fatality rates. Therefore, the distinctive socioeconomic features of Kentucky could be a possible explanation for its high crash rates. Exploring this aspect might help to identify the causes leading to the increasing crash trends in the state.

The primary goal of this research is to identify the potential impact of socioeconomic variables on the probability of a driver to be the one at-fault in a crash and use this information to identify potential countermeasures to improve roadway safety. This study investigates the socioeconomic and demographic factors associated with a driver’s residence zip code, employing statistical analysis of historic crash data. The modeling approach estimates the probability of a driver to be at-fault when involved in a crash based on the socioeconomic and demographic features of the driver residence zip code. The objective of the study is to determine the potential descriptors explaining the likelihood of a driver to be at fault in a crash. The models developed in this study can be used to identify high-risk driver groups who are more likely to be involved as the at-fault driver in a crash, which can be used as evidence for implementing efficient safety programs for these groups.

Literature Review

Prior research has identified income, education level, poverty percentage, employment, driver age, and the rurality of an area as the socioeconomic factors more relevant to crash occurrence ( 3 , 6 , 8 , 13 , 15 ). Education and income are identified to have a negative association with crash occurrence, and poverty is positively correlated, while employment varied across studies ( 14 ). Youth is highly correlated with crashes and fatalities. In 2019, 45% of the fatal crashes occurred in rural areas ( 16 ).

The relationship between crash involvement and age has been well documented demonstrating the positive association between young (< 25) and older (> 65) drivers and crashes or fatalities, as well as their higher propensity to be at fault in a crash ( 13 , 17 , 18 ). Previous research findings concluded that older drivers exhibit higher crash involvement than younger drivers and create greater risk for themselves and others ( 18 , 19 ). This study will also evaluate these trends for Kentucky.

Gender and marital status (separated or widowed) of the driver have also been identified to be associated with crash occurrence. In Kentucky, most of the drivers (55%, where the gender was listed) involved in collisions in 2016 were male ( 20 ). This increased to 74% in fatal crashes. These trends are holding true over the years, and the gender-crash (or crash severity) relationship could be influenced by socioeconomic factors in Kentucky. A study in Alabama found that driving under the influence (DUI)-related crashes are associated with male employment and female educational attainment ( 8 ). The marital status of drivers was initially considered in this study but not included in the final model because of multicollinearity. This study will also examine these interactions to determine their potential contribution to crash occurrence.

Prior crash involvement and citations have been also shown to be good predictors of crash occurrence. Past research has examined the effects of crash history/citation on crash occurrence, but none has considered the driver’s socioeconomic attributes. Future crash involvement as an at-fault driver has been examined considering prior crash involvements ( 21 , 22 ). Das et al. did not consider drivers with single crash involvement thus providing a future research opportunity, while Chandraratna et al. used previous crash involvement as the dependent variable without assigning any weight to the number of crash involvements, that is, a driver with one crash involvement was considered as risky as the driver with four (for instance) ( 21 , 22 ). This study will utilize citation information to predict the at-fault probability of a driver involved in a crash when adjusted for socioeconomic characteristics.

Different methods have been used in the past to investigate the role of these factors on crashes, and, while all are valid, a wide range of analytical practices for relating socioeconomic characteristics with crash data is available. In the past, many forms of regression technique have been applied, as well as spatial statistics, clustering, and comparative groupings,§ to analyze the influence of socioeconomic characteristics on several descriptors of crashes (crash risk, crash frequency, etc.) ( 3 , 5 , 12 ). However, there is little research on how socioeconomic characteristics influence crash occurrence. This research aims to estimate the at-fault likelihood of a driver involved in a crash using as predictors U.S. Census socioeconomic factors associated with their residence. The response variable here is the driver’s fault status, a categorical variable, that is most appropriately estimated through logistic regression. This modeling technique is beneficial for examining the effects of more than one explanatory variable and allows for the results to be displayed as probabilities or odds ratios.

Data

Kentucky crash data for 2013–2016 aggregated at the zip code level were used in this study. According to the Kentucky Geological Survey, there are 746 zip codes in the state ( 14 , 23 ). The 5-digit zip code of the driver residence was obtained from the Kentucky State Police (KSP) records. The majority of crashes were two-unit crashes (about 77%) and the remaining were single-unit (13.7%) or more than two units (9.3%). For this study, only single- and two-unit crashes were utilized, limiting the number of drivers involved to a maximum of two.

This study evaluates the driver fault status; therefore, information on passenger and pedestrians associated with crashes was not considered. The police officer responsible for reporting the crash determines the driver human factors contributing to the crash occurrence ( 14 ). The human factors recorded by the officer for each driver are used to determine their at-fault status. For each crash considered, the driver with a human factor code recorded by the police officer is considered to be the at-fault driver for the crash occurrence—an assumption that has been previously validated ( 14 , 24 , 25 ). For each crash, the driver with a human factor coded as “non-detected” was considered to be not at fault, while the driver with a human factor detected was treated as the at-fault driver ( 14 ). Crash records with multiple human factors identified for the driver were reviewed and the first human factor recorded was used to determine fault status. Crashes in which a human factor code was recorded for both or neither drivers were eliminated from analysis. These selection criteria avoided identifying multiple at-fault drivers for the same crash in two-unit crashes ( 14 , 24 ). In single-unit crashes, just one vehicle/driver is involved and is considered to be responsible for the crash. The absence of a corresponding not-at-fault driver group is addressed through the use of the not-at-fault driver group randomly obtained from the two-unit crash dataset. This allows for forming the exposure of single-unit crashes as well—an assumption that has been previously tested and validated ( 17 , 24 ). Additionally, a few random samples were drawn to ensure that they were truly random, and all tested samples resulted in the same logistic regression model within no significant differences in the coefficients and estimates.

If the human factor of a driver involved in a single-unit crash was recorded as “99–none detected,” then those crashes were not included in the analysis. Only drivers with a human factor coded were included in the dataset, and these drivers were considered to be the at-fault drivers. Therefore, it is assumed that the crash exposure of drivers to a crash is same for single- and two-unit crashes.

The final datasets included drivers with ages between 15 and 90 years. Drivers were grouped into seven groups: < 20, 20 to 24, 25 to 39, 40 to 64, 65 to 74, 75 to 84 and > 84 to allow for analyzing the age effects on crash occurrence as they may relate to the zip code socioeconomic variables. Table 1 shows the distribution of age groups in the dataset prepared after data processing as outlined above.

Table 1.

Driver Age Distribution, 2013–2016

Fault status	Age group							Total
Fault status	< 20	20–24	25–39	40–64	65–74	75–84	> 84	Total
Two-unit crashes
At-fault	30,582	36,579	68,634	75,568	18,168	9,916	2,303	241,750
Not-at-fault	14,801	24,985	72,739	103,180	18,885	6,240	920	241,750
Single-unit crashes
At-fault	11,792	13,219	22,754	21,433	3,453	1,640	350	74,641
Not-at-fault	4,600	7,778	22,464	31,740	5,840	1,948	271	74,641

Age and gender were obtained for each driver from crash records, and other socioeconomic and demographic variables were collected from the U.S. Census Bureau ( 26 ). This database includes information related to population (such as, total population, race, marital status, education, income, employment, poverty status, rurality), as well as information on households (such as, average home value, number of housing units, household size, household type), in a particular geographical area. Prior research, along with initial analysis conducted as part of this effort, informed the variable selection ( 14 ). Variables chosen for the analysis include Race, Housing, Marital Status, Education, Income, and Other (such as, employment by population ratio, percent below poverty level, rurality, etc.). Note that the U.S. Census Bureau has limited the categorization of areas based on rurality to the county level. There is no direct information available on rurality at the zip code level. Therefore, this study utilized the percentage of people living in rural area of each zip code as a description of rurality.

Additionally, conviction data was also examined here and the conviction data for 2012–2018 were obtained from the Kentucky Driver License database. The convictions used in the analysis are those of DUI, speeding, reckless driving, texting while driving, and ignorance of law. Non-moving convictions (which include parking tickets, and violation of court or other legal proceedings such as failure to answer court summons) are not considered to be closely associated with traffic safety and, therefore, were not included in this analysis. The average convictions per year was calculated for every zip code and the variable used was standardized to 1,000 drivers for each zip code.

Table 2 shows the descriptive statistics of selected socioeconomic variables utilized in this study.

Table 2.

Descriptive Statistics of Selected Variables

Variable	Minimum	Maximum	Mean	Standard deviation
Median housing value (in dollars)	21,400	385,300	98,260.45	50,572.85
Percent of people now married	0	100	53.73	13.16
Percent with less than high school graduate	0	100	21.60	11.94
Household median income (in dollars)	10,881	121,152	41,330.83	15,534.62
Percent rural	0	100	79.93	34.11
Percent below poverty level	0	100	22.97	13.58
Employment by population ratio	0	100	46.43	14.54
Average convictions per 1,000 drivers	0	636	28.90	34.51

Methodology

The main objective of the study was to establish the relationship between the likelihood of a driver to be the one at-fault in a crash and the socioeconomic factors associated with their residence zip codes. The analysis considers driver characteristics (age, gender, and zip code), human factors from crash records, and the socioeconomic and demographic features of the driver residence zip code. The final models can help transportation agencies target their focus on driver groups identified to be at higher risk.

Many socioeconomic variables were chosen for the analysis which makes it a tedious process to test all possible variable combinations to develop the best model. Therefore, a systematic approach of variable selection was used to select the appropriate variables and understand the effect of socioeconomic variables on driver at-fault status ( 27 ). Two statistical analyses were thus conducted: correlation analysis and recursive partitioning analysis ( 14 ). The correlation analysis allowed for the identification of those predictor variables that have the greatest ability to predict the dependent variable alone. The correlation tests allowed for the identification of the statistically significant explanatory variables to be included in the next steps of variable selection. The recursive partitioning is a statistical analysis used in predictive modeling in statistics and machine learning, and is a widely practiced approach in variable selection. It attempts to correctly classify the data along a decision tree by splitting it into subgroups based on the variables at hand ( 14 ). It was used here to clarify the association between the potential predictors, as selected from the correlation analysis, and the dependent variable. This step aids in illustrating the importance of the variables that should be considered in the modeling. Next, an advanced variable selection process was used, through addition and removal, back and forth, in the logistic regression model to determine the best candidates for predicting the driver fault status. Potential interactions were tested using an interactive web-based tool called the feasible solution algorithm (FSA) to improve the strength and stability of the model ( 14 , 28 ). FSA helps explore large subsets or higher-order interaction terms in statistical modeling that utilized big databases aiming to identify possible variable interactions. The use of FSA improves on existing models and can identify higher-order interactions and best subsets. It helps to estimate models that can have fixed variables, and higher order interactions and their lower order terms. This process resulted in developing binary logistic regression models predicting the at-fault probability of a driver for single-unit and two-unit crashes.

Crash exposure is an important measure to consider while analyzing crash data. There are several exogenous metrics used for this purpose, since crash databases do not contain direct information on exposure. Typically, vehicle miles traveled (VMT), number of licensed drivers, registered vehicles, and similar exogenous factors have been used to define exposure ( 17 ). However, these conventional metrics do not capture the variability in the exposure proportion of the driving population depending on other factors such as time of day, driver gender or age, road type, and so on ( 17 ). The reliability and applicability of these exposure metrics are questionable when examining the safety issues concerning specific groups of drivers or conditions, since the denominator in the ratio of crash occurrence for such subgroups and conditions cannot be obtained ( 14 ). Carr developed the quasi-induced exposure technique to address this problem ( 29 ). The approach assumes that not-at-fault drivers represent the total population in question, and the crash rate measure of exposure is developed in relation to the relative accident involvement ratio (RAIR), which is the ratio of the percentage of at-fault drivers to the percentage of not-at-fault drivers with the same characteristics (e.g., age, gender). The current study adopts the quasi-induced exposure which is widely accepted and validated ( 24 , 25 ). The approach relies on the ability to identify at-fault and not-at-fault drivers in a crash, which is based on the use of the human factors coded in a crash record. A process of cleaning the data is undertaken to develop the final set of drivers to be used in the next step of the analysis, which is the model development using the cleaned data, that is, crashes with only one driver identified as being at-fault.

Logistic regression is the most appropriate modeling technique for this study—beneficial when the effects of more than one explanatory variable influence a binary outcome ( 21 ). Logistic regression allows for expressing the response variable through the log-odds of logit. The logistic regression model defines the log-odds for the dependent variable as a linear function of explanatory variables.

\ln (\frac{p}{1 - p}) = a + b_{1} X_{1} + b_{2} X_{2} + \dots + b_{n} X_{n}

(1)

In logistic regression, the odds ratio is defined as the ratio of the probability of a driver to be the one at-fault in a crash to the probability of a driver with similar characteristics to be the not-at-fault driver in a crash. The odds ratio is analogous to the RAIR, which is the driver relative crash rate measure in the quasi-induced exposure technique. After mathematical modification of Equation 1, the probability of the characteristics of interest is expressed as a function of the regression equation:

p = \frac{1}{1 + e^{- f (X)}}

(2)

\forall f (X) = a + b_{1} X_{1} + b_{2} X_{2} + \dots + b_{n} X_{n}

where

$f (X)$ is the regression model;

X _i is the i^th explanatory variable;

a is the intercept;

and b_i is the i^th coefficient estimated using the maximum likelihood method.

Regression models were developed to predict a driver’s at-fault status using binary logistic regression and utilizing socioeconomic and demographic variables chosen through a series of initial assessment as predictive variables. Equation 2 can be used to estimate the likelihood of a driver to be at-fault and is analogous to the RAIR of the quasi-induced exposure methodology measuring crash propensity. Equation 3 can be used to calculate the RAIR of a driver group expressed as the probability of a driver being at fault (p).

RAIR (at - fault) = \frac{p}{1 - p}

(3)

Several models were developed for Kentucky utilizing the two-step statistical approach for variable selection discussed previously (i.e., correlation tests and recursive partitioning). Each model was further evaluated to ensure that the best one was identified using widely used criteria such as likelihood functions (such as Akaike information criterion [AIC] and Bayesian information criterion [BIC]), receiver operating characteristic curve (ROC), and training and validation method (where a split of 80:20 was used to define the training and validation datasets, respectively). Residuals are an important criterion in regression analysis used in model validation and they are defined as the difference between the observed and predicted value of the dependent variable ( 14 ). The smaller the residual of the model, the better its predictability.

Several models utilizing various combinations of socioeconomic variables were tested for single- and two-unit crashes to identify the one with the best predictive ability. Logistic regression was used to calculate at-fault probabilities to identify the target groups/target areas with high crash propensity and provide transportation agencies with targeted foci for their efforts to improve safety.

Result

Two-Unit Crashes

The final two-unit model (Table 3) chosen through the variable selection process includes rurality, poverty level, average convictions, driver population density (defined by 1,000 people living per square mile), age groups, gender, and interaction terms between age and gender. The sample size of the data used to develop the model is shown in Table 1. The FSA used to identify interactions indicated a strong correlation between age and gender in at-fault crash involvement, concurring with prior research findings ( 30 ). The selected model offers better performance than other models tested. AIC and BIC are 33,095.8 and 33,295.4, while the area under the curve (AUC) and classification percentage are both higher, at 0.612 and 62.9%, respectively. It should be noted that AUC measures the model’s capability to distinguish between the two classes: at-fault and not-at-fault status of the driver. The model predicts crash likelihood of drivers in 523 zip codes (out of 721, which reflects 91.12% of Kentucky’s area) within 10% error.

Table 3.

Logistic Regression Models for Two-Unit Crashes

Variables	Estimate	Standard error	95% Wald confidence interval		Wald chi-square	P-value
Variables	Estimate	Standard error	Lower	Upper	Wald chi-square	P-value
Intercept	0.771	0.0169	0.738	0.804	2,071.575	0.000
< 20	0	na	na	na	na	na
20–24	−0.369	0.0185	−0.405	−0.332	397.135	0.000
25–39	−0.807	0.0162	−0.839	−0.775	2,480.328	0.000
40–64	−1.093	0.0158	−1.124	−1.062	4,783.197	0.000
65–74	−0.939	0.0202	−0.979	−0.899	2,152.687	0.000
75–84	−0.394	0.0263	−0.446	−0.343	224.289	0.000
> 84	0.068	0.0539	−0.038	0.174	1.595	0.207
Male	0	na	na	na	na	na
Female	−0.243	0.0201	−0.283	−0.204	147.205	0.000
CON	0.001	0.0004	0	0.002	5.453	0.020
RUR	0.0003	0.0001	7.74E−05	0.001	7.024	0.008
POV	0.002	0.0004	0.001	0.003	32.288	0.000
DOPSQM	1.22E−05	2.86E−06	6.54E−06	1.78E−05	18.023	0.000
< 20 Male	0	na	na	na	na	na
< 20 Female	0	na	na	na	na	na
20–24 Male	0	na	na	na	na	na
20–24 Female	0.032	0.0259	−0.019	0.083	1.537	0.215
25–39 Male	0	na	na	na	na	na
25–39 Female	0.027	0.0227	−0.017	0.072	1.442	0.230
40–64 Male	0	na	na	na	na	na
40–64 Female	0.091	0.0222	0.047	0.134	16.677	0.000
65–74 Male	0	na	na	na	na	na
65–74 Female	0.348	0.0289	0.291	0.404	144.322	0.000
75–84 Male	0	na	na	na	na	na
75–84 Female	0.257	0.0381	0.183	0.332	45.553	0.000
> 84 Male	0	na	na	na	na	na
> 84 Female	0.241	0.0812	0.082	0.4	8.822	0.003

Note: CON = average convictions; DOPSQM = driver population density (defined by 1,000 people living per square mile); POV = poverty level; RUR = rurality; na = not applicable.

The socioeconomic predictors of at-fault likelihood in a two-unit crash are poverty, rurality, average convictions, and driver population density. The positive signs of the coefficient estimates are concurring with the findings of correlation analysis. Higher rates of poverty, rurality, population density, and convictions of the driver’s residency zip code result in increased probability of being at-fault. Percent below the poverty line has a high Wald score, indicating its importance in the model. This concurs with the results of the recursive partitioning analysis.

A U-shape relationship of age and crash involvement is noted here, that is, young and old drivers have higher crash propensity compared with the middle-age driver. Figure 1 shows the odds ratio of driver groups in different age and gender categories. For both genders in two-unit crashes, the at-fault probability is higher for young (< 20 years) and older (> 84 years) drivers. For two-unit crashes, the crash propensity is high for young (< 20 years) males and it reduces with age. After middle age (> 40 years), the probability of male drivers to be at-fault in a two-unit crash increases. In the comparison between genders, the odds ratios show that in young and middle ages, male drivers are responsible for a greater proportion of two-unit crashes. At the same time, the oldest female drivers (> 84 years) are more likely to be at-fault when compared with the youngest female drivers (< 20 years). In other words, the propensity to be the at-fault driver in a crash increases for female drivers after middle age. Under 65 years, male drivers have a higher propensity to cause a crash—gender plays a prominent role—while over 65 years, male and female drivers contribute almost equally to crash occurrences. For the 20–24 and 25–39 groups, the age/gender interaction is not significant because of the predominant effect of age.

Figure 1.

Odds ratio of two-unit crashes: (a) male drivers and (b) female drivers.

Single-Unit Crashes

The selected model for single-unit crashes indicates that the probability of being at-fault increases when a driver’s residence zip code has lower educational attainment and higher rates of rurality and population density. The model also includes the age and gender interaction (Table 4). The model includes rurality, education (defined as the percentage of bachelor’s degree), driver population density, and average convictions as predictor variables in addition to age and gender. The sample size of the single-unit crash data is shown in Table 1. The goodness-of-fit parameters for the model (AIC and BIC) are 26,023.9 and 26,202.3, respectively, indicating better predictive power than other models tested. The AUC is slightly improved while the percent correctly classified in the training and validation datasets remained at 63%. Crash likelihood for drivers in 408 zip codes is predicted correctly within the 10% error.

Table 4.

Logistic Regression Models for Single-Unit Crashes

Variables	Estimate	Standard error	95% Wald confidence interval		Wald chi-square	p-Value
Variables	Estimate	Standard error	Lower	Upper	Wald chi-square	p-Value
Intercept	1.008	0.0397	0.93	1.086	645.668	0.000
< 20	0	na	na	na	na	na
20–24	−0.293	0.031	−0.355	−0.23	84.252	0.000
25–39	−0.832	0.02	−0.887	−0.777	885.234	0.000
40–64	−1.326	0.027	−1.38	−1.272	2,329.867	0.000
65–74	−1.635	0.038	−1.71	−1.56	1,835.027	0.000
75–84	−1.316	0.052	−1.418	−1.214	634.162	0.000
> 84	−0.911	0.115	−1.137	−0.685	62.403	0.000
Male	0	na	na	na	na	na
Female	−0.495	0.035	−0.565	−0.425	193.876	0.000
CON	0.002	0.0007	0	0.003	4.679	0.031
RUR	0.008	0.0002	0.007	0.008	1,273.488	0.000
BS	−0.014	0.001	−0.016	−0.012	199.294	0.000
DOPSQM	−6.07E−05	5.43E−06	−7.14E−05	−5.01E−05	124.945	0.000
< 20 Male	0	na	na	na	na	na
< 20 Female	0	na	na	na	na	na
20–24 Male	0	na	na	na	na	na
20–24 Female	−0.148	0.046	−0.239	−0.058	10.318	0.001
25–39 Male	0	na	na	na	na	na
25–39 Female	−0.093	0.04	−0.173	−0.014	5.283	0.022
40–64 Male	0	na	na	na	na	na
40–64 Female	0.004	0.04	−0.074	0.082	0.01	0.920
65–74 Male	0	na	na	na	na	na
65–74 Female	0.38	0.056	0.269	0.491	44.872	0.000
75–84 Male	0	na	na	na	na	na
75–84 Female	0.476	0.077	0.324	0.627	37.798	0.000
> 84 Male	0	na	na	na	na	na
> 84 Female	0.678	0.169	0.346	1.009	16.019	0.000

Note: BS = education (defined as the percentage of bachelor’s degree); CON = average convictions; DOPSQM = driver population density (defined by 1,000 people living per square mile); RUR = rurality; na = not applicable.

Rurality is one of the predictor variables having the highest Wald score and concurring with the recursive partitioning findings. Furthermore, it illustrates the strong association between the rurality of the driver’s residence zip code and their likelihood of being the at-fault driver in a single-unit crash. The other predictor variables are average convictions, percent with bachelor’s degree, and driver population density. The coefficients for percent rural and average convictions have a positive relationship with at-fault status, concurring with the findings from two-unit crashes. The model also identified percent with bachelor’s degree as a predictor and it has a negative association with the response variable. This indicates that people who live in a zip code with a higher proportion of its residents with bachelor’s degrees have a lower chance of being the at-fault driver in single-unit crashes. Driver population density has a negative coefficient, denoting that drivers living in less dense areas are more likely to be the at-fault driver in single-unit crashes. The positive estimate of rurality in the model could explain this, that is, there is a high likelihood that rural areas are less populated, and, therefore, some interaction may be present but not easily detectable. The correlation between rurality and driver population density was also tested to examine the potential presence of multicollinearity; however, no direct association was found ( 27 ).

Age and gender overall display trends similar to those observed for two-unit crashes. The odds ratios (Figure 2) evidently show that male drivers are more likely to cause a single-unit crash in younger ages. Compared with older male drivers, younger and middle-age drivers (aged < 39 years) are more likely to be the at-fault driver in single-unit crashes. The likelihood of being the at-fault driver in a single-unit crash is reduced in older age groups but slightly increases for those > 84 years. Female drivers have a lower likelihood to be the at-fault driver as they age (up to 64 years) but then their likelihood increases from those > 65. Under 65 years, male drivers have higher propensity to be the at-fault driver in a crash—gender plays a prominent role—while for the > 65 group, female drivers have higher propensity. For age groups between 40 and 64, age/gender interaction is not significant because the main effect of age plays a more prominent role. Female drivers are less likely to be the at-fault driver in single-unit crashes than their male counterparts, except for those > 84 years of age.

Figure 2.

Odds ratio of single-unit crashes: (a) male drivers and (b) female drivers.

Application of Models

The logistic regression models developed in this study can be used to identify the target zip codes and driver groups for safety programs. A practitioner can use the crash models to calculate the at-fault probability of different age-gender categories based on socioeconomic characteristics. Figure 3 shows an example graph indicating the effect of rurality on the probability of male drivers in different age categories to be the at-fault driver in a single-unit crash. Assume that it is decided to target safety programs for drivers with at-fault probability greater than 0.75 who are living in zip codes with 35% or above rurality, the figure shows that the < 20 and 20–24 groups of many zip codes fall into the target group. Therefore, it is possible to determine the list of the zip codes with rurality 35% or above and using the desired threshold probability, for example greater than 0.75, develop appropriate programs targeting the driver groups of concern to improve their safety.

Figure 3.

Example—association of rurality and at-fault probability of male drivers, single-unit crashes.

Discussion and Conclusions

This research aimed to estimate the at-fault likelihood of a driver involved in a crash using as predictors U.S. Census socioeconomic factors associated with their residence. The measure of exposure utilized here is based on the quasi-induced exposure technique and is developed in relation to RAIR. Binary logistic regression analysis is used to model the two-unit and single-unit crashes, since the dependent variable used here is the fault status of a driver involved in a crash. A structured statistical process followed here allowed for the development of regression models for two- and single-unit crashes as a function of several socioeconomic and demographic variables.

The initial analysis found that age, gender, and several socioeconomic characteristics, such as rurality, employment and poverty, have an effect on increasing the likelihood of a driver to be the one at-fault in single- and two-unit crashes. Models initially considered whether marital status had a significant effect on two-unit crashes, along with individual and household income which were correlated with single- and two-unit crashes. However, they were eliminated from the final model because they were not considered to be good predictors based on the recursive partitioning evaluation.

The model results for both single- and two-unit crashes are fairly similar to each other. For two-unit crashes, fault status is determined as a function of age group, gender, rurality, poverty level, average conviction, and driver population density. For single-unit crashes, all of these variables are also found to have a significant effect; poverty level, though, was not included when educational attainment (percent with a bachelor’s degree or higher) was considered instead. In conclusion, people residing in areas with low socioeconomic conditions have a higher propensity to cause both two-unit and single-unit crashes.

The overall age and at-fault probability U-shaped relationship noted for both single- and two-unit crashes demonstrates that younger and older drivers exhibit a greater likelihood of being the at-fault driver in a crash compared with other age groups. This is consistent with the findings of previous research efforts ( 31 ). Aguero-Valverde and Jovanis concluded that drivers < 25 years and > 65 years have a positive association with crash risk and the majority of prior research supports this conclusion ( 9 ).

For both single- and two-unit crashes, this study observed a higher crash propensity for < 20 male drivers. Involvement as the at-fault driver is reduced with increasing age, probably as a result of improved judgment and decision-making because of longer driving experience. In addition, the higher involvement of young male drivers could be attributed to their increased driving—that is, more miles—than young females, resulting in increased exposure and, therefore, increased likelihood of being the at-fault driver in a crash ( 14 , 32 , 33 ). Also, young men exhibit greater aggressive behavior and risk-taking while driving and these facts may also explain their higher odds ( 14 , 32 , 34 ). The likelihood to be the at-fault driver in single-unit crashes decreases with age, probably because of the greater experience in handling situations that may lead to a single-unit crash. However, this does not hold true for two-unit crashes. Male and female drivers > 65 years are almost equally at fault in two-unit crashes, which could be attributed to aging-related changes affecting their driving performance ( 14 , 35 ). Prior research has shown that age effects have a greater impact on crash involvement in two-unit crashes, since most of these crashes occur in environments that may be more taxing for older individuals than environments where single-unit crashes may occur (typically rural areas with less traffic) ( 30 , 31 , 36 ).

Many older driver studies have identified their increased crash involvement and demonstrated the risk factors they create for themselves and other drivers ( 14 , 18 , 19 ). Prior studies have also illustrated that young and old drivers have a positive association with crash involvement, indicating their higher propensity to be the at-fault driver in a crash ( 13 , 14 , 17 , 18 ). These are consistent with the findings of this study.

The findings of this study could help transportation agencies identify high-crash-involvement-risk groups of drivers. Using the models developed in this study, target driver groups can be identified as explained in Figure 3. The model also makes it possible to calculate the risk of each driver group or area in relation to probability to be involved in a crash as the at-fault driver as a function of socioeconomic and demographic variables. This can be used to depict the data in a heatmap, which can be used to better understand the relative risk of the driver groups or area. This is crucial, as it provides a more appropriate, evidence-based, probability of crash involvement that could be used to implement efficient safety programs targeting such specific groups. The Safety Circuit Rider (SCR) program of the Federal Highway Administration (FHWA) provides safety-related support to transportation agencies tasked with local road safety addressing the frequency and severity of roadway crashes ( 14 , 37 ). The Kentucky SCR program follows a process where six high-risk counties are identified annually, followed by a detailed crash data analysis and road safety audits on the county public roadways ( 14 , 38 ). The results of this study can be used to identify the target counties for the SCR program. This can be achieved utilizing the models developed here and aggregating them at the county level through a geospatial process that would identify the zip codes for each county and a weighted average to establish the probability at the county level for at-fault involvement by age, gender, or both. This can then be used to identify the target counties.

High-quality driver education and supervised driving practices are key elements to prevent crashes by novice drivers. Such driver education training programs should be a mandatory requirement to be eligible for an intermediate license for teen drivers, who are higher at risk. Also, virtual driving simulators can be a cost-effective way to train new drivers, especially young drivers, about the possible real-world dangerous situations they may need to tackle as drivers. In addition, the at-risk group of drivers (typically young and old drivers) can be issued severe penalties (such as license suspension or revocation) when found repeatedly guilty in a traffic violation or a crash, and the penalty can be increased on subsequent violations.

It should be noted once more that the findings of this study are limited to two-unit and single-unit crashes, and research on multi-unit crashes with three or more vehicles involved could be a potential future research. Also, crash type and severity were not examined in the current study, which could be another limitation of the findings. They could be associated with the socioeconomic condition of the at-fault driver; however, this was not feasible because of the small numbers for crashes in several zip codes. Consideration of severity and crash type as dependent variables could also provide more insight into how the socioeconomic factors of a driver affect the characteristics of a crash. Other primary causes of the crash, such as geometric and environmental conditions of the crash location, also need to be investigated. Other employment-related variables, such as the purpose of travel (work or leisure) and proximity of crash locations to driver residence, may also influence crash occurrence. Crash history of drivers is a better predictor of crash occurrence than conviction records. However, because of a lack of access to the crash history database, this research could not analyze the effect of driver history on their propensity to be at-fault in a future crash, and this can be a lead for future study. According to the 2018 reports of the National Highway Traffic Safety Administration (NHTSA), about 10 million or more crashes go unreported each year ( 39 ). These crashes are not traceable. Therefore, relying on police-reported crashes could lead to a bias in any safety study or analysis; however, this is unavoidable.

Finally, this study utilizes socioeconomic data collected by the U.S. Census Bureau at the zip-code level. The study assumes that the information observed at the zip-code level also holds true at the individual level. This may not be the case, since most of these values are aggregate representations of individuals. It could be argued thus that the conclusions drawn here and based on this assumption may not be valid. This is known as ecological fallacy, where conclusions for individuals are drawn based on group inferences ( 40 , 41 ). This is a common criticism of studies like this one when aggregation is used. However, it should be noted that such detailed socioeconomic information at the individual—that is driver—level is not collected and thus not available. The aggregation used here is the only available means for conducting such a study and it is therefore a reasonable limitation of the study findings.

Footnotes

Acknowledgements

The authors would like to acknowledge the support of the Kentucky Injury Prevention Center and the Kentucky Transportation Cabinet.

Author Contributions

The authors confirm contribution to the paper as follows: study conception and design: N. Stamatiadis, S. Sagar; data collection: S. Sagar; analysis and interpretation of results: N. Stamatiadis, A. Stromberg, S. Sagar; draft manuscript preparation: N. Stamatiadis, S. Sagar. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Shraddha Sagar

Nikiforos Stamatiadis

Arnold Stromberg

References

Kentucky Transportation Center. Traffic Collision Facts 2002. Kentucky Transportation Center, University of Kentucky, Lexington, Kentucky 40506-0281, 2003. https://uknowledge.uky.edu/cgi/viewcontent.cgi?article=2624&context=ktc_researchreports.

Kentucky Transportation Center. Traffic Collision Facts 2017. Kentucky Transportation Center, University of Kentucky, Lexington, Kentucky 40506-0281, 2018. https://uknowledge.uky.edu/cgi/viewcontent.cgi?article=2624&context=ktc_researchreports.

Brown

K. T

. A Safety Analysis of Spatial Phenomena About the Residences of Drivers Involved in Crashes. Dissertation. Graduate School of Clemson University, 2016. https://tigerprints.clemson.edu/all_dissertations/1746

Noland

R. B.

The Effect of Infrastructure and Demographic Change on Traffic-Related Fatalities and Crashes: A Case Study of Illinois County-Level Data. Accident Analysis and Prevention, Vol. 36, No. 4, 2004, pp. 525–532.

Adanu

E. K.

Smith

Powell

Jones

Multilevel Analysis of the Role of Human Factors in Regional Disparities in Crash Outcomes. Accident Analysis and Prevention, Vol. 109, 2017, pp. 10–17.

Factor

Mahalel

Yair

Inter-Group Differences in Road-Traffic Crash Involvement. Accident Analysis and Prevention, Vol. 40, 2008, pp. 2000–2007.

Hasselberg

Vaeza

Laflamme

Socioeconomic Aspects of the Circumstances and Consequences of Car Crashes among Young Adults. Social Science & Medicine, Vol. 60, No. 2, 2005, pp. 287–295.

Adanu

Penmetsa

Jones

Smith

Gendered Analysis of Fatal Crashes among Young Drivers in Alabama, USA. Safety, Vol. 4, No. 3, 2018, p. 29

Aguero-Valverde

Jovanis

P. P.

Spatial Analysis of Fatal and Injury Crashes in Pennsylvania. Accident Analysis and Prevention, Vol. 38, No. 3, 2006, pp. 618–625.

10.

Hanna

C. L.

Laflamme

Bingham

C. R.

Fatal Crash Involvement of Unlicensed Young Drivers: County Level Differences According to Material Deprivation and Urbanicity in the United States. Accident Analysis and Prevention, Vol. 45, 2012, pp. 291–295.

11.

National Highway Traffic Safety Administration. Critical Reasons for Crashes Investigated in the National Motor Vehicle Crash Causation Survey. National Highway Traffic Safety Administration, U.S. Department of Transportation, Washington, D.C., 2018.

12.

Kocatepe

Ulak

M. B.

Ozguven

E. E.

Horner

M. W.

Arghandeh

Socioeconomic Characteristics and Crash Injury Exposure: A Case Study in Florida Using Two-Step Floating Catchment Area Method. Applied Geography, Vol. 87, 2017, pp. 207–221.

13.

Stamatiadis

Puccini

Fatal Crash Rates in the Southeastern United States: Why Are They Higher?

Transportation Research Record: Journal of the Transportation Research Board, 1999. 1665: 118–124.

14.

Stamatiadis

Sagar

Wright

Cambron

Effect of Socioeconomic Factors on Crash Occurrence. Kentucky Transportation Center Research Report, 2020. https://uknowledge.uky.edu/ktc_researchreports/1696/

15.

Sagar

Stamatiadis

Wright

Cambron

Identifying High-Risk Commercial Vehicle Drivers Using Sociodemographic Characteristics. Accident Analysis & Prevention, Vol. 143, 2020, p. 105582.

16.

Insurance Institute for Highway Safety. Fatality Facts 2019: Urban/Rural Comparison. Insurance Institute for Highway Safety, Virginia, 2019. https://www.iihs.org/topics/fatality-statistics/detail/urban-rural-comparison.

17.

Stamatiadis

Deacon

J. A.

Quasi-Induced Exposure: Methodology and Insight. Accident Analysis & Prevention, Vol. 29, No. 1, 1997, pp. 37–52.

18.

Lyman

Ferguson

S. A.

Braver

E. R.

Williams

A. F.

Older Driver Involvements in Police Reported Crashes and Fatal Crashes: Trends and Projections. Injury Prevention, Vol. 8, No. 2, 2002, pp. 116–120.

19.

Sagar

Stamatiadis

Wright

Green

Use of Codes Data to Improve Estimates of at-Fault Risk of Elderly Drivers. Accident Analysis and Prevention, Vol. 144, 2020, p. 105637.

20.

Kentucky State Police. Kentucky’s Traffic Collision Facts. Kentucky State Police, 2016. https://uknowledge.uky.edu/ktc_researchreports/1583/

21.

Das

Sun

Wang

Leboeuf

Estimating Likelihood of Future Crashes for Crash-Prone Drivers. Journal of Traffic and Transportation Engineering (English Edition), Vol. 2, No. 3, 2015, pp. 145–157.

22.

Chandraratna

Stamatiadis

Stromberg

Potential Crash Involvement of Young Novice Drivers with Previous Crash and Citation Records. Human Performance; Simulation And Visualization, Vol. 1937, 2005, pp. 1–6.

23.

University of Kentucky. Kentucky Geological Survey. http://www.uky.edu/KGS/gis/bounds.htm. Accessed December 2, 2019.

24.

Chandraratna

Stamatiadis

Quasi-Induced Exposure Method: Evaluation of Not-at-Fault Assumption. Accident Analysis and Prevention, Vol. 41, No. 2, 2009, pp. 308–313.

25.

Shen

Pope

C. N.

Stamatiadis

Zhu

Validation of Not-at-Fault Driver Representativeness Assumption for Quasi-Induced Exposure Using U.S. National Traffic Databases. Journal of Safety Research, Vol. 71, 2019, pp. 243–249.

26.

Chen

Zhang

Tian

Bogus

S. M.

Yang

Hierarchical Bayesian Random Intercept Model-Based Cross-Level Interaction Decomposition for Truck Driver Injury Severity Investigations. Accident Analysis and Prevention, Vol. 85, 2015, pp. 186–198.

27.

Sagar

. Effect of Socioeconomic and Demographic Factors of Driver Residence on Crash Occurrence. Theses and dissertations. Civil Engineering, 94, University of Kentucky. https://uknowledge.uky.edu/ce_etds/94.

28.

Lambert

Gong

Elliott

C. F.

Thompson

Stromberg

Rfsa: An R Package for Finding Best Subsets and Interaction. The R Journal, Vol. 10, 2018, pp. 295–308.

29.

Carr

B. R.

A Statistical Analysis of Rural Ontario Traffic Accidents Using Induced Exposure Data. Accident Analysis and Prevention, Vol. 1, 1969, pp. 33–357.

30.

Regev

Rolison

J. J.

Moutari

Crash Risk by Driver Age, Gender, and Time of Day Using a New Exposure Methodology. Journal of Safety Research, Vol. 66, 2018, pp. 131–140.

31.

Stamatiadis

Deacon

J. A.

Trends in Highway Safety: Effects of an Aging Population on Accident Propensity. Accident Analysis & Prevention, Vol. 27, No. 4, 1995, pp. 443–459.

32.

Insurance Institute for Highway Safety, Highway Loss Data Institute. Fatality Facts 2019: Males and Females Gender. Insurance Institute for Highway Safety, Highway Loss Data Institute, 2018. https://www.iihs.org/topics/fatality-statistics/detail/gender.

33.

Sivak

Female Drivers in the United States, 1963–2010: From a Minority to a Majority?

Traffic Injury Prevention, Vol. 14, No. 3, 2013, pp. 259–260.

34.

Turner

Mc Clure

Age and Gender Differences in Risk-Taking Behaviour as an Explanation for High Incidence of Motor Vehicle Crashes as a Driver in Young Males. Injury Control and Safety Promotion, Vol. 10, No. 3, 2003, pp. 123–130.

35.

Staplin

Lococo

K. H.

Stewart

Decina

L. E.

Safety Mobility for Older Drivers Handbook. National Highway Traffic Safety Administration, Washington, D.C., 1999.

36.

Stutts

Martell

Staplin

Identifying Behaviors and Situations Associated with Increased Crash Risk for Older Drivers. Office of Behavioral Safety Research and U. S. D. o. T. National Highway Traffic Safety Administration, Washington, D.C., 2009.

37.

Gross

Nabors

Eck

Hood

Safety Circuit Rider Programs Best Practices Guide. Federal Highway Administration Office of Safety, 2009. https://safety.fhwa.dot.gov/local_rural/training/fhwasa09019/fhwasa09019.pdf.

38.

Safety Circuit Rider Program . University of Kentucky, College of Engineering. https://www.kyt2.com/local-services/safety-circuit-rider-program. Accessed April 20, 2020.

39.

The Ultimate List of Driving Statistics for 2020. https://driving-tests.org/driving-statistics/. Accessed July 3, 2020.

40.

Tranmer

Steel

D. G.

Using Census Data to Investigate the Causes of the Ecological Fallacy. Environment and Planning A, Vol. 30, No. 5, 1998, pp. 817–831.

41.

Hsieh

J. J.

Ecological Fallacy. Encyclopedia Britannica, 2017. https://www.britannica.com/science/ecological-fallacy. Accessed November 27, 2020.