In this essay we address the problem of estimating, from data provided by the FRA in the WABS databases, the probability of a collision at specific rail crossings in a year. WABS itself makes some estimates, but does not document the method used for making them. We are not trying to replicate the WABS results, but instead develop here an independent method based on logistic regression.
Logistic regression is a way of estimating the probability of an event, such as an accident or collision, happening. The response variable is a binary variable indicating whether an accident occurs or not. Predictor variables are taken from the data fields provided by WABS. Logistic regression allows either continuous or discrete values of predictors, including factors (which might be character data in the data file). It uses a version of regression which chooses coefficients based on maximum likelihood, instead of least square error, which is used in ordinary regression. In addition, the response variable does not follow a normal distribution, but a binomial one, so that ordinary least squares methods of analysis of results are inappropriate.
WABS Accident data is contained in pdf documents downloadable from their website. The files are not downloadable in any other format. We extracted the data from the pdf and edited the files, using Notepad+, to get columns and field names aligned properly, and then used the RStudio import data feature to get it in. The R to do so is:
We first create a shorter, more generic name for the data frame. A summary of the data shows some NAs in AADT, and one in HWYLNS. We have to replace these
dataf = WABS_2019_01_28
summary(dataf)
## RANK PRED_COLLS CROSSING RR
## Min. : 1.00 Min. :0.000057 Length:80 Length:80
## 1st Qu.:20.75 1st Qu.:0.008430 Class :character Class :character
## Median :40.50 Median :0.015378 Mode :character Mode :character
## Mean :40.50 Mean :0.022960
## 3rd Qu.:60.25 3rd Qu.:0.034537
## Max. :80.00 Max. :0.171142
##
## STATE COUNTY CITY
## Length:80 Length:80 Length:80
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## ROAD C C1 C2
## Length:80 Min. :0.0000 Min. :0.0000 Min. :0
## Class :character 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0
## Mode :character Median :0.0000 Median :0.0000 Median :0
## Mean :0.0375 Mean :0.0125 Mean :0
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0
## Max. :1.0000 Max. :1.0000 Max. :0
##
## C3 C4 DATE_CHG WD
## Min. :0 Min. :0 Length:80 Length:80
## 1st Qu.:0 1st Qu.:0 Class :character Class :character
## Median :0 Median :0 Mode :character Mode :character
## Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0
##
## TOT_TRN TOT_TRK TTBL_SPD HWYPVD
## Min. : 0.00 Min. :0.000 Min. : 0.00 Length:80
## 1st Qu.: 8.00 1st Qu.:1.000 1st Qu.:25.00 Class :character
## Median :11.00 Median :1.000 Median :40.00 Mode :character
## Mean :34.41 Mean :1.238 Mean :38.73
## 3rd Qu.:61.00 3rd Qu.:1.000 3rd Qu.:45.00
## Max. :72.00 Max. :3.000 Max. :79.00
##
## HWYLNS AADT
## Min. :0.000 Min. : 1
## 1st Qu.:2.000 1st Qu.: 775
## Median :2.000 Median : 3122
## Mean :2.165 Mean : 6855
## 3rd Qu.:2.000 3rd Qu.: 8623
## Max. :5.000 Max. :48000
## NA's :1 NA's :4
HWYLNS should probably be replaced with a 2, since it is the number of highway lanes at the crossing, and could not be 0. And the AADT ones should be replaced with zeroes. We also note that HWYPVD is always YES, so it need not be considered as a discriminating response variable.
dataf$HWYLNS[which(is.na(dataf$HWYLNS) )] = 2
dataf$AADT[which(is.na(dataf$AADT))] = 0
Now the data look good for analysis. Here is a list of the column names:
## [1] "RANK" "PRED_COLLS" "CROSSING" "RR" "STATE"
## [6] "COUNTY" "CITY" "ROAD" "C" "C1"
## [11] "C2" "C3" "C4" "DATE_CHG" "WD"
## [16] "TOT_TRN" "TOT_TRK" "TTBL_SPD" "HWYPVD" "HWYLNS"
## [21] "AADT"
RANK and PRED_COLLS show the results of the WABS estimation. RANK is the order high to low PRED_COLLS, PRED_COLLS is the estimate of the probability of a collision at the crossing in a year computed by WABS. The data have ranked the crossings in descending order of probability of a collision.
CROSSING, RR, STATE, COUNTY, CITY, and ROAD describe the exact location of the crossing. To create this file we selected only crossings on the SMART train line in Sonoma and Marin counties, California. Thus RR always is the acronym SMRT, STATE is always California, and COUNTY is either Sonoma or Marin. The fields beginning with C are the accident or collision counts for the current cycle (representing one year), and previous cycles of one year length. Since the SMART line only operated in 2017 and 2018, the last 3 fields here are all zeros. WD is a code for the type of equipment at the crossing. TOT_TRN is the number of trains per day, TOT_TRK the number of tracks at the crossing, TTBL_SPD the maximum speed permitted by the timetable schedule, HWYPVD a code for whether the crossing is paved or not (in this sample always YES, so it is not a candidate for a predictor). HWYLNS is the number of highway lanes going through the crossing, and AADT is the avverage daily vehicle traffic count at the crossing. Except for HWYPVD, the variables after DATE_CHG, which represents whether there was a change to the entry made by the railroad during the cycles included (perhaps a re-engineering of the crossing equipment), are all candidates for predictors of the collision likelihood.
Here are the key predictions about the crossings: you can search the table to see highs and lows.
print(dataf[,c(1:2,7:8,15:18, 20:21)])
## # A tibble: 80 x 10
## RANK PRED_COLLS CITY ROAD WD TOT_TRN TOT_TRK TTBL_SPD HWYLNS AADT
## <int> <dbl> <chr> <ch> <chr> <int> <int> <int> <dbl> <dbl>
## 1 1 0.171 SANT~ HEA~ GT 64 1 79 3 20600
## 2 2 0.129 SANT~ STE~ GT 64 1 70 3 8100
## 3 3 0.0504 PETA~ WAS~ GT 60 1 25 5 21549
## 4 4 0.0494 SANT~ COL~ GT 64 2 60 4 26900
## 5 5 0.0485 SANT~ PIN~ GT 64 1 79 4 23900
## 6 6 0.0464 ROHN~ GOL~ GT 10 1 40 2 5024
## 7 7 0.0458 SANR~ 4TH~ FQ 60 2 15 2 48000
## 8 8 0.0455 SANR~ N_S~ GT 60 1 45 2 8500
## 9 9 0.0426 SANT~ GUE~ GT 64 2 40 2 27300
## 10 10 0.0424 PETA~ EAS~ FQ 60 1 25 3 18371
## # ... with 70 more rows
Only 2 have higher predictions than 10%, and the last 10 are all under 0.4%. The last 5 are pedestrian crossings.
To gain some insight into the data we plot the predicted collisions against the six chosen variables.
pairs(dataf[,c(2, 16:18, 20:21)])
We see there are a couple of outliers with very high PRED_COLLS. These occur with high TOT_TRN, TOT_TRK = 2, high TTBL_SPD, and 3 HWYLNS; AADT for these is medium in one case and high in another. Otherwise there is not a clear pattern. Very low PRED_COLLS go with low values of AADT,TTBL_SPD, and TOT_TRN.
A preliminary analysis was done using logistic regression; we used the glm routine in R. We tried a model for C using all the predictors.
glm.fit1=glm(C~as.factor(WD)+TOT_TRN+TOT_TRK+TTBL_SPD+HWYLNS+AADT, data=dataf,family="binomial")
summary(glm.fit1)
##
## Call:
## glm(formula = C ~ as.factor(WD) + TOT_TRN + TOT_TRK + TTBL_SPD +
## HWYLNS + AADT, family = "binomial", data = dataf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.93591 -0.25999 -0.02120 -0.00008 2.29508
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.728e+01 7.279e+03 -0.002 0.998
## as.factor(WD)FQ -3.534e+00 8.758e+03 0.000 1.000
## as.factor(WD)GT 1.268e+01 7.279e+03 0.002 0.999
## as.factor(WD)HS -2.409e+00 1.231e+04 0.000 1.000
## as.factor(WD)NO -3.617e+00 1.450e+04 0.000 1.000
## as.factor(WD)SS -2.371e-01 1.917e+04 0.000 1.000
## as.factor(WD)XB -4.972e-01 1.917e+04 0.000 1.000
## TOT_TRN -1.433e-01 1.433e-01 -1.000 0.317
## TOT_TRK -3.982e+00 1.119e+01 -0.356 0.722
## TTBL_SPD 2.057e-01 1.780e-01 1.156 0.248
## HWYLNS -4.243e-01 1.235e+00 -0.344 0.731
## AADT 8.549e-05 9.695e-05 0.882 0.378
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25.587 on 79 degrees of freedom
## Residual deviance: 18.116 on 68 degrees of freedom
## AIC: 42.116
##
## Number of Fisher Scoring iterations: 19
A smaller Akaike Information Criterion (AIC) would represent a better fit; one cannot use R2 the conventional least squares regression measure of fit. Interpreting the coefficients, we see that none of the WD crossing type factors contributed anything to the result (p close to 1) . TOT_TRK and HWYLNS were also very likely (p=.772, p=.731) to be unrelated. Of the remaining 3 predictors, TTBL_SPD (timetable train speed) was the most likely to have an effect, and the effect is positive on the probability– the coefficient has a positive sign. The number of trains TOT_TRN is next most effective, in a negative direction– more trains through a crossing make the probability of a collision lower. AADT, the number or autos per day through the crossing, is third, with again a positive effect– more would predict a greater probability of collision – however the coefficient is almost 0.
As a check we can use the caret package to assess the importance of the variables.
library(caret); library(dplyr)
round(varImp(glm.fit1,scale=TRUE),4)
## Overall
## as.factor(WD)FQ 0.0004
## as.factor(WD)GT 0.0017
## as.factor(WD)HS 0.0002
## as.factor(WD)NO 0.0002
## as.factor(WD)SS 0.0000
## as.factor(WD)XB 0.0000
## TOT_TRN 1.0002
## TOT_TRK 0.3559
## TTBL_SPD 1.1555
## HWYLNS 0.3436
## AADT 0.8817
Clearly TTBL_SPD is most important; TOT_TRN is close behind; AADT is third; and TOT_TRK and HWYLNS are quite a bit less important.
It therefore makes sense to rerun the logistic regression, without the WD factors and TOT_TRK, HWYLNS, to see if a better result is obtained.
glm.fit3=glm(C~TOT_TRN+TTBL_SPD+AADT, data=dataf, family="binomial")
summary(glm.fit3)
##
## Call:
## glm(formula = C ~ TOT_TRN + TTBL_SPD + AADT, family = "binomial",
## data = dataf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.85641 -0.35314 -0.05429 -0.00886 2.36721
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.039e+01 6.151e+00 -1.689 0.0912 .
## TOT_TRN -1.690e-01 1.464e-01 -1.155 0.2482
## TTBL_SPD 2.316e-01 1.831e-01 1.265 0.2059
## AADT 8.778e-05 9.642e-05 0.910 0.3626
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25.587 on 79 degrees of freedom
## Residual deviance: 18.930 on 76 degrees of freedom
## AIC: 26.93
##
## Number of Fisher Scoring iterations: 9
Now we see all predictor coefficients have relatively low p scores– below .37. The coefficients all go in the expected direction, if you consider that when there are lots of trains through a crossing each day, people tend to be more careful at the intersection.
We can use the results of the latter logistic regression to predict the probability of a collision at each crossing. Our results will be different from those of the WABS. We build a report to show the results for the crossings. The report will show city and road, type of equipment at the crossing, the predicted probability of collision, and the standard error of the prediction, which gives an idea of how confident we are in the prediction. it could be used to calculate confidence intervals on the predictions. We marked with a * the crossings at which the standard error was small enough so that Prediction was greater than SE, and hence one standard error below the Prediction was still positive probability of collision.
## City Road Type Prediction SEfit sig
## 1 SANTAROSA PINER_ROAD GT 0.30700 0.2876 *
## 2 PETALUMA CORONA_ROAD GT 0.26425 0.2190 *
## 3 SANTAROSA HEARN_AVE GT 0.24901 0.2239 *
## 4 FULTON AVIATION_BLVD GT 0.19704 0.2095
## 5 FULTON RIVER_ROAD GT 0.14239 0.1141 *
## 6 COTATI EAST_COTATI_AVE GT 0.10823 0.0774 *
## 7 ROHNERTPARK SOUTHWEST_BLVD GT 0.09945 0.0686 *
## 8 WINDSOR WINDSOR_RIVER_RD GT 0.09902 0.0721 *
## 9 FULTON FULTON_ROAD GT 0.09146 0.0675 *
## 10 SANTAROSA BELLEVUE/CORBY_AVE FQ 0.08986 0.1151
## 11 ROHNERTPARK GOLF_COURSE_DRIVE GT 0.08493 0.0571 *
## 12 WINDSOR STARR_ROAD GT 0.08308 0.0634 *
## 13 FULTON SHILOH_ROAD GT 0.08175 0.0628 *
## 14 FULTON AIRPORT_BLVD GT 0.08109 0.0626 *
## 15 WINDSOR MITCHELL_LANE GT 0.07851 0.0615 *
## 16 SONOMA SEARSPOINT/ST-37 GT 0.07651 0.1969
## 17 SANTAROSA SAN_MIGUEL_ROAD GT 0.07650 0.1084
## 18 SANTAROSA BARHAM_AVE GT 0.07406 0.1071
## 19 ROHNERTPARK ROHNERT_PARK_EXPY GT 0.06865 0.0492 *
## 20 HEALDSBURG BAILHACHE_AVE GT 0.06430 0.0478 *
## 21 COTATI EAST_RAILROAD_AVE GT 0.06325 0.0476 *
## 22 SANTAROSA WEST_ROBLES_AVE GT 0.06120 0.0470 *
## 23 SANTAROSA TODD_ROAD GT 0.06070 0.0469 *
## 24 ROHNERTPARK SCENIC_AVE GT 0.06020 0.0468 *
## 25 HEALDSBURG LIMERICK_LANE GT 0.05873 0.0465 *
## 26 HEALDSBURG GRANT_AVE. GT 0.05729 0.0462 *
## 27 PETALUMA NMCDOWELL_BLVD GT 0.05547 0.0573
## 28 PENNGROVE MAIN_ST/WOODWARD GT 0.02283 0.0278
## 29 PENNGROVE OLD_ADOBE_ROAD GT 0.02168 0.0271
## 30 PETALUMA ELY_ROAD_N GT 0.01990 0.0261
## 31 SANRAFAEL 2ND_STREET GT 0.01534 0.0399
## 32 SANTAROSA STEELE_LANE GT 0.01358 0.0323
## 33 HEALDSBURG GRANT_SCHOOL_ROAD NO 0.01011 0.0175
## 34 PETALUMA SOUTH_POINT_ROAD NO 0.01011 0.0175
## 35 SANRAFAEL 3RD_STREET GT 0.00828 0.0209
## 36 SONOMA SR_ROUTE_12/121 GT 0.00765 0.0154
## 37 HEALDSBURG HEALDSBURG_AVE GT 0.00742 0.0158
## 38 SANTAROSA COLLEGE_AVENUE GT 0.00703 0.0230
## 39 SANRAFAEL IRWINSTREET HS 0.00378 0.0101
## 40 SONOMA TOULAY_CREEK_ROAD HS 0.00369 0.0092
## 41 SONOMA SR_ROUTE_12/121 GT 0.00273 0.0078
## 42 HEALDSBURG FRONT_STREET GT 0.00234 0.0068
## 43 NOVATO ROBLAR_DRIVE GT 0.00151 0.0059
## 44 SONOMA SKAGGS_ISLAND_RD GT 0.00144 0.0046
## 45 NOVATO HAMILTON_PKWY GT 0.00137 0.0055
## 46 NOVATO BIKE_PATH_PED GT 0.00131 0.0053
## 47 SANTAROSA SEBASTOPOL_ROAD GT 0.00114 0.0048
## 48 PETALUMA HOPPER_STREET FL 0.00043 0.0018
## 49 PETALUMA W_PAYRAN_STREET GT 0.00029 0.0015
## 50 SANRAFAEL FRANCISCO_BLVD XB 0.00029 0.0013
## 51 SANRAFAEL RICE_STREET FL 0.00029 0.0013
## 52 SANRAFAEL SMITH_RANCH_RD GT 0.00023 0.0012
## 53 SANTAROSA STANDISH_AVE SS 0.00022 0.0010
## 54 NOVATO HANNA_RANCH_RD GT 0.00017 0.0010
## 55 NOVATO Franklin_Ped_crossing FL 0.00017 0.0010
## 56 NAPA MILTON_ROAD GT 0.00015 0.0007
## 57 NOVATO PED_XING HS 0.00013 0.0007
## 58 SANRAFAEL N_SANPEDRO_ROAD GT 0.00009 0.0005
## 59 SANTAROSA GUERNEVILLE_ROAD GT 0.00007 0.0005
## 60 PETALUMA LAKEVILLE_STREET GT 0.00007 0.0004
## 61 NOVATO GRANT_AVE GT 0.00004 0.0003
## 62 SANRAFAEL PACHECO_STREET GT 0.00004 0.0002
## 63 SANRAFAEL PED_XING FL 0.00004 0.0003
## 64 SANRAFAEL PALOMA_AVENUE GT 0.00004 0.0003
## 65 SANTAROSA 9TH_STREET FQ 0.00003 0.0002
## 66 ROHNERTPARK PED_XING GT 0.00003 0.0002
## 67 NOVATO RUSHCREEK_PLACE FQ 0.00002 0.0001
## 68 PETALUMA WASHINGTON_STREET GT 0.00000 0.0000
## 69 SANRAFAEL 4TH_STREET FQ 0.00000 0.0000
## 70 PETALUMA EAST_D_STREET FQ 0.00000 0.0000
## 71 SANTAROSA 3RD_STREET GT 0.00000 0.0000
## 72 NOVATO OLIVE_AVE GT 0.00000 0.0000
## 73 SANRAFAEL MISSION_AVENUE FQ 0.00000 0.0000
## 74 SANRAFAEL CIVIC_CENTER_DR GT 0.00000 0.0000
## 75 SANRAFAEL 5TH_AVENUE FQ 0.00000 0.0000
## 76 SANTAROSA 8TH_STREET GT 0.00000 0.0000
## 77 SANTAROSA 7TH_STREET GT 0.00000 0.0000
## 78 SANTAROSA 6TH_STREET GT 0.00000 0.0000
## 79 PETALUMA CAULFIELD_LANE GT 0.00000 0.0000
## 80 SANTAROSA Railroad_square_ped_xing FL 0.00000 0.0000
Let us select all the data with a Prediction (probability of collison) greater than 1 in 100 (0.01). There are 10.
The table shows 34 crossings have probabilities of a collision in the next year greater than 0.01 or 1 in 100. If we look at the lowest ones, we will screen for the ones with a Prediction less than .0001, or one in ten thousand. We obtain 23 crossings with less than 1 in 10000.
Which variable is most important? The package caret gives us a method for deciding. We see below that TTBL_SPD is a bit more important than TOT_TRN, and AADT is less important. But they are not greatly different in importance.
library(caret)
caret::varImp(glm.fit3,scale=TRUE)
## Overall
## TOT_TRN 1.1546445
## TTBL_SPD 1.2648867
## AADT 0.9104228
Residual analysis is important as well, to see if we have a decent model for prediction. We call this candidatemodel LOGR.
glm.fit3_data = data.frame( RANK=dataf$RANK, name=paste0(dataf$CITY, "; ", substr(dataf$ROAD,1,20) ),
std_resids = scale(glm.fit3$residuals), C = dataf$C, L=glm.fit3$fitted.values, W=dataf$PRED_COLLS )
ggplot(glm.fit3_data, aes( RANK, std_resids, color=factor(C) ) ) +
geom_point(alpha = .5) +
geom_hline(yintercept = 3, color = "orange") + geom_hline( yintercept=0, color = "blue") +
labs( x = "RANK by WABS probability", y="Standardized residual",
title= "Standardized residuals of crossings using LOGR regression", subtitle = "Crossings with collisons have large positive standard errors, others negative standard errors." )
As the above graph and table following show, only three crossings have residuals over 0, and these are the ones with collisions actually occurring. Which are these?
glm.fit3_data %>% filter ( std_resids > 0 )
## RANK name std_resids C L W
## 1 1 SANTAROSA; HEARN_AVE 1.868924 1 0.24901172 0.171142
## 2 6 ROHNERTPARK; GOLF_COURSE_DRIVE 5.012333 1 0.08492718 0.046391
## 3 17 SANTAROSA; TODD_ROAD 6.916577 1 0.06069780 0.036735
Because of this upward bias in the LOGR predictions, we now revise the LOGA model to include only the interaction between TTBL_SPD and AADT. AIC is now negative, signifying a better fit. We first show a histogram of the distribution of prediction gap between LOGA and WABS. There are only 6 outliers with more absolute gap than 0.05.
glm.spdxaadt = glm(glm(C~TTBL_SPD:AADT, data=dataf, family="binomial")) #+TTBL_SPD+AADT
s1=summary(glm.spdxaadt)
newdata = data.frame( HWYLNS= dataf$HWYLNS,TOT_TRN=dataf$TOT_TRN, TTBL_SPD=dataf$TTBL_SPD, AADT=dataf$AADT)
spdxaadt.pred=predict(glm.spdxaadt, newdata, type="response", se.fit=TRUE)
glm.spdxaadt_data = data.frame( RANK=dataf$RANK, name=paste0(dataf$CITY, "; ", substr(dataf$ROAD,1,20) ),
std_resids = round(scale(glm.spdxaadt$residuals),4), C = dataf$C,
L=round(glm.spdxaadt$fitted.values,4), W=round(dataf$PRED_COLLS,4) ) %>%
mutate ( gap=round(L-W, 4) )
ggplot(glm.spdxaadt_data, aes(gap)) +
geom_histogram (color="red", alpha=0.2, size=1) +
coord_cartesian( ylim= c(0,30), xlim=c(-0.05, 0.15)) +
geom_vline ( xintercept = 0) +
labs(title=paste0("Distribution of predictions from LOGA with TTBL_SPD:AADT \n AIC = ", round(s1$aic,3) ) )
We display the big 6 in descending order of the gap between LOGA and WABS (LOGA - WABS). Only one gap is negative.
glm.spdxaadt_data %>% filter ( !between (gap, -.05,.05) ) %>% arrange ( desc(abs(gap) ) )
## RANK name std_resids C L W gap
## 1 5 SANTAROSA; PINER_ROAD -1.0467 0 0.1964 0.0485 0.1479
## 2 4 SANTAROSA; COLLEGE_AVENUE -0.9047 0 0.1697 0.0494 0.1203
## 3 16 PETALUMA; CORONA_ROAD -0.6337 0 0.1189 0.0371 0.0818
## 4 31 SONOMA; SEARSPOINT/ST-37 -0.5296 0 0.0994 0.0212 0.0782
## 5 9 SANTAROSA; GUERNEVILLE_ROAD -0.6343 0 0.1190 0.0426 0.0764
## 6 2 SANTAROSA; STEELE_LANE -0.3623 0 0.0680 0.1293 -0.0613
We try adding TOT_TRN into LOGA. the new AIC is -34.92129, not as good as the previous try. Now there are 10 that miss by more than 0.05. So we reject this version.
## RANK name std_resids C L W gap
## 1 5 SANTAROSA; PINER_ROAD -1.1160 0 0.2078 0.0485 0.1593
## 2 4 SANTAROSA; COLLEGE_AVENUE -0.9399 0 0.1750 0.0494 0.1256
## 3 31 SONOMA; SEARSPOINT/ST-37 -0.7401 0 0.1378 0.0212 0.1166
## 4 43 SANRAFAEL; 2ND_STREET -0.5103 0 0.0950 0.0141 0.0809
## 5 2 SANTAROSA; STEELE_LANE -0.2674 0 0.0498 0.1293 -0.0795
## 6 16 PETALUMA; CORONA_ROAD -0.6223 0 0.1159 0.0371 0.0788
## 7 9 SANTAROSA; GUERNEVILLE_ROAD -0.6046 0 0.1126 0.0426 0.0700
## 8 52 SANRAFAEL; 3RD_STREET -0.4191 0 0.0780 0.0115 0.0665
## 9 28 PETALUMA; NMCDOWELL_BLVD -0.4424 0 0.0824 0.0236 0.0588
## 10 40 FULTON; RIVER_ROAD -0.3597 0 0.0670 0.0158 0.0512
Now there are 10 above 0.05. This is something of a fit but still is skewed right and bimodal. Personally I’d prefer the one with just the two variable interaction. We plot the histogram below.
Now we add in HWYLNS as a factor with TOT_TRN. We now obtain 17 gaps outside the (-0.05,0.05) interval.
## RANK name std_resids C L W gap
## 1 43 SANRAFAEL; 2ND_STREET -1.1043 0 0.2016 0.0141 0.1875
## 2 36 PETALUMA; CAULFIELD_LANE 0.6167 0 -0.1126 0.0172 -0.1298
## 3 31 SONOMA; SEARSPOINT/ST-37 -0.7887 0 0.1440 0.0212 0.1228
## 4 11 SANTAROSA; 3RD_STREET 0.4189 0 -0.0765 0.0423 -0.1188
## 5 1 SANTAROSA; HEARN_AVE 3.9278 1 0.2830 0.1711 0.1119
## 6 10 PETALUMA; EAST_D_STREET -0.7606 0 0.1389 0.0424 0.0965
## 7 5 SANTAROSA; PINER_ROAD -0.7204 0 0.1315 0.0485 0.0830
## 8 13 SANTAROSA; BARHAM_AVE -0.6615 0 0.1207 0.0400 0.0807
## 9 16 PETALUMA; CORONA_ROAD -0.6362 0 0.1161 0.0371 0.0790
## 10 19 SANRAFAEL; CIVIC_CENTER_DR -0.5875 0 0.1073 0.0358 0.0715
## 11 9 SANTAROSA; GUERNEVILLE_ROAD -0.6153 0 0.1123 0.0426 0.0697
## 12 52 SANRAFAEL; 3RD_STREET -0.4426 0 0.0808 0.0115 0.0693
## 13 39 ROHNERTPARK; ROHNERT_PARK_EXPY 0.2321 0 -0.0424 0.0159 -0.0583
## 14 40 FULTON; RIVER_ROAD -0.3765 0 0.0687 0.0158 0.0529
## 15 37 COTATI; EAST_COTATI_AVE -0.3734 0 0.0682 0.0169 0.0513
## 16 29 NOVATO; RUSHCREEK_PLACE 0.1505 0 -0.0275 0.0235 -0.0510
## 17 3 PETALUMA; WASHINGTON_STREET 0.0000 0 0.0000 0.0504 -0.0504
The distribution of gaps is shown here.
The anova table indicates that the interaction of TTBL_SPD and AADT makes predictions whose error is not significantly different from that obtained by adding HWYLNS and TOT_TRN. We should therefore stick with the interaction only model. Six crossings bear closer investigation, since the predictions of the LOGR model are widely different from those of the WABS model. Adding HWYLNS as a factor does not improve the regression.
## Analysis of Deviance Table
##
## Model 1: C ~ as.factor(HWYLNS) + TOT_TRN + TTBL_SPD:AADT
## Model 2: C ~ TTBL_SPD:AADT
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 73 2.6326
## 2 78 2.7802 -5 -0.14767 0.5359
The simpler model is clearly better. So we fix on LOGA using TTBL_SPD:AADT only. This raises the question of why these crossings are predicted differently by the LOGR model chosen. Only one of these, Steele Lane, is predicted lower by LOGR than WABS.
wh=glm.spdxaadt_data[which(abs(glm.spdxaadt_data$gap) >= 0.05),"RANK"]
mutate (dataf, L = glm.spdxaadt_data$L, gap = glm.spdxaadt_data$gap )[ wh, ] %>%
mutate ( name = substr(paste0(CITY, "; ",substr(ROAD, 1, 16)),1,30), W = round(PRED_COLLS,4) ) %>%
select ( RANK, name, W, L, gap, C, TTBL_SPD, AADT, TOT_TRN, HWYLNS, TOT_TRK ) %>%
arrange (desc ( gap))
## # A tibble: 6 x 11
## RANK name W L gap C TTBL_SPD AADT TOT_TRN HWYLNS
## <int> <chr> <dbl> <dbl> <dbl> <int> <int> <dbl> <int> <dbl>
## 1 5 SANT~ 0.0485 0.196 0.148 0 79 23900 64 4
## 2 4 SANT~ 0.0494 0.170 0.120 0 60 26900 64 4
## 3 16 PETA~ 0.0371 0.119 0.0818 0 79 13810 60 2
## 4 31 SONO~ 0.0212 0.0994 0.0782 0 25 35600 6 2
## 5 9 SANT~ 0.0426 0.119 0.0764 0 40 27300 64 2
## 6 2 SANT~ 0.129 0.068 -0.0613 0 70 8100 64 3
## # ... with 1 more variable: TOT_TRK <int>
We note that Sears Point and St 37 crossing actually has 4 lanes of traffic, as a look at google maps shows. It seems that the somewhat higher probability imputed by L might be justified, although we cannot support inclusion of HWYLNS in the LOGR model.
Similarly Guerneville Rd has 4 highway lanes, though the database shows only 2. It has lower TTBL_SPD than College Ave, but higher AADT by a little. This would make it most similar to College Ave. We would expect a prediction higher for College Ave, and both methods get that, though the WABS difference is minimal.
We calculate 95% confidence intervals for the predictions from the interactions model to show how precise the predictions should be thought. We would expect that the WABS predictions should be within the 95% confidence interval for the LOGR predictions. Otherwise the WABS prediction could be anomalous. We do not have information on confidence intervals for the WABS predictions to see if they overlap those for LOGR.
pr = predict(glm.spdxaadt, type="response", se.fit=TRUE)
confint = round(data.frame(W = dataf$PRED_COLLS, L = pr$fit, gap=glm.spdxaadt_data$gap,
lwr= pr$fit - 1.96*pr$se.fit, upr = pr$fit + 1.96 * pr$se.fit),4 )%>%
mutate ( clen = upr - lwr, rank=dataf$RANK, name=paste0(dataf$CITY,"; ",substr(dataf$ROAD,1,20) ) )
confint[wh,]
## W L gap lwr upr clen rank
## 2 0.1293 0.0680 -0.0613 0.0141 0.1218 0.1077 2
## 4 0.0494 0.1697 0.1203 0.0147 0.3247 0.3100 4
## 5 0.0485 0.1964 0.1479 0.0122 0.3806 0.3684 5
## 9 0.0426 0.1190 0.0764 0.0181 0.2199 0.2018 9
## 16 0.0371 0.1189 0.0818 0.0181 0.2197 0.2016 16
## 31 0.0212 0.0994 0.0782 0.0181 0.1806 0.1625 31
## name
## 2 SANTAROSA; STEELE_LANE
## 4 SANTAROSA; COLLEGE_AVENUE
## 5 SANTAROSA; PINER_ROAD
## 9 SANTAROSA; GUERNEVILLE_ROAD
## 16 PETALUMA; CORONA_ROAD
## 31 SONOMA; SEARSPOINT/ST-37
Only for Steele Lane does the WABS prediction fall outside the 95% confidence limits for the L prediction. What is special about Steele Lane? A picture of the intersection from Google Maps is shown here: https://goo.gl/maps/HyDXqU8J5LG2 . While there are only 3 lanes of traffic, the inner lane is a left turn lane into Coffey which crosses the tracks. This perhaps encourages cars waiting for the left turn, and not wanting to miss the red light cycle at Coffee, to chance waiting in the crossing for the turn. With train timetable speeds of 70 here, and no knowledge of the light cycle, trains might not be able to stop; drivers, not knowing this, would be taken unawares by a train that could not stop. The confidence limits give us additional support for the LOGR model, though not necessarily in preference to the WABS model.
We have independently modeled the WABS data for the cycle ending in November 2018, and have obtained a logistic regression model. It predicts the probability of a collision at each crossing during a one year period by using two of the predictors in the WABS data set, TTBL_SPD and AADT. These are the timetable speed of the scheduled trains, and the average auto daily traffic at the crossing. The variables TOT_TRN and HWYLNS, the total number of trains per day, and the number of highway lanes at the crossing were found not to improve the model, as was also true of the discrete crossing type variable WD.
The model obtained regresses C upon the interaction term TTBL_SPD x AADT alone. Thus there is a complex relation between the predictors.
This model which we call the LOGR model predicts probabilities of collision at all 80 intersections. Only 6 crossings have predicted probabilities greater than 0.05 or 1 in 20 chance that observing the crossing for a year will detect a collision.
An analysis of these crossings shows that of the 6, two have erroneous data in the HWYLNS field in WABS data. This was discovered by inspecting the crossing via Google Maps interactive view. However these errata would not impact the predictions made by the LOGA model. We do not know whether the WABS predictions took the erroneous values into account.
Assume that at crossings accident rates are independent; this is likely on a long linear line such as SMART. IN that case we might add the collision probabilities to obtain the number of accidents in a year for the whole line. The WABS method predicts almost 1.84 accidents per year, the LOGR method predicts 3 accidents per year. Last year (2018) there were 3 accidents at crossings on SMART. We could predict that there would be 3 again this year, if no changes are made at crossings.
The author thanks Steve Birdlebough for calling this interesting problem and the WABS dataset to his attention.
[1] WABS dataset
First we try the most significant variable, TTBL_SPD. A plot of the distribution of the gaps will show the range. These gaps appear distributed bimodal and skewed right(?) which means slightly higher estimates than WABS. But it is a pretty good result with no gap larger than about .075 in either direction.
Next we try TOT_TRN, the next most significant variable. We plot the distribution of gaps.
This variable skews gaps to the right, estimating in excess of the WABS estimates. But a look at the predicted values shows a certain amount of discretization, since the TOT_TRN predictor has only a few discrete values. This makes it not a good predictor to include.
Finally we try AADT on its own. We plot the distribution of the gaps.
Again there are relatively few values of the prediction, they tend to be low, and the busy intersections tend to be much lower.
The plots appear below.
grid.arrange(grobs=list(g1,g2,g3), nrow=2, ncol=2)
We now use a simple neural network to predict the probabilities of accident at each crossing. We use the nnet routine in the caret package. Such a routine uses the data to fire adjacent nodes in the net at random, based on random numbers generated internally. Because we need to seed the neural network with a random number we cannot be assured we will get the same predictions each time. However, after we establish that the neural network works we can then bootstrap its results to get confidence intervals on the predictions. Start by setting the random number generation seed to a specific repeatable number. For comparison we select the same 3 variables we chose above. Since there are only 3 collisions, there could be difficulty training the neural net.
We can display the predicted probabilities with the WABS and LOGR predictions for comparison.
print(mutate( out, NNET=round(probs$`1`,4 ) ) )
## name WABS LOGR NNET
## 1 SANTAROSA; HEARN_AVE 0.1711 0.30700 0.0989
## 2 SANTAROSA; STEELE_LANE 0.1293 0.26425 0.0223
## 3 PETALUMA; WASHINGTON_STREET 0.0504 0.24901 0.0989
## 4 SANTAROSA; COLLEGE_AVENUE 0.0494 0.19704 0.0989
## 5 SANTAROSA; PINER_ROAD 0.0485 0.14239 0.0989
## 6 ROHNERTPARK; GOLF_COURSE_DRIVE 0.0464 0.10823 0.0973
## 7 SANRAFAEL; 4TH_STREET 0.0458 0.09945 0.0989
## 8 SANRAFAEL; N_SANPEDRO_ROAD 0.0455 0.09902 0.0223
## 9 SANTAROSA; GUERNEVILLE_ROAD 0.0426 0.09146 0.0989
## 10 PETALUMA; EAST_D_STREET 0.0424 0.08986 0.0989
## 11 SANTAROSA; 3RD_STREET 0.0423 0.08493 0.0223
## 12 SANTAROSA; SAN_MIGUEL_ROAD 0.0412 0.08308 0.0223
## 13 SANTAROSA; BARHAM_AVE 0.0400 0.08175 0.0223
## 14 NOVATO; GRANT_AVE 0.0379 0.08109 0.0223
## 15 NOVATO; OLIVE_AVE 0.0377 0.07851 0.0223
## 16 PETALUMA; CORONA_ROAD 0.0371 0.07651 0.0223
## 17 SANTAROSA; TODD_ROAD 0.0367 0.07650 0.0223
## 18 SANRAFAEL; MISSION_AVENUE 0.0364 0.07406 0.0247
## 19 SANRAFAEL; CIVIC_CENTER_DR 0.0358 0.06865 0.0223
## 20 SANRAFAEL; 5TH_AVENUE 0.0354 0.06430 0.0223
## 21 PETALUMA; W_PAYRAN_STREET 0.0343 0.06325 0.0223
## 22 SANTAROSA; BELLEVUE/CORBY_AVE 0.0332 0.06120 0.0223
## 23 SANTAROSA; SEBASTOPOL_ROAD 0.0325 0.06070 0.0223
## 24 SANRAFAEL; SMITH_RANCH_RD 0.0324 0.06020 0.0223
## 25 PETALUMA; LAKEVILLE_STREET 0.0313 0.05873 0.0223
## 26 SANTAROSA; 9TH_STREET 0.0313 0.05729 0.0223
## 27 NOVATO; ROBLAR_DRIVE 0.0240 0.05547 0.0223
## 28 PETALUMA; NMCDOWELL_BLVD 0.0236 0.02283 0.0989
## 29 NOVATO; RUSHCREEK_PLACE 0.0235 0.02168 0.0223
## 30 SANRAFAEL; IRWINSTREET 0.0231 0.01990 0.0989
## 31 SONOMA; SEARSPOINT/ST-37 0.0212 0.01534 0.0989
## 32 SANRAFAEL; PACHECO_STREET 0.0200 0.01358 0.0223
## 33 SANTAROSA; 8TH_STREET 0.0190 0.01011 0.0223
## 34 SANTAROSA; 7TH_STREET 0.0180 0.01011 0.0223
## 35 SANTAROSA; 6TH_STREET 0.0180 0.00828 0.0223
## 36 PETALUMA; CAULFIELD_LANE 0.0172 0.00765 0.0223
## 37 COTATI; EAST_COTATI_AVE 0.0169 0.00742 0.0989
## 38 ROHNERTPARK; SOUTHWEST_BLVD 0.0163 0.00703 0.0989
## 39 ROHNERTPARK; ROHNERT_PARK_EXPY 0.0159 0.00378 0.0224
## 40 FULTON; RIVER_ROAD 0.0158 0.00369 0.0989
## 41 HEALDSBURG; HEALDSBURG_AVE 0.0149 0.00273 0.0989
## 42 NOVATO; HAMILTON_PKWY 0.0145 0.00234 0.0223
## 43 SANRAFAEL; 2ND_STREET 0.0141 0.00151 0.0989
## 44 NOVATO; HANNA_RANCH_RD 0.0136 0.00144 0.0223
## 45 SONOMA; SR_ROUTE_12/121 0.0129 0.00137 0.0989
## 46 HEALDSBURG; FRONT_STREET 0.0128 0.00131 0.0276
## 47 PENNGROVE; MAIN_ST/WOODWARD 0.0126 0.00114 0.0227
## 48 WINDSOR; WINDSOR_RIVER_RD 0.0126 0.00043 0.0304
## 49 SANRAFAEL; FRANCISCO_BLVD 0.0124 0.00029 0.0989
## 50 PENNGROVE; OLD_ADOBE_ROAD 0.0117 0.00029 0.0223
## 51 SONOMA; SR_ROUTE_12/121 0.0117 0.00029 0.0989
## 52 SANRAFAEL; 3RD_STREET 0.0115 0.00023 0.0989
## 53 FULTON; FULTON_ROAD 0.0114 0.00022 0.0225
## 54 HEALDSBURG; BAILHACHE_AVE 0.0112 0.00017 0.0223
## 55 COTATI; EAST_RAILROAD_AVE 0.0108 0.00017 0.0223
## 56 SANTAROSA; WEST_ROBLES_AVE 0.0099 0.00015 0.0223
## 57 PETALUMA; ELY_ROAD_N 0.0096 0.00013 0.0223
## 58 ROHNERTPARK; SCENIC_AVE 0.0093 0.00009 0.0223
## 59 WINDSOR; STARR_ROAD 0.0091 0.00007 0.0223
## 60 FULTON; SHILOH_ROAD 0.0085 0.00007 0.0223
## 61 HEALDSBURG; LIMERICK_LANE 0.0082 0.00004 0.0223
## 62 FULTON; AIRPORT_BLVD 0.0081 0.00004 0.0223
## 63 SANRAFAEL; PED_XING 0.0081 0.00004 0.0223
## 64 PETALUMA; HOPPER_STREET 0.0078 0.00004 0.0223
## 65 NAPA; MILTON_ROAD 0.0077 0.00003 0.0295
## 66 SANRAFAEL; RICE_STREET 0.0075 0.00003 0.0989
## 67 SONOMA; SKAGGS_ISLAND_RD 0.0068 0.00002 0.0226
## 68 HEALDSBURG; GRANT_AVE. 0.0064 0.00000 0.0223
## 69 WINDSOR; MITCHELL_LANE 0.0060 0.00000 0.0223
## 70 SONOMA; TOULAY_CREEK_ROAD 0.0052 0.00000 0.0223
## 71 FULTON; AVIATION_BLVD 0.0039 0.00000 0.0224
## 72 SANTAROSA; STANDISH_AVE 0.0038 0.00000 0.0239
## 73 SANRAFAEL; PALOMA_AVENUE 0.0016 0.00000 0.0223
## 74 HEALDSBURG; GRANT_SCHOOL_ROAD 0.0003 0.00000 0.0231
## 75 PETALUMA; SOUTH_POINT_ROAD 0.0003 0.00000 0.0231
## 76 NOVATO; PED_XING 0.0002 0.00000 0.0223
## 77 NOVATO; Franklin_Ped_crossin 0.0001 0.00000 0.0223
## 78 SANTAROSA; Railroad_square_ped_ 0.0001 0.00000 0.0223
## 79 NOVATO; BIKE_PATH_PED 0.0001 0.00000 0.0223
## 80 ROHNERTPARK; PED_XING 0.0001 0.00000 0.0452
The neural network does not seem very useful, as the probabilities are all rather low (perhaps appropriate!) and also have only a few unique values. This shows it does not discriminate enough among the cases to come up with a range of probabilities. There are too few accidents to model the data this way. The logistic regression does a better job.