In this essay we address the problem of estimating, from data provided by the FRA in the WABS databases, the probability of a collision at specific rail crossings in a year. WABS itself makes some estimates, but does not document the method used for making them. We are not trying to replicate the WABS results, but instead develop here an independent method based on logistic regression.

Logistic regression is a way of estimating the probability of an event, such as an accident or collision, happening. The response variable is a binary variable indicating whether an accident occurs or not. Predictor variables are taken from the data fields provided by WABS. Logistic regression allows either continuous or discrete values of predictors, including factors (which might be character data in the data file). It uses a version of regression which chooses coefficients based on maximum likelihood, instead of least square error, which is used in ordinary regression. In addition, the response variable does not follow a normal distribution, but a binomial one, so that ordinary least squares methods of analysis of results are inappropriate.

WABS Accident data is contained in pdf documents downloadable from their website. The files are not downloadable in any other format. We extracted the data from the pdf and edited the files, using Notepad+, to get columns and field names aligned properly, and then used the RStudio import data feature to get it in. The R to do so is:

Organizing the data

We first create a shorter, more generic name for the data frame. A summary of the data shows some NAs in AADT, and one in HWYLNS. We have to replace these

dataf = WABS_2019_01_28
summary(dataf)
##       RANK         PRED_COLLS         CROSSING              RR           
##  Min.   : 1.00   Min.   :0.000057   Length:80          Length:80         
##  1st Qu.:20.75   1st Qu.:0.008430   Class :character   Class :character  
##  Median :40.50   Median :0.015378   Mode  :character   Mode  :character  
##  Mean   :40.50   Mean   :0.022960                                        
##  3rd Qu.:60.25   3rd Qu.:0.034537                                        
##  Max.   :80.00   Max.   :0.171142                                        
##                                                                          
##     STATE              COUNTY              CITY          
##  Length:80          Length:80          Length:80         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      ROAD                 C                C1               C2   
##  Length:80          Min.   :0.0000   Min.   :0.0000   Min.   :0  
##  Class :character   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0  
##  Mode  :character   Median :0.0000   Median :0.0000   Median :0  
##                     Mean   :0.0375   Mean   :0.0125   Mean   :0  
##                     3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :0  
##                                                                  
##        C3          C4      DATE_CHG              WD           
##  Min.   :0   Min.   :0   Length:80          Length:80         
##  1st Qu.:0   1st Qu.:0   Class :character   Class :character  
##  Median :0   Median :0   Mode  :character   Mode  :character  
##  Mean   :0   Mean   :0                                        
##  3rd Qu.:0   3rd Qu.:0                                        
##  Max.   :0   Max.   :0                                        
##                                                               
##     TOT_TRN         TOT_TRK         TTBL_SPD        HWYPVD         
##  Min.   : 0.00   Min.   :0.000   Min.   : 0.00   Length:80         
##  1st Qu.: 8.00   1st Qu.:1.000   1st Qu.:25.00   Class :character  
##  Median :11.00   Median :1.000   Median :40.00   Mode  :character  
##  Mean   :34.41   Mean   :1.238   Mean   :38.73                     
##  3rd Qu.:61.00   3rd Qu.:1.000   3rd Qu.:45.00                     
##  Max.   :72.00   Max.   :3.000   Max.   :79.00                     
##                                                                    
##      HWYLNS           AADT      
##  Min.   :0.000   Min.   :    1  
##  1st Qu.:2.000   1st Qu.:  775  
##  Median :2.000   Median : 3122  
##  Mean   :2.165   Mean   : 6855  
##  3rd Qu.:2.000   3rd Qu.: 8623  
##  Max.   :5.000   Max.   :48000  
##  NA's   :1       NA's   :4

HWYLNS should probably be replaced with a 2, since it is the number of highway lanes at the crossing, and could not be 0. And the AADT ones should be replaced with zeroes. We also note that HWYPVD is always YES, so it need not be considered as a discriminating response variable.

dataf$HWYLNS[which(is.na(dataf$HWYLNS) )] = 2
dataf$AADT[which(is.na(dataf$AADT))] = 0

Now the data look good for analysis. Here is a list of the column names:

##  [1] "RANK"       "PRED_COLLS" "CROSSING"   "RR"         "STATE"     
##  [6] "COUNTY"     "CITY"       "ROAD"       "C"          "C1"        
## [11] "C2"         "C3"         "C4"         "DATE_CHG"   "WD"        
## [16] "TOT_TRN"    "TOT_TRK"    "TTBL_SPD"   "HWYPVD"     "HWYLNS"    
## [21] "AADT"

RANK and PRED_COLLS show the results of the WABS estimation. RANK is the order high to low PRED_COLLS, PRED_COLLS is the estimate of the probability of a collision at the crossing in a year computed by WABS. The data have ranked the crossings in descending order of probability of a collision.

CROSSING, RR, STATE, COUNTY, CITY, and ROAD describe the exact location of the crossing. To create this file we selected only crossings on the SMART train line in Sonoma and Marin counties, California. Thus RR always is the acronym SMRT, STATE is always California, and COUNTY is either Sonoma or Marin. The fields beginning with C are the accident or collision counts for the current cycle (representing one year), and previous cycles of one year length. Since the SMART line only operated in 2017 and 2018, the last 3 fields here are all zeros. WD is a code for the type of equipment at the crossing. TOT_TRN is the number of trains per day, TOT_TRK the number of tracks at the crossing, TTBL_SPD the maximum speed permitted by the timetable schedule, HWYPVD a code for whether the crossing is paved or not (in this sample always YES, so it is not a candidate for a predictor). HWYLNS is the number of highway lanes going through the crossing, and AADT is the avverage daily vehicle traffic count at the crossing. Except for HWYPVD, the variables after DATE_CHG, which represents whether there was a change to the entry made by the railroad during the cycles included (perhaps a re-engineering of the crossing equipment), are all candidates for predictors of the collision likelihood.

Here are the key predictions about the crossings: you can search the table to see highs and lows.

print(dataf[,c(1:2,7:8,15:18, 20:21)])
## # A tibble: 80 x 10
##     RANK PRED_COLLS CITY  ROAD WD    TOT_TRN TOT_TRK TTBL_SPD HWYLNS  AADT
##    <int>      <dbl> <chr> <ch> <chr>   <int>   <int>    <int>  <dbl> <dbl>
##  1     1     0.171  SANT~ HEA~ GT         64       1       79      3 20600
##  2     2     0.129  SANT~ STE~ GT         64       1       70      3  8100
##  3     3     0.0504 PETA~ WAS~ GT         60       1       25      5 21549
##  4     4     0.0494 SANT~ COL~ GT         64       2       60      4 26900
##  5     5     0.0485 SANT~ PIN~ GT         64       1       79      4 23900
##  6     6     0.0464 ROHN~ GOL~ GT         10       1       40      2  5024
##  7     7     0.0458 SANR~ 4TH~ FQ         60       2       15      2 48000
##  8     8     0.0455 SANR~ N_S~ GT         60       1       45      2  8500
##  9     9     0.0426 SANT~ GUE~ GT         64       2       40      2 27300
## 10    10     0.0424 PETA~ EAS~ FQ         60       1       25      3 18371
## # ... with 70 more rows

Only 2 have higher predictions than 10%, and the last 10 are all under 0.4%. The last 5 are pedestrian crossings.

Data preview

To gain some insight into the data we plot the predicted collisions against the six chosen variables.

pairs(dataf[,c(2, 16:18, 20:21)])

We see there are a couple of outliers with very high PRED_COLLS. These occur with high TOT_TRN, TOT_TRK = 2, high TTBL_SPD, and 3 HWYLNS; AADT for these is medium in one case and high in another. Otherwise there is not a clear pattern. Very low PRED_COLLS go with low values of AADT,TTBL_SPD, and TOT_TRN.

Logistic regression analysis

Initial analysis of all variables

A preliminary analysis was done using logistic regression; we used the glm routine in R. We tried a model for C using all the predictors.

glm.fit1=glm(C~as.factor(WD)+TOT_TRN+TOT_TRK+TTBL_SPD+HWYLNS+AADT, data=dataf,family="binomial")
summary(glm.fit1)
## 
## Call:
## glm(formula = C ~ as.factor(WD) + TOT_TRN + TOT_TRK + TTBL_SPD + 
##     HWYLNS + AADT, family = "binomial", data = dataf)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.93591  -0.25999  -0.02120  -0.00008   2.29508  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -1.728e+01  7.279e+03  -0.002    0.998
## as.factor(WD)FQ -3.534e+00  8.758e+03   0.000    1.000
## as.factor(WD)GT  1.268e+01  7.279e+03   0.002    0.999
## as.factor(WD)HS -2.409e+00  1.231e+04   0.000    1.000
## as.factor(WD)NO -3.617e+00  1.450e+04   0.000    1.000
## as.factor(WD)SS -2.371e-01  1.917e+04   0.000    1.000
## as.factor(WD)XB -4.972e-01  1.917e+04   0.000    1.000
## TOT_TRN         -1.433e-01  1.433e-01  -1.000    0.317
## TOT_TRK         -3.982e+00  1.119e+01  -0.356    0.722
## TTBL_SPD         2.057e-01  1.780e-01   1.156    0.248
## HWYLNS          -4.243e-01  1.235e+00  -0.344    0.731
## AADT             8.549e-05  9.695e-05   0.882    0.378
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25.587  on 79  degrees of freedom
## Residual deviance: 18.116  on 68  degrees of freedom
## AIC: 42.116
## 
## Number of Fisher Scoring iterations: 19

A smaller Akaike Information Criterion (AIC) would represent a better fit; one cannot use R2 the conventional least squares regression measure of fit. Interpreting the coefficients, we see that none of the WD crossing type factors contributed anything to the result (p close to 1) . TOT_TRK and HWYLNS were also very likely (p=.772, p=.731) to be unrelated. Of the remaining 3 predictors, TTBL_SPD (timetable train speed) was the most likely to have an effect, and the effect is positive on the probability– the coefficient has a positive sign. The number of trains TOT_TRN is next most effective, in a negative direction– more trains through a crossing make the probability of a collision lower. AADT, the number or autos per day through the crossing, is third, with again a positive effect– more would predict a greater probability of collision – however the coefficient is almost 0.

As a check we can use the caret package to assess the importance of the variables.

library(caret); library(dplyr)
round(varImp(glm.fit1,scale=TRUE),4)
##                 Overall
## as.factor(WD)FQ  0.0004
## as.factor(WD)GT  0.0017
## as.factor(WD)HS  0.0002
## as.factor(WD)NO  0.0002
## as.factor(WD)SS  0.0000
## as.factor(WD)XB  0.0000
## TOT_TRN          1.0002
## TOT_TRK          0.3559
## TTBL_SPD         1.1555
## HWYLNS           0.3436
## AADT             0.8817

Clearly TTBL_SPD is most important; TOT_TRN is close behind; AADT is third; and TOT_TRK and HWYLNS are quite a bit less important.

Reduced form of logistic regression

It therefore makes sense to rerun the logistic regression, without the WD factors and TOT_TRK, HWYLNS, to see if a better result is obtained.

glm.fit3=glm(C~TOT_TRN+TTBL_SPD+AADT, data=dataf, family="binomial")
summary(glm.fit3)
## 
## Call:
## glm(formula = C ~ TOT_TRN + TTBL_SPD + AADT, family = "binomial", 
##     data = dataf)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.85641  -0.35314  -0.05429  -0.00886   2.36721  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -1.039e+01  6.151e+00  -1.689   0.0912 .
## TOT_TRN     -1.690e-01  1.464e-01  -1.155   0.2482  
## TTBL_SPD     2.316e-01  1.831e-01   1.265   0.2059  
## AADT         8.778e-05  9.642e-05   0.910   0.3626  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25.587  on 79  degrees of freedom
## Residual deviance: 18.930  on 76  degrees of freedom
## AIC: 26.93
## 
## Number of Fisher Scoring iterations: 9

Now we see all predictor coefficients have relatively low p scores– below .37. The coefficients all go in the expected direction, if you consider that when there are lots of trains through a crossing each day, people tend to be more careful at the intersection.

Predicting probabilities of collision

We can use the results of the latter logistic regression to predict the probability of a collision at each crossing. Our results will be different from those of the WABS. We build a report to show the results for the crossings. The report will show city and road, type of equipment at the crossing, the predicted probability of collision, and the standard error of the prediction, which gives an idea of how confident we are in the prediction. it could be used to calculate confidence intervals on the predictions. We marked with a * the crossings at which the standard error was small enough so that Prediction was greater than SE, and hence one standard error below the Prediction was still positive probability of collision.

##           City                     Road Type Prediction  SEfit sig
## 1    SANTAROSA               PINER_ROAD   GT    0.30700 0.2876   *
## 2     PETALUMA              CORONA_ROAD   GT    0.26425 0.2190   *
## 3    SANTAROSA                HEARN_AVE   GT    0.24901 0.2239   *
## 4       FULTON            AVIATION_BLVD   GT    0.19704 0.2095    
## 5       FULTON               RIVER_ROAD   GT    0.14239 0.1141   *
## 6       COTATI          EAST_COTATI_AVE   GT    0.10823 0.0774   *
## 7  ROHNERTPARK           SOUTHWEST_BLVD   GT    0.09945 0.0686   *
## 8      WINDSOR         WINDSOR_RIVER_RD   GT    0.09902 0.0721   *
## 9       FULTON              FULTON_ROAD   GT    0.09146 0.0675   *
## 10   SANTAROSA       BELLEVUE/CORBY_AVE   FQ    0.08986 0.1151    
## 11 ROHNERTPARK        GOLF_COURSE_DRIVE   GT    0.08493 0.0571   *
## 12     WINDSOR               STARR_ROAD   GT    0.08308 0.0634   *
## 13      FULTON              SHILOH_ROAD   GT    0.08175 0.0628   *
## 14      FULTON             AIRPORT_BLVD   GT    0.08109 0.0626   *
## 15     WINDSOR            MITCHELL_LANE   GT    0.07851 0.0615   *
## 16      SONOMA         SEARSPOINT/ST-37   GT    0.07651 0.1969    
## 17   SANTAROSA          SAN_MIGUEL_ROAD   GT    0.07650 0.1084    
## 18   SANTAROSA               BARHAM_AVE   GT    0.07406 0.1071    
## 19 ROHNERTPARK        ROHNERT_PARK_EXPY   GT    0.06865 0.0492   *
## 20  HEALDSBURG            BAILHACHE_AVE   GT    0.06430 0.0478   *
## 21      COTATI        EAST_RAILROAD_AVE   GT    0.06325 0.0476   *
## 22   SANTAROSA          WEST_ROBLES_AVE   GT    0.06120 0.0470   *
## 23   SANTAROSA                TODD_ROAD   GT    0.06070 0.0469   *
## 24 ROHNERTPARK               SCENIC_AVE   GT    0.06020 0.0468   *
## 25  HEALDSBURG            LIMERICK_LANE   GT    0.05873 0.0465   *
## 26  HEALDSBURG               GRANT_AVE.   GT    0.05729 0.0462   *
## 27    PETALUMA           NMCDOWELL_BLVD   GT    0.05547 0.0573    
## 28   PENNGROVE         MAIN_ST/WOODWARD   GT    0.02283 0.0278    
## 29   PENNGROVE           OLD_ADOBE_ROAD   GT    0.02168 0.0271    
## 30    PETALUMA               ELY_ROAD_N   GT    0.01990 0.0261    
## 31   SANRAFAEL               2ND_STREET   GT    0.01534 0.0399    
## 32   SANTAROSA              STEELE_LANE   GT    0.01358 0.0323    
## 33  HEALDSBURG        GRANT_SCHOOL_ROAD   NO    0.01011 0.0175    
## 34    PETALUMA         SOUTH_POINT_ROAD   NO    0.01011 0.0175    
## 35   SANRAFAEL               3RD_STREET   GT    0.00828 0.0209    
## 36      SONOMA          SR_ROUTE_12/121   GT    0.00765 0.0154    
## 37  HEALDSBURG           HEALDSBURG_AVE   GT    0.00742 0.0158    
## 38   SANTAROSA           COLLEGE_AVENUE   GT    0.00703 0.0230    
## 39   SANRAFAEL              IRWINSTREET   HS    0.00378 0.0101    
## 40      SONOMA        TOULAY_CREEK_ROAD   HS    0.00369 0.0092    
## 41      SONOMA          SR_ROUTE_12/121   GT    0.00273 0.0078    
## 42  HEALDSBURG             FRONT_STREET   GT    0.00234 0.0068    
## 43      NOVATO             ROBLAR_DRIVE   GT    0.00151 0.0059    
## 44      SONOMA         SKAGGS_ISLAND_RD   GT    0.00144 0.0046    
## 45      NOVATO            HAMILTON_PKWY   GT    0.00137 0.0055    
## 46      NOVATO            BIKE_PATH_PED   GT    0.00131 0.0053    
## 47   SANTAROSA          SEBASTOPOL_ROAD   GT    0.00114 0.0048    
## 48    PETALUMA            HOPPER_STREET   FL    0.00043 0.0018    
## 49    PETALUMA          W_PAYRAN_STREET   GT    0.00029 0.0015    
## 50   SANRAFAEL           FRANCISCO_BLVD   XB    0.00029 0.0013    
## 51   SANRAFAEL              RICE_STREET   FL    0.00029 0.0013    
## 52   SANRAFAEL           SMITH_RANCH_RD   GT    0.00023 0.0012    
## 53   SANTAROSA             STANDISH_AVE   SS    0.00022 0.0010    
## 54      NOVATO           HANNA_RANCH_RD   GT    0.00017 0.0010    
## 55      NOVATO    Franklin_Ped_crossing   FL    0.00017 0.0010    
## 56        NAPA              MILTON_ROAD   GT    0.00015 0.0007    
## 57      NOVATO                 PED_XING   HS    0.00013 0.0007    
## 58   SANRAFAEL          N_SANPEDRO_ROAD   GT    0.00009 0.0005    
## 59   SANTAROSA         GUERNEVILLE_ROAD   GT    0.00007 0.0005    
## 60    PETALUMA         LAKEVILLE_STREET   GT    0.00007 0.0004    
## 61      NOVATO                GRANT_AVE   GT    0.00004 0.0003    
## 62   SANRAFAEL           PACHECO_STREET   GT    0.00004 0.0002    
## 63   SANRAFAEL                 PED_XING   FL    0.00004 0.0003    
## 64   SANRAFAEL            PALOMA_AVENUE   GT    0.00004 0.0003    
## 65   SANTAROSA               9TH_STREET   FQ    0.00003 0.0002    
## 66 ROHNERTPARK                 PED_XING   GT    0.00003 0.0002    
## 67      NOVATO          RUSHCREEK_PLACE   FQ    0.00002 0.0001    
## 68    PETALUMA        WASHINGTON_STREET   GT    0.00000 0.0000    
## 69   SANRAFAEL               4TH_STREET   FQ    0.00000 0.0000    
## 70    PETALUMA            EAST_D_STREET   FQ    0.00000 0.0000    
## 71   SANTAROSA               3RD_STREET   GT    0.00000 0.0000    
## 72      NOVATO                OLIVE_AVE   GT    0.00000 0.0000    
## 73   SANRAFAEL           MISSION_AVENUE   FQ    0.00000 0.0000    
## 74   SANRAFAEL          CIVIC_CENTER_DR   GT    0.00000 0.0000    
## 75   SANRAFAEL               5TH_AVENUE   FQ    0.00000 0.0000    
## 76   SANTAROSA               8TH_STREET   GT    0.00000 0.0000    
## 77   SANTAROSA               7TH_STREET   GT    0.00000 0.0000    
## 78   SANTAROSA               6TH_STREET   GT    0.00000 0.0000    
## 79    PETALUMA           CAULFIELD_LANE   GT    0.00000 0.0000    
## 80   SANTAROSA Railroad_square_ped_xing   FL    0.00000 0.0000

Let us select all the data with a Prediction (probability of collison) greater than 1 in 100 (0.01). There are 10.

The table shows 34 crossings have probabilities of a collision in the next year greater than 0.01 or 1 in 100. If we look at the lowest ones, we will screen for the ones with a Prediction less than .0001, or one in ten thousand. We obtain 23 crossings with less than 1 in 10000.

Importance of variables

Which variable is most important? The package caret gives us a method for deciding. We see below that TTBL_SPD is a bit more important than TOT_TRN, and AADT is less important. But they are not greatly different in importance.

library(caret)
caret::varImp(glm.fit3,scale=TRUE)
##            Overall
## TOT_TRN  1.1546445
## TTBL_SPD 1.2648867
## AADT     0.9104228

Examine residuals

Residual analysis is important as well, to see if we have a decent model for prediction. We call this candidatemodel LOGR.

glm.fit3_data = data.frame( RANK=dataf$RANK, name=paste0(dataf$CITY, "; ", substr(dataf$ROAD,1,20) ),
                            std_resids = scale(glm.fit3$residuals),  C = dataf$C, L=glm.fit3$fitted.values, W=dataf$PRED_COLLS )
ggplot(glm.fit3_data, aes( RANK, std_resids, color=factor(C) ) )  + 
  geom_point(alpha = .5) +
  geom_hline(yintercept = 3, color = "orange") + geom_hline( yintercept=0, color = "blue") +
  labs( x = "RANK by WABS probability", y="Standardized residual", 
        title= "Standardized residuals of crossings using LOGR regression", subtitle = "Crossings with collisons have large positive standard errors, others negative standard errors." )

As the above graph and table following show, only three crossings have residuals over 0, and these are the ones with collisions actually occurring. Which are these?

glm.fit3_data %>% filter ( std_resids  > 0 )
##   RANK                           name std_resids C          L        W
## 1    1           SANTAROSA; HEARN_AVE   1.868924 1 0.24901172 0.171142
## 2    6 ROHNERTPARK; GOLF_COURSE_DRIVE   5.012333 1 0.08492718 0.046391
## 3   17           SANTAROSA; TODD_ROAD   6.916577 1 0.06069780 0.036735

Comparing with the WABS predictions

We will now compare our predictions from LOGR with the WABS predictions in the field PRED_COLLS. The easiest way to do this is via a graph, plotting the PRED against PRED_COLLS. The graph shows that for the most part LOGR predicts a higher probability than WABS, particularly when the probability is over about .025, or 2.5%. Let us then exclude all the ones with WABS predicted collision likelihood less than .02, and the two with WABS prediction above 0.1. This will show us the graph in a more detailed scale. Even in this smaller group it is clear that the LOGR method is obtaining a higher probability than the WABS predictions.
Comparisons of WABS and LOGR predictions

Comparisons of WABS and LOGR predictions

Predictions using interaction terms

Because of this upward bias in the LOGR predictions, we now revise the LOGA model to include only the interaction between TTBL_SPD and AADT. AIC is now negative, signifying a better fit. We first show a histogram of the distribution of prediction gap between LOGA and WABS. There are only 6 outliers with more absolute gap than 0.05.

glm.spdxaadt = glm(glm(C~TTBL_SPD:AADT, data=dataf, family="binomial"))  #+TTBL_SPD+AADT
s1=summary(glm.spdxaadt)
newdata = data.frame( HWYLNS= dataf$HWYLNS,TOT_TRN=dataf$TOT_TRN, TTBL_SPD=dataf$TTBL_SPD, AADT=dataf$AADT)
spdxaadt.pred=predict(glm.spdxaadt, newdata, type="response", se.fit=TRUE)
glm.spdxaadt_data = data.frame( RANK=dataf$RANK, name=paste0(dataf$CITY, "; ", substr(dataf$ROAD,1,20) ),
                            std_resids = round(scale(glm.spdxaadt$residuals),4),  C = dataf$C, 
                            L=round(glm.spdxaadt$fitted.values,4), W=round(dataf$PRED_COLLS,4) ) %>% 
  mutate ( gap=round(L-W, 4) )
ggplot(glm.spdxaadt_data, aes(gap)) +
  geom_histogram (color="red", alpha=0.2, size=1) +
  coord_cartesian( ylim= c(0,30), xlim=c(-0.05, 0.15)) +
  geom_vline ( xintercept = 0) +
  labs(title=paste0("Distribution of predictions from LOGA with TTBL_SPD:AADT \n AIC = ", round(s1$aic,3) ) )

We display the big 6 in descending order of the gap between LOGA and WABS (LOGA - WABS). Only one gap is negative.

glm.spdxaadt_data %>% filter ( !between (gap, -.05,.05) ) %>% arrange ( desc(abs(gap) ) )
##   RANK                        name std_resids C      L      W     gap
## 1    5       SANTAROSA; PINER_ROAD    -1.0467 0 0.1964 0.0485  0.1479
## 2    4   SANTAROSA; COLLEGE_AVENUE    -0.9047 0 0.1697 0.0494  0.1203
## 3   16       PETALUMA; CORONA_ROAD    -0.6337 0 0.1189 0.0371  0.0818
## 4   31    SONOMA; SEARSPOINT/ST-37    -0.5296 0 0.0994 0.0212  0.0782
## 5    9 SANTAROSA; GUERNEVILLE_ROAD    -0.6343 0 0.1190 0.0426  0.0764
## 6    2      SANTAROSA; STEELE_LANE    -0.3623 0 0.0680 0.1293 -0.0613

Attempt to improve on LOGA

We try adding TOT_TRN into LOGA. the new AIC is -34.92129, not as good as the previous try. Now there are 10 that miss by more than 0.05. So we reject this version.

##    RANK                        name std_resids C      L      W     gap
## 1     5       SANTAROSA; PINER_ROAD    -1.1160 0 0.2078 0.0485  0.1593
## 2     4   SANTAROSA; COLLEGE_AVENUE    -0.9399 0 0.1750 0.0494  0.1256
## 3    31    SONOMA; SEARSPOINT/ST-37    -0.7401 0 0.1378 0.0212  0.1166
## 4    43       SANRAFAEL; 2ND_STREET    -0.5103 0 0.0950 0.0141  0.0809
## 5     2      SANTAROSA; STEELE_LANE    -0.2674 0 0.0498 0.1293 -0.0795
## 6    16       PETALUMA; CORONA_ROAD    -0.6223 0 0.1159 0.0371  0.0788
## 7     9 SANTAROSA; GUERNEVILLE_ROAD    -0.6046 0 0.1126 0.0426  0.0700
## 8    52       SANRAFAEL; 3RD_STREET    -0.4191 0 0.0780 0.0115  0.0665
## 9    28    PETALUMA; NMCDOWELL_BLVD    -0.4424 0 0.0824 0.0236  0.0588
## 10   40          FULTON; RIVER_ROAD    -0.3597 0 0.0670 0.0158  0.0512

Now there are 10 above 0.05. This is something of a fit but still is skewed right and bimodal. Personally I’d prefer the one with just the two variable interaction. We plot the histogram below.

Extending model with TOT_TRN

Now we add in HWYLNS as a factor with TOT_TRN. We now obtain 17 gaps outside the (-0.05,0.05) interval.

##    RANK                           name std_resids C       L      W     gap
## 1    43          SANRAFAEL; 2ND_STREET    -1.1043 0  0.2016 0.0141  0.1875
## 2    36       PETALUMA; CAULFIELD_LANE     0.6167 0 -0.1126 0.0172 -0.1298
## 3    31       SONOMA; SEARSPOINT/ST-37    -0.7887 0  0.1440 0.0212  0.1228
## 4    11          SANTAROSA; 3RD_STREET     0.4189 0 -0.0765 0.0423 -0.1188
## 5     1           SANTAROSA; HEARN_AVE     3.9278 1  0.2830 0.1711  0.1119
## 6    10        PETALUMA; EAST_D_STREET    -0.7606 0  0.1389 0.0424  0.0965
## 7     5          SANTAROSA; PINER_ROAD    -0.7204 0  0.1315 0.0485  0.0830
## 8    13          SANTAROSA; BARHAM_AVE    -0.6615 0  0.1207 0.0400  0.0807
## 9    16          PETALUMA; CORONA_ROAD    -0.6362 0  0.1161 0.0371  0.0790
## 10   19     SANRAFAEL; CIVIC_CENTER_DR    -0.5875 0  0.1073 0.0358  0.0715
## 11    9    SANTAROSA; GUERNEVILLE_ROAD    -0.6153 0  0.1123 0.0426  0.0697
## 12   52          SANRAFAEL; 3RD_STREET    -0.4426 0  0.0808 0.0115  0.0693
## 13   39 ROHNERTPARK; ROHNERT_PARK_EXPY     0.2321 0 -0.0424 0.0159 -0.0583
## 14   40             FULTON; RIVER_ROAD    -0.3765 0  0.0687 0.0158  0.0529
## 15   37        COTATI; EAST_COTATI_AVE    -0.3734 0  0.0682 0.0169  0.0513
## 16   29        NOVATO; RUSHCREEK_PLACE     0.1505 0 -0.0275 0.0235 -0.0510
## 17    3    PETALUMA; WASHINGTON_STREET     0.0000 0  0.0000 0.0504 -0.0504

The distribution of gaps is shown here.

The anova table indicates that the interaction of TTBL_SPD and AADT makes predictions whose error is not significantly different from that obtained by adding HWYLNS and TOT_TRN. We should therefore stick with the interaction only model. Six crossings bear closer investigation, since the predictions of the LOGR model are widely different from those of the WABS model. Adding HWYLNS as a factor does not improve the regression.

## Analysis of Deviance Table
## 
## Model 1: C ~ as.factor(HWYLNS) + TOT_TRN + TTBL_SPD:AADT
## Model 2: C ~ TTBL_SPD:AADT
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        73     2.6326                     
## 2        78     2.7802 -5 -0.14767   0.5359

Investigating particular crossings

The simpler model is clearly better. So we fix on LOGA using TTBL_SPD:AADT only. This raises the question of why these crossings are predicted differently by the LOGR model chosen. Only one of these, Steele Lane, is predicted lower by LOGR than WABS.

wh=glm.spdxaadt_data[which(abs(glm.spdxaadt_data$gap) >= 0.05),"RANK"]
mutate (dataf, L = glm.spdxaadt_data$L, gap = glm.spdxaadt_data$gap  )[ wh, ] %>%
  mutate ( name = substr(paste0(CITY, "; ",substr(ROAD, 1, 16)),1,30), W = round(PRED_COLLS,4) ) %>%
  select ( RANK, name, W, L, gap, C, TTBL_SPD, AADT, TOT_TRN, HWYLNS, TOT_TRK ) %>%
  arrange (desc ( gap))
## # A tibble: 6 x 11
##    RANK name       W      L     gap     C TTBL_SPD  AADT TOT_TRN HWYLNS
##   <int> <chr>  <dbl>  <dbl>   <dbl> <int>    <int> <dbl>   <int>  <dbl>
## 1     5 SANT~ 0.0485 0.196   0.148      0       79 23900      64      4
## 2     4 SANT~ 0.0494 0.170   0.120      0       60 26900      64      4
## 3    16 PETA~ 0.0371 0.119   0.0818     0       79 13810      60      2
## 4    31 SONO~ 0.0212 0.0994  0.0782     0       25 35600       6      2
## 5     9 SANT~ 0.0426 0.119   0.0764     0       40 27300      64      2
## 6     2 SANT~ 0.129  0.068  -0.0613     0       70  8100      64      3
## # ... with 1 more variable: TOT_TRK <int>

We note that Sears Point and St 37 crossing actually has 4 lanes of traffic, as a look at google maps shows. It seems that the somewhat higher probability imputed by L might be justified, although we cannot support inclusion of HWYLNS in the LOGR model.

Similarly Guerneville Rd has 4 highway lanes, though the database shows only 2. It has lower TTBL_SPD than College Ave, but higher AADT by a little. This would make it most similar to College Ave. We would expect a prediction higher for College Ave, and both methods get that, though the WABS difference is minimal.

Confidence intervals for predicted probabilities

We calculate 95% confidence intervals for the predictions from the interactions model to show how precise the predictions should be thought. We would expect that the WABS predictions should be within the 95% confidence interval for the LOGR predictions. Otherwise the WABS prediction could be anomalous. We do not have information on confidence intervals for the WABS predictions to see if they overlap those for LOGR.

pr = predict(glm.spdxaadt, type="response", se.fit=TRUE)
confint = round(data.frame(W = dataf$PRED_COLLS, L = pr$fit, gap=glm.spdxaadt_data$gap, 
                lwr= pr$fit - 1.96*pr$se.fit, upr = pr$fit + 1.96 * pr$se.fit),4 )%>% 
  mutate ( clen = upr - lwr, rank=dataf$RANK, name=paste0(dataf$CITY,"; ",substr(dataf$ROAD,1,20) ) )
confint[wh,]
##         W      L     gap    lwr    upr   clen rank
## 2  0.1293 0.0680 -0.0613 0.0141 0.1218 0.1077    2
## 4  0.0494 0.1697  0.1203 0.0147 0.3247 0.3100    4
## 5  0.0485 0.1964  0.1479 0.0122 0.3806 0.3684    5
## 9  0.0426 0.1190  0.0764 0.0181 0.2199 0.2018    9
## 16 0.0371 0.1189  0.0818 0.0181 0.2197 0.2016   16
## 31 0.0212 0.0994  0.0782 0.0181 0.1806 0.1625   31
##                           name
## 2       SANTAROSA; STEELE_LANE
## 4    SANTAROSA; COLLEGE_AVENUE
## 5        SANTAROSA; PINER_ROAD
## 9  SANTAROSA; GUERNEVILLE_ROAD
## 16       PETALUMA; CORONA_ROAD
## 31    SONOMA; SEARSPOINT/ST-37

Only for Steele Lane does the WABS prediction fall outside the 95% confidence limits for the L prediction. What is special about Steele Lane? A picture of the intersection from Google Maps is shown here: https://goo.gl/maps/HyDXqU8J5LG2 . While there are only 3 lanes of traffic, the inner lane is a left turn lane into Coffey which crosses the tracks. This perhaps encourages cars waiting for the left turn, and not wanting to miss the red light cycle at Coffee, to chance waiting in the crossing for the turn. With train timetable speeds of 70 here, and no knowledge of the light cycle, trains might not be able to stop; drivers, not knowing this, would be taken unawares by a train that could not stop. The confidence limits give us additional support for the LOGR model, though not necessarily in preference to the WABS model.

Conclusions

We have independently modeled the WABS data for the cycle ending in November 2018, and have obtained a logistic regression model. It predicts the probability of a collision at each crossing during a one year period by using two of the predictors in the WABS data set, TTBL_SPD and AADT. These are the timetable speed of the scheduled trains, and the average auto daily traffic at the crossing. The variables TOT_TRN and HWYLNS, the total number of trains per day, and the number of highway lanes at the crossing were found not to improve the model, as was also true of the discrete crossing type variable WD.

The model obtained regresses C upon the interaction term TTBL_SPD x AADT alone. Thus there is a complex relation between the predictors.

This model which we call the LOGR model predicts probabilities of collision at all 80 intersections. Only 6 crossings have predicted probabilities greater than 0.05 or 1 in 20 chance that observing the crossing for a year will detect a collision.

An analysis of these crossings shows that of the 6, two have erroneous data in the HWYLNS field in WABS data. This was discovered by inspecting the crossing via Google Maps interactive view. However these errata would not impact the predictions made by the LOGA model. We do not know whether the WABS predictions took the erroneous values into account.

Predicting the number of accidents in a year.

Assume that at crossings accident rates are independent; this is likely on a long linear line such as SMART. IN that case we might add the collision probabilities to obtain the number of accidents in a year for the whole line. The WABS method predicts almost 1.84 accidents per year, the LOGR method predicts 3 accidents per year. Last year (2018) there were 3 accidents at crossings on SMART. We could predict that there would be 3 again this year, if no changes are made at crossings.

Acknowledgements

The author thanks Steve Birdlebough for calling this interesting problem and the WABS dataset to his attention.

References

[1] WABS dataset

Appendix A Univariate logistic regressions

First we try the most significant variable, TTBL_SPD. A plot of the distribution of the gaps will show the range. These gaps appear distributed bimodal and skewed right(?) which means slightly higher estimates than WABS. But it is a pretty good result with no gap larger than about .075 in either direction.

Next we try TOT_TRN, the next most significant variable. We plot the distribution of gaps.

This variable skews gaps to the right, estimating in excess of the WABS estimates. But a look at the predicted values shows a certain amount of discretization, since the TOT_TRN predictor has only a few discrete values. This makes it not a good predictor to include.

Finally we try AADT on its own. We plot the distribution of the gaps.

Again there are relatively few values of the prediction, they tend to be low, and the busy intersections tend to be much lower.

The plots appear below.

grid.arrange(grobs=list(g1,g2,g3), nrow=2, ncol=2)
Comparisons of WABS and single-predictor LOGR predictions

Comparisons of WABS and single-predictor LOGR predictions

Appendix C Neural network prediction

We now use a simple neural network to predict the probabilities of accident at each crossing. We use the nnet routine in the caret package. Such a routine uses the data to fire adjacent nodes in the net at random, based on random numbers generated internally. Because we need to seed the neural network with a random number we cannot be assured we will get the same predictions each time. However, after we establish that the neural network works we can then bootstrap its results to get confidence intervals on the predictions. Start by setting the random number generation seed to a specific repeatable number. For comparison we select the same 3 variables we chose above. Since there are only 3 collisions, there could be difficulty training the neural net.

We can display the predicted probabilities with the WABS and LOGR predictions for comparison.

print(mutate( out, NNET=round(probs$`1`,4 ) ) )
##                               name   WABS    LOGR   NNET
## 1             SANTAROSA; HEARN_AVE 0.1711 0.30700 0.0989
## 2           SANTAROSA; STEELE_LANE 0.1293 0.26425 0.0223
## 3      PETALUMA; WASHINGTON_STREET 0.0504 0.24901 0.0989
## 4        SANTAROSA; COLLEGE_AVENUE 0.0494 0.19704 0.0989
## 5            SANTAROSA; PINER_ROAD 0.0485 0.14239 0.0989
## 6   ROHNERTPARK; GOLF_COURSE_DRIVE 0.0464 0.10823 0.0973
## 7            SANRAFAEL; 4TH_STREET 0.0458 0.09945 0.0989
## 8       SANRAFAEL; N_SANPEDRO_ROAD 0.0455 0.09902 0.0223
## 9      SANTAROSA; GUERNEVILLE_ROAD 0.0426 0.09146 0.0989
## 10         PETALUMA; EAST_D_STREET 0.0424 0.08986 0.0989
## 11           SANTAROSA; 3RD_STREET 0.0423 0.08493 0.0223
## 12      SANTAROSA; SAN_MIGUEL_ROAD 0.0412 0.08308 0.0223
## 13           SANTAROSA; BARHAM_AVE 0.0400 0.08175 0.0223
## 14               NOVATO; GRANT_AVE 0.0379 0.08109 0.0223
## 15               NOVATO; OLIVE_AVE 0.0377 0.07851 0.0223
## 16           PETALUMA; CORONA_ROAD 0.0371 0.07651 0.0223
## 17            SANTAROSA; TODD_ROAD 0.0367 0.07650 0.0223
## 18       SANRAFAEL; MISSION_AVENUE 0.0364 0.07406 0.0247
## 19      SANRAFAEL; CIVIC_CENTER_DR 0.0358 0.06865 0.0223
## 20           SANRAFAEL; 5TH_AVENUE 0.0354 0.06430 0.0223
## 21       PETALUMA; W_PAYRAN_STREET 0.0343 0.06325 0.0223
## 22   SANTAROSA; BELLEVUE/CORBY_AVE 0.0332 0.06120 0.0223
## 23      SANTAROSA; SEBASTOPOL_ROAD 0.0325 0.06070 0.0223
## 24       SANRAFAEL; SMITH_RANCH_RD 0.0324 0.06020 0.0223
## 25      PETALUMA; LAKEVILLE_STREET 0.0313 0.05873 0.0223
## 26           SANTAROSA; 9TH_STREET 0.0313 0.05729 0.0223
## 27            NOVATO; ROBLAR_DRIVE 0.0240 0.05547 0.0223
## 28        PETALUMA; NMCDOWELL_BLVD 0.0236 0.02283 0.0989
## 29         NOVATO; RUSHCREEK_PLACE 0.0235 0.02168 0.0223
## 30          SANRAFAEL; IRWINSTREET 0.0231 0.01990 0.0989
## 31        SONOMA; SEARSPOINT/ST-37 0.0212 0.01534 0.0989
## 32       SANRAFAEL; PACHECO_STREET 0.0200 0.01358 0.0223
## 33           SANTAROSA; 8TH_STREET 0.0190 0.01011 0.0223
## 34           SANTAROSA; 7TH_STREET 0.0180 0.01011 0.0223
## 35           SANTAROSA; 6TH_STREET 0.0180 0.00828 0.0223
## 36        PETALUMA; CAULFIELD_LANE 0.0172 0.00765 0.0223
## 37         COTATI; EAST_COTATI_AVE 0.0169 0.00742 0.0989
## 38     ROHNERTPARK; SOUTHWEST_BLVD 0.0163 0.00703 0.0989
## 39  ROHNERTPARK; ROHNERT_PARK_EXPY 0.0159 0.00378 0.0224
## 40              FULTON; RIVER_ROAD 0.0158 0.00369 0.0989
## 41      HEALDSBURG; HEALDSBURG_AVE 0.0149 0.00273 0.0989
## 42           NOVATO; HAMILTON_PKWY 0.0145 0.00234 0.0223
## 43           SANRAFAEL; 2ND_STREET 0.0141 0.00151 0.0989
## 44          NOVATO; HANNA_RANCH_RD 0.0136 0.00144 0.0223
## 45         SONOMA; SR_ROUTE_12/121 0.0129 0.00137 0.0989
## 46        HEALDSBURG; FRONT_STREET 0.0128 0.00131 0.0276
## 47     PENNGROVE; MAIN_ST/WOODWARD 0.0126 0.00114 0.0227
## 48       WINDSOR; WINDSOR_RIVER_RD 0.0126 0.00043 0.0304
## 49       SANRAFAEL; FRANCISCO_BLVD 0.0124 0.00029 0.0989
## 50       PENNGROVE; OLD_ADOBE_ROAD 0.0117 0.00029 0.0223
## 51         SONOMA; SR_ROUTE_12/121 0.0117 0.00029 0.0989
## 52           SANRAFAEL; 3RD_STREET 0.0115 0.00023 0.0989
## 53             FULTON; FULTON_ROAD 0.0114 0.00022 0.0225
## 54       HEALDSBURG; BAILHACHE_AVE 0.0112 0.00017 0.0223
## 55       COTATI; EAST_RAILROAD_AVE 0.0108 0.00017 0.0223
## 56      SANTAROSA; WEST_ROBLES_AVE 0.0099 0.00015 0.0223
## 57            PETALUMA; ELY_ROAD_N 0.0096 0.00013 0.0223
## 58         ROHNERTPARK; SCENIC_AVE 0.0093 0.00009 0.0223
## 59             WINDSOR; STARR_ROAD 0.0091 0.00007 0.0223
## 60             FULTON; SHILOH_ROAD 0.0085 0.00007 0.0223
## 61       HEALDSBURG; LIMERICK_LANE 0.0082 0.00004 0.0223
## 62            FULTON; AIRPORT_BLVD 0.0081 0.00004 0.0223
## 63             SANRAFAEL; PED_XING 0.0081 0.00004 0.0223
## 64         PETALUMA; HOPPER_STREET 0.0078 0.00004 0.0223
## 65               NAPA; MILTON_ROAD 0.0077 0.00003 0.0295
## 66          SANRAFAEL; RICE_STREET 0.0075 0.00003 0.0989
## 67        SONOMA; SKAGGS_ISLAND_RD 0.0068 0.00002 0.0226
## 68          HEALDSBURG; GRANT_AVE. 0.0064 0.00000 0.0223
## 69          WINDSOR; MITCHELL_LANE 0.0060 0.00000 0.0223
## 70       SONOMA; TOULAY_CREEK_ROAD 0.0052 0.00000 0.0223
## 71           FULTON; AVIATION_BLVD 0.0039 0.00000 0.0224
## 72         SANTAROSA; STANDISH_AVE 0.0038 0.00000 0.0239
## 73        SANRAFAEL; PALOMA_AVENUE 0.0016 0.00000 0.0223
## 74   HEALDSBURG; GRANT_SCHOOL_ROAD 0.0003 0.00000 0.0231
## 75      PETALUMA; SOUTH_POINT_ROAD 0.0003 0.00000 0.0231
## 76                NOVATO; PED_XING 0.0002 0.00000 0.0223
## 77    NOVATO; Franklin_Ped_crossin 0.0001 0.00000 0.0223
## 78 SANTAROSA; Railroad_square_ped_ 0.0001 0.00000 0.0223
## 79           NOVATO; BIKE_PATH_PED 0.0001 0.00000 0.0223
## 80           ROHNERTPARK; PED_XING 0.0001 0.00000 0.0452

The neural network does not seem very useful, as the probabilities are all rather low (perhaps appropriate!) and also have only a few unique values. This shows it does not discriminate enough among the cases to come up with a range of probabilities. There are too few accidents to model the data this way. The logistic regression does a better job.