When trying to use WABS data to put a human face on the results, it’s important not to misuse the data given. WABS gives a probability of collison in the next year given a collection of facts, both descriptive and numeric, about the crossing. WABS does nt reveal how this estimate was obtained.

WABS Accident data is contained in pdf documents downloadable from their website. We extracted the data from the pdf and edited the files using Notepad+ to get columns and field names aligned properly, and then used the RStudio import data feature to get it in. The R to do so is:

library(readr)
WABS_2019_01_28 <- read_table2("C:/Users/bruce/OneDrive/Documents/Friends House/Sonoma Transit/Accidents Prediction WABS data 2019 01 28.txt",skip = 96)
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   PRED_COLLS = col_double(),
##   CROSSING = col_character(),
##   RR = col_character(),
##   STATE = col_character(),
##   COUNTY = col_character(),
##   CITY = col_character(),
##   ROAD = col_character(),
##   DATE_CHG = col_character(),
##   WD = col_character(),
##   HWYPVD = col_character(),
##   AADT = col_number()
## )
## See spec(...) for full column specifications.

We first create a shorter, more generic name for the data frame. A summary of the data shows some NAs in AADT, and one in HWYLNS. We have to replace these

dataf = WABS_2019_01_28
summary(dataf)
##       RANK         PRED_COLLS         CROSSING              RR           
##  Min.   : 1.00   Min.   :0.000057   Length:80          Length:80         
##  1st Qu.:20.75   1st Qu.:0.008430   Class :character   Class :character  
##  Median :40.50   Median :0.015378   Mode  :character   Mode  :character  
##  Mean   :40.50   Mean   :0.022960                                        
##  3rd Qu.:60.25   3rd Qu.:0.034537                                        
##  Max.   :80.00   Max.   :0.171142                                        
##                                                                          
##     STATE              COUNTY              CITY          
##  Length:80          Length:80          Length:80         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      ROAD                 C                C1               C2   
##  Length:80          Min.   :0.0000   Min.   :0.0000   Min.   :0  
##  Class :character   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0  
##  Mode  :character   Median :0.0000   Median :0.0000   Median :0  
##                     Mean   :0.0375   Mean   :0.0125   Mean   :0  
##                     3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :0  
##                                                                  
##        C3          C4      DATE_CHG              WD           
##  Min.   :0   Min.   :0   Length:80          Length:80         
##  1st Qu.:0   1st Qu.:0   Class :character   Class :character  
##  Median :0   Median :0   Mode  :character   Mode  :character  
##  Mean   :0   Mean   :0                                        
##  3rd Qu.:0   3rd Qu.:0                                        
##  Max.   :0   Max.   :0                                        
##                                                               
##     TOT_TRN         TOT_TRK         TTBL_SPD        HWYPVD         
##  Min.   : 0.00   Min.   :0.000   Min.   : 0.00   Length:80         
##  1st Qu.: 8.00   1st Qu.:1.000   1st Qu.:25.00   Class :character  
##  Median :11.00   Median :1.000   Median :40.00   Mode  :character  
##  Mean   :34.41   Mean   :1.238   Mean   :38.73                     
##  3rd Qu.:61.00   3rd Qu.:1.000   3rd Qu.:45.00                     
##  Max.   :72.00   Max.   :3.000   Max.   :79.00                     
##                                                                    
##      HWYLNS           AADT      
##  Min.   :0.000   Min.   :    1  
##  1st Qu.:2.000   1st Qu.:  775  
##  Median :2.000   Median : 3122  
##  Mean   :2.165   Mean   : 6855  
##  3rd Qu.:2.000   3rd Qu.: 8623  
##  Max.   :5.000   Max.   :48000  
##  NA's   :1       NA's   :4

HWYLNS should probably be replaced with a 2, since it is the number of highway lanes at the crossing, and could not be 0. And the AADT ones should be replaced with zeroes. We also note that HWYPVD is always YES, so it need not be considered as a discriminating response variable.

dataf$HWYLNS[which(is.na(dataf$HWYLNS) )] = 2
dataf$AADT[which(is.na(dataf$AADT))] = 0
unique(dataf$HWYPVD) #shows only YES for HWYPVD paved highway
## [1] "YES"
summary(dataf)
##       RANK         PRED_COLLS         CROSSING              RR           
##  Min.   : 1.00   Min.   :0.000057   Length:80          Length:80         
##  1st Qu.:20.75   1st Qu.:0.008430   Class :character   Class :character  
##  Median :40.50   Median :0.015378   Mode  :character   Mode  :character  
##  Mean   :40.50   Mean   :0.022960                                        
##  3rd Qu.:60.25   3rd Qu.:0.034537                                        
##  Max.   :80.00   Max.   :0.171142                                        
##     STATE              COUNTY              CITY          
##  Length:80          Length:80          Length:80         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##      ROAD                 C                C1               C2   
##  Length:80          Min.   :0.0000   Min.   :0.0000   Min.   :0  
##  Class :character   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0  
##  Mode  :character   Median :0.0000   Median :0.0000   Median :0  
##                     Mean   :0.0375   Mean   :0.0125   Mean   :0  
##                     3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :0  
##        C3          C4      DATE_CHG              WD           
##  Min.   :0   Min.   :0   Length:80          Length:80         
##  1st Qu.:0   1st Qu.:0   Class :character   Class :character  
##  Median :0   Median :0   Mode  :character   Mode  :character  
##  Mean   :0   Mean   :0                                        
##  3rd Qu.:0   3rd Qu.:0                                        
##  Max.   :0   Max.   :0                                        
##     TOT_TRN         TOT_TRK         TTBL_SPD        HWYPVD         
##  Min.   : 0.00   Min.   :0.000   Min.   : 0.00   Length:80         
##  1st Qu.: 8.00   1st Qu.:1.000   1st Qu.:25.00   Class :character  
##  Median :11.00   Median :1.000   Median :40.00   Mode  :character  
##  Mean   :34.41   Mean   :1.238   Mean   :38.73                     
##  3rd Qu.:61.00   3rd Qu.:1.000   3rd Qu.:45.00                     
##  Max.   :72.00   Max.   :3.000   Max.   :79.00                     
##      HWYLNS           AADT      
##  Min.   :0.000   Min.   :    0  
##  1st Qu.:2.000   1st Qu.:  700  
##  Median :2.000   Median : 2600  
##  Mean   :2.163   Mean   : 6512  
##  3rd Qu.:2.000   3rd Qu.: 8275  
##  Max.   :5.000   Max.   :48000

Now the data look good for analysis. Column C is the count of collisions at the crossing. PRED_COLLS is the predicted probability of a collision at the crossing in one year. The data have ranked the crossings in descending order of probability of a collision. We do not know how these were calculated and elsewhere we will propose an alternate approach to their estimation.

Such estimates do allow a ranking of crossings based on the probabilities, and so give insight into the most problematic ones. However, crossings differ. Some are pedestrian only, some are located in busy areas with lots of vehicular traffic, some have more tracks than others; the trains themselves have timetable speeds faster or slower at the various crossings; there are different forms of gating; and there are more lanes of traffic crossing in some places. All these factors could affect the probability of an accident.

For many of the crossings, the current probabilities are quite low. Here we list those with probability less than .001; there are only 7, five of which are pedestrian crossings.

dataf[which(dataf$PRED_COLLS < .001), c("CITY","ROAD","PRED_COLLS")]
## # A tibble: 7 x 3
##   CITY        ROAD                     PRED_COLLS
##   <chr>       <chr>                         <dbl>
## 1 HEALDSBURG  GRANT_SCHOOL_ROAD          0.000341
## 2 PETALUMA    SOUTH_POINT_ROAD           0.000341
## 3 NOVATO      PED_XING                   0.000205
## 4 NOVATO      Franklin_Ped_crossing      0.000145
## 5 SANTAROSA   Railroad_square_ped_xing   0.000141
## 6 NOVATO      BIKE_PATH_PED              0.000086
## 7 ROHNERTPARK PED_XING                   0.000057

How should we interpret these in terms of everyday experience? The probability is so small as to not be comprehendible in terms of the actual risk of an accident should I be at that location. One possibility is to try to grasp how many years it might take before an accident might occur with a sensible probability such as 1%.

We used the WABS data on collisions at grade crossings, for the current period. For the seven pedestrian grade crossings in the SMART system above we estimated the number of years required to reach certain chances like a 1% (one in a hundred) or 0.1% (one in a thousand chance) that there would be an accident in that many years. The model treats each year as a binomial trial, like flipping a coin (albeit unbalanced, with a probability of success – a collision– the given WABS probability). The question then is how many trials are necessary to have a 1% (respectively 0.1%) chance of a collision occurring at least once. The chance of course is 1 - the cumulative binomial distribution function with probability p of 0 collisions, in x trials, and we need to determine the x for which the chance is 1% or 0.1%.

We wrote an algorithm in R to do this given the probability data from WABS for the crossings. The given data is shown here:

name = paste(dataf$CITY[74:80],dataf$ROAD[74:80],sep="; ")
prob = dataf$PRED_COLLS[74:80]
result=data.frame(name = name, prob = prob)
print(result)
##                                  name     prob
## 1       HEALDSBURG; GRANT_SCHOOL_ROAD 0.000341
## 2          PETALUMA; SOUTH_POINT_ROAD 0.000341
## 3                    NOVATO; PED_XING 0.000205
## 4       NOVATO; Franklin_Ped_crossing 0.000145
## 5 SANTAROSA; Railroad_square_ped_xing 0.000141
## 6               NOVATO; BIKE_PATH_PED 0.000086
## 7               ROHNERTPARK; PED_XING 0.000057

We wrote a function, years_to_reach to take this data and compute three numbers. Outputs are; the number of periods (default years) till the percentage is realized (default pct=.01); the actual chance at that year; and the chance at the last period searched, which is set by default to 200 years. The function code is here.

years_to_reach = function ( prob, endcount = 200, by=1, pct=0.01) {
  x=seq(0,endcount,by=by)
  r=matrix(NA, nrow=length(prob), ncol = 3, dimnames=list(NULL, c("years","pyear","lastmax")))
  for (p in 1:length(prob) ) {
    likelihood = 1 - pbinom ( 0, x, prob[p])
    r[p,1] = x[which(likelihood > pct)[1] ]
    r[p,2] = likelihood[which(likelihood > pct)[1]]
    r[p,3] = likelihood[length(x)] 
  }
  r
}

Finally we can override the default percentage likelihood to .001, and display our results as follows:

lopxings=cbind(result, years_to_reach(prob, pct=0.001) )
print(lopxings)
##                                  name     prob years       pyear
## 1       HEALDSBURG; GRANT_SCHOOL_ROAD 0.000341     3 0.001022651
## 2          PETALUMA; SOUTH_POINT_ROAD 0.000341     3 0.001022651
## 3                    NOVATO; PED_XING 0.000205     5 0.001024580
## 4       NOVATO; Franklin_Ped_crossing 0.000145     7 0.001014559
## 5 SANTAROSA; Railroad_square_ped_xing 0.000141     8 0.001127443
## 6               NOVATO; BIKE_PATH_PED 0.000086    12 0.001031512
## 7               ROHNERTPARK; PED_XING 0.000057    18 0.001025503
##      lastmax
## 1 0.06593722
## 2 0.06593722
## 3 0.04017490
## 4 0.02858558
## 5 0.02780802
## 6 0.01705365
## 7 0.01133559

We see that the 5 pedestrian crossings require at least 5 years of watching to have a 1 in 1000 chance of observing a collision. the Rohnert Park crossing would require 18 years of watching to have a 1 in 1000 chance of seeing a collision. The Healdsburg and Petaluma crossings have higher probabilities, and so only 3 years of watching is required to have a 1 in 1000 chance of seeing an accident.

One might complain that 1 in 1000 is still not sensible by humans. If we change the percentage to a 1% chance (0.01) or 1 in 100, which is the default in the function, we get the following:

lopxings=cbind(result, years_to_reach(prob, pct=0.01) )
print(lopxings)
##                                  name     prob years      pyear    lastmax
## 1       HEALDSBURG; GRANT_SCHOOL_ROAD 0.000341    30 0.01017958 0.06593722
## 2          PETALUMA; SOUTH_POINT_ROAD 0.000341    30 0.01017958 0.06593722
## 3                    NOVATO; PED_XING 0.000205    50 0.01019869 0.04017490
## 4       NOVATO; Franklin_Ped_crossing 0.000145    70 0.01009939 0.02858558
## 5 SANTAROSA; Railroad_square_ped_xing 0.000141    72 0.01010135 0.02780802
## 6               NOVATO; BIKE_PATH_PED 0.000086   117 0.01001198 0.01705365
## 7               ROHNERTPARK; PED_XING 0.000057   177 0.01003856 0.01133559

These figures can be interpreted like this. If you stood at Healdsburg’s Grant School Road crossing for 30 years, you would have a 1% chance of seeing a crossing collision. You would have to stand at the Rohnert Park pedestrian crossing for 177 years to have a 1% chance of seeing a crossing collision. Watching for 200 years would still give you only a 1.13% chance of seeing an accident.

Any of these times is so long that of course the technology, schedules and conditions near the crossing would change so much that the original estimates would be useless. Indeed, if we look at data for 2017, the list of cxrossings and probabilities of collision are in different order, and different sizes. So we need to ask about the prediction method– how does WABC calculate the probabilities of collison.

I’ll write more about another method to estimate the probability of a collision another time.

  1. Copyright 2018 Bruce C Hartman Santa Rosa, CA 95409 USA