When trying to use WABS data to put a human face on the results, it’s important not to misuse the data given. WABS gives a probability of collison in the next year given a collection of facts, both descriptive and numeric, about the crossing. WABS does nt reveal how this estimate was obtained.
WABS Accident data is contained in pdf documents downloadable from their website. We extracted the data from the pdf and edited the files using Notepad+ to get columns and field names aligned properly, and then used the RStudio import data feature to get it in. The R to do so is:
library(readr)
WABS_2019_01_28 <- read_table2("C:/Users/bruce/OneDrive/Documents/Friends House/Sonoma Transit/Accidents Prediction WABS data 2019 01 28.txt",skip = 96)
## Parsed with column specification:
## cols(
## .default = col_integer(),
## PRED_COLLS = col_double(),
## CROSSING = col_character(),
## RR = col_character(),
## STATE = col_character(),
## COUNTY = col_character(),
## CITY = col_character(),
## ROAD = col_character(),
## DATE_CHG = col_character(),
## WD = col_character(),
## HWYPVD = col_character(),
## AADT = col_number()
## )
## See spec(...) for full column specifications.
We first create a shorter, more generic name for the data frame. A summary of the data shows some NAs in AADT, and one in HWYLNS. We have to replace these
dataf = WABS_2019_01_28
summary(dataf)
## RANK PRED_COLLS CROSSING RR
## Min. : 1.00 Min. :0.000057 Length:80 Length:80
## 1st Qu.:20.75 1st Qu.:0.008430 Class :character Class :character
## Median :40.50 Median :0.015378 Mode :character Mode :character
## Mean :40.50 Mean :0.022960
## 3rd Qu.:60.25 3rd Qu.:0.034537
## Max. :80.00 Max. :0.171142
##
## STATE COUNTY CITY
## Length:80 Length:80 Length:80
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## ROAD C C1 C2
## Length:80 Min. :0.0000 Min. :0.0000 Min. :0
## Class :character 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0
## Mode :character Median :0.0000 Median :0.0000 Median :0
## Mean :0.0375 Mean :0.0125 Mean :0
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0
## Max. :1.0000 Max. :1.0000 Max. :0
##
## C3 C4 DATE_CHG WD
## Min. :0 Min. :0 Length:80 Length:80
## 1st Qu.:0 1st Qu.:0 Class :character Class :character
## Median :0 Median :0 Mode :character Mode :character
## Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0
##
## TOT_TRN TOT_TRK TTBL_SPD HWYPVD
## Min. : 0.00 Min. :0.000 Min. : 0.00 Length:80
## 1st Qu.: 8.00 1st Qu.:1.000 1st Qu.:25.00 Class :character
## Median :11.00 Median :1.000 Median :40.00 Mode :character
## Mean :34.41 Mean :1.238 Mean :38.73
## 3rd Qu.:61.00 3rd Qu.:1.000 3rd Qu.:45.00
## Max. :72.00 Max. :3.000 Max. :79.00
##
## HWYLNS AADT
## Min. :0.000 Min. : 1
## 1st Qu.:2.000 1st Qu.: 775
## Median :2.000 Median : 3122
## Mean :2.165 Mean : 6855
## 3rd Qu.:2.000 3rd Qu.: 8623
## Max. :5.000 Max. :48000
## NA's :1 NA's :4
HWYLNS should probably be replaced with a 2, since it is the number of highway lanes at the crossing, and could not be 0. And the AADT ones should be replaced with zeroes. We also note that HWYPVD is always YES, so it need not be considered as a discriminating response variable.
dataf$HWYLNS[which(is.na(dataf$HWYLNS) )] = 2
dataf$AADT[which(is.na(dataf$AADT))] = 0
unique(dataf$HWYPVD) #shows only YES for HWYPVD paved highway
## [1] "YES"
summary(dataf)
## RANK PRED_COLLS CROSSING RR
## Min. : 1.00 Min. :0.000057 Length:80 Length:80
## 1st Qu.:20.75 1st Qu.:0.008430 Class :character Class :character
## Median :40.50 Median :0.015378 Mode :character Mode :character
## Mean :40.50 Mean :0.022960
## 3rd Qu.:60.25 3rd Qu.:0.034537
## Max. :80.00 Max. :0.171142
## STATE COUNTY CITY
## Length:80 Length:80 Length:80
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## ROAD C C1 C2
## Length:80 Min. :0.0000 Min. :0.0000 Min. :0
## Class :character 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0
## Mode :character Median :0.0000 Median :0.0000 Median :0
## Mean :0.0375 Mean :0.0125 Mean :0
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0
## Max. :1.0000 Max. :1.0000 Max. :0
## C3 C4 DATE_CHG WD
## Min. :0 Min. :0 Length:80 Length:80
## 1st Qu.:0 1st Qu.:0 Class :character Class :character
## Median :0 Median :0 Mode :character Mode :character
## Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0
## TOT_TRN TOT_TRK TTBL_SPD HWYPVD
## Min. : 0.00 Min. :0.000 Min. : 0.00 Length:80
## 1st Qu.: 8.00 1st Qu.:1.000 1st Qu.:25.00 Class :character
## Median :11.00 Median :1.000 Median :40.00 Mode :character
## Mean :34.41 Mean :1.238 Mean :38.73
## 3rd Qu.:61.00 3rd Qu.:1.000 3rd Qu.:45.00
## Max. :72.00 Max. :3.000 Max. :79.00
## HWYLNS AADT
## Min. :0.000 Min. : 0
## 1st Qu.:2.000 1st Qu.: 700
## Median :2.000 Median : 2600
## Mean :2.163 Mean : 6512
## 3rd Qu.:2.000 3rd Qu.: 8275
## Max. :5.000 Max. :48000
Now the data look good for analysis. Column C is the count of collisions at the crossing. PRED_COLLS is the predicted probability of a collision at the crossing in one year. The data have ranked the crossings in descending order of probability of a collision. We do not know how these were calculated and elsewhere we will propose an alternate approach to their estimation.
Such estimates do allow a ranking of crossings based on the probabilities, and so give insight into the most problematic ones. However, crossings differ. Some are pedestrian only, some are located in busy areas with lots of vehicular traffic, some have more tracks than others; the trains themselves have timetable speeds faster or slower at the various crossings; there are different forms of gating; and there are more lanes of traffic crossing in some places. All these factors could affect the probability of an accident.
For many of the crossings, the current probabilities are quite low. Here we list those with probability less than .001; there are only 7, five of which are pedestrian crossings.
dataf[which(dataf$PRED_COLLS < .001), c("CITY","ROAD","PRED_COLLS")]
## # A tibble: 7 x 3
## CITY ROAD PRED_COLLS
## <chr> <chr> <dbl>
## 1 HEALDSBURG GRANT_SCHOOL_ROAD 0.000341
## 2 PETALUMA SOUTH_POINT_ROAD 0.000341
## 3 NOVATO PED_XING 0.000205
## 4 NOVATO Franklin_Ped_crossing 0.000145
## 5 SANTAROSA Railroad_square_ped_xing 0.000141
## 6 NOVATO BIKE_PATH_PED 0.000086
## 7 ROHNERTPARK PED_XING 0.000057
How should we interpret these in terms of everyday experience? The probability is so small as to not be comprehendible in terms of the actual risk of an accident should I be at that location. One possibility is to try to grasp how many years it might take before an accident might occur with a sensible probability such as 1%.
We used the WABS data on collisions at grade crossings, for the current period. For the seven pedestrian grade crossings in the SMART system above we estimated the number of years required to reach certain chances like a 1% (one in a hundred) or 0.1% (one in a thousand chance) that there would be an accident in that many years. The model treats each year as a binomial trial, like flipping a coin (albeit unbalanced, with a probability of success – a collision– the given WABS probability). The question then is how many trials are necessary to have a 1% (respectively 0.1%) chance of a collision occurring at least once. The chance of course is 1 - the cumulative binomial distribution function with probability p of 0 collisions, in x trials, and we need to determine the x for which the chance is 1% or 0.1%.
We wrote an algorithm in R to do this given the probability data from WABS for the crossings. The given data is shown here:
name = paste(dataf$CITY[74:80],dataf$ROAD[74:80],sep="; ")
prob = dataf$PRED_COLLS[74:80]
result=data.frame(name = name, prob = prob)
print(result)
## name prob
## 1 HEALDSBURG; GRANT_SCHOOL_ROAD 0.000341
## 2 PETALUMA; SOUTH_POINT_ROAD 0.000341
## 3 NOVATO; PED_XING 0.000205
## 4 NOVATO; Franklin_Ped_crossing 0.000145
## 5 SANTAROSA; Railroad_square_ped_xing 0.000141
## 6 NOVATO; BIKE_PATH_PED 0.000086
## 7 ROHNERTPARK; PED_XING 0.000057
We wrote a function, years_to_reach to take this data and compute three numbers. Outputs are; the number of periods (default years) till the percentage is realized (default pct=.01); the actual chance at that year; and the chance at the last period searched, which is set by default to 200 years. The function code is here.
years_to_reach = function ( prob, endcount = 200, by=1, pct=0.01) {
x=seq(0,endcount,by=by)
r=matrix(NA, nrow=length(prob), ncol = 3, dimnames=list(NULL, c("years","pyear","lastmax")))
for (p in 1:length(prob) ) {
likelihood = 1 - pbinom ( 0, x, prob[p])
r[p,1] = x[which(likelihood > pct)[1] ]
r[p,2] = likelihood[which(likelihood > pct)[1]]
r[p,3] = likelihood[length(x)]
}
r
}
Finally we can override the default percentage likelihood to .001, and display our results as follows:
lopxings=cbind(result, years_to_reach(prob, pct=0.001) )
print(lopxings)
## name prob years pyear
## 1 HEALDSBURG; GRANT_SCHOOL_ROAD 0.000341 3 0.001022651
## 2 PETALUMA; SOUTH_POINT_ROAD 0.000341 3 0.001022651
## 3 NOVATO; PED_XING 0.000205 5 0.001024580
## 4 NOVATO; Franklin_Ped_crossing 0.000145 7 0.001014559
## 5 SANTAROSA; Railroad_square_ped_xing 0.000141 8 0.001127443
## 6 NOVATO; BIKE_PATH_PED 0.000086 12 0.001031512
## 7 ROHNERTPARK; PED_XING 0.000057 18 0.001025503
## lastmax
## 1 0.06593722
## 2 0.06593722
## 3 0.04017490
## 4 0.02858558
## 5 0.02780802
## 6 0.01705365
## 7 0.01133559
We see that the 5 pedestrian crossings require at least 5 years of watching to have a 1 in 1000 chance of observing a collision. the Rohnert Park crossing would require 18 years of watching to have a 1 in 1000 chance of seeing a collision. The Healdsburg and Petaluma crossings have higher probabilities, and so only 3 years of watching is required to have a 1 in 1000 chance of seeing an accident.
One might complain that 1 in 1000 is still not sensible by humans. If we change the percentage to a 1% chance (0.01) or 1 in 100, which is the default in the function, we get the following:
lopxings=cbind(result, years_to_reach(prob, pct=0.01) )
print(lopxings)
## name prob years pyear lastmax
## 1 HEALDSBURG; GRANT_SCHOOL_ROAD 0.000341 30 0.01017958 0.06593722
## 2 PETALUMA; SOUTH_POINT_ROAD 0.000341 30 0.01017958 0.06593722
## 3 NOVATO; PED_XING 0.000205 50 0.01019869 0.04017490
## 4 NOVATO; Franklin_Ped_crossing 0.000145 70 0.01009939 0.02858558
## 5 SANTAROSA; Railroad_square_ped_xing 0.000141 72 0.01010135 0.02780802
## 6 NOVATO; BIKE_PATH_PED 0.000086 117 0.01001198 0.01705365
## 7 ROHNERTPARK; PED_XING 0.000057 177 0.01003856 0.01133559
These figures can be interpreted like this. If you stood at Healdsburg’s Grant School Road crossing for 30 years, you would have a 1% chance of seeing a crossing collision. You would have to stand at the Rohnert Park pedestrian crossing for 177 years to have a 1% chance of seeing a crossing collision. Watching for 200 years would still give you only a 1.13% chance of seeing an accident.
Any of these times is so long that of course the technology, schedules and conditions near the crossing would change so much that the original estimates would be useless. Indeed, if we look at data for 2017, the list of cxrossings and probabilities of collision are in different order, and different sizes. So we need to ask about the prediction method– how does WABC calculate the probabilities of collison.
I’ll write more about another method to estimate the probability of a collision another time.