Mobility Clustering

To combine Google Mobility Report data with, for instance, Coronavirus prevalence in a specific area, we need to be able to determine how the six mobility categories reported vary to distinguish certain patterns that can then be correlated with time periods. For instance:

  • What are the mobility characteristics of normal behavior periods?
  • When a complete lockdown is imposed, what are the new characteristics of mobility patterns?
  • How does mobility characterize transition periods?
  • What types of mobility do we see in the current period?
  • Can we predict what mobility category variables will look like in the future, for say economic analyses?

We undertook a study of the Google Mobility Report data for two counties in California, Sonoma and Marin, and one in Arizona (Pima). We used a form of somewhat unsupervised data mining, clustering, to create groups of observations called clusters.

There are many algorithms for clustering data. Some of these are hierarchical (hclust) clustering, k-means clustering, partitioning around medoids (pam) clustering, large data (clara) clustering, fuzzy analysis (fanny) clustering, divisive analysis (diana) clustering, agglomeration nesting (agnes) clustering. Others are various classification and regression tree (CART) methods including recursive partitioning (rpart) (video), random forests, and extreme gradient boosting (xgboost).

Each of these methods will yield different results on the same data. So classification into clusters is not an exact science, to say the least. The table below gives some of the relevant features of the routines we have chosen, and may help to show what needs to be done to get them to work. The analyst (me) needs to guide each routine differently, according to these feature capabilities.


Feature
hclustkmeanspam clarafannydianaagnesrpart random forestxgb
Missing data handling [1]YesYesYesYesYes for data framesYes for data framesNot documented in RYes, you can specify an na.action; default is to delete rows with NA response, but keep rows with only NAs in predictors missing. A complicated rule is used to assign surrogate values to NA data in predictors.Yes, you can specify an na.action such as na.omit, to omit NA dataNo, you need to interpolate missing predictor data.
Comparison clusters as responseNoNoNoNoNoNoNoYes, you specify a formula with the response and predictors.Yes, you specify a formula with the response and predictors; you can run it in an unsupervised form, but I used a responseYes, you specify them as labels
Specify number of clustersYou cut tree to get the clusters you want.Yes, you specify number of clusters.Yes, you specify number of clusters.YesYesYesYesNo, you prune the tree (retain only part of it) to pick one that creates the number of clusters you want, based on ‘cost’.No, you get the number of clusters that the response has.No, you get the number of clusters in your labels
Calculate cluster from probabilitiesNoNoNoNoNoNoNoYes, you need to choose cluster with largest probability.Yes, you need to choose cluster with largest probability.Yes, you need to choose cluster with largest probability.
Cross-validation with training and test dataNo, produces best tree.No, you need to cross-validate.No, you need to cross-validate.NoNo, you need to cross-validate.NoNoYes, it is used to determine the best ‘cost’ of each level.Yes, done automatically in programYes, you need to train, test and then predict all.
Uses alternate treesNo, creates one hierarchical treeNoNoNoKind of; assigns fuzzy (probability) membership in clusters.No; divides clusters using dissimilarity measure till every point is a clusterNo; merges clusters using a measure of separation (average, single, or complete)No, creates one tree but by a recursive method that modifies the choices at each stage.Yes, chooses using random trees. A forest is a collection of trees!!!Yes; uses complex rules to consider trees rejected previously.
SourceLinkLinkLinkLinkLinkLinkLinkLinkLinkLink
Table of feature handling for several clustering routines. [1] We eliminated missing data as a problem by interpolating missing values, using the earlier available value.

The result of using any one of these routines, for our purposes, is a classification of the observed days of mobility data into four clusters, which turned out to be a reasonable number. I didn’t plan this number; I conducted some analysis in advance that suggested four was the right number of groups the mobility data fall into.

The first seven methods are run using eclust from the factoviz package in R. xgb requires a training stage and a final stage, and does strong cross-validation; it is probably more robust. rpart and random forest use different strategies and have not yet been fully integrated.

Below we compare the methods and show the correlations between the clustering methods. Some are quite different from others.

In Sonoma County, kse, ward, fanny, and xgb all correlate well at over 85%.

For Marin County, ward, agnes, and xgb correlate at 97%; pam, fanny, diana and agnes all correlate at 75% or better with kse; agnes is actually over 90%;

In Pima County, ward, kse, agnes, and xgb all correlate well over 97%, with agnes and ward giving perfect correlation.

We can therefore judge so far that xgb gives excellent representative results for all three counties. ward also does quite well; but kse is not as good for Marin. xgb is probably more robust than the methods run via eclust, because it has been cross-validated internally, and uses information from rejected trees.

We now display some mobility data time series graphs with different clustering methods. Scroll through each window simultaneously to compare counties. Your eye should tell you which you think are the ‘best’ clustering.

Clustering Efficacy Scores for four Locations
Method Sonoma Marin Pima Arizona TOTAL
c_ward 20 16 23 23 82
c_compl 8 5 7 13 33
c_kse 19 3 23 23 68
c_pam 15 3 5 6 29
c_clara 15 1 5 6 27
c_fanny 5 4 6 7 22
c_agnes 15 3 23 23 64
c_diana 5 9 9 12 35
c_rpart 20 11 23 23 77
c_rf 20 14 23 23 80
c_xgb 20 15 23 23 81
MAX 20 16 23 23 82
Note:
Breakpoints are 90%-4, 80%-3,70%-2,50%-1