Mobility Clustering

To combine Google Mobility Report data with, for instance, Coronavirus prevalence in a specific area, we need to be able to determine how the six mobility categories reported vary to distinguish certain patterns that can then be correlated with time periods. For instance:

What are the mobility characteristics of normal behavior periods?
When a complete lockdown is imposed, what are the new characteristics of mobility patterns?
How does mobility characterize transition periods?
What types of mobility do we see in the current period?
Can we predict what mobility category variables will look like in the future, for say economic analyses?

We undertook a study of the Google Mobility Report data for two counties in California, Sonoma and Marin, and one in Arizona (Pima). We used a form of somewhat unsupervised data mining, clustering, to create groups of observations called clusters.

There are many algorithms for clustering data. Some of these are hierarchical (hclust) clustering, k-means clustering, partitioning around medoids (pam) clustering, large data (clara) clustering, fuzzy analysis (fanny) clustering, divisive analysis (diana) clustering, agglomeration nesting (agnes) clustering. Others are various classification and regression tree (CART) methods including recursive partitioning (rpart) (video), random forests, and extreme gradient boosting (xgboost).

Each of these methods will yield different results on the same data. So classification into clusters is not an exact science, to say the least. The table below gives some of the relevant features of the routines we have chosen, and may help to show what needs to be done to get them to work. The analyst (me) needs to guide each routine differently, according to these feature capabilities.

Feature	hclust	kmeans	pam	clara	fanny	diana	agnes	rpart	random forest	xgb
Missing data handling [1]	Yes	Yes	Yes	Yes	Yes for data frames	Yes for data frames	Not documented in R	Yes, you can specify an na.action; default is to delete rows with NA response, but keep rows with only NAs in predictors missing. A complicated rule is used to assign surrogate values to NA data in predictors.	Yes, you can specify an na.action such as na.omit, to omit NA data	No, you need to interpolate missing predictor data.
Comparison clusters as response	No	No	No	No	No	No	No	Yes, you specify a formula with the response and predictors.	Yes, you specify a formula with the response and predictors; you can run it in an unsupervised form, but I used a response	Yes, you specify them as labels
Specify number of clusters	You cut tree to get the clusters you want.	Yes, you specify number of clusters.	Yes, you specify number of clusters.	Yes	Yes	Yes	Yes	No, you prune the tree (retain only part of it) to pick one that creates the number of clusters you want, based on ‘cost’.	No, you get the number of clusters that the response has.	No, you get the number of clusters in your labels
Calculate cluster from probabilities	No	No	No	No	No	No	No	Yes, you need to choose cluster with largest probability.	Yes, you need to choose cluster with largest probability.	Yes, you need to choose cluster with largest probability.
Cross-validation with training and test data	No, produces best tree.	No, you need to cross-validate.	No, you need to cross-validate.	No	No, you need to cross-validate.	No	No	Yes, it is used to determine the best ‘cost’ of each level.	Yes, done automatically in program	Yes, you need to train, test and then predict all.
Uses alternate trees	No, creates one hierarchical tree	No	No	No	Kind of; assigns fuzzy (probability) membership in clusters.	No; divides clusters using dissimilarity measure till every point is a cluster	No; merges clusters using a measure of separation (average, single, or complete)	No, creates one tree but by a recursive method that modifies the choices at each stage.	Yes, chooses using random trees. A forest is a collection of trees!!!	Yes; uses complex rules to consider trees rejected previously.
Source	Link	Link	Link	Link	Link	Link	Link	Link	Link	Link

Table of feature handling for several clustering routines. [1] We eliminated missing data as a problem by interpolating missing values, using the earlier available value.

The result of using any one of these routines, for our purposes, is a classification of the observed days of mobility data into four clusters, which turned out to be a reasonable number. I didn’t plan this number; I conducted some analysis in advance that suggested four was the right number of groups the mobility data fall into.

The first seven methods are run using eclust from the factoviz package in R. xgb requires a training stage and a final stage, and does strong cross-validation; it is probably more robust. rpart and random forest use different strategies and have not yet been fully integrated.

Below we compare the methods and show the correlations between the clustering methods. Some are quite different from others.

Sonoma and Marin cluster pairs plots of different clustering methods. They can deliver qualitatively different results.

In Sonoma County, kse, ward, fanny, and xgb all correlate well at over 85%.

For Marin County, ward, agnes, and xgb correlate at 97%; pam, fanny, diana and agnes all correlate at 75% or better with kse; agnes is actually over 90%;

In Pima County, ward, kse, agnes, and xgb all correlate well over 97%, with agnes and ward giving perfect correlation.

We can therefore judge so far that xgb gives excellent representative results for all three counties. ward also does quite well; but kse is not as good for Marin. xgb is probably more robust than the methods run via eclust, because it has been cross-validated internally, and uses information from rejected trees.

We now display some mobility data time series graphs with different clustering methods. Scroll through each window simultaneously to compare counties. Your eye should tell you which you think are the ‘best’ clustering.

Sonoma-alltsplots-2021-03-10

Marin-alltsplots-2021-03-10

Pima-alltsplots-2021-03-10

Arizona-alltsplots-2021-03-10

Clustering Efficacy Scores for four Locations
Method	Sonoma	Marin	Pima	Arizona	TOTAL
c_ward	20	16	23	23	82
c_compl	8	5	7	13	33
c_kse	19	3	23	23	68
c_pam	15	3	5	6	29
c_clara	15	1	5	6	27
c_fanny	5	4	6	7	22
c_agnes	15	3	23	23	64
c_diana	5	9	9	12	35
c_rpart	20	11	23	23	77
c_rf	20	14	23	23	80
c_xgb	20	15	23	23	81
MAX	20	16	23	23	82
Note:
Breakpoints are 90%-4, 80%-3,70%-2,50%-1

Cluster method efficacy 2021-03-10 Download