Post by Matthias Studer
Introduction
This document provides a very quick introduction to the
R
code needed to use parametric bootstraps for typology
validation in sequence analysis. Readers interested in the methods and
the exact interpretation of the results are referred to:
- Studer, M. (2021). Validating sequence analysis typologies using parametric bootstraps. Sociological Methodology 51(2), 290–318. https://doi.org/10.1177/00811750211014232
Creating the State Sequence Object
For this example, we will use the mvad
dataset. Let’s
start with the creation of the state sequence object.
## Loading the TraMineR library
library(TraMineR)
## Loading the data
data(mvad)
## State properties
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
mvad.lab <- c("employment", "further education", "higher education", "joblessness", "school", "training")
mvad.shortlab <- c("EM","FE","HE","JL","SC","TR")
## Creating the state sequence object
mvad.seq <- seqdef(mvad, 17:86, alphabet = mvad.alphabet, states = mvad.shortlab, labels = mvad.lab, xtstep = 6)
Creating the typology
We will now create a typology using cluster analysis. Readers
interested in more detail are referred to the
WeightedCluster
library manual, which goes into the details
of the creation and computation of cluster quality measures.
We start by computing dissimilarities with the seqdist
function and the Hamming distance. We then use Ward clustering to create
a typology of the trajectories. For this step, we recommend the use of
the fastcluster
library, which considerably speed up the
computations.
## Using fastcluster for hierarchical clustering
library(fastcluster)
## Distance computation
diss <- seqdist(mvad.seq, method="HAM")
## Hierarchical clustering
hc <- hclust(as.dist(diss), method="ward.D")
We can now compute several cluster quality indices using
as.clustrange
function from two to ten groups.
# Loading the WeightedCluster library
library(WeightedCluster)
# Computing cluster quality measures.
clustqual <- as.clustrange(hc, diss=diss, ncluster=10)
clustqual
## PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
## cluster2 0.55 0.71 0.69 0.39 0.39 184.57 0.21 347.47 0.33 0.14
## cluster3 0.49 0.57 0.57 0.27 0.27 139.53 0.28 261.70 0.42 0.22
## cluster4 0.46 0.58 0.57 0.24 0.24 124.40 0.35 228.94 0.49 0.23
## cluster5 0.51 0.68 0.67 0.27 0.28 118.20 0.40 241.78 0.58 0.18
## cluster6 0.52 0.70 0.70 0.28 0.29 117.08 0.45 241.96 0.63 0.17
## cluster7 0.55 0.78 0.78 0.32 0.32 114.14 0.49 245.83 0.68 0.13
## cluster8 0.56 0.82 0.82 0.32 0.33 109.22 0.52 244.85 0.71 0.11
## cluster9 0.57 0.85 0.85 0.34 0.35 105.74 0.55 255.12 0.74 0.09
## cluster10 0.55 0.85 0.84 0.32 0.33 101.81 0.57 244.05 0.76 0.10
Parametric Bootstrap
Parametric bootstrap aims to provide baseline values obtained by
clustering similar but non-clustered data. This can be
computed using the seqnullcqi
function with the following
parameters:
R
: number of bootstraps.model
: The null model (see table belowseqdist.args
: list of arguments passed toseqdist
(should be identical to first call to seqdist).hclust.method
: hierarchical clustering method (should be identical to orginal clustering).kmedoid
: IfTRUE
, use PAM (and thewcKMedRange
function) instead of hierarchical clustering.parallel
: IfTRUE
, use parallel computing to speed up the computations.
Combined Randomization
The following R
code estimate expected values of the
cluster quality indices when clustering similar sequences that are not
clustered according to the "combined"
model, using the
Hamming distance and Ward hierarchical clustering. We set
parallel=TRUE
to use parallel computing. You can use
progressbar=TRUE
to show a progress bar of the computations
(not meaningful here within a document)
bcq.combined <- seqnullcqi(mvad.seq, clustqual, R=1000, model="combined", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE)
Once the parametric bootstrap is computed (may take a while…), the
results are stored in the bcq.combined
object. Printing the
object (just by writing its name), already provides several information,
the standardized cluster quality indices and the inconclusive interval
(standardized values).
bcq.combined
## Parametric bootstrap cluster analysis validation
## Sequence analysis null model: list(model = "combined")
## Number of bootstraps: 1000
## Clustering method: hclust with ward.D
## Seqdist arguments: list(method = "HAM")
##
##
## PBC HG HGSD ASW
## cluster2 6.15 6.06 6.15 15.14
## cluster3 4.28 4.31 4.4 8.23
## cluster4 2.28 2.65 2.72 5.3
## cluster5 2.11 2.4 2.46 6.05
## cluster6 1.33 1.67 1.7 5.5
## cluster7 2.29 3.12 3.16 8.29
## cluster8 3.09 4.12 4.15 9.8
## cluster9 3.86 4.68 4.71 11.81
## cluster10 3.24 4.23 4.26 11.69
##
## Null Max-T 0.95 interval [-0.08; 2.3] [0.01; 2.15] [0; 2.16] [-0.22; 2.33]
## ASWw CH R2 CHsq
## cluster2 15.06 29.07 24.65 32.36
## cluster3 8.18 24.52 19.9 24.89
## cluster4 5.26 23.67 18.49 20.78
## cluster5 6.01 24.59 18.46 22.79
## cluster6 5.45 26.91 19.27 22.57
## cluster7 8.23 30.77 21.2 26.43
## cluster8 9.72 33.57 22.6 29.86
## cluster9 11.7 36.88 24.2 35.71
## cluster10 11.55 39.15 25.2 36.89
##
## Null Max-T 0.95 interval [-0.23; 2.33] [-1.16; 2.57] [-1.16; 2.49] [-1.1; 2.61]
## R2sq HC
## cluster2 24.15 -5.18
## cluster3 17.56 -4.75
## cluster4 14.23 -3.23
## cluster5 14.22 -3.22
## cluster6 13.39 -2.8
## cluster7 14.71 -4.84
## cluster8 15.91 -6.4
## cluster9 17.68 -7.47
## cluster10 18.11 -7.36
##
## Null Max-T 0.95 interval [-1.1; 2.49] [-0.15; 2.77]
Looking at the numbers, 2, 9 and 10 groups stand out. To get
non-standardized values, use norm=FALSE
. Notice that
inconclusive ASW intervals are well below the recommended values (over
0.5).
print(bcq.combined, norm=FALSE)
## Parametric bootstrap cluster analysis validation
## Sequence analysis null model: list(model = "combined")
## Number of bootstraps: 1000
## Clustering method: hclust with ward.D
## Seqdist arguments: list(method = "HAM")
##
##
## PBC HG HGSD ASW
## cluster2 0.55 0.71 0.69 0.39
## cluster3 0.49 0.57 0.57 0.27
## cluster4 0.46 0.58 0.57 0.24
## cluster5 0.51 0.68 0.67 0.27
## cluster6 0.52 0.7 0.7 0.28
## cluster7 0.55 0.78 0.78 0.32
## cluster8 0.56 0.82 0.82 0.32
## cluster9 0.57 0.85 0.85 0.34
## cluster10 0.55 0.85 0.84 0.32
##
## Null Max-T 0.95 interval [0.45; 0.54] [0.69; 0.78] [0.68; 0.78] [0.14; 0.2]
## ASWw CH R2 CHsq
## cluster2 0.39 184.57 0.21 347.47
## cluster3 0.27 139.53 0.28 261.7
## cluster4 0.24 124.4 0.35 228.94
## cluster5 0.28 118.2 0.4 241.78
## cluster6 0.29 117.08 0.45 241.96
## cluster7 0.32 114.14 0.49 245.83
## cluster8 0.33 109.22 0.52 244.85
## cluster9 0.35 105.74 0.55 255.12
## cluster10 0.33 101.81 0.57 244.05
##
## Null Max-T 0.95 interval [0.15; 0.21] [40.07; 56.73] [0.31; 0.34] [74.6; 99.79]
## R2sq HC
## cluster2 0.33 0.14
## cluster3 0.42 0.22
## cluster4 0.49 0.23
## cluster5 0.58 0.18
## cluster6 0.63 0.17
## cluster7 0.68 0.13
## cluster8 0.71 0.11
## cluster9 0.74 0.09
## cluster10 0.76 0.1
##
## Null Max-T 0.95 interval [0.48; 0.53] [0.33; 0.44]
Several plots can then be used to inspect the results using the
plot
command and the type
argument. First, one
can look at the sequences generated by the null model by using
type="seqdplot"
.
plot(bcq.combined, type="seqdplot")
The overall distribution of the CQI values can be plotted using
type="density"
. In this case, one also needs to specify the
CQI to be used. All tested number of groups are found to be significant.
Any CQI computed by as.clustrange()
can be used here. To
show the density of the average silhouette width ("ASW"
),
one can use:
plot(bcq.combined, stat="ASW", type="density")
By using type="line"
, we plot the obtained and
bootstrapped CQI values depending on the number of groups. Here
again
plot(bcq.combined, stat="ASW", type="line")
Randomized Sequencing
To use another null model, one only needs to change the
model
argument of the seqnullcqi
function. The
randomized sequencing keep the duration attached to each state, but
randomizes the ordering of the spells. It can be used to uncover
sequencing structure of the data.
bcq.seq <- seqnullcqi(mvad.seq, clustqual, R=1000, model="sequencing", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE)
We can then plot the results as before. Notice that solutions between 3 and 6 are below the critical line.
plot(bcq.seq, stat="ASW", type="line")