On this page

Requirements
Step 1: Method & settings
Clustering mode
Algorithms for case and variable clustering
Biclustering algorithms
Automatic standardization
Step 2: Determine optimal k
Standard clustering (case / variable modes)
Biclustering (bi-clustering mode)
Step 3: Run full analysis
Validity metrics (case and variable clustering)
Output options (case and variable clustering)
Variable clustering specifics
Output options (biclustering)
Warnings
Inserting results into the dataset
Missing data
Reporting checklist
Reproducibility
Common pitfalls

Cluster analysis

The Cluster analysis module groups observations, variables, or both into clusters based on similarity. It supports three modes — case clustering, variable clustering, and bi-clustering — with multiple algorithms per mode. A three-step workflow walks you through choosing a method, finding the optimal number of clusters, and running the full analysis with validity metrics and visualizations.

What is cluster analysis? Imagine you have survey responses from 500 people with 20 measured variables. Cluster analysis finds groups of people who answered similarly — without you telling it what the groups should be. Unlike classification (where you already know the categories), clustering discovers categories from the data. It’s used in market segmentation (“what customer types do we have?”), biology (“which species are similar?”), and psychology (“are there distinct response profiles?”).

Select your variables (at least 2 numeric)
Choose a clustering mode and algorithm
Set the cluster range and click Analyze & determine k to find the best number of clusters
Set k, toggle validity metrics and output options
Click Run cluster analysis (or Run biclustering analysis)

Requirements

At least 2 numeric variables must be selected. Non-numeric variables are automatically excluded (and listed in the output).

Step 1: Method & settings

Clustering mode

Mode	What it groups	When to use
Case clustering	Rows (observations)	“What types of participants are in my data?”
Variable clustering	Columns (variables)	“Which of my variables behave similarly?” — an alternative to factor analysis for grouping variables.
Bi-clustering	Rows and columns simultaneously	“Are there subsets of observations that are similar on a subset of variables?” — useful when the cluster structure doesn’t span all variables.

Algorithms for case and variable clustering

Algorithm	Description	Best for
K-Means	Assigns observations to the nearest cluster center, then recomputes centers. Fast and effective for large datasets.	Well-separated, roughly spherical clusters of similar size
K-Medoids (PAM)	Like K-Means, but uses actual data points as centers (medoids) instead of computed means.	Data with outliers — medoids are more robust than means
Hierarchical	Builds a tree (dendrogram) by progressively merging the most similar observations or splitting the most different ones.	Exploring cluster structure at multiple levels; small to medium datasets

K-Means settings:

Initialization — Hartigan-Wong (default), Lloyd, Forgy, or MacQueen
Random starts (default 25) — runs the algorithm multiple times with different starting points and keeps the best result. Higher values reduce the chance of a suboptimal solution.
Maximum iterations (default 100)

K-Medoids has no algorithm-specific settings beyond the distance metric.

Hierarchical settings:

Distance metric — Euclidean (default), Manhattan, Maximum (Chebyshev), Canberra, or Minkowski
Linkage method — how to measure distance between clusters:

Linkage	How it works	Tends to produce
Ward’s D2 (default)	Minimizes the increase in total within-cluster variance at each merge	Compact, roughly equal-sized clusters
Complete	Distance = maximum distance between any two points in the clusters	Compact clusters, sensitive to outliers
Average (UPGMA)	Distance = average of all pairwise distances	Moderate-sized, balanced clusters
Single	Distance = minimum distance between any two points	Long, chain-like clusters — good for detecting elongated shapes, but prone to “chaining”
Centroid	Distance between cluster centroids	Can produce inversions in the dendrogram
Median	Like centroid but weights clusters equally regardless of size	Similar to centroid
McQuitty (WPGMA)	Like average but weights clusters equally	Similar to average

A warning appears if Ward’s linkage is combined with a non-Euclidean distance, since Ward’s assumes Euclidean distances.

The distance metric selector is available for PAM and Hierarchical (K-Means uses its own internal distance).

Assumptions:

K-Means assumes roughly spherical clusters of similar size and works best with continuous variables. It minimizes within-cluster variance (Euclidean distance), so it struggles with elongated, ring-shaped, or very unequal clusters.
PAM makes the same distance-based assumptions but is more robust to outliers since it uses medoids.
Hierarchical makes no distributional assumptions, but the choice of linkage method strongly shapes the result. Ward’s assumes Euclidean distances; single linkage can produce chain-like artifacts.
All methods assume the selected variables are relevant to the grouping structure. Irrelevant variables add noise and degrade cluster quality.

Which algorithm? K-Means is the default for good reason — it’s fast, scales well, and works for most situations. Use PAM when you have outliers or want interpretable cluster centers (actual data points). Use Hierarchical when you want to visually explore different numbers of clusters with the dendrogram, or when cluster shapes might not be spherical.

Biclustering algorithms

Biclustering finds subgroups of observations that are similar on a subset of variables — unlike standard clustering, which uses all variables for every cluster.

Algorithm	Description	Determines k automatically?
BiMax	Finds biclusters of maximal size in binarized data	No
Plaid	Additive model — each bicluster is a “layer” added to the background	Yes
FABIA	Factor model approach — finds sparse, overlapping biclusters	No
Cheng & Church	Finds biclusters with low mean squared residue (high coherence)	No
Spectral	Uses singular value decomposition to find checkerboard patterns	Yes

BiMax settings:

Minimum rows and minimum columns (default 2 each)
Binarization threshold (default 0) — values above the threshold become 1, below become 0

Plaid settings:

Background model — Row + Column effects (default) or Constant only
Maximum layers (default 20)
Row/Column release (default 0.7 each, range 0–1) — controls how aggressively rows/columns are pruned from layers

FABIA settings:

Sparseness prior for loadings and factors (defaults 0.6 and 0.5) — higher values produce sparser (more selective) biclusters
Iterations (default 500)

Cheng & Church settings:

Residue threshold (delta) (default 1.5) — maximum allowed mean squared residue. Lower values demand more coherent biclusters.
Alpha (default 1.5) — scaling factor for node deletion

Spectral settings:

Singular vectors (default 3) — number of SVD dimensions to use
Normalization — Log (default), IRRC, or Bistochastization

Which biclustering algorithm? BiMax is a good starting point — simple and fast. Plaid works well when biclusters may overlap and you want the algorithm to decide how many there are. FABIA handles noisy data well due to its probabilistic model. Cheng & Church gives you direct control over coherence via the residue threshold. Spectral is useful for checkerboard-pattern data.

Automatic standardization

Variables are automatically standardized (z-scores) if they have different ranges. This prevents variables with larger scales from dominating the distance calculations. When applied, it’s noted in the output.

Why standardize? If one variable ranges 0–100 (exam scores) and another ranges 1–5 (Likert scale), the first would dominate the clustering simply because its numbers are bigger. Standardization puts all variables on the same scale so each contributes equally.

Step 2: Determine optimal k

Before running the full analysis, this step compares solutions across a range of cluster counts to help you choose k.

Standard clustering (case / variable modes)

Set the cluster range to test (default 2–10) and click Analyze & determine k.

Four diagnostic plots can be toggled:

Elbow plot (within-cluster SS) — total within-cluster sum of squares for each k. Look for the “elbow” where the curve bends — adding more clusters past that point gives diminishing returns.
Silhouette plot — average silhouette score for each k. The best k (highest score) is highlighted. Reference lines at 0.25, 0.50, and 0.70 help judge quality.
Gap statistic plot — compares within-cluster dispersion to what you’d expect from random (uniformly distributed) data. Error bars show uncertainty from Monte Carlo simulation (50 bootstrap samples).
Dendrogram (Hierarchical only) — the full tree with a draggable cut line. Drag it up or down to explore different k values — branches are colored by cluster assignment in real time.

What is a silhouette score? For each observation, the silhouette measures how similar it is to its own cluster compared to the nearest other cluster. Values range from −1 to +1: near +1 means the observation is well-placed, near 0 means it sits on the boundary between clusters, and negative values mean it’s probably in the wrong cluster. The average across all observations summarizes overall cluster quality.

What is the gap statistic? It compares the compactness of your clusters to what you’d get by clustering random noise. If your data has real structure, the gap (difference) will be large. The optimal k is the smallest value where the gap is within one standard error of the maximum — balancing cluster quality against parsimony.

Comparison table

A table with one row per k:

Metric	Good value	What it measures
Within SS	Lower	Total within-cluster sum of squares — how tight the clusters are internally
Silhouette	Higher (≥ 0.50 reasonable, ≥ 0.70 strong)	How well-separated the clusters are
Calinski-Harabasz	Higher	Ratio of between-cluster to within-cluster variance — higher means tighter clusters that are farther apart
Davies-Bouldin	Lower	Average similarity between each cluster and its most similar neighbor — lower means clusters are more distinct
Dunn	Higher	Ratio of minimum between-cluster distance to maximum within-cluster diameter — sensitive to outliers
Gap	Higher	How much better your clustering is compared to random data
Hartigan	> 10 suggests adding another cluster	Whether the improvement from k to k+1 is worthwhile

Best values are highlighted in green. Hartigan values above 10 are highlighted in yellow.

Below the table, suggestions show:

Suggested k (silhouette) — the k with the highest average silhouette
Suggested k (gap statistic) — using the firstSEmax criterion

Metrics disagree — which one do I trust? They often do. Silhouette is the most intuitive and widely used — start there. Gap statistic tends to favor parsimony (fewer clusters). Calinski-Harabasz and Davies-Bouldin assume roughly spherical clusters, so they may mislead with elongated or irregular shapes. Use the comparison table as a guide, not a verdict — try the top 2–3 candidates and see which produces the most interpretable clusters.

Biclustering (bi-clustering mode)

For algorithms that need a user-specified k (BiMax, FABIA, Cheng & Church), set the bicluster range to test (default 2–10) and click Compare solutions.

Diagnostic plots:

Variance explained — percentage of total variance captured at each k
Coherence — average mean squared residue (lower = more coherent biclusters)
Coverage — three lines showing row, column, and cell coverage percentages

Three optional slower diagnostics (off by default):

Stability analysis — bootstrap resampling to measure how consistently the same biclusters appear across samples (Jaccard similarity)
F-statistic diagnostics — tests whether row and column effects within each bicluster are statistically significant. An 80% reference line helps gauge quality.
Consensus scoring — runs the algorithm multiple times with different seeds and measures agreement between runs

Bicluster comparison table

Metric	What it measures
Found	Actual number of biclusters discovered (may be less than requested)
Variance explained (%)	How much of the total variance the biclusters capture
Incremental change	Additional variance explained compared to the previous k
Average MSR	Mean squared residue — how coherent the biclusters are (lower = more coherent)
Average overlap	Jaccard similarity between biclusters — high overlap (> 0.3, shown in yellow) means biclusters share many members
Cell coverage (%)	Percentage of data cells included in at least one bicluster

If stability, F-statistic, or consensus diagnostics were enabled, additional columns appear.

A suggested k is provided based on elbow detection in the variance explained curve.

For auto-k algorithms (Plaid, Spectral), Step 2 shows an informational note — the algorithm determines k on its own.

Step 3: Run full analysis

Enter the number of clusters and click Run cluster analysis (or Run biclustering analysis).

Validity metrics (case and variable clustering)

Metric	Default	Good value	What it measures
Silhouette analysis	On	≥ 0.50 reasonable, ≥ 0.70 strong	Overall cluster separation quality
Calinski-Harabasz	On	Higher is better	Between-cluster vs. within-cluster variance ratio
Davies-Bouldin	On	Lower is better	How similar each cluster is to its closest neighbor (lower = more distinct clusters)
Dunn index	Off	Higher is better	Ratio of smallest between-cluster distance to largest within-cluster diameter

When interpretation is enabled, silhouette values include labels:

Value	Interpretation
≥ 0.70	Strong structure
≥ 0.50	Reasonable structure
≥ 0.25	Weak structure
< 0.25	No substantial structure

Output options (case and variable clustering)

Option	Default	What it shows
Cluster profiles	On	Mean of each variable within each cluster, plus the overall mean. Helps characterize what makes each cluster distinct. Not available in variable mode.
Cluster sizes	On	Number of observations and percentage in each cluster
Within-cluster SS	On	Sum of squares within each cluster and percentage of total — shows which clusters are tight vs. loose
Between-cluster SS	Off	Variance explained by the clustering — the proportion of total variance that falls between clusters rather than within them
Cluster centers (medoids)	PAM only	The actual data points used as cluster representatives, with their row indices
Silhouette plot	Off	Bar chart showing every observation’s silhouette width, grouped by cluster
Variable contribution	Off	F-statistic and eta-squared for each variable — which variables best distinguish the clusters. Not available in variable mode.

Hierarchical-specific options:

Dendrogram (on by default) — tree diagram with branches colored by cluster
Optimize leaf ordering — reorders leaves for cleaner visualization

Reading cluster profiles: the profile table is often the most useful output. Look at which variables have high or low means in each cluster compared to the overall mean. If Cluster 1 has high anxiety, high stress, and low well-being while Cluster 2 shows the opposite pattern, you’ve found psychologically distinct groups. Name clusters by their defining characteristics, but remember — the labels are your interpretation, not the data’s (same caveat as factor naming).

Variable contribution (eta-squared): eta-squared tells you what proportion of a variable’s variance is explained by cluster membership. A high value (e.g. 0.60) means the clusters differ sharply on that variable — it’s a strong differentiator. A low value (e.g. 0.05) means the clusters are similar on that variable — it’s not contributing much to the cluster structure.

Variable clustering specifics

In variable clustering mode, the data matrix is transposed — variables become the “observations” being clustered. Distance between variables is based on their correlation across observations: highly correlated variables end up in the same cluster.

The output includes a Variable cluster assignments table showing which variables belong to each cluster. Cluster profiles and variable contribution options are hidden (not applicable). The silhouette plot and dendrogram use variable names as labels.

Variable clustering vs. factor analysis: both group variables, but they work differently. Factor analysis models latent constructs — it assumes your variables are caused by underlying factors and estimates a formal model with loadings and communalities. Variable clustering is purely distance-based — it groups variables that correlate highly, without assuming any generative model. Use factor analysis when you want to model latent structure and compute factor scores. Use variable clustering when you just need a quick grouping — for example, to identify redundant variables before running another analysis, or to check whether your variables naturally fall into the subscales you expect.

If the dendrogram is enabled, variables that merge at low heights are the most similar (highest correlation). Look for distinct branches — each branch is a potential variable cluster. A variable that joins late (high merge height) may not belong clearly to any group.

Output options (biclustering)

Option	Default	What it shows
Bicluster summary	On	Each bicluster’s row count, column count, total size, mean value, and mean squared residue
Membership tables	On	Which variables belong to which biclusters (checkmark matrix), plus a count column
Bicluster profiles	On	Mean of each variable within each bicluster (only variables that belong to that bicluster)
Coherence per bicluster	On	MSR, row variance, and column variance for each bicluster
Overlap analysis	Off	Jaccard similarity matrices showing how much biclusters share rows and columns
Heatmap	On	Color-coded matrix with bicluster membership shown as colored borders

The heatmap supports optional row and column dendrograms (both on by default), uses a diverging color scale (blue–white–red for standardized data), and shows tooltips with row, column, value, and bicluster membership on hover.

Biclustering overview

Every run includes a summary showing variance explained (%), row coverage, column coverage, and cell coverage.

Biclustering coverage: coverage tells you how much of your data is “explained” by the biclusters. Low cell coverage means the biclusters capture only a small part of the data — either the structure is sparse (which is fine for some applications) or k is too low. Row and column coverage tell you whether some observations or variables are left out entirely.

Warnings

The analysis generates warnings for potentially problematic results:

Highly unbalanced clusters — smallest cluster < 5% of cases
Very small clusters — fewer than 10 observations
Extreme size imbalance — largest cluster > 10× the smallest
Low silhouette — below 0.25 (“no substantial structure”) or 0.25–0.50 (“weak structure”)
Many negative silhouettes — more than 10% of observations have negative silhouette values (likely in the wrong cluster)
Low variance explained — below 50%
Single-variable clusters (variable mode) — a cluster containing only one variable may not be meaningful

Inserting results into the dataset

Case clustering: click Insert cluster assignments into dataset to create a new categorical variable (e.g. Cluster_k3) with each observation’s cluster number. Cases with missing data receive NA.

Biclustering: click Insert bicluster memberships into dataset to create one binary variable per bicluster (e.g. BC1_k3, BC2_k3). Each variable is 1 if the observation belongs to that bicluster, 0 otherwise.

Inserted variables can be used in further analyses — as grouping variables for comparisons, as predictors in regression, or as group variables for measurement invariance testing.

Missing data

Missing values are handled by listwise deletion — only complete cases are used. When the number of complete cases differs from the total, the output reports both counts. A fixed random seed (42) ensures reproducible results.

Missing data and clustering: unlike some analysis modules, cluster analysis does not support pairwise deletion. Every observation needs complete data on all selected variables. If missingness is widespread, consider reducing the number of variables or applying imputation before running the analysis.

Reporting checklist

Key things to include when writing up cluster analysis results:

Method:

Clustering mode (case, variable, or bi-clustering)
Algorithm used (K-Means, PAM, Hierarchical, or which biclustering algorithm)
Distance metric and linkage method (for Hierarchical/PAM)
Whether variables were standardized (and why, e.g. different measurement scales)
How the number of clusters was determined — which metrics were consulted (silhouette, gap statistic, elbow, dendrogram) and how conflicts were resolved
Sample size and number of variables
How missing data were handled

Results:

Number of clusters and cluster sizes
Validity metrics — at minimum average silhouette width; consider also Calinski-Harabasz and Davies-Bouldin
Cluster profiles (means per variable per cluster) — the core of interpretation
Variance explained (between-cluster SS as percentage of total)
Any warnings (imbalanced clusters, low silhouette, negative silhouettes)

For biclustering: report the algorithm, number of biclusters found, variance explained, cell coverage, and coherence (MSR). Include the membership table or heatmap.

Reproducibility

Every analysis prints the underlying R code to the R console — you can inspect, copy, or re-run the exact commands. Cluster analysis uses base R functions (kmeans, hclust) and the cluster package (for PAM and silhouette); biclustering uses the biclust and fabia packages. Citations appear automatically at the top of the output section. A fixed random seed (42) ensures reproducible results across runs.

Common pitfalls

Cluster analysis is exploratory by nature — it will always produce clusters, whether or not they’re meaningful. Keep these points in mind:

Clusters always exist — even in random data. K-Means will partition random noise into k groups and report cluster centers with a straight face. A low silhouette score (< 0.25) or low variance explained (< 50%) are signs that the “clusters” may not reflect real structure. Always check validity metrics before interpreting.

Results depend on the method. K-Means, PAM, and Hierarchical can produce different clusterings from the same data. Different linkage methods within Hierarchical can produce different clusterings. Different distance metrics can produce different clusterings. If your clusters only appear with one specific combination of settings, they may not be robust. Try multiple approaches and look for consistent patterns.

Too many variables can hurt. With many variables, distances become dominated by noise — every observation looks equally far from every other one (the “curse of dimensionality”). If you have 50 variables, consider reducing them first with factor analysis or PCA and clustering on the factor scores instead.

Don’t test cluster differences on clustering variables. If you cluster people using anxiety and depression scores, then run a t-test asking “do the clusters differ on anxiety?” — of course they do, you made them differ. Testing whether clusters differ on the variables used to create them is circular. Instead, validate clusters against external variables not used in the clustering (e.g. cluster on personality items, then check whether clusters differ on job performance).

Clusters might be arbitrary cuts of a continuum. Not all data has natural groups. Depression scores might form a smooth gradient from low to high rather than distinct “depressed” and “not depressed” clusters. Forcing this into two clusters creates an artificial boundary. A well-known example of this debate is personality typology: researchers have clustered Big Five scores into types like “resilient,” “overcontrolled,” and “undercontrolled” — but the Big Five dimensions themselves are continuously and normally distributed, so the “types” may simply be regions of a smooth space rather than natural categories. Check whether the silhouette plot shows clear separation or a muddy overlap.

Cluster labels are interpretations. Same caveat as factor naming — calling a cluster “Resilient High-Achievers” because it has above-average scores on several positive traits is your interpretation. Report the actual profile means so readers can judge for themselves.

Sample-specific solutions. Cluster structures are sensitive to sample composition. A 3-cluster solution in your sample might not replicate in a different population. If possible, split your data and check whether the same clusters emerge in both halves — or use the stability diagnostics in biclustering comparison as a model.

Cluster analysis #

Requirements #

Step 1: Method & settings #

Clustering mode #

Algorithms for case and variable clustering #

Biclustering algorithms #

Automatic standardization #

Step 2: Determine optimal k #

Standard clustering (case / variable modes) #

Comparison table #

Biclustering (bi-clustering mode) #

Bicluster comparison table #

Step 3: Run full analysis #

Validity metrics (case and variable clustering) #

Output options (case and variable clustering) #

Variable clustering specifics #

Output options (biclustering) #

Biclustering overview #

Warnings #

Inserting results into the dataset #

Missing data #

Reporting checklist #

Reproducibility #

Common pitfalls #