R package mdendro enables the calculation of agglomerative hierarchical clustering (AHC), extending the standard functionalities in several ways:
Native handling of both similarity and dissimilarity (distances) matrices.
Calculation of pair-group dendrograms and variable-group multidendrograms [1].
Implementation of the most common AHC methods in both weighted and unweighted forms: single linkage, complete linkage, average linkage (UPGMA and WPGMA), centroid (UPGMC and WPGMC), and Ward.
Implementation of two additional parametric families of methods: versatile linkage [2], and beta flexible. Versatile linkage leads naturally to the definition of two additional methods: harmonic linkage, and geometric linkage.
Calculation of the cophenetic (or ultrametric) matrix.
Calculation of five descriptors of the final dendrogram: cophenetic correlation coefficient, space distortion ratio, agglomerative coefficient, chaining coefficient, and tree balance.
Plots of the descriptors for the parametric methods.
All this functionality is obtained with functions
linkage
, descval
and descplot
.
Function linkage
may be considered as a replacement for
functions hclust
(in package stats)
and agnes
(in package cluster).
To enhance usability and interoperability, the linkage
class includes several methods for plotting, summarizing information,
and class conversion.
There exist two main ways to install mdendro:
Installation from CRAN (recommended method):
install.packages("mdendro")
RStudio has a menu entry (Tools \(\rightarrow\) Install Packages) for this job.
Installation from GitHub (you may need to install first devtools):
install.packages("devtools")
library(devtools)
install_github("sergio-gomez/mdendro")
Since mdendro includes C++ code, you may need to install first Rtools in Windows, or Xcode in MacOS.
Let us start by using the linkage
function to calculate
the complete linkage AHC of the UScitiesD
dataset, a matrix
of distances between a few US cities:
library(mdendro)
<- linkage(UScitiesD, method = "complete") lnk
Now we can plot the resulting dendrogram:
plot(lnk)
The summary of this dendrogram is:
summary(lnk)
## Call:
## linkage(prox = UScitiesD,
## type.prox = "distance",
## digits = 0,
## method = "complete",
## group = "variable")
##
## Number of objects: 10
##
## Binary dendrogram: TRUE
##
## Descriptive measures:
## cor sdr ac cc tb
## 0.8077859 1.0000000 0.7738478 0.3055556 0.9316262
In particular, you can recognize the calculated descriptors:
cor
: cophenetic correlation coefficientsdr
: space distortion ratioac
: agglomerative coefficientcc
: chaining coefficienttb
: tree balanceIt is possible to work with similarity data without having to convert them to distances, provided they are in range [0.0, 1.0]. A typical example would be a matrix of non-negative correlations:
<- as.dist(Harman23.cor$cov)
sim <- linkage(sim, type.prox = "sim")
lnk plot(lnk, main = "Harman23")
There is also the option to choose between unweighted (default) and weighted methods:
par(mfrow = c(1, 2))
<- round(dist(scale(mtcars)), digits = 3)
cars <- list(cex = 0, lab.cex = 0.4)
nodePar
<- linkage(cars, method = "arithmetic")
lnk1 plot(lnk1, main = "unweighted", nodePar = nodePar)
<- linkage(cars, method = "arithmetic", weighted = TRUE)
lnk2 plot(lnk2, main = "weighted", nodePar = nodePar)
When there are tied minimum distances in the agglomeration process,
you may ignore them and proceed choosing a random pair (pair-group
methods) or agglomerate them all at once (variable-group
multidendrograms). With linkage
you can use both
approaches, being multidendrograms the default one:
par(mfrow = c(1, 2))
<- round(dist(scale(mtcars)), digits = 1)
cars <- list(cex = 0, lab.cex = 0.4)
nodePar
<- linkage(cars, method = "complete")
lnk1 plot(lnk1, main = "multidendrogram", nodePar = nodePar)
<- linkage(cars, method = "complete", group = "pair")
lnk2 plot(lnk2, main = "pair-group", nodePar = nodePar)
While multidendrograms are unique, you may obtain structurally different pair-group dendrograms by just reordering the data. As a consequence, the descriptors are invariant to permutations for multidendrograms, but not for pair-group dendrograms:
<- round(dist(scale(mtcars)), digits = 1)
cars <- linkage(cars, method = "complete")
lnk1 <- linkage(cars, method = "complete", group = "pair")
lnk2
# apply random permutation to data
set.seed(666)
<- sample(attr(cars, "Size"))
ord <- as.dist(as.matrix(cars)[ord, ord])
cars_p <- linkage(cars_p, method = "complete")
lnk1p <- linkage(cars_p, method = "complete", group = "pair")
lnk2p
# compare original and permuted cophenetic correlation
c(lnk1$cor, lnk1p$cor)
## [1] 0.7782257 0.7782257
c(lnk2$cor, lnk2p$cor)
## [1] 0.7780010 0.7780994
# compare original and permuted tree balance
c(lnk1$tb, lnk1p$tb)
## [1] 0.9564568 0.9564568
c(lnk2$tb, lnk2p$tb)
## [1] 0.9472909 0.9424148
In multidendrograms, the ranges (rectangles) show the heterogeneity between distances within the group, but they are optional in the plots:
par(mfrow = c(1, 2))
<- round(dist(scale(mtcars)), digits = 1)
cars <- list(cex = 0, lab.cex = 0.4)
nodePar <- linkage(cars, method = "complete")
lnk plot(lnk, col.rng = "bisque", main = "with ranges", nodePar = nodePar)
plot(lnk, col.rng = NULL, main = "without ranges", nodePar = nodePar)
Plots including ranges are only available if you directly use the
plot.linkage
function from mdendro.
Anyway, you may still take advantage of other dendrogram plotting
packages, such as dendextend and
ape:
par(mfrow = c(1, 2))
<- round(dist(scale(mtcars)), digits = 1)
cars <- linkage(cars, method = "complete")
lnk
<- as.dendrogram(lnk)
lnk.dend plot(dendextend::set(lnk.dend, "branches_k_color", k = 4),
main = "dendextend package",
nodePar = list(cex = 0.4, lab.cex = 0.5))
<- as.hclust(lnk)
lnk.hcl <- c("red", "forestgreen", "purple", "orange")
pal4 <- cutree(lnk.hcl, 4)
clu4 plot(ape::as.phylo(lnk.hcl),
type = "fan",
main = "ape package",
tip.color = pal4[clu4],
cex = 0.5)
In addition, you can also use the linkage
function to
plot heatmaps containing multidendrograms:
heatmap(scale(mtcars), hclustfun = linkage)
The list of available AHC linkage methods is the following: single, complete, arithmetic, geometric, harmonic, versatile, ward, centroid and flexible. Their equivalences with the methods in other packages can be found below. The default method is arithmetic, which is also commonly known as average linkage or UPGMA.
par(mfrow = c(3, 3))
<- c("single", "complete", "arithmetic",
methods "geometric", "harmonic", "versatile",
"ward", "centroid", "flexible")
for (m in methods) {
<- linkage(UScitiesD, method = m)
lnk plot(lnk, cex = 0.6, main = m)
}
Two of the methods, versatile and flexible, depend on a parameter that takes values in (-Inf, +Inf) for versatile, and in [-1.0, +1.0] for flexible. Here come some examples:
par(mfrow = c(2, 3))
<- c(-10.0, 0.0, 10.0)
vals for (v in vals) {
<- linkage(UScitiesD, method = "versatile", par.method = v)
lnk plot(lnk, cex = 0.6, main = sprintf("versatile (%.1f)", v))
}
<- c(-0.8, 0.0, 0.8)
vals for (v in vals) {
<- linkage(UScitiesD, method = "flexible", par.method = v)
lnk plot(lnk, cex = 0.6, main = sprintf("flexible (%.1f)", v))
}
It is interesting to know how the descriptors depend on those
parameters. Package mdendro
provides two specific functions for this task, namely
descval
and descplot
, which return just the
numerical values or also the corresponding plot, respectively. For
example, using versatile linkage:
par(mfrow = c(2, 3))
<- c("cor", "sdr", "ac", "cc", "tb")
measures <- c(-Inf, (-20:+20), +Inf)
vals for (m in measures)
descplot(UScitiesD, method = "versatile",
measure = m, par.method = vals,
type = "o", main = m, col = "blue")
Similarly for the flexible method:
par(mfrow = c(2, 3))
<- c("cor", "sdr", "ac", "cc", "tb")
measures <- seq(from = -1, to = +1, by = 0.1)
vals for (m in measures)
descplot(UScitiesD, method = "flexible",
measure = m, par.method = vals,
type = "o", main = m, col = "blue")
For comparison, the same AHC can be found using functions
hclust
and agnes
, where the default plots just
show some differences in aesthetics:
library(mdendro)
<- linkage(UScitiesD, method = "complete")
lnk
library(cluster)
<- agnes(UScitiesD, method = "complete")
agn
# library(stats) # unneeded, stats included by default
<- hclust(UScitiesD, method = "complete")
hcl
par(mfrow = c(1, 3))
plot(lnk)
plot(agn, which.plots = 2)
plot(hcl)
Converting to class dendrogram, we can see all three are structurally equivalent:
<- as.dendrogram(lnk)
lnk.dend <- as.dendrogram(agn)
agn.dend <- as.dendrogram(hcl)
hcl.dend identical(lnk.dend, agn.dend)
## [1] TRUE
par(mfrow = c(1, 2))
plot(lnk.dend, main = "lnk.dend = agn.dend", cex = 0.7)
plot(hcl.dend, main = "hcl.dend", cex = 0.7)
The cophenetic (ultrametric) matrix is readily available as component
coph
of the returned linkage
object, and
coincides with those obtained using the other functions:
<- cophenetic(hcl)
hcl.coph <- cophenetic(agn)
agn.coph
all(lnk$coph == hcl.coph)
## [1] TRUE
all(lnk$coph == agn.coph)
## [1] TRUE
The coincidence also applies to the cophenetic correlation
coefficient and agglomerative coefficient, with the advantage that
linkage
has them all already calculated:
<- cor(UScitiesD, hcl.coph)
hcl.cor all.equal(lnk$cor, hcl.cor)
## [1] TRUE
all.equal(lnk$ac, agn$ac)
## [1] TRUE
The computational efficiency of the three functions is compared next,
both in linear scale (left) and in log-log scale (right). It can be
observed that the time cost of functions linkage
and
hclust
is quadratic, whereas that of function
agnes
is cubic:
Package mdendro
was initially designed to make publicly available implementations of our
work on agglomerative hierarchical clustering (AHC) [1, 2], trying to
reach a wide community through the use of the R language. We already
have two implementations, a Java application called MultiDendrograms,
and a Hierarchical_Clustering
tool within our Radatools
set of programs for the analysis of complex networks, whose Ada source
code can be found in Radalib.
Apart from providing access to our algorithms, we also wanted to
simplify how to use them, and improve the performance of their
implementations. All this was accomplished by using state-of-the-art
methods based on neighbor chains, implementing the base code in C++, and
adding functionality to make the new package very similar and compatible
with the main ones currently in use, namely the functions
hclust
in package stats,
and agnes
in package cluster.
The result is a package mdendro
that includes and largely extends the functionality of these reference
functions, thus with the option of being a replacement for them.
The main novelties in mdendro, which are not available in other packages, are the following:
Calculation of variable-group multidendrograms, which solve the non-unicity problem of AHC when there are tied distances [1].
Native handling of similarity matrices.
Explicit separation of weighted and unweighted methods of AHC.
Implementation of a new parametric family of methods: versatile linkage [2].
Automatic calculation of the cophenetic (or ultrametric) matrix.
Automatic calculation of five descriptors of the final dendrogram: cophenetic correlation coefficient, space distortion ratio, agglomerative coefficient, chaining coefficient, and tree balance.
Plots of the descriptors for the parametric methods.
Multidendrograms are the hierarchical structures that are obtained after applying the variable-group AHC algorithms introduced in [1]. They solve the non-uniqueness problem found in the standard pair-group algorithms and implementations [3]. This problem arises when two or more minimum distances between different clusters are equal during the amalgamation process. The standard approach consists in choosing a pair, breaking the ties between distances, and proceeding in the same way until the final hierarchical classification is obtained. However, different clusterings are possible depending on the criterion used to break the ties (usually a pair is just chosen at random), and the user is unaware of this problem.
The variable-group algorithms group more than two clusters at the same time when ties occur, given rise to a graphical representation called multidendrogram. Their main properties are:
Both hclust
and agnes
ignore this
non-uniqueness problem, thus the need for mdendro.
Let us consider the genetic profiles of 51 grapevine cultivars at 6
microsatellite loci [4]. The distance between two cultivars is
defined as one minus the fraction of shared alleles, that we use to
calculate a distance matrix d
. The main characteristic of
this kind of data is that the number of different distances is very
small:
length(unique(d))
## [1] 11
As a consequence, it becomes very easy to find tied distances during the agglomeration process. The corresponding unique multidendrogram is the following:
<- linkage(d, method = "arithmetic", digits = 3)
lnk <- list(cex = 0, lab.cex = 0.6)
nodePar plot(lnk, col.rng = NULL, nodePar = nodePar, main = "multidendrogram")
The reach of the non-uniqueness problem for this example is the existence of
This number corresponds to arithmetic linkage and a resolution of 3
decimal digits, and has been computed using the
Hierarchical_Clustering
tool in Radatools.
We can check this fact by just calculating the binary dendrogram for
random permutations of the data, and plotting the broad range of values
of their cophenetic correlations, which clearly indicate the existence
of many structurally different dendrograms:
<- 1000
nreps <- vector(length = nreps)
cors <- list()
lnks for (r in 1:nreps) {
<- sample(attr(d, "Size"))
ord <- as.dist(as.matrix(d)[ord, ord])
d2 <- linkage(d2, group = "pair", digits = 3)
lnks[[r]] <- lnks[[r]]$cor
cors[r]
}plot(sort(cors), main = "pair-group cophenetic correlations (same data)")
For example, the first two pair-group dendrograms calculated above are structurally different:
<- as.dendrogram(lnks[[1]])
dend1 <- as.dendrogram(lnks[[2]])
dend2 <- dendextend::highlight_distinct_edges(dend1, dend2)
diff_12 <- dendextend::highlight_distinct_edges(dend2, dend1)
diff_21
par(mfrow = c(2, 1))
<- list(cex = 0, lab.cex = 0.6)
nodePar plot(diff_12, nodePar = nodePar, main = "pair-group (original)")
plot(diff_21, nodePar = nodePar, main = "pair-group (permuted)")
The identification of ties requires the selection of the number of
significant digits in our data. For example, if the original distances
are experimentally obtained with a resolution of three decimal digits,
two distances that differ in the sixth decimal digit should be
considered as equal. If this is not taken into account, ties may be
broken just by the numerical imprecision inherent to computer
representations of real numbers. In the linkage
function,
you can control this level of resolution by adjusting its
digits
argument.
<- linkage(d, method = "arithmetic", digits = 3)
lnk3 <- linkage(d, method = "arithmetic", digits = 1)
lnk1
par(mfrow = c(2, 1))
<- list(cex = 0, lab.cex = 0.6)
nodePar plot(lnk3, col.rng = NULL, nodePar = nodePar,
main = "multidendrogram (digits = 3)")
plot(lnk1, col.rng = NULL, nodePar = nodePar,
main = "multidendrogram (digits = 1)")
Package mdendro also introduces a new family of space-conserving hierarchical clustering methods called versatile linkage [2]. It is based on the use of generalized means, and includes single linkage, complete linkage and arithmetic linkage (a.k.a. average linkage or UPGMA/WPGMA) as particular cases. Additionally, versatile linkage naturally leads to the definition of two new methods, geometric linkage and harmonic linkage (hence the convenience to rename average linkage as arithmetic linkage, to emphasize the existence of different types of averages).
In function linkage
, the parameter of versatile linkage
is introduced using the par.method
argument. It can take
any real value, and the correspondence between versatile linkage and the
above mentioned methods is the following:
linkage | versatile linkage (par.method ) |
---|---|
complete | Inf |
arithmetic | 1 |
geometric | 0 |
harmonic | -1 |
single | -Inf |
Let us show a small example in which we plot the different (multi)dendrograms as we increase the versatile linkage parameter, indicating the corresponding named methods:
= as.dist(matrix(c( 0, 7, 16, 12,
d 7, 0, 9, 19,
16, 9, 0, 12,
12, 19, 12, 0), nrow = 4))
par(mfrow = c(2, 3))
<- c(-Inf, -1, 0, 1, Inf)
vals <- c("single", "harmonic", "geometric", "arithmetic", "complete")
names <- sprintf("versatile (%.1f) = %s", vals, names)
titles
for (i in 1:length(vals)) {
<- linkage(d, method = "versatile", par.method = vals[i], digits = 2)
lnk plot(lnk, ylim = c(0, 20), cex = 0.6, main = titles[i])
}
linkage(prox, type.prox = "distance", digits = NULL,
method = "arithmetic", par.method = 0, weighted = FALSE,
group = "variable")
descval(prox, type.prox = "distance", digits = NULL,
method = "versatile", par.method = c(-1,0,+1), weighted = FALSE,
group = "variable", measure = "cor")
descplot(prox, ..., type.prox = "distance", digits = NULL,
method = "versatile", par.method = c(-1, 0, +1), weighted = FALSE,
group = "variable", measure = "cor", slope = 10)
Arguments:
prox |
A structure of class dist containing
non-negative proximity data (distances or similarities). All the linkage
methods are meant to be used with non-squared proximity data as
input. |
type.prox |
A character string to indicate whether the proximity
data represent "distance" (default) or
"similarity" between objects. Methods "ward"
and "centroid" cannot be used with similarity data as
input, while the rest of the linkage methods can be used with both
distances and similarities. |
digits |
An integer value specifying the precision, i.e., the
number of significant decimal digits to be used for the comparisons
between proximity data. This is an important parameter, since equal
proximity data at a certain precision may become different by increasing
its value. Thus, it may be responsible of the existence of tied
proximity data. If the value of this parameter is negative or
NULL (default), then the precision is automatically set to
the number of significant decimal digits in the input proximity
data. |
method |
A character string specifying the linkage method to be
used. For linkage() , this should be one of:
"single" , "complete" ,
"arithmetic" , "geometric" ,
"harmonic" , "versatile" , "ward" ,
"centroid" or "flexible" . Methods
"versatile" and "flexible" are the only two
methods that can be used in descval() and
descplot() . |
par.method |
A real value, in the case of linkage() , or
a vector of real values, in the case of descval() and
descplot() , required as parameter for the methods
"versatile" and "flexible" . The range of
possible values is [-Inf, +Inf] for "versatile" , and [-1,
+1] for "flexible" . |
weighted |
A logical value to choose between the weighted and the
unweighted (default) versions of some linkage methods. Weighted linkage
gives merging branches in a dendrogram equal weight regardless of the
number of objects carried on each branch. Such a procedure weights
objects unequally, contrasting with unweighted linkage that gives equal
weight to each object in the clusters. This parameter has no effect on
the "single" and "complete" linkages. |
group |
A character string to choose a grouping criterion
between the "variable" -group approach (default) that
returns a multidendrogram, i.e., a multifurcated dendrogram (m-ary
tree), and the "pair" -group approach that returns a
bifurcated dendrogram (binary tree). |
measure |
A character string specifying the descriptive measure
to be plotted. This should be one of: "cor" , for cophenetic
correlation coefficient; "sdr" , for space distortion ratio;
"ac" , for agglomerative coefficient; "cc" , for
chaining coefficient; or "tb" , for tree balance. |
slope |
A real value representing the slope of a sigmoid
function to map the "versatile" linkage unbounded interval
(-Inf, +Inf) onto the bounded interval (-1, +1). It can be used to
improve the distribution of points along the x axis. |
... |
Graphical parameters may also be supplied (see
par ) and are passed to plot.default . |
## S3 method for class 'linkage'
plot(x, type = c("rectangle", "triangle"),
center = FALSE, edge.root = FALSE,
nodePar = NULL, edgePar = list(),
leaflab = c("perpendicular", "textlike", "none"),
dLeaf = NULL, xlab = "", ylab = "", xaxt = "n", yaxt = "s",
horiz = FALSE, frame.plot = FALSE, xlim, ylim,
col.rng = "lightgray", ...)
Arguments:
x |
An object of class linkage , typically
created by linkage() . |
type |
Type of plot |
center |
Logical; if TRUE , nodes are plotted
centered with respect to the leaves in the branch. Otherwise (default),
plot them in the middle of all direct child nodes. |
edge.root |
Logical; if TRUE , draw an edge to the root
node. |
nodePar |
A list of plotting parameters to use for
the nodes (see points ), or NULL by default
which does not draw symbols at the nodes. The list may contain
components named pch , cex , col ,
xpd , and/or bg each of which can have length
two for specifying separate attributes for inner nodes and leaves. Note
that the default of pch is 1:2 , so you may
want to use pch = NA if you specify
nodePar . |
edgePar |
A list of plotting parameters to use for
the edge segments . The list may contain components named
co l, lty and lwd . As with
nodePar , each can have length two for differentiating
leaves and inner nodes. |
leaflab |
A string specifying how leaves are labeled. The default
"perpendicular" writes text vertically (by default),
"textlike" writes text horizontally (in a rectangle), and
"none" suppresses leaf labels. |
dLeaf |
A number specifying the distance in user coordinates
between the tip of a leaf and its label. If NULL as per
default, 3/4 of a letter width or height is used. |
xlab ,ylab |
A label for the axis. |
xaxt ,yaxt |
A character which specifies the axis type. Specifying
"n" suppresses plotting, while any value other than
"n" implies plotting. |
horiz |
Logical indicating if the dendrogram should be drawn horizontally or not. |
frame.plot |
Logical indicating if a box around the plot should be
drawn, see plot.default . |
xlim ,ylim |
Optional x- and y-limits of the plot, passed to
plot.default . The defaults for these show the full
dendrogram. |
col.rng |
Color ("lightgray" by default) to be used
for plotting range rectangles in case of tied heights. If
NULL , range rectangles are not plotted. |
... |
Graphical parameters (see par ) may also be
supplied and are passed to plot.default . |
Based on the plot function for objects of class
dendrogram
(see plot.dendrogram
), the
plot
function for objects of class linkage
adds the possibility of visualizing the existence of tied heights in a
dendrogram thanks to the col.rng
parameter.
An object of class linkage
that describes the
multifurcated dendrogram obtained. The object is a list with the
following components:
call |
The call that produced the result. |
digits |
Number of significant decimal digits used as precision. It is given by the user or automatically set to the number of significant decimal digits in the input proximity data. |
merger |
A list of vectors of integer that describes the merging of clusters at each step of the clustering. If a number j in a vector is negative, then singleton cluster -j was merged at this stage. If j is positive, then the merge was with the cluster formed at stage j of the algorithm. |
height |
A vector with the proximity values between merging clusters (for the particular agglomeration) at the successive stages. |
range |
A vector with the range (the maximum minus the minimum) of proximity values between merging clusters. It is equal to 0 for binary clusters. |
order |
A vector giving a permutation of the original observations to allow for plotting, in the sense that the branches of a clustering tree will not cross. |
coph |
Object of class dist containing the
cophenetic (or ultrametric) proximity data in the output dendrogram,
sorted in the same order as the input proximity data in
prox . |
binary |
A logical value indicating whether the output
dendrogram is a binary tree or, on the contrary, it contains an
agglomeration of more than two clusters due to the existence of tied
proximity data. Its value is always TRUE when the
pair grouping criterion is used. |
cor |
Cophenetic correlation coefficient [5], defined as the Pearson correlation coefficient between the output cophenetic proximity data and the input proximity data. It is a measure of how faithfully the dendrogram preserves the pairwise proximity between objects. |
sdr |
Space distortion ratio [2], calculated as the difference between the maximum and minimum cophenetic proximity data, divided by the difference between the maximum and minimum initial proximity data. Space dilation occurs when the space distortion ratio is greater than 1. |
ac |
Agglomerative coefficient [6], a number between 0 and 1 measuring the strength of the clustering structure obtained. |
cc |
Chaining coefficient [7], a number between 0 and 1 measuring the tendency for clusters to grow by the addition of clusters much smaller rather than by fusion with other clusters of comparable size. |
tb |
Tree balance [2], a number between 0 and 1 measuring the equality in the number of leaves in the branches concerned at each fusion in the hierarchical tree. |
Class linkage
has methods for the following generic
functions: print
, summary
, plot
(see plot.linkage
), as.dendrogram
,
as.hclust
and cophenetic
.
The difference between the available hierarchical clustering methods rests in the way the proximity between two clusters is defined from the proximity between their constituent objects:
"single"
: the proximity between clusters equals the
minimum distance or the maximum similarity between objects.
"complete"
: the proximity between clusters equals
the maximum distance or the minimum similarity between objects.
"arithmetic"
: the proximity between clusters equals
the arithmetic mean proximity between objects. Also known as average
linkage, WPGMA (weighted version) or UPGMA (unweighted
version).
"geometric"
: the proximity between clusters equals
the geometric mean proximity between objects.
"harmonic"
: the proximity between clusters equals
the harmonic mean proximity between objects.
"versatile"
: the proximity between clusters equals
the generalized power mean proximity between objects. It depends on the
value of par.method, with the following linkage methods as particular
cases: "complete"
(par.method = +Inf
),
"arithmetic"
(par.method = +1
),
"geometric"
(par.method = 0
),
"harmonic"
(par.method = -1
) and
"single"
(par.method = -Inf
).
"ward"
: the distance between clusters is a weighted
squared Euclidean distance between the centroids of each cluster. This
method is available only for distance data.
"centroid"
: the distance between clusters equals the
square of the Euclidean distance between the centroids of each cluster.
Also known as WPGMC (weighted version) or UPGMC (unweighted version).
This method is available only for distance data. Note that both centroid
versions, weighted and unweighted, may yield inversions that make
dendrograms difficult to interpret.
"flexible"
: the proximity between clusters is a
weighted sum of the proximity between clusters in the previous
iteration. It depends on the value of par.method
, in the
range [-1, +1], and it is equivalent to "arithmetic"
linkage when par.method = 0
.
Except for the cases containing tied distances, the following
equivalences hold between function linkage
in package
mdendro, function hclust
in package
stats, and function agnes
in package
cluster. Special attention must be paid to the equivalence with
methods "centroid"
and "median"
of function
hclust
, since these methods require the input distances to
be squared before calling hclust
and, consequently, the
square root of its results should be taken afterwards. When relevant,
weighted (W) or unweighted (U) versions of the linkage methods and the
value for par.method
are indicated:
linkage |
hclust |
agnes |
---|---|---|
single | single | single |
complete | complete | complete |
arithmetic, U | average | average |
arithmetic, W | mcquitty | weighted |
geometric, U/W | — | — |
harmonic, U/W | — | — |
versatile, U/W, \(p\) | — | — |
— | ward | — |
ward | ward.D2 | ward |
centroid, U | centroid | — |
centroid, W | median | — |
flexible, U, \(\beta\) | — | gaverage, \(\beta\) |
— | — | gaverage, \(\alpha_1\), \(\alpha_2\), \(\beta\), \(\gamma\) |
flexible, W, \(\beta\) | — | flexible, \((1-\beta)/2\) |
— | — | flexible, \(\alpha_1\), \(\alpha_2\), \(\beta\), \(\gamma\) |
New function descval
, that can be used instead of
descplot
when the plots are not needed.
New logical value added to objects of class linkage
,
indicating whether the output dendrogram is a binary tree.
Fully new implementation from scratch, programmed in C++, using the efficient technique of neighbor chains, including plotting capabilities, and compatible with the main hierarchical clustering and dendrogram packages in R.
Initial R version based on the MultiDendrograms Java core for the calculations.