library(andrews)
#> See the package vignette with `vignette("andrews")`
The andrews
routine, implemented by Jaroslav Myslivec,
has been rewritten and extended. The old routine is still available as
andrews0
and the rewritten routine should give the same
graphics with the same parameter values.
op <- par(mfrow=c(1,2))
andrews0(iris, main="andrews0")
andrews(iris, main="andrews")
par(op)
Some more types of curves has been added. To see more comparisons of
andrews
and andrews0
, run interactively
zzz()
Full parameter list for andrews
is:
andrews (df, type = 1, clr = NULL, step = 100, ymax = 10, # old parameters
alpha = NULL, palcol = NULL, lwd = 1, lty = "solid", ...) # new parameters
The parameters in the call to andrews
that come in
...
are passed to plot
to draw the base
window:
x
, and y
cannot be set by
the user through ...
.main
, sub
,
xlim
, ylim
, xlab
,
ylab
, axes
, asp
and possibly
other parameters.Note that the setting of ylim
has priority over the
setting of ymax
.
The height of the window can be scaled by setting ymax
or ylim
.
ymax
is not set then ylim=c(-10,10)
is
used.ymax
is set to NA
, then
ylim=c(-lim,lim)
, where the maximum height lim
is calculated from the curves. Note that this option doubles the
calculation time.ylim
directly.op<-par(mfrow=c(1,3))
andrews(iris, main="no ymax")
andrews(iris, ymax=NA, main="ymax=NA")
andrews(iris, ylim=c(-1,3), main="ylim=c(-1,3)")
par(op)
The width of the window can be scaled by setting xlim
,
which is particularly useful for type==3
or
type==5
.
andrews(iris, type=3, xlim=c(0,6*pi), ymax=NA)
The parameters lty
and lwd
determine the
linetype and width. The length of the parameters must be either
1
or nrow(df)
. In the first case the values
are applied to each curve, in the second case lty[i]
and
lwd[i]
are used for the i
th line.
andrews(iris, ymax=NA, lwd=3)
The curves can be coloured individually if
length(clr)==nrow(df)
.
andrews(iris, ymax=NA, clr=rainbow(nrow(iris)))
By default, the curves are coloured by a variable of the data frame.
andrews(iris, ymax=NA, clr=5) # iris$Species
If is.numeric(df[,clr])
is FALSE
then
rainbow(nuv)
is used for the colours, where
nuv
is the number of unique values
(nuv = length(unique(df[,clr])
). You can use
palcol
to select another palette that gives
nuv
colours.
andrews(iris, ymax=NA, clr=5, palcol=hcl.colors) # iris$Species
If is.numeric(df[,clr])
is TRUE
, the
variable color(df[,clr])
(scaled to the interval \([0, 1]\)) is used. The default is
function(v) { hsv(1,1,v,1) }
, but any other colouring
function can be used.
andrews(iris, ymax=NA, clr=1) # iris$Sepal.Length
andrews(iris, ymax=NA, clr=1, palcol=function(v) { gray(v) }) # iris$Sepal.Length
or alternatively
andrews(iris, ymax=NA, clr=1, palcol=gray) # iris$Sepal.Length
If you plot many curves then the curve details disappear by
overplotting. If the output device supports the alpha channel, the
parameter alpha
(\(0<\alpha<1\)) is applied to all
curves (length(alpha)==1
) or one for each observation
(length(alpha)==nrow(df)
).
andrews(iris, ymax=NA, alpha=0.1)
andrews(iris, ymax=NA, clr=5, alpha=0.1, lwd=2)
All types of Andrews curves can be written as
\[ f_i(t) = \sum_{k=1}^p x_{ik} g_k(t). \]
Currently, six types of curves are implemented, which are determined
by the parameter type
:
With deftype
the list of types can be queried by
deftype(1)
#> $fun
#> function (n, t)
#> {
#> n <- as.integer(if (n < 1) 1 else n)
#> m <- matrix(1/sqrt(2), nrow = length(t), ncol = n)
#> if (n > 1) {
#> for (i in 2:n) {
#> m[, i] <- if (i%%2)
#> cos((i%/%2) * t)
#> else sin((i%/%2) * t)
#> }
#> }
#> m
#> }
#> <bytecode: 0x563c84433f98>
#> <environment: 0x563c8443fda0>
#>
#> $range
#> [1] -3.141593 3.141593
A curve type consists of a function function(n, t)
which
computes a matrix of \((g_1(t), \ldots,
g_n(t))\) and a default range for plotting, usually \(-\pi\) till \(\pi\).
A new curve type and a default plot range can be implemented by number or name:
deftype("sine", xlim = c(-pi, pi), function(n, t) {
n <- as.integer(if (n<1) 1 else n)
m <- matrix(NA, nrow=length(t), ncol=n)
for (i in 1:n) m[,i] <- sin(i*t)
m
})
andrews(iris, "sine", ymax=NA, clr=5)
The representation of a set of multivariate observations as curves was proposed by Andrews (1972). For a set of orthonormal basis functions \(g_i(t)\) with \(i=1, 2, 3, \ldots\) in an interval \([a, b]\) it holds:
\[\int_a^b g_i(t) g_j(t) dt = \delta_{ij} = \begin{cases} 1 & \text{ if } i=j \\ 0 & \text{ if } i\neq j \end{cases}\]
If one creates from multivariate observations \(x_i=(x_{i1}, \ldots, x_{ip})\) with \(i=1,2,\ldots,n\) a function
\[ f_i(t) = \sum_{k=1}^p x_{ik} g_k(t) \]
then it follows:
Property 1 - The integral of the squared difference of two curves is equal to the squared Euclidean distance between the data points \(x_i\) and \(x_j\): \[\begin{align*} \int_a^b (f_i(t)-f_j(t))^2 dt & = \int_a^b \left(\sum_{k=1}^p (x_{ik}- x_{jk}) g_k(t)\right)^2 dt \\ &= \sum_{k=1}^p \sum_{l=1}^p (x_{ik}- x_{jk}) (x_{il}- x_{jl}) \underbrace{\int_a^b g_k(t)g_l(t) dt}_{= \delta_{kl}}\\ &= \sum_{k=1}^p (x_{ik}- x_{jk})^2 \end{align*}\]
This makes the Andrews curves useful for visualising multivariate outliers and clusters in the data.
None of the basis functions of the six types of Andrews curves form
an orthonormal system, but two (type %in% c(1,2)
) of them
form an orthogonal system:
\[\int_{-\pi}^{\pi} g_i(t)g_j(t) dt = \begin{cases} \pi & \text{ if } i=j \\ 0 & \text{ if } i\neq j \end{cases}\] Thus, property 1 changes to
\[ \int_{-\pi}^{\pi} (f_i(t)-f_j(t))^2 dt = \pi \sum_{k=1}^p (x_{ik}- x_{jk})^2\]
Property 2 - The average curve is the mean value of the observations:
\[\begin{align*} \bar{f}(t) &= \frac1n \sum_{i=1}^n f_i(t) =\frac1n \sum_{i=1}^n \sum_{k=1}^p x_{ik} g_k(t)\\ &=\sum_{k=1}^p g_k(t) \left(\frac1n \sum_{i=1}^n x_{ik}\right)\\ &=\sum_{k=1}^p g_k(t) \bar{x}_k \end{align*}\]
Property 3 - For a given \(t\), the variable \(P_t=\sum_{k=1}^p X_k w_{kt}\) can be considered as a univariate projection of the variables \(X_1\), \(X_2\), …, \(X_p\). If \(Var(X_k)=\sigma\) and \(Cov(X_k,X_l)=\delta_{kl}\) hold then
\[ Var(P_t)= Var\left(\sum_{k=1}^p X_k
w_{kt}\right) = \sum_{k=1}^p w_{kt}^2 Var(X_k)\] For the curves
with type %in% c(1,2,6)
\(Var(P_t)\approx \sigma^2 p/2\) holds, since
\(\sin^2(t)+\cos^2(t)=1\). For these
curves, however, we only cover a subset of the possible one-dimensional
projections.
For type %in% c(3,5)
, the choice of \(\sqrt{p_i}\) ensures that all possible
one-dimensional projections are used, but the range \([a,b]\) is not fixed. Usually \(a=0\) is chosen and \(b\) “appropriate” (in practice
\(b=4\pi\)).