Think Globally, Fit Locally (Saul and Roweis 2003)
Modeling spectral data has garnered wide interest in the last four decades. Spectroscopy is the study of the spectral response of a matrix (e.g. soil, plant material, seeds, etc.) when it interacts with electromagnetic radiation. This spectral response directly or indirectly relates to a wide range of compositional characteristics (chemical, physical or biological) of the matrix. Therefore, it is possible to develop empirical models that can accurately quantify properties of different matrices. In this respect, quantitative spectroscopy techniques are usually fast, non-destructive and cost-efficient in comparison to conventional laboratory methods used in the analyses of such matrices. This has resulted in the development of comprehensive spectral databases for several agricultural products comprising large amounts of observations. The size of such databases increases de facto their complexity. To analyze large and complex spectral data, one must then resort to numerical and statistical tools and methods such as dimensionality reduction, and local spectroscopic modeling based on spectral dissimilarity concepts.
The aim of the resemble
package is to provide tools to efficiently and
accurately extract meaningful quantitative information from large and complex
spectral databases. The core functionalities of the package include:
Simply type and you will get the info you need:
citation(package = "resemble")
##
## To cite resemble in publications use:
##
## Ramirez-Lopez, L., and Stevens, A., and Viscarra Rossel, R., and
## Shen, Z., and Wadoux, A., and Breure, T. (2024). resemble: Regression
## and similarity evaluation for memory-based learning in spectral
## chemometrics. R package Vignette R package version 2.2.3.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{resemble-package,
## title = {resemble: Regression and similarity evaluation for memory-based learning in spectral chemometrics. },
## author = {Leonardo Ramirez-Lopez and Antoine Stevens and Claudio Orellano and Raphael {Viscarra Rossel} and Zefang Shen and Alex Wadoux and Timo Breure},
## publication = {R package Vignette},
## year = {2024},
## note = {R package version 2.2.3},
## url = {https://CRAN.R-project.org/package=resemble},
## }
This vignette uses the soil Near-Infrared (NIR) spectral dataset provided in the
package prospectr
package (Stevens and Ramirez-Lopez 2024). The reason why we use this dataset is because
soils are one of the most complex matrices analyzed with NIR spectroscopy. This
spectral dataset/library was used in the challenge by
Pierna and Dardenne (2008). The library contains NIR absorbance spectra of dried and sieved
825 soil observations/samples. These samples originate from agricultural fields
collected from all over the Walloon region in Belgium. The data are in an R
data.frame
object which is organized as follows:
Response variables:
Nt (Total Nitrogen in g/kg of dry soil): a numerical variable (values are available for 645 samples and missing for 180 samples).
Ciso (Carbon in g/100 g of dry soil): a numerical variable (values are available for 732 and missing for 93 samples).
CEC (Cation Exchange Capacity in meq/100 g of dry soil): A numerical variable (values are available for 447 and missing for 378 samples).
Predictor variables: the predictor variables are in a matrix embedded in
the data frame, which can be accessed via NIRsoil$spc
. These variables
contain the NIR absorbance spectra of the samples recorded between the
1100 nm and 2498 nm of the electromagnetic spectrum at 2 nm interval. Each
column name in the matrix of spectra represents a specific wavelength (in nm).
Set: a binary variable that indicates whether the samples belong to the training subset (represented by 1, 618 samples) or to the test subset (represented by 0, 207 samples).
Load the necessary packages and data:
The dataset can be loaded into R as follows:
This step aims at improving the signal quality of the spectra for quantitative
analysis. In this respect, the following standard methods are applied using the
package prospectr
(Stevens and Ramirez-Lopez 2024):
# obtain a numeric vector of the wavelengths at which spectra is recorded
wavs <- NIRsoil$spc %>% colnames() %>% as.numeric()
# pre-process the spectra:
# - resample it to a resolution of 6 nm
# - use first order derivative
new_res <- 5
poly_order <- 1
window <- 5
diff_order <- 1
NIRsoil$spc_p <- NIRsoil$spc %>%
resample(wav = wavs, new.wav = seq(min(wavs), max(wavs), by = new_res)) %>%
savitzkyGolay(p = poly_order, w = window, m = diff_order)
new_wavs <- as.matrix(as.numeric(colnames(NIRsoil$spc_p)))
matplot(x = wavs, y = t(NIRsoil$spc),
xlab = "Wavelengths, nm",
ylab = "Absorbance",
type = "l", lty = 1, col = "#5177A133")
matplot(x = new_wavs, y = t(NIRsoil$spc_p),
xlab = "Wavelengths, nm",
ylab = "1st derivative",
type = "l", lty = 1, col = "#5177A133")
Both the raw absorbance spectra and the first derivative spectra are shown in Figure 4.1. The first derivative spectra represents the explanatory variables that will be used for all the examples throughout this document.
For more explicit examples, the NIRsoil
data is split into training and
testing subsets:
# training dataset
training <- NIRsoil[NIRsoil$train == 1, ]
# testing dataset
testing <- NIRsoil[NIRsoil$train == 0, ]
In the resemble package we use the following notation (Ramirez-Lopez, Behrens, Schmidt, Stevens, et al. (2013)):
Training observations:
Xr
stands for the matrix of predictor variables in the reference/training
set (spectral data for calibration).
Yr
stands for the response variable(s) in the reference/training set
(dependent variable for calibration). In the context of this package, Yr
is
also referred as to “side information”, which is a variable or set of
variables that are associated to the training observations that can also be used
to support or guide optimization during modeling, but that not necessarily are
part of the input of such models. For example, we will see in latter sections
that Yr
can be used in Principal Component Analysis to help on deciding how
many components are optimal.
Testing observations:
Xu
stands for the matrix of predictor variables in the unknown/test
set (spectral data for validation/testing).
Yu
stands for the response variable(s) in the unknown/test set (dependent
variable for testing).
When conducting exploratory analysis of spectral data, we face the curse of dimensionality. It is such that we may be dealing with (using NIR spectra data as an example) hundreds to thousands of individual wavelengths for each spectrum. When one wants to find patterns in the data, spectral similarities and differences, or detect spectral outliers, it is necessary to reduce the dimension of the spectra while retaining important information.
Principal Component (PC) analysis and Partial Least Squares (PLS) decomposition methods assume that the meaningful structure the data intrinsically lies on a lower dimensional space. Both methods attempt to find a projection matrix that projects the original variables onto a less complex subspaces (represented by new few variables). These new variables mimic the original variability across observations. These two methods can be considered as the standard ones for dimensionality reduction in many fields of spectroscopic analysis.
The difference between PC and PLS is that in PC the objective is to find few new variables (which are orthogonal) that capture as much of the original data variance while in the latter the objective is to find few new variables that maximize their variance with respect to a set of one or more external variables (e.g. response variables or side information variables).
In PC and PLS the input spectra (\(X\), of \(n \times d\) dimensions) is decomposed into two main matrices: a matrix of scores (\(T\)) and a matrix of ladings (\(P\)), so that:
\[X = T \: P + \varepsilon\] where the dimensions of \(T\) and \(P\) are \(n \times o\) and \(o \times d\), and where \(o\) represents a given number of components being retained and \(\varepsilon\) represents the reconstruction error. The maximum \(o\) (number of components) that can be retrieved is limited to \(\textrm{min}(n-1, d)\). One interesting property of \(P\) is that it is equivalent to \(P^{-1}\). This implies that when the PC decomposition is estimated for a given set of observations (\(X_{new}\)) the resulting \(P\) matrix can be directly used to project new spectra onto the same principal component space by: \[T_{new} = X_{new}\:P'\]
In the resemble
package, PC analysis and PLS decomposition are available
through the ortho_projection()
function which offers the following algorithms:
"pca"
: the standard method for PC analysis based on the singular value
decomposition algorithm.
"pca.nipals"
: this algorithm uses the non-linear iterative partial
least squares algorithm (NIPALS, H. Wold 1975) for the purpose of PC analysis.
"pls"
: Here, PLS decomposition also uses the NIPALS algorithm, but in this
case it makes use of side information, which can be a variable or set of
variables that are associated to the training observations and that are used to
project the data. In this case, the variance between the projected variables and
the side information variable(s) is maximized.
The PC analysis of the training spectra can be executed as follows:
# principal component (pc) analysis with the default
# method (singular value decomposition)
pca_tr <- ortho_projection(Xr = training$spc_p, method = "pca")
pca_tr
Plot the ortho_projection
object:
The code above shows that in this dataset, 7 components are required to explain around 97% of the original variance found in the spectra (Figure 5.1).
Equivalent results can be obtained with the NIPALS algorithm:
# principal component (pc) analysis with the default
# NIPALS algorithm
pca_nipals_tr <- ortho_projection(Xr = training$spc_p,
method = "pca.nipals")
pca_nipals_tr
The advantage of the NIPALS algorithm is that it can be faster than SVD when only few components are required.
For a PLS decomposition the method
argument is set to "pls"
. In this case,
side information (Yr
) is required. In the following example, the side
information used is the Total Carbon (Ciso
):
# Partial Least Squares decomposition using
# Total carbon as side information
# (this might take some seconds)
pls_tr <- ortho_projection(Xr = training$spc_p,
Yr = training$Ciso,
method = "pls")
pls_tr
Note that in the previous code, for PLS projection the observations with missing
training$Ciso
are hold out, and then the projection takes place. The
missing observations are projected with the resulting projection matrix and
pooled together with the initial results.
By default the ortho_projection()
function retains all the first components
that, alone, account for at least 1% of the original variance of data. In
the following section we will see that the function also offers additional
options that might be more convenient for choosing the number of components.
Those options can be specified using the
pc_selection
argument. The following options are all the ones available for
that purpose:
"var"
(default option):Those components that alone explain more than a given amount of the original spectral variance are retained. Example:
"cumvar"
:Only the first components that together explain at least a given amount of the original variance are retained. Example:
"opc"
:This is a more sophisticated method in which the selection of the components is based on the side information concept presented in Ramirez-Lopez, Behrens, Schmidt, Stevens, et al. (2013). First let \(P\) be a sequence of retained components (so that \(P = 1, 2, ...,k\)). At each iteration, the function computes a dissimilarity matrix retaining \(p_i\) components. The values in this side information variable are compared against the side information values of their most spectrally similar observations. The optimal number of components retrieved by the function is the one that minimizes the root mean squared differences (RMSD) in the case of continuous variables, or maximizes the kappa index in the case of categorical variables. The RMSD is calucated as follows:
\[\begin{equation} j(i) = NN(xr_i, Xr^{\{-i\}}) \end{equation}\]
\[\begin{equation} RMSD = \sqrt{\frac{1}{m} \sum_{i=1}^m {(y_i - y_{j(i)})^2}} \end{equation}\]
where \(j(i) = NN(xr_i, Xr^{\{-i\}})\) represents a function to obtain the index
of the nearest neighbor observation found in \(Xr\) (excluding the \(i\)th
observation) for \(xr_i\), \(y_i\) is the value of the side variable of the \(i\)th
observation, \(y_{j(i)}\) is the value of the side variable of the nearest
neighbor of the \(i\)th observation and \(m\) is the total number of observations.
Note that for the "opc"
method Yr
is required (i.e. the side information of
the observations). Type help(sim_eval)
in the R
console to get more details
on how the RMSD and kappa are calculated in the function.
The rationale behind the "opc"
method is based on the assumption that the
closer two observations are in terms of their explanatory variables (Xr
), the
closer they may be in terms of their side information (Yr
).
# This uses optimal component selection
# variation in training$spc_p
optimal_sel <- list(method = "opc", value = 40)
pca_tr_opc <- ortho_projection(Xr = training$spc_p,
Yr = training$Ciso,
method = "pca",
pc_selection = optimal_sel)
pca_tr_opc
In the example above, 11 components are required to represent the space in which the overall Total Carbon difference between each sample and its corresponding nearest neighbor is minimized. The following graph shows how the RMSD varies as a function of the number of components (Figure 5.2):
The following code exemplifies how the RMSD is calculated (only for the 11th component, Figure 5.3):
# compute the dissimilarity matrix using all the retained scores
pc_diss <- f_diss(pca_tr_opc$scores, diss_method = "mahalanobis")
# get the nearest neighbor for each sample
nearest_n <- apply(pc_diss, MARGIN = 1, FUN = function(x) order(x)[2])
# compute the RMSD
rmsd <- sqrt(mean((training$Ciso - training$Ciso[nearest_n])^2, na.rm = TRUE))
rmsd
## [1] 0.8570114
# the RSMD for all the components is already available in
# ...$opc_evaluation
pca_tr_opc$opc_evaluation[pca_tr_opc$n_components, , drop = FALSE]
## pc rmsd_Yr
## 11 11 0.8570114
plot(training$Ciso[nearest_n],
training$Ciso,
ylab = "Ciso of the nearest neighbor, %", xlab = "Ciso, %",
col = "#D19C17CC", pch = 16)
grid()
"manual"
:The user explicitly defines how many components to retrieve. Example:
# This uses manual component selection
manual_sel <- list(method = "manual", value = 9)
# PC
pca_tr_manual <- ortho_projection(Xr = training$spc_p,
method = "pca",
pc_selection = manual_sel)
pca_tr_manual
# PLS
pls_tr_manual <- ortho_projection(Xr = training$spc_p,
Yr = training$Ciso,
method = "pls",
pc_selection = manual_sel)
pls_tr_manual
Both PC and PLS methods generate projection matrices that can be used to project
new observations onto the new lower dimensional score space they were built for.
In the case of PC analysis this projection matrix is equivalent to the transposed
matrix of loadings. The predict
method along with a projection model can be
used to project new data:
optimal_sel <- list(method = "opc", value = 40)
# PLS
pls_tr_opc <- ortho_projection(Xr = training$spc_p,
Yr = training$Ciso,
method = "pls",
pc_selection = optimal_sel,
scale = TRUE)
# the pls projection matrix
pls_tr_opc$projection_mat
pls_projected <- predict(pls_tr_opc, newdata = testing$spc_p)
# PC
pca_tr_opc <- ortho_projection(Xr = training$spc_p,
Yr = training$Ciso,
method = "pca",
pc_selection = optimal_sel,
scale = TRUE)
# the pca projection matrix
t(pca_tr_opc$X_loadings)
pca_projected <- predict(pca_tr_opc, newdata = testing$spc_p)
The ortho_projection()
function allows to project two separate datasets in one
run. For example, training and testing data can be passed to the function as
follows:
optimal_sel <- list(method = "opc", value = 40)
pca_tr_ts <- ortho_projection(Xr = training$spc_p,
Xu = testing$spc_p,
Yr = training$Ciso,
method = "pca",
pc_selection = optimal_sel,
scale = TRUE)
plot(pca_tr_ts)
In the above code for PC analyisis, training
and testing
datasets are pooled
together and then projected and split back for presenting the final results.
For the opc
selection method, the dissimilarity
matrices are built only for the training
data and for the observations with
available side information (Total Carbon). These dissimilarity matrices are used
only to find the optimal number of PCs. Note that Xr
and Yr
refer to the
same observations. Also note that the optimal number of PCs might not be the same
as when testing
is not passed to the Xu
argument since the PC projection
model is built from a different pool of observations.
In the case of PLS, the observations used for projection necessarily have to
have side information available, therefore the missing values in Yr
are hold
out during the projection model building. For these samples, the final projection
matrix is use to project them into the PLS space.
optimal_sel <- list(method = "opc", value = 40)
pls_tr_ts <- ortho_projection(Xr = training$spc_p,
Xu = testing$spc_p,
Yr = training$Ciso,
method = "pls",
pc_selection = optimal_sel,
scale = TRUE)
# the same PLS projection model can be obtained with:
pls_tr_ts2 <- ortho_projection(Xr = training$spc_p[!is.na(training$Ciso),],
Yr = training$Ciso[!is.na(training$Ciso)],
method = "pls",
pc_selection = optimal_sel,
scale = TRUE)
identical(pls_tr_ts$projection_mat, pls_tr_ts2$projection_mat)
The ortho_projection()
function allows to pass more than one variable to Yr
(side information):
optimal_sel <- list(method = "opc", value = 40)
pls_multi_yr <- ortho_projection(Xr = training$spc_p,
Xu = testing$spc_p,
Yr = training[, c("Ciso", "Nt", "CEC")],
method = "pls",
pc_selection = optimal_sel,
scale = TRUE)
plot(pls_multi_yr)
In the above code for PLS projections using multivariate side information, the
PLS2 method (based on the NIPALS algorithm) is used (see S. Wold et al. 1983).
The optimal component selection (opc
) also uses the multiple variables passed
to Yr
, the RMSD is computed for each of the variables. Each RMSD is then
standardized and the final RMSD used for optimization is their average.
For the example above, this data can be accessed as follows:
For PC analysis multivariate side information is also allowed for the opc
method. Alternatively, a categorical variable can also be used as side
information for the opc
. In that case, the kappa index is used instead of the
RMSD.
Similarity/dissimilarity measures between objects are often estimated by means of distance measurements, the closer two objects are to one another, the higher the similarity between them. Dissimilarity or distance measures are useful for a number of applications, for example for outlier detection and nearest neighbors search.
The dissimilarity()
function is the main function for measuring dissimilarities
between observations. It is basically a wrapper to other existing dissimilarity
functions within the package (see f_diss()
, cor_diss()
, sid()
and
ortho_diss()
). It allows to compute dissimilarities between:
all the observations in a single matrix.
observations in a matrix against observations in a second matrix.
The dissimilarity methods available in dissimilarity()
are as follows (see
diss_method
argument):
"pca"
: Mahalanobis distance computed on the matrix of scores of a PC projection of
Xr
(and Xu
if provided). PC projection is done using the singular value
decomposition (SVD) algorithm. Type help(ortho_diss)
for more details on the
function called by this method.
"pca.nipals"
: Mahalanobis distance computed on the matrix of scores of a
PC projection of Xr
(and Xu
if provided). PC projection is done
using the non-linear iterative partial least squares (NIPALS) algorithm.
Type help(ortho_diss)
in the R
console for more details on the function
called by this method.
"pls"
: Mahalanobis distance computed on the matrix of scores of a partial
least squares projection of Xr
(and Xu
if provided). In this case, Yr
is
always required. Type help(ortho_diss)
in the R
console for more details on
the function called by this method.
"cor"
: correlation dissimilarity which is based on the coefficient between
observations. Type help(cor_diss)
in the R
console for more details on the
function called by this method.
"euclid"
: Euclidean distance between observations. Type help(f_diss)
in
the R
console for more details on the function called by this method.
"cosine"
: Cosine distance between observations. Type help(f_diss)
in the
R
console for more details on the function called by this method.
"sid"
: spectral information divergence between observations. Type
help(sid)
in the R
console for more details on the function called by this
method.
In this package, the orthogonal space dissimilarities refer to dissimilarity measures performed either in the PC space or in the PLS space.
Since we can assume that the meaningful structure the data lies on a lower dimensional space, we can also assume that this lower dimensional space is optimal to measure the dissimilarity between observations (Ramirez-Lopez, Behrens, Schmidt, Rossel, et al. 2013).
To measure the dissimilarity between observations (\(x_i\) and \(x_j\)), the Mahalanobis distance is computed on their corresponding projected score vectors (\(t_i\) and \(t_j\)) found in the matrix of scores (\(\mathrm T\)):
\[d(x_i,x_j) = d(t_i,t_j) = \sqrt{\frac{1}{z}\sum(t_i - t_j) C^{-1}(t_i - t_j)'}\] where \(z\) is the number of components used, \(C^{-1}\) is the inverse of the covariance matrix computed from the matrix of projected variables for all the observations \(\mathrm T\). Since the projected variables are orthogonal to each other, the resulting \(C^{-1}\) would be equivalent to a diagonal matrix with the variance of each \(\mathrm T\) column in its main diagonal. Therefore, for this case of orthogonal spaces, the Mahalanobis distance is equivalent to the Euclidean distance applied on the variance-scaled \(\mathrm T\) (De Maesschalck, Jouan-Rimbaud, and Massart 2000).
To compute orthogonal dissimilarities in the resemble
package, the
dissimilarity()
function can be used as follows:
# for PC dissimilarity using the default settings
pcd <- dissimilarity(Xr = training$spc_p,
diss_method = "pca")
dim(pcd$dissimilarity)
# for PC dissimilarity using the optimized component selection method
pcd2 <- dissimilarity(Xr = training$spc_p,
diss_method = "pca.nipals",
Yr = training$Ciso,
pc_selection = list("opc", 20),
return_projection = TRUE)
dim(pcd2$dissimilarity)
pcd2$dissimilarity
pcd2$projection # the projection used to compute the dissimilarity matrix
# for PLS dissimilarity
plsd <- dissimilarity(Xr = training$spc_p,
diss_method = "pls",
Yr = training$Ciso,
pc_selection = list("opc", 20),
return_projection = TRUE)
dim(plsd$dissimilarity)
plsd$dissimilarity
plsd$projection # the projection used to compute the dissimilarity matrix
To compute the correlation dissimilarity between training and testing observations:
# For PC dissimilarity using the optimized component selection method
pcd_tr_ts <- dissimilarity(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pca.nipals",
Yr = training$Ciso,
pc_selection = list("opc", 20))
dim(pcd_tr_ts$dissimilarity)
# For PLS dissimilarity
plsd_tr_ts <- dissimilarity(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pls",
Yr = training$Ciso,
pc_selection = list("opc", 20))
dim(plsd_tr_ts$dissimilarity)
In the last two examples, matrices of 618 rows and 207 columns are retrieved. The number of rows is the same as in the training dataset while the number of columns is the same as in the testing dataset. The dissimilarity between the \(i\)th observation in the training dataset and the \(j\)th observation in the testing dataset is stored in the \(i\)th row and the \(j\)th column of the resulting dissimilarity matrices.
It is also possible to measure the dissimilarity between observations in a localized fashion. In this case, first a global dissimilarity matrix is computed. Then, by using this matrix for each target observation, a given set of k-nearest neighbors are identified. These neighbors (together with the target observation) are projected (from the original data space) onto a (local) orthogonal space (using the same parameters specified in the function). In this projected space the Mahalanobis distance between the target observation and its neighbors is recomputed. A missing value is assigned to the observations that do not belong to this set of neighbors (non-neighbor observations). In this case the dissimilarity matrix cannot be considered as a distance metric since it does not necessarily satisfies the symmetry condition for distance matrices (i.e. given two observations \(x_i\) and \(x_j\), the local dissimilarity, \(d\), between them is relative since generally \(d(x_i, x_j) \neq d(x_j, x_i)\)).
For computing this type of localized dissimilarity matrix, two arguments need to
be passed to the dissimilarity()
function: .local
and pre_k
. These are
not formal arguments of the function, however, they are passed to
the ortho_diss()
function which is used by the dissimilarity()
function for
computing the dissimilarities in the orthogonal spaces.
Here are two examples on how to perform localized dissimilarity computations:
# for localized PC dissimilarity using the optimized component selection method
# set the number of neighbors to retain
knn <- 200
local_pcd_tr_ts <- dissimilarity(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pca",
Yr = training$Ciso,
pc_selection = list("opc", 20),
.local = TRUE,
pre_k = knn)
## The neighborhoods of 207 observations contain missing 'Yr' values.
## Check ...$neighborhood_info
dim(local_pcd_tr_ts$dissimilarity)
## [1] 618 207
# For PLS dissimilarity
local_plsd_tr_ts <- dissimilarity(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pls",
Yr = training$Ciso,
pc_selection = list("opc", 20),
.local = TRUE,
pre_k = knn)
## The neighborhoods of 207 observations contain missing 'Yr' values.
## Check ...$neighborhood_info
dim(local_plsd_tr_ts$dissimilarity)
## [1] 618 207
# check the dissimilarity scores between the first two
# observations in the testing dataset and the first 10
# observations in the training dataset
local_plsd_tr_ts$dissimilarity[1:10, 1:2]
## Xu_1 Xu_2
## Xr_1 * *
## Xr_2 1.9537271 2.0093944
## Xr_3 2.0663499 *
## Xr_4 2.3116881 1.6281155
## Xr_5 * *
## Xr_6 * 1.8760568
## Xr_7 2.1609596 2.0513451
## Xr_8 2.1993001 1.8352469
## Xr_9 2.1764284 2.0324282
## Xr_10 * *
## *: not a neighbor
Correlation dissimilarity is based on the Pearson’s \(\rho\) correlation coefficient between observations. The value of Pearson’s \(\rho\) varies between -1 and 1. A correlation of 1 between two observations would indicate that they are perfectly correlated and might have identical characteristics (i.e. they are can be considered as highly similar). A value of -1, conversely, would indicate that the two observations are perfectly negatively correlated (i.e. the two observations are highly dissimilar). The correlation dissimilarity implemented in the package scales the values between 0 (highest dissimilarity) and 1 (highest similarity). To measure \(d\) between two observations \(x_i\) and \(x_j\) based on the correlation dissimilarity the following equation is applied:
\[d(x_i, x_j) = \frac{1}{2} (1 - \rho(x_i, x_j))\]
Note that \(d\) cannot be considered as a distance metric since it does not satisfy the axiom of identity of indiscernibles. Therefore we prefer the use of the term dissimilarity.
The following code demonstrates how to compute the correlation dissimilarity between all observations in the training dataset:
cd_tr <- dissimilarity(Xr = training$spc_p, diss_method = "cor")
dim(cd_tr$dissimilarity)
cd_tr$dissimilarity
To compute the correlation dissimilarity between training and testing observations:
cd_tr_ts <- dissimilarity(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "cor")
dim(cd_tr_ts$dissimilarity)
cd_tr_ts$dissimilarity
Alternatively, the correlation dissimilarity can be computed using a moving window. In this respect, a window size term \(w\) is introduced to the original equation:
\[d(x_i, x_j; w) = \frac{1}{2 w}\sum_{k=1}^{p-w}1 - \rho(x_{i,\{k:k+w\}}, x_{j,\{k:k+w\}})\]
In this case, the correlation dissimilarity is computed by averaging
the moving window correlation measures. The introduction of the window term
increases the computational cost in comparison to the simple correlation
dissimilarity. The moving window correlation dissimilarity can be computed by
setting the diss_method
argument to "cor"
and passing a window size value to
the ws
argument as follows:
In the computation of the Euclidean dissimilarity, each feature has equal significance. Hence, correlated variables which may represent irrelevant features, may have a disproportional influence on the final dissimilarity measurement (Brereton 2003). Therefore, it is not recommended to use this measure directly on the raw data. To compute the dissimilarity between two observations/vectors \(x_i\) and \(x_j\) the package uses the following equation:
\[d(x_i,x_j) = \sqrt{\frac{1}{p} \sum(x_i - x_j) (x_i - x_j)'}\] where \(p\) represents the number of variables.
With the dissimilarity()
function the Euclidean dissimilarity can be computed
as follows:
# compute the dissimilarity between all the training observations
ed <- dissimilarity(Xr = training$spc_p, diss_method = "euclid")
ed$dissimilarity
The dist()
function in the R
package stats
can also be used to compute
Euclidean distances, however the resemble
implementation tends to be faster
(especially for very large matrices):
# compute the dissimilarity between all the training observations
pre_time_resemble <- proc.time()
ed_resemble <- dissimilarity(Xr = training$spc_p, diss_method = "euclid")
post_time_resemble <- proc.time()
post_time_resemble - pre_time_resemble
pre_time_stats <- proc.time()
ed_stats <- dist(training$spc_p, method = "euclid")
post_time_stats <- proc.time()
post_time_stats - pre_time_stats
# scale the results of dist() based on the number of input columns
ed_stats_tr <- sqrt((as.matrix(ed_stats)^2)/ncol(training$spc_p))
ed_stats_tr[1:2, 1:3]
# compare resemble and R stats results of Euclidean distances
ed_resemble$dissimilarity[1:2, 1:3]
In the above code it can be seen that the results of the dist()
require
scaling based on the number of input variables. This means that, by default,
the values output by dist()
increase with the number of input
variables. This is an effect that is already accounted for in the implementation
of the Euclidean (and also Mahalanobis) dissimilarity implementation
of resemble
.
Another advantage of the Euclidean dissimilarity in resemble
over the one in
R
stats
is that the one in resemble
allows the computation of the
dissimilarities between observations in two matrices:
This dissimilarity metric is also known as the “Spectral Angle Mapper” which has been extensively applied in remote sensing as a tool for unsupervised classification and spectral similarity analysis. The cosine dissimilarity between two observations (\(x_i\) and \(x_j\)) is calculated as:
\[d (x_i, x_j) = cos^{-1} \tfrac{\sum_{k=1}^{p} x_{i,k} x_{j,k} } {\sqrt{\sum_{k=1}^{p} x_{i,k}^2} \sqrt{\sum_{k=1}^{p} x_{j,k}^2}}\] where \(p\) is the number of variables.
With the dissimilarity()
function the Euclidean dissimilarity can be computed
as follows:
The spectral information divergence (SID, Chang 2000) indicates how dissimilar are two observations based on their probability distributions. To account for the discrepancy between the distributions of two observations (\(x_i\) and \(x_j\)), the SID method uses the Kullback-Leibler divergence (\(kl\), Kullback and Leibler 1951) measure. Since the \(kl\) is a non-symmetric measure, i.e. \(kl (x_i, x_j) \neq kl(x_j, x_i)\), the dissimilarity between \(x_i\) and \(x_j\) based on this method is computed as:
\[d(x_i, x_j) = kl (x_i, x_j) + kl (x_j, x_i)\]
The following code can be used to compute the SID between the training and testing observations:
sid_tr_ts <- dissimilarity(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "sid")
dim(sid_tr_ts$dissimilarity)
sid_tr_ts$dissimilarity
See the sid()
function in the resemble
package for more details.
Usually, dissimilarity assessment is disregarded and the decision on what method to use is sometimes arbitrary. However, if the estimations of similarity/dissimilarity between observations from its predictor/explanatory variables fail to reflect the real or main similarity/dissimilarity, these estimations can be seen as useless for further analyses.
The package resemble
offers functionality for assessing dissimilarity matrices.
These assestments are based on first nearest neighbor search (1-NN). In this
section, the different methods to measure dissimilarity between spectra are
compared in terms of their ability to retrieve 1-NNs observations with
similar Total Carbon (“Ciso”). This indicates how well the spectral similarity
between observations reflect their compositional similarity.
Compute a dissimilarty matrix for training$spc_p
using the different methods:
# PC dissimilarity with default settings (variance-based
# of components)
pcad <- dissimilarity(training$spc_p, diss_method = "pca", scale = TRUE)
# PLS dissimilarity with default settings (variance-based
# of components)
plsd <- dissimilarity(training$spc_p, diss_method = "pls", Yr = training$Ciso,
scale = TRUE)
# PC dissimilarity with optimal selection of components
opc_sel <- list("opc", 30)
o_pcad <- dissimilarity(training$spc_p,
diss_method = "pca",
Yr = training$Ciso,
pc_selection = opc_sel,
scale = TRUE)
# PLS dissimilarity with optimal selection of components
o_plsd <- dissimilarity(training$spc_p,
diss_method = "pls",
Yr = training$Ciso,
pc_selection = opc_sel,
scale = TRUE)
# Correlation dissimilarity
cd <- dissimilarity(training$spc_p, diss_method = "cor", scale = TRUE)
# Moving window correlation dissimilarity
mcd <- dissimilarity(training$spc_p, diss_method = "cor", ws = 51, scale = TRUE)
# Euclidean dissimilarity
ed <- dissimilarity(training$spc_p, diss_method = "euclid", scale = TRUE)
# Cosine dissimilarity
cosd <- dissimilarity(training$spc_p, diss_method = "cosine", scale = TRUE)
# Spectral information divergence/dissimilarity
sinfd <- dissimilarity(training$spc_p, diss_method = "sid", scale = TRUE)
Use the sim_eval()
function with each dissimilarity matrix to find the closest
observation to each observation and compare them in terms of the Ciso
variable:
Ciso <- as.matrix(training$Ciso)
ev <- NULL
ev[["pcad"]] <- sim_eval(pcad$dissimilarity, side_info = Ciso)
ev[["plsd"]] <- sim_eval(plsd$dissimilarity, side_info = Ciso)
ev[["o_pcad"]] <- sim_eval(o_pcad$dissimilarity, side_info = Ciso)
ev[["o_plsd"]] <- sim_eval(o_plsd$dissimilarity, side_info = Ciso)
ev[["cd"]] <- sim_eval(cd$dissimilarity, side_info = Ciso)
ev[["mcd"]] <- sim_eval(mcd$dissimilarity, side_info = Ciso)
ev[["ed"]] <- sim_eval(ed$dissimilarity, side_info = Ciso)
ev[["cosd"]] <- sim_eval(cosd$dissimilarity, side_info = Ciso)
ev[["sinfd"]] <- sim_eval(sinfd$dissimilarity, side_info = Ciso)
Table 6.1 and Figure 6.1 show the
results of the comparisons (for the training dataset) between the Total Carbon
of the observations and the Total Carbon of their most similar samples (1-NN)
according to the dissimilarity method used. In the example, the spectral
dissimilarity matrices that best reflect the compositions similarity are those
built with the
pls with optimized component selection (o_plsd
) and
pca with optimized component selection (o_pcad
).
comparisons <- lapply(names(ev),
FUN = function(x, label) {
irmsd <- x[[label]]$eval[1]
ir <- x[[label]]$eval[2]
data.frame(Measure = label,
RMSD = irmsd,
r = ir)
},
x = ev)
comparisons
Measure | RMSD | r |
---|---|---|
pcad | 0.85 | 0.9 |
plsd | 0.81 | 0.89 |
o_pcad | 0.8 | 0.91 |
o_plsd | 0.75 | 0.91 |
cd | 0.99 | 0.86 |
mcd | 0.92 | 0.88 |
ed | 0.82 | 0.9 |
cosd | 1.01 | 0.86 |
sinfd | 1.44 | 0.72 |
old_par <- par("mfrow")
par(mfrow = c(3, 3))
p <- sapply(names(ev),
FUN = function(x, label, labs = c("Ciso (1-NN), %", "Ciso, %")) {
xy <- x[[label]]$first_nn[,2:1]
plot(xy, xlab = labs[1], ylab = labs[2], col = "red")
title(label)
grid()
abline(0, 1)
},
x = ev)
par(old_par)
In the package, the k-NN search aims at finding in a given reference set of observations a group of spectrally similar observations for another given set of observations. For an observation, its most similar observations are known as nearest neighbors and they are usually found by using dissimilairty metrics.
In resemble
, the k-nearest neighbor search is implemented in the
function search_neighbors()
. This function uses the dissimilarity()
function
to compute the dissimilarity matrix that serves in the identification of the
neighbors. These neighbors can be retained in two ways: i. by providing a
specific number of neighbors or ii. by setting a dissimilarity
threshold (\(d_{th}\)).
We encourage readers to go through the section where we discuss about dissimilarity measures, which serves as the basis for the examples presented in this section.
This means that the neighboring observations are retained regardless their
dissimilarity/distance to the target observation. Each target observation for
which its neighbors are to be found ends up with the same neighborhood size
(\(k\)). A drawback of this approach is that observations that are in fact
largely dissimilar to the target observation might end up in its neighborhood.
This is because the requirement for building the neighborhood is based on its
size and not on the similarity of the retained observations to the target one.
In the dissimilarity()
function, the neighborhood size is controlled by
the argument k
.
Here is an example that demonstrates how search_neighbors
can be used to
search in the training
set the spectral neighbors of the testing
set:
knn_pc <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pca.nipals",
k = 50)
# matrix of neighbors
knn_pc$neighbors
# matrix of neighbor distances (dissimilarity scores)
knn_pc$neighbors_diss
# the index (in the training set) of the first two closest neighbors found in
# training for the first observation in testing:
knn_pc$neighbors[1:2, 1, drop = FALSE]
# the distances of the two closest neighbors found in
# training for the first observation in testing:
knn_pc$neighbors_diss[1:2, 1, drop = FALSE]
# the indices in training that fall in any of the
# neighborhoods of testing
knn_pc$unique_neighbors
In the above code, knn_pc$neighbors
is a matrix showing the results of the
neighbors found. This is a matrix of neighbor indices where every column
represents an observarion in the testing set while every row represents the
neighbor index (in descending order). Every entry represents the index of the
neighbor observation in the training set. The knn_pc$neighbors_diss
matrix
shows the dissimilarity scores corresponding to the neighbors found. For example,
for the first observation in testing
its closest observation found in
training
corresponds to the one with index 526
(knn_pc$neighbors[1]
) which has a dissimilarity score of
3.57 (knn_pc$neighbors_diss[1]
).
Neighbor search can also be conducted with all the dissimilarity measures described in previous sections. The neighbors retrieved will then depend on the dissimilarity method used. Thus, it is recommended to evaluate carefully what dissimilarity metric to use before neighbor search.
Here are other examples of neighbor search based on other dissimilarity measures:
# using PC dissimilarity with optimal selection of components
knn_opc <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pca.nipals",
Yr = training$Ciso,
k = 50,
pc_selection = list("opc", 20),
scale = TRUE)
# using PLS dissimilarity with optimal selection of components
knn_pls <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pls",
Yr = training$Ciso,
k = 50,
pc_selection = list("opc", 20),
scale = TRUE)
# using correlation dissimilarity
knn_c <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "cor",
k = 50, scale = TRUE)
# using moving window correlation dissimilarity
knn_mwc <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "cor",
k = 50,
ws = 51, scale = TRUE)
Another example with localized PC and PLS dissimilarity measures:
# using localized PC dissimilarity with optimal selection of components
knn_local_opc <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pca.nipals",
Yr = training$Ciso,
k = 50,
pc_selection = list("opc", 20),
scale = TRUE,
.local = TRUE,
pre_k = 250)
# using localized PLS dissimilarity with optimal selection of components
knn_local_opc <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pls",
Yr = training$Ciso,
k = 50,
pc_selection = list("opc", 20),
scale = TRUE,
.local = TRUE,
pre_k = 250)
Here, the neighboring observations to be retained must have a dissimilarity
score less or equal to a given dissimilarity threshold (\(d_{th}\)). Therefore,
the neighborhood size of the target observations is not constant. A drawback
with this approach is that choosing a meaningful \(d_{th}\) can be difficult,
especially because its value is largely influenced by the dissimilarity method
used. Furthermore, some neighborhoods retrieved by certain thresholds might be
of a very small size or even empty, which constraints any type of analysis
within such neighborhoods. On the other hand, some neighborhood might end up
with large sizes which might include either redundant observations or in some
other cases where \(d_{th}\) is too large the complexity in the neighborhood might
be large. In the dissimilarity()
function, \(d_{th}\) is controlled by the
argument k_diss
. This argument is accompanied by the argument k_range
which
is used to control the maximum and minimum neighborhood sizes. For example, if
a neighborhood size is below the minimum size \(k_{min}\) specified in k_range
,
the function automatically ignores \(d_{th}\) and retrieves the \(k_{min}\) closest
observations. Similarly, if the neighborhood size is above the maximum size
\(k_{max}\) specified in k_range
the function automatically ignores \(d_{th}\) and
retrieves only a maximum of \(k_{max}\) neighbors.
In the package, we can use search_neighbors()
to find in the training
set
the neighbors of the testing
set which dissimilarity scores are less or equal
to a user-defined threshold:
# a dissimilarity threshold
d_th <- 1
# the minimum number of observations required in each neighborhood
k_min <- 20
# the maximum number of observations allowed in each neighborhood
k_max <- 300
dnn_pc <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pca.nipals",
k_diss = d_th,
k_range = c(k_min, k_max),
scale = TRUE)
# matrix of neighbors. The minimum number of indices is 20 (given by k_min)
# and the maximum number of indices is 300 (given by k_max).
# NAs indicate "not a neighbor"
dnn_pc$neighbors
# this reports how many neighbors were found for each observation in
# testing using the input distance threshold (column n_k) and how
# many were finally selected (column final_n_k)
dnn_pc$k_diss_info
# matrix of neighbor distances
dnn_pc$neighbors_diss
# the indices in training that fall in any of the
# neighborhoods of testing
dnn_pc$unique_neighbors
In the code above, the size of the neighborhoods is not constant, the size
variability can be easily visualized with a histogram on
dnn_pc$k_diss_info$n_k
. Figure 7.1, shows that many
neighborhoods were reset to a size of 20 or to a size of 300.
hist(dnn_pc$k_diss_info$final_n_k,
breaks = k_min,
xlab = "Final neighborhood size",
main = "", col = "#EFBF47CC")
In the package, spiking refers to forcing specific observations to be included
in the neighborhoods. For example, if we are searching in the training
set
the neighbors of the testing
set, and if we want to force certain observations
in training
to be included in the neighborhood of each observation in
tetsing
, we can use the spike
argument in search_neighbors()
. For that,
in this argument we will need to pass the indices of training
that we will
be forced into the neighborhoods. The following example demonstrates how to do
that:
measures:
# the indices of the observations that we want to "invite" to every neighborhood
forced_guests <- c(1, 5, 8, 9)
# using PC dissimilarity with optimal selection of components
knn_spiked <- search_neighbors(Xr = training$spc_p,
Xu = testing$spc_p,
diss_method = "pca.nipals",
Yr = training$Ciso,
k = 50,
spike = forced_guests,
pc_selection = list("opc", 20))
# check the first 8 neighbors found in training for the
# first 2 observations in testing
knn_spiked$neighbors[1:8, 1:2]
The previous code shows that the indices specified in forced_guests
are always
selected as part of every neighborhood.
Spiking might be useful when there is a prior knowledge of the similarity between certain observations that cannot be easily pick up by the data.
Memory-based learning (MBL) describes a family of (non-linear) machine learning methods designed to deal with complex spectral datasets (Ramirez-Lopez, Behrens, Schmidt, Stevens, et al. 2013). In MBL, instead of deriving a general or global regression function, a specific regression model is built for each observation requiring a prediction of a response. Each model is fitted using the nearest neighbors of the target observation found in a calibration or reference set [8.1]. While a global function may be very complex, MBL can describe the target function as a collection of less complex local (or locally stable) approximations (Mitchell 1997). For example, for predicting the response variable \(Y\) of a set of \(m\) observations from their explanatory variables \(X\), a set of \(m\) functions are required to be fitted. This can be described as:
\[\hat{y}_i = \hat{f}_i(x_i;\theta_i) \; \forall \; i = \{1, ..., m\}\] where \(\theta_{i}\) represents a set of particular parameters required to fit \(\hat{f}_i\) (e.g. number of factors in a PLS model). Therefore, MBL in the above example can be described broadly as:
\[\hat{f} = \{\hat{f}_1,...,\hat{f}_m\}\] Figure 8.1 illustrates the basic steps in MBL for a set of five observations (\(m = 5\)).
There are four basic aspects behind the steps in Figure 8.1 that must be defined for any MBL algorithm:
A dissimilarity metric: It is required for neighbor search. The dissimilarity metric used must be capable also to reflect the dissimilarity in terms of the response variable for which models are to be built. For example, in soil NIR spectroscopy, the spectral dissisimilarity values of soil samples must be capable of reflecting the compositional dissisimilarity between them. Dissimilarity methods that poorly reflect this general sample dissimilarity are prone to lead to MBL models with poor predictive performance.
How many neighbors to look at?: It is important to optimize the neighborhood size to be used for fitting the local models. Neighborhoods which are too small might be too sensitive to noise and outliers affecting the robustness of the models (Ramirez-Lopez, Behrens, Schmidt, Stevens, et al. 2013). Small neighborhoods might also lack of enough variance to properly capture the relationships between the predictors and the response. On the other hand, large size neighborhoods might introduce complex non-linear relationships between predictors and response which might decrease the accuracy of the models.
How to use the dissimilarity information?: The dissimilarity information can be:
Ignored, this means is only used to retrieve neighbors (e.g. the LOCAL algorithm, Shenk, Westerhaus, and Berzaghi 1997).
Used to weight the training observations according to their dissimilarity to the target observation (e.g. as in locally weighted PLS regression, Naes, Isaksson, and Kowalski 1990).
Used as source of additional predictors (Ramirez-Lopez, Behrens, Schmidt, Stevens, et al. 2013). In this
case, the pairwise dissimilarity matrix between all the \(k\) neighbors
is also retrieved. This matrix of \(k \times k\) dimensions is combined with
the \(p\) predictor variables resulting in a final matrix of predictors
(for the neighborhood) of \(k \times (k+p)\) dimensions. To predict the
target observation, the predictors used are the \(p\) spectral variables in
combination to the vector of distances between the target observation and
its neighbors. In some cases, this approach might lead to an increase on
the predictive performance of the local models. This combined matrix of
predictors can be built as follows:
\[\begin{equation}
\begin{bmatrix}
0_{1,1} & d_{2,1} & ... & d_{1,k} & x_{1,k+1} & x_{1,k+2} & ...& x_{1,k+p}\\
d_{1,2} & 0_{2,2} & ... & d_{2,k} & x_{2,k+1} & x_{2,k+2} & ...& x_{2,k+p}\\
... & ... & ... & ... & ... & ... & ...& ... \\
d_{k,1} & d_{k,2} & ... & 0_{k,k} & x_{k,k+1} & x_{k,k+2} & ...& x_{k,k+p}
\end{bmatrix}
\end{equation}\]
where \(d_{i,j}\) represents the dissimilarity score between the \(i\)th
neighbor and the \(j\)th neighbor.
How to fit the local points?: This is given by the regression method used which is usually a linear one, as the relationships between the explanatory variables and the response are usually assumed linear within the neighborhood.
In the literature MBL is sometimes referred to as local modeling, nevertheless local modeling comprises other approaches, for example, cluster–based modeling and geographical segmentation-based modeling, etc. Hence, MBL can be described as a type of local modeling (Ramirez-Lopez, Behrens, Schmidt, Stevens, et al. 2013).
The mbl()
function in the resemble
package offers the possibility to build
customized memory-based learners. This can be done by choosing from different
dissimilarity metrics, different methods for neighborhood size optimization,
different ways of using the dissimilarity information and different regression
methods for fitting the models within the neighborhoods.
We encourage readers to go through the sections corresponding to dissimilarity measures and k-Nearest Neighbors search which serve as the basis for the examples presented in this section.
The mbl()
function can be described in regards to the four basic aspects
of the MBL methods (which are described few paragraphs above):
The dissimilarity metric: this is controlled by the diss_method
argument of the mbl()
function. The methods available are the same as the ones
described in the dissimilarity measures section.
How many neighbors to look at?: this can be defined either in the k
or in the k_diss
arguments of the mbl()
function. These arguments operate
in a similar fashion as their counterparts in the search_neighbors()
function
described in the k-Nearest Neighbors search section.
However, in mbl()
a vector of different neighborhood sizes can be passed to k
,
or a vector of different distance thresholds can be passed to k_diss
. This
allows to test different values in one run. Here, the k_diss
argument is
also accompanied by the argument k_range
which is used to control the
maximum and minimum neighborhood sizes.
How to use the dissimilarity information: this is controlled by the
diss_usage
argument. If "none"
is passed, the dissimilarity information is
ignored, if "weights"
is passed, the dissimilarity information is used to
weight the training observations (using a tricubic function). If "predictors"
is passed, the dissimilarity information is used as source of additional
predictors.
How to fit the local points?: This is controlled by the method
argument. For this, a local_fit
object (which carries the information of
the regression method and its parameters) is used. There are three methods
available: partial least squares (PLS) regression, weighted average partial
least squares regression (WAPLS, Shenk, Westerhaus, and Berzaghi 1997) and Gaussian process
regression (GPR) with dot product covariance. The following examples show how
to build such local_fit
objects:
# creates an object with instructions to build PLS models
my_plsr <- local_fit_pls(pls_c = 15)
my_plsr
## Partial least squares (pls)
## Number of factors: 15
# creates an object with instructions to build WAPLS models
my_waplsr <- local_fit_wapls(min_pls_c = 3, max_pls_c = 20)
my_waplsr
## Weighted average partial least squares (wapls)
## Min. and max. number of factors: from 3 to 20
# creates an object with instructions to build GPR models
my_gpr <- local_fit_gpr()
my_gpr
## Gaussian process with linear kernel/dot product (gpr)
## Noise: 0.001
A function named mbl_control()
allows to build objects that control internal
validation and some optimization aspects of the mbl()
function. In
mbl_control()
two types of validation can be specified using the
validation_type
argument:
Leave-nearest-neighbor-out cross-validation ("NNv"
): From the group of
neighbors of each target observation, the nearest observation (i.e. the most
similar observation) is excluded and then a local model is fitted using the
remaining neighbors. This model is then used to predict the value of the
response variable of the nearest observation. The set of predicted values in
all the 1-NN observations are finally cross-validated against their
corresponding reference values.
Local leave-group-out cross-validation ("local_cv"
): The group of neighbors
of each observation to be predicted is partitioned into different equal size
subsets. The model fitted with the selected calibration samples is used to
predict the response values of the local validation samples and the local root
mean square error is computed. This process is repeated \(m\) times and
the final local error is computed as the average of the local root mean square
errors obtained for all the \(m\) iterations. In the mbl_control()
function
\(m\) is controlled by the numbe
argument and the size of the subsets is
controlled by the p
argument which indicates the percentage of observations
to be selected from the subset of nearest neighbors. The global error of the
predictions is computed as the average of the local root mean square errors.
Let’s see some examples on how to build objects for controlling the validation
in mbl
.
# create an object with instructions to conduct both validation types
# "NNv" "local_cv"
two_val_control <- mbl_control(validation_type = c("NNv", "local_cv"),
number = 10,
p = 0.75)
The object two_val_control
stores the instructions for conducting both types
of validations ("NNv"
and "local_cv"
). For "local_cv"
, the number of groups
is set to 10 and the percentage of neighbors to build
the local calibration groups is set to 75%.
Now that we have explained the main components for the mbl()
let’s see how the
mbl()
function can be used to predict response variables in the testing
set
by building models with the training
set. The following MBL configuration
reproduces the LOCAL algorithm (Shenk, Westerhaus, and Berzaghi 1997):
# define the dissimilarity method
my_diss <- "cor"
# define the neighborhood sizes to test
my_ks <- seq(80, 200, by = 40)
# define how to use the dissimilarity information (ignore it)
ignore_diss <- "none"
# define the regression method to be used at each neighborhood
my_waplsr <- local_fit_wapls(min_pls_c = 3, max_pls_c = 20)
# for the moment use only "NNv" validation (it will be faster)
nnv_val_control <- mbl_control(validation_type = "NNv")
# predict Total Carbon
# (remove missing values)
local_ciso <- mbl(
Xr = training$spc_p[!is.na(training$Ciso),],
Yr = training$Ciso[!is.na(training$Ciso)],
Xu = testing$spc_p,
k = my_ks,
method = my_waplsr,
diss_method = my_diss,
diss_usage = ignore_diss,
control = nnv_val_control,
scale = TRUE
)
Now let’s explore the local_ciso
object:
local_ciso
##
## Call:
##
## mbl(Xr = training$spc_p[!is.na(training$Ciso), ], Yr = training$Ciso[!is.na(training$Ciso)],
## Xu = testing$spc_p, k = my_ks, method = my_waplsr, diss_method = my_diss,
## diss_usage = ignore_diss, control = nnv_val_control, scale = TRUE)
##
## _______________________________________________________
##
## Total number of observations predicted: 207
## _______________________________________________________
##
## Nearest neighbor validation statistics
##
## k rmse st_rmse r2
## <num> <num> <num> <num>
## 1: 80 0.621 0.0363 0.874
## 2: 120 0.646 0.0377 0.864
## 3: 160 0.636 0.0371 0.868
## 4: 200 0.664 0.0388 0.859
## _______________________________________________________
According to the results obtained in the above example, the neighborhood size
that minimizes the root mean squared error (RMSE) in nearest neighbor
cross-validation is 80. Let’s get the predictions done for the testing
dataset:
bki <- which.min(local_ciso$validation_results$nearest_neighbor_validation$rmse)
bk <- local_ciso$validation_results$nearest_neighbor_validation$k[bki]
# all the prediction results are stored in:
local_ciso$results
# the get_predictions function makes easier to retrieve the
# predictions from the previous object
ciso_hat <- as.matrix(get_predictions(local_ciso))[, bki]
# Plot predicted vs reference
plot(ciso_hat, testing$Ciso,
xlim = c(0, 14),
ylim = c(0, 14),
xlab = "Predicted Total Carbon, %",
ylab = "Total Carbon, %",
main = "LOCAL using argument k")
grid()
abline(0, 1, col = "red")
The prediction root mean squared error is then:
# prediction RMSE:
sqrt(mean((ciso_hat - testing$Ciso)^2, na.rm = TRUE))
## [1] 0.4506581
# squared R
cor(ciso_hat, testing$Ciso, use = "complete.obs")^2
## [1] 0.9155142
Similar results are obtained when the optimization of the neighbrhoods is based on distance thresholds:
# create a vector of dissimilarity thresholds to evaluate
# since the correlation dissimilarity will be used
# these thresholds need to be > 0 and <= 1
dths <- seq(0.025, 0.3, by = 0.025)
# indicate the minimum and maximum sizes allowed for the neighborhood
k_min <- 30
k_max <- 200
local_ciso_diss <- mbl(
Xr = training$spc_p[!is.na(training$Ciso),],
Yr = training$Ciso[!is.na(training$Ciso)],
Xu = testing$spc_p,
k_diss = dths,
k_range = c(k_min, k_max),
method = my_waplsr,
diss_method = my_diss,
diss_usage = ignore_diss,
control = nnv_val_control,
scale = TRUE
)
local_ciso_diss
##
## Call:
##
## mbl(Xr = training$spc_p[!is.na(training$Ciso), ], Yr = training$Ciso[!is.na(training$Ciso)],
## Xu = testing$spc_p, k_diss = dths, k_range = c(k_min, k_max),
## method = my_waplsr, diss_method = my_diss, diss_usage = ignore_diss,
## control = nnv_val_control, scale = TRUE)
##
## _______________________________________________________
##
## Total number of observations predicted: 207
## _______________________________________________________
##
## Nearest neighbor validation statistics
##
## k_diss p_bounded rmse st_rmse r2
## <num> <char> <num> <num> <num>
## 1: 0.025 94.686% 0.611 0.0357 0.878
## 2: 0.050 71.014% 0.613 0.0358 0.877
## 3: 0.075 62.319% 0.620 0.0362 0.874
## 4: 0.100 51.691% 0.610 0.0356 0.879
## 5: 0.125 41.546% 0.588 0.0343 0.887
## 6: 0.150 31.884% 0.537 0.0313 0.906
## 7: 0.175 21.739% 0.571 0.0333 0.894
## 8: 0.200 26.57% 0.615 0.0359 0.877
## 9: 0.225 28.986% 0.595 0.0347 0.884
## 10: 0.250 29.952% 0.628 0.0366 0.871
## 11: 0.275 33.816% 0.618 0.0361 0.875
## 12: 0.300 37.681% 0.625 0.0365 0.873
## _______________________________________________________
The best correlation dissimilarity threshold is 0.15. The column “p_bounded”
in the table of validation results, indicate the percentage of neighborhoods
for which the size was reset either to k_min
or k_max
.
Here we provide few additional examples of some MBL configurations where we make use of another response variable available in the dataset: soil cation exchange capacity (CEC). This variable is perhaps more challenging to predict in comparison to Total Carbon. Table 8.1 provides a summary of the configurations tested in the following code examples.
Abreviation | Dissimilarity method | Dissimilarity usage | Local regression |
---|---|---|---|
local_cec |
Correlation | None | Weighted average PLS |
pc_pred_cec |
optimized PC | Source of predictors | Weighted average PLS |
pls_pred_cec |
optimized PLS | None | Weighted average PLS |
local_gpr_cec |
optimized PC | Source of predictors | Gaussian process |
# Lets define some methods:
my_wapls <- local_fit_wapls(2, 25)
k_min_max <- c(80, 200)
# use the LOCAL algorithm
# specific thresholds for cor dissimilarity
dth_cor <- seq(0.01, 0.3, by = 0.03)
local_cec <- mbl(
Xr = training$spc_p[!is.na(training$CEC),],
Yr = training$CEC[!is.na(training$CEC)],
Xu = testing$spc_p,
k_diss = dth_cor,
k_range = k_min_max,
method = my_wapls,
diss_method = "cor",
diss_usage = "none",
control = nnv_val_control,
scale = TRUE
)
# use one where pca dissmilarity is used and the dissimilarity matrix
# is used as source of additional predictors
# lets define first a an appropriate vector of dissimilarity thresholds
# for the PC dissimilarity method
dth_pc <- seq(0.05, 1, by = 0.1)
pc_pred_cec <- mbl(
Xr = training$spc_p[!is.na(training$CEC),],
Yr = training$CEC[!is.na(training$CEC)],
Xu = testing$spc_p,
k_diss = dth_pc,
k_range = k_min_max,
method = my_wapls,
diss_method = "pca",
diss_usage = "predictors",
control = nnv_val_control,
scale = TRUE
)
# use one where PLS dissmilarity is used and the dissimilarity matrix
# is used as source of additional predictors
pls_pred_cec <- mbl(
Xr = training$spc_p[!is.na(training$CEC),],
Yr = training$CEC[!is.na(training$CEC)],
Xu = testing$spc_p,
Yu = testing$CEC,
k_diss = dth_pc,
k_range = k_min_max,
method = my_wapls,
diss_method = "pls",
diss_usage = "none",
control = nnv_val_control,
scale = TRUE
)
# use one where Gaussian process regression and pca dissmilarity are used
# and the dissimilarity matrix is used as source of additional predictors
local_gpr_cec <- mbl(
Xr = training$spc_p[!is.na(training$CEC),],
Yr = training$CEC[!is.na(training$CEC)],
Xu = testing$spc_p,
k_diss = dth_pc,
k_range = k_min_max,
method = local_fit_gpr(),
diss_method = "pca",
diss_usage = "predictors",
control = nnv_val_control,
scale = TRUE
)
Collect the predictions for each configuration:
# get the indices of the best results according to
# nearest neighbor validation statistics
c_val_name <- "validation_results"
c_nn_val_name <- "nearest_neighbor_validation"
bi_local <- which.min(local_cec[[c_val_name]][[c_nn_val_name]]$rmse)
bi_pc_pred <- which.min(pc_pred_cec[[c_val_name]][[c_nn_val_name]]$rmse)
bi_pls_pred <- which.min(pls_pred_cec[[c_val_name]][[c_nn_val_name]]$rmse)
bi_local_gpr <- which.min(local_gpr_cec[[c_val_name]][[c_nn_val_name]]$rmse)
preds <- cbind(get_predictions(local_cec)[, ..bi_local],
get_predictions(pc_pred_cec)[, ..bi_pc_pred],
get_predictions(pls_pred_cec)[, ..bi_pls_pred],
get_predictions(local_gpr_cec)[, ..bi_local_gpr])
colnames(preds) <- c("local_cec",
"pc_pred_cec",
"pls_pred_cec",
"local_gpr_cec")
preds <- as.matrix(preds)
# R2s
cor(testing$CEC, preds, use = "complete.obs")^2
## local_cec pc_pred_cec pls_pred_cec local_gpr_cec
## [1,] 0.7557315 0.7977637 0.77421 0.7790474
#RMSEs
colMeans((preds - testing$CEC)^2, na.rm = TRUE)^0.5
## local_cec pc_pred_cec pls_pred_cec local_gpr_cec
## 3.334444 3.202677 3.299601 3.189961
The scatter plots in 8.3 ilustrate the prediction results obatined for CEC with each of the MBL configurations tested.
old_par <- par("mfrow", "mar")
par(mfrow = c(2, 2))
plot(testing$CEC, preds[, 2],
xlab = "Predicted CEC, meq/100g",
ylab = "CEC, meq/100g", main = colnames(preds)[2])
abline(0, 1, col = "red")
plot(testing$CEC, preds[, 3],
xlab = "Predicted CEC, meq/100g",
ylab = "CEC, meq/100g", main = colnames(preds)[3])
abline(0, 1, col = "red")
plot(testing$CEC, preds[, 4],
xlab = "Predicted CEC, meq/100g",
ylab = "CEC, meq/100g", main = colnames(preds)[4])
abline(0, 1, col = "red")
par(old_par)
Yu
argumentIf information of the response values in the prediction set is available, then,
the Yu
argument can be used to directly validate the predictions done by
mbl()
. It is not taken into accound for any optimization or modeling step.
It can be used as follows:
# use Yu argument to validate the predictions
pc_pred_nt_yu <- mbl(
Xr = training$spc_p[!is.na(training$Nt),],
Yr = training$Nt[!is.na(training$Nt)],
Xu = testing$spc_p,
Yu = testing$Nt,
k = seq(40, 100, by = 10),
diss_usage = "none",
control = mbl_control(validation_type = "NNv"),
scale = TRUE
)
pc_pred_nt_yu
The mbl()
function uses the foreach()
function of the package foreach
for iterating over every
row/observation passed to the argument Xu
. In the following example,
we use the package doParallel
to set up the cores to be used. Alternatively the package doSNOW
can also be used. In the
following example we use parallel processing to predict Total Nitrogen:
# Running the mbl function using multiple cores
# Execute with two cores, if available, ...
n_cores <- 2
# ... if not then go with 1 core
if (parallel::detectCores() < 2) {
n_cores <- 1
}
# Set the number of cores
library(doParallel)
clust <- makeCluster(n_cores)
registerDoParallel(clust)
# Alternatively:
# library(doSNOW)
# clust <- makeCluster(n_cores, type = "SOCK")
# registerDoSNOW(clust)
# getDoParWorkers()
pc_pred_nt <- mbl(
Xr = training$spc_p[!is.na(training$Nt),],
Yr = training$Nt[!is.na(training$Nt)],
Xu = testing$spc_p,
k = seq(40, 100, by = 10),
diss_usage = "none",
control = mbl_control(validation_type = "NNv"),
scale = TRUE
)
# go back to sequential processing
registerDoSEQ()
try(stopCluster(clust))
pc_pred_nt