The package \textsf{absorber} provides a tool to select variables in a nonlinear multivariate model. More precisely, it consists in providing a variable selection tool from \(n\) observations satisfying the following nonparametric regression model: \begin{equation} \label{eq:model} Y_i = f(x_i) + \varepsilon_i, \quad x_i = \left(x_i^{(1)}, \ldots, x_i^{(p)}\right), \quad 1\leq i \leq n, \end{equation} where \(f\) is an unknown real-valued function and where the \(\varepsilon_i\)’s are i.i.d centered random variables of variance \(\sigma^2\). The \(x_i\)’s are observation points which belong to a compact set \(S\) of \(\mathbb{R}^p\). We will also assume that \(f\) actually depends on only \(d\) variables instead of \(p\), with \(d<p\), which means that there exists a real-valued function \(\widetilde{f}\) such that \(f(x)=\widetilde{f}(\widetilde{x})\), where \(x\in\mathbb{R}^p\) and \(\widetilde{x}\in\mathbb{R}^d\). Variable selection consists in identifying the components of \(\widetilde{x}\). This variable selection approach is described in [1]. We refer the reader to this paper for further details and references.
You can install the released version of \textsf{absorber} from CRAN with:
install.packages("absorber")
We first propose to apply our method to \(n=700\) observations satisfying Model \eqref{eq:model} with \(f=f_1\) where \(p=5\), defined in [1]. These observations are obtained with a Gaussian noise of \(\sigma = 0.25\). In the following, the \(d=2\) relevant variables to select are \(\{3,5\}\) and the irrelevant ones to discard are \(\{1,2,4\}\):
true.dimensions = c(3,5) ; false.dimensions = c(1,2,4)
The observation set is loaded from files which are provided within the package, as follows:
# --- Loading the values of the observation sets --- ##
data('x_obs') ;
head(x_obs)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.3687684 0.16895845 0.7114856 0.1493075 0.2300115
## [2,] 0.7162858 0.47407370 0.2271114 0.8187909 0.3845692
## [3,] 0.5543277 0.63473174 0.9341467 0.4209710 0.1551578
## [4,] 0.2551628 0.55242762 0.8940447 0.8587429 0.6602330
## [5,] 0.1468073 0.21261063 0.8249912 0.7159358 0.6177809
## [6,] 0.3917696 0.01350068 0.6862343 0.8377919 0.6143807
## --- Loading the values of corresponding noisy values of the response variable --- ##
data('y_obs') ;
head(y_obs)
## [1] -0.09049367 -1.56817050 0.02365417 0.32580069 1.07158399 1.21354888
The \(\texttt{absorber}\) function of the \(\texttt{absorber}\) package is applied by using the following arguments:
res = absorber(x = x_obs, y = y_obs, M = 3)
Additional arguments can also be provided in this function:
The resulting outputs are the following:
First, we can print the sequence of penalization parameters \(\lambda\) used in our method:
head(res$lambdas)
## [1] 0.01563831 0.01492752 0.01424904 0.01360140 0.01298320 0.01239309
We can then print the corresponding sequences of selected variables for each penalization parameter:
head(res$selec.var)
## [[1]]
## NULL
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 3
##
## [[5]]
## [1] 3
##
## [[6]]
## [1] 3
and finally the variables selected with AIC:
res$aic.var
## [1] 3 5
The \(\texttt{plot\_selection}\) function of the \(\texttt{absorber}\) package produces a histogram of the variable selection percentage for each variable on which \(f\) depends. It also displays in red the results obtained with the AIC.
plot_selection(res)
nlam = length(res$lambdas)
occurrence = data.frame(table(unlist(res$selec.var))) ;
colnames(occurrence) = c("Covariable", "Percentage") ;
occurrence$Percentage =occurrence$Percentage*100/nlam ;
occurrence = occurrence[order(-occurrence$Percentage),,drop=FALSE] ;
occurrence$Covariable = factor(occurrence$Covariable,
levels = unique(occurrence$Covariable)) ;
occurrence$Category = as.factor(ifelse(occurrence$Covariable %in% true.dimensions,
'real features', 'fake features')) ;
str(occurrence) ;
## 'data.frame': 5 obs. of 3 variables:
## $ Covariable: Factor w/ 5 levels "3","5","4","2",..: 1 2 3 4 5
## $ Percentage: num 99 65 45 37 36
## $ Category : Factor w/ 2 levels "fake features",..: 2 2 1 1 1
We can then plot the results as a histogram of variable selection percentage:
color.order = c('firebrick', 'forestgreen')[which( c('fake features', 'real features')
%in% levels(occurrence$Category))]
plt_occ = ggplot(data = occurrence, aes(x = Covariable, y = Percentage, fill = Category)) +
geom_bar(stat = 'identity') +
scale_fill_manual(values = color.order) +
ylab('Percentage of selection') +
theme_bw() +
theme(legend.title = element_blank(),
axis.text.x = element_text(size = 16, face = 'bold'),
axis.text.y = element_text(size = 14),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 15),
legend.text = element_text(size = 14),
legend.position = 'bottom',
legend.key.size = unit(1, "cm"),
panel.grid.major = element_line(size = 0.6, linetype = 'solid',
colour = "darkgrey"),
panel.grid.minor = element_line(size = 0.2, linetype = 'solid',
colour = "darkgrey"))
print(plt_occ)
References
[1] Savino, M. E. and Lévy-Leduc, C. (2024) A novel variable selection method in nonlinear multivariate models using B-splines with an application to geoscience. ⟨hal-04434820⟩.