crossval {LogitBoost}R Documentation

Runs v-fold cross validation with LogitBoost

Description

The data are divided into v non-overlapping subsets of roughly equal size. Then, feature selection is applied on (v-1) of the subsets, which are also used to fit the LogitBoost classifier. Then, predictions are made for the left out subsets, and the process is repeated for each of the v subsets.

Usage

crossval(x, y, v=length(y), mfinal=100, presel=0, estimate=0, verbose=F)

Arguments

x A matrix with n rows (different individuals) and p columns (different genes) containing expression values.
y A vector of length n containing the class labels from individuals of K different classes. The labels need to be coded by consecutive integers from 0 to (K-1).
v An integer, specifying the type of v-fold cross validation. The default, v=length(y) means leave-one-out cross validation. Besides this, every value between 2 and length(y) is valid and means that roughly every v-th observation is left out. Make sure that (especially for multiclass problems) this is a sensible partition into training and test data.
mfinal An integer, describing the number of iterations for which boosting should be run. The default value is mfinal=100, which is a reasonable choice for gene expression data.
presel An integer, giving the number of features to be used for classification. If presel=0, no feature preselection is carried out.
estimate An integer, specifying the v of an additional, internal v-fold cross validation on the respective training data for stopping parameter estimation. Please note that this is (especially for larger values of 'estimate') extremly time consuming. The default value of estimate=0 means no stopping parameter estimation.
verbose Logical, indicates whether comments should be given.

Details

The computation of the stopping parameter estimate is computationally very expensive and time consuming.

Value

probs Array, whose rows contain out of sample probabilities that the class labels are predicted as 1, for every boosting iteration. For multiclass problems, the third dimension of the array are the probabilites for the K binary one-against-all partitions of the data.
loglikeli Array, contains the log-likelihood across the training instances for determination of the stopping parameter if estimate>0. For multiclass problems, the third dimension of the array contains the values for the K binary one-against-all partitions of the data.

Author(s)

Marcel Dettling

References

See "Boosting for Tumor Classification of Gene Expression Data", Dettling and Buhlmann (2002), available on the web page http://stat.ethz.ch/~dettling/boosting.html

See Also

logitboost, summarize

Examples

data(leukemia)

## An example without stopping parameter estimation
fit <- crossval(leukemia.x,leukemia.y,v=5,mfinal=100,presel=75,verbose=TRUE)
summarize(fit,leukemia.y)

## 4-fold cross validation with stopping estimation by 3-fold-cv
fit <- crossval(leukemia.x,leukemia.y,v=4,presel=50,estimate=3,verbose=TRUE)
summarize(fit,leukemia.y)

[Package LogitBoost version 1.3.0 Index]