categoryEncodings

Travis build status AppVeyor build status Codecov test coverage Lifecycle: stable License: GPL v3 CRAN status

categoryEncodings intends to provide a fast way to encode ‘factor’ or qualitative variables through various methods. The packages uses data.table as the backend for speed, with as few other dependencies as possible. Most of the methods are based on the paper of Johannemann et al.(2019) - Sufficient Representations for Categorical Variables (arXiv:1908.09874).

The current version features automatic inference of factors and uses a very simple heuristic for encoding, as well as allowing manual controls.

Installation

You can install the latest version of categoryEncodings from github using the devtools package

devtools::install_github("JSzitas/categoryEncodings")

Soon the package will be submitted to CRAN, and hopefully will be accepted.

Example

Here we want to encode all of the factors in a given data.frame.

library(categoryEncodings)
# currently 
data_fm <- cbind( data.frame(matrix(rnorm(5*100),ncol = 5)),
                  sample(sample(letters, 10), 100, replace = TRUE))
colnames(data_fm)[6] <- "few_letters"

# encoding is done automatically, as is the inference of factors
    result <- encode_categories(X = data_fm)
# note that due to the data.table backend, the result has to be saved to an object to be 
# visible: otherwise printing is surpressed.     
    print(result)
    
data_fm <- cbind( data.frame( 
   matrix( rnorm(5*100),ncol = 5)),
           sample(sample(letters, 10), 100, replace = TRUE),
           sample(sample(letters, 20), 100, replace = TRUE),
           sample(sample(1:10, 5), 100, replace = TRUE),
           sample(sample(1:50, 35), 100, replace = TRUE ),
           sample(1:2, 100, replace = TRUE ))
colnames(data_fm)[6:10] <- c( "few_letters",  "many_letters",
                              "some_numbers", "many_numbers",
                              "binary" ) 
# it does not matter how many factor variables they are, whether they are encoded as factors
# and whether you supply a method to encode them by - some simple inference of factors is done
# based on the number of distinct values in every variable - over a certain threshold 
# a variable is deemed as essentialy a factor, and treated as such for conversion 
# you will be notified of which variables are being converted via a warning
    
    result <- encode_categories(data_fm)
    print(result)

 

Contributing

If you would like to contribute a pull request, please do contribute! All contributions will be considered for acceptance, provided they are justifiable and the code is reasonable, regardless of anything related to the person submitting the pull request. Please keep things civil - there is no need for negativity. Also, please do refrain from adding unnecessary dependencies (Ex: pipe) to the package (such pull requests as would add an unnecessary dependencies will be denied/ suspended until the code can be made dependency free).