README

The objective of swaprinc is to streamline the comparison between a regression model using original variables and a model in which some of these variables have been swapped out for principal components.

Installation

install.packages("swaprinc")

# install.packages("devtools")
devtools::install_github("mncube/swaprinc")

A Simple Example

In the simple example provided, a regression model estimates the relationship between x1 and y, while controlling for variables x2 through x10.

By using the default engine, “stats”, the statistical model is fitted with stats::lm, and by using the default prc_eng, “stats”, principal components are extracted with stats::prcomp.

The “raw model” is specified by the formula parameter, which is passed to stats::lm. The pca_vars and n_pca_components parameters indicate that variables x2 to x10 will be used to extract three principal components. Subsequently, the “PCA model” is passed to stats::lm as follows: y ~ x1 + PC1 + PC2 + PC3.

By setting the lpca_center and lpca_scale parameters to ‘pca’, the data in pca_vars will be centered and scaled according to the guidelines in the Step-by-Step PCA vignette before being passed to stats::prcomp. The miss_handler parameter, set to ‘omit’, ensures that only complete cases are included by subsetting the data frame rows with stats::complete.cases.

library(swaprinc)

  # Create a small simulated dataset
  set.seed(40)
  n <- 50
  x1 <- rnorm(n)
  x2 <- rnorm(n, 5, 15)
  x3 <- rnorm(n, -5.5, 20)
  x4 <- rnorm(n, 3, 3) + x3*1.5
  x5 <- rnorm(n, -2, 4) + x3*.25
  x6 <- rnorm(n, -5, 5) + x4
  x7 <- rnorm(n, -2, 6)
  x8 <- rnorm(n, 2, 7)
  x9 <- rnorm(n, -2, 3) +x2*.4
  x10 <- rnorm(n, 5, 4)
  y <- 1 + 2 * x1 + 3 * x2 + 2.5*x4 - 3.5*x5 + 2*x6 + 1.5*x7 + x8 + 2*x9 + x10 + rnorm(n)
  data <- data.frame(y, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)

  # Run swaprinc with
  swaprinc_result <- swaprinc(data,
                              formula = "y ~ x1 + x2 + x3 + x4 + x5 + x5 + x6 + x7 + x8 + x9 + x10",
                              pca_vars = c("x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10"),
                              n_pca_components = 3,
                              lpca_center = "pca", 
                              lpca_scale = "pca",
                              miss_handler = "omit")
  
  # Summarize raw model
  summary(swaprinc_result$model_raw)
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.22717 -0.55113  0.00573  0.45296  3.03938 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  1.75250    0.38418   4.562 4.94e-05 ***
#> x1           2.06070    0.17472  11.794 1.96e-14 ***
#> x2           2.93211    0.02976  98.525  < 2e-16 ***
#> x3           0.06272    0.07596   0.826    0.414    
#> x4           2.40907    0.06368  37.830  < 2e-16 ***
#> x5          -3.48453    0.04467 -78.007  < 2e-16 ***
#> x6           2.04578    0.03497  58.501  < 2e-16 ***
#> x7           1.49816    0.02621  57.158  < 2e-16 ***
#> x8           0.97008    0.02530  38.345  < 2e-16 ***
#> x9           2.12426    0.06502  32.670  < 2e-16 ***
#> x10          0.99118    0.03787  26.174  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.064 on 39 degrees of freedom
#> Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
#> F-statistic: 7.237e+04 on 10 and 39 DF,  p-value: < 2.2e-16
  
  # Summarize pca model
  summary(swaprinc_result$model_pca)
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.47301 -0.16369 -0.03991  0.15923  0.48937 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  4.130e-17  3.701e-02   0.000    1.000    
#> x1           6.056e-02  4.050e-02   1.495    0.142    
#> PC1          4.227e-01  1.955e-02  21.621  < 2e-16 ***
#> PC2          3.692e-01  2.781e-02  13.275  < 2e-16 ***
#> PC3         -1.513e-01  3.422e-02  -4.423 6.11e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.2617 on 45 degrees of freedom
#> Multiple R-squared:  0.9371, Adjusted R-squared:  0.9315 
#> F-statistic: 167.6 on 4 and 45 DF,  p-value: < 2.2e-16
  
  # Get model comparisons
  print(swaprinc_result$comparison)
#>   model r_squared adj_r_squared      AIC       BIC
#> 1   Raw 0.9999461     0.9999323 159.7190 182.66323
#> 2   PCA 0.9370900     0.9314980  14.5812  26.05334

The Motivating Example

A common challenge in applied statistics and data science involves performing logistic regression with a set of categorical independent variables. In this motivating example, swaprinc is employed to compare a ‘raw’ logistic regression model containing seven categorical independent variables with a ‘pca’ logistic regression model. The latter model replaces six of the independent variables with their first three principal components, using Gifi::princals to extract principal components. For a comprehensive tutorial on Gifi, refer to Nonlinear Principal Components Analysis: Multivariate Analysis with Optimal Scaling (MVAOS).

I recommend using the ‘broom’ and ‘broom.mixed’ packages to summarize model results when utilizing the ’*_options’ parameters for passing arguments to functions within ‘swaprinc’. This approach helps prevent overly extensive summaries caused by ‘do.call’.

 # Create a small simulated dataset
  set.seed(42)
  n <- 50
  x1 <- rnorm(n, 0.5, 4)
  x2 <- rnorm(n, 3, 15)
  x3 <- rnorm(n, -2.5, 5)
  x4 <- -2.5*x2 + 3*x3 + rnorm(n, 0, 4)
  x5 <- x2*x3 + rnorm(n, -5, 5)*rnorm(n, 5, 10)
  x6 <- rnorm(n, -2, 4)*rnorm(n, 3, 5)
  x7 <- x4 + x6 + rnorm(n, 0, 3)
  y <- 1 + 2*x1 + 3*x2 + -2*x3 + .5*x4 + x5 + 1.5*x6 + x7 + rnorm(n)
  data <- data.frame(y, x1, x2, x3, x4, x5, x6, x7)

  # Categorize the variables
  yq <- stats::quantile(data$y,c(0,1/2, 1))
  x1q <- stats::quantile(data$x1,c(0,1/2, 1))
  x2q <- stats::quantile(data$x2,c(0,1/4,3/4,1))
  x3q <- stats::quantile(data$x3,c(0,2/5,3/5,1))
  x4q <- stats::quantile(data$x4,c(0,1/5,4/5,1))
  x5q <- stats::quantile(data$x5,c(0,2/5,3/5,1))
  x6q <- stats::quantile(data$x6,c(0,2/5,4/5,1))
  x7q <- stats::quantile(data$x7,c(0,2/5,3/5,1))


  data <- data %>% dplyr::mutate(
    y = cut(y, breaks=yq, labels=c("0", "1"),include.lowest = TRUE),
    x1 = cut(x1, breaks=x1q, labels=c("control", "treatment"),include.lowest = TRUE),
    x2 = cut(x2, breaks=x2q, labels=c("small","medium","large"),include.lowest = TRUE),
    x3 = cut(x3, breaks=x3q, labels=c("short","average","tall"),include.lowest = TRUE),
    x4 = cut(x4, breaks=x4q, labels=c("lowbit","most","highbit"),include.lowest = TRUE),
    x5 = cut(x5, breaks=x5q, labels=c("under","healthy","over"),include.lowest = TRUE),
    x6 = cut(x6, breaks=x6q, labels=c("small","medium","large"),include.lowest = TRUE),
    x7 = cut(x7, breaks=x7q, labels=c("small","medium","large"),include.lowest = TRUE)) %>%
    dplyr::mutate(y = as.numeric(ifelse(y == "0", 0, 1)))

  # Run swaprinc with prc_eng set to Gifi
  swaprinc_result <- swaprinc(data,
                              formula = "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7",
                              pca_vars = c("x2", "x3", "x4", "x5", "x6", "x7"),
                              n_pca_components = 3,
                              prc_eng = "Gifi",
                              model_options = list(family = binomial(link = "logit")))
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
  
  # Summarize raw model
  broom::tidy(swaprinc_result$model_raw)
#> # A tibble: 14 × 5
#>    term         estimate std.error statistic p.value
#>    <chr>           <dbl>     <dbl>     <dbl>   <dbl>
#>  1 (Intercept) -2.90e+ 1  39239.   -7.38e- 4   0.999
#>  2 x1treatment  8.68e- 1      2.20  3.94e- 1   0.693
#>  3 x2medium    -2.76e+ 1  36042.   -7.66e- 4   0.999
#>  4 x2large     -3.26e+ 1  38090.   -8.57e- 4   0.999
#>  5 x3average    2.51e+ 1  18410.    1.36e- 3   0.999
#>  6 x3tall       4.67e+ 0  17688.    2.64e- 4   1.00 
#>  7 x4most       9.15e-12      2.51  3.65e-12   1.00 
#>  8 x4highbit   -2.22e+ 1  39541.   -5.61e- 4   1.00 
#>  9 x5healthy    4.02e+ 1  16106.    2.50e- 3   0.998
#> 10 x5over       5.23e+ 1  21037.    2.49e- 3   0.998
#> 11 x6medium    -4.18e+ 0  19682.   -2.12e- 4   1.00 
#> 12 x6large      5.65e+ 1  23134.    2.44e- 3   0.998
#> 13 x7medium    -8.68e- 1      2.20 -3.94e- 1   0.693
#> 14 x7large      3.00e+ 1  20792.    1.44e- 3   0.999
  
  # Summarize pca model
  broom::tidy(swaprinc_result$model_pca)
#> # A tibble: 5 × 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)   -0.162     0.567    -0.286 0.775   
#> 2 x1treatment    0.303     0.810     0.374 0.708   
#> 3 PC1           -2.00      0.539    -3.72  0.000201
#> 4 PC2           -0.501     0.387    -1.29  0.195   
#> 5 PC3            0.339     0.381     0.891 0.373
  
  # Get model comparisons
  print(swaprinc_result$comparison)
#>   model     logLik      AIC      BIC
#> 1   Raw  -4.539636 37.07927 63.84759
#> 2   PCA -20.382553 50.76511 60.32522

Compare Multiple Models

Utilizing the same dataset as in the logistic regression model mentioned earlier, it is beneficial to compare outcomes for various swaps. In the example below, the compswap helper function facilitates the comparison of results with 2, 3, 4, and 5 principal components replacing six original independent variables.

  # Run swaprinc with prc_eng set to Gifi
  compswap_results <- compswap(data,
                              formula = "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7",
                              .pca_varlist = list(c("x2", "x3", "x4", "x5", "x6", "x7")),
                              .n_pca_list = list(2, 3, 4, 5),
                              .prc_eng_list = list("Gifi"),
                              .model_options_list = list(list(family = binomial(link = "logit"))))
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

  # Show available models
  summary(compswap_results$all_models)
#>             Length Class Mode
#> model_raw   30     glm   list
#> model_pca_1 30     glm   list
#> model_pca_2 30     glm   list
#> model_pca_3 30     glm   list
#> model_pca_4 30     glm   list
  
  # Get model comparisons
  print(compswap_results$all_comparisons)
#>   model     logLik      AIC      BIC   model_set
#> 1   Raw  -4.539636 37.07927 63.84759   model_raw
#> 2   PCA -21.390567 50.78113 58.42923 model_pca_1
#> 3   PCA -20.382553 50.76511 60.32522 model_pca_2
#> 4   PCA -13.780908 39.56182 51.03395 model_pca_3
#> 5   PCA -11.616326 37.23265 50.61681 model_pca_4
  
  # View model summaries
  lapply(compswap_results$all_models, broom::tidy)
#> $model_raw
#> # A tibble: 14 × 5
#>    term         estimate std.error statistic p.value
#>    <chr>           <dbl>     <dbl>     <dbl>   <dbl>
#>  1 (Intercept) -2.90e+ 1  39239.   -7.38e- 4   0.999
#>  2 x1treatment  8.68e- 1      2.20  3.94e- 1   0.693
#>  3 x2medium    -2.76e+ 1  36042.   -7.66e- 4   0.999
#>  4 x2large     -3.26e+ 1  38090.   -8.57e- 4   0.999
#>  5 x3average    2.51e+ 1  18410.    1.36e- 3   0.999
#>  6 x3tall       4.67e+ 0  17688.    2.64e- 4   1.00 
#>  7 x4most       9.15e-12      2.51  3.65e-12   1.00 
#>  8 x4highbit   -2.22e+ 1  39541.   -5.61e- 4   1.00 
#>  9 x5healthy    4.02e+ 1  16106.    2.50e- 3   0.998
#> 10 x5over       5.23e+ 1  21037.    2.49e- 3   0.998
#> 11 x6medium    -4.18e+ 0  19682.   -2.12e- 4   1.00 
#> 12 x6large      5.65e+ 1  23134.    2.44e- 3   0.998
#> 13 x7medium    -8.68e- 1      2.20 -3.94e- 1   0.693
#> 14 x7large      3.00e+ 1  20792.    1.44e- 3   0.999
#> 
#> $model_pca_1
#> # A tibble: 4 × 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)   -0.285     0.545    -0.523 0.601   
#> 2 x1treatment    0.464     0.767     0.604 0.546   
#> 3 PC1           -1.91      0.519    -3.68  0.000229
#> 4 PC2           -0.483     0.375    -1.29  0.198   
#> 
#> $model_pca_2
#> # A tibble: 5 × 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)   -0.162     0.567    -0.286 0.775   
#> 2 x1treatment    0.303     0.810     0.374 0.708   
#> 3 PC1           -2.00      0.539    -3.72  0.000201
#> 4 PC2           -0.501     0.387    -1.29  0.195   
#> 5 PC3            0.339     0.381     0.891 0.373   
#> 
#> $model_pca_3
#> # A tibble: 6 × 5
#>   term        estimate std.error statistic p.value
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)   -0.348     0.752    -0.463 0.643  
#> 2 x1treatment    1.28      1.14      1.12  0.261  
#> 3 PC1           -2.78      0.851    -3.26  0.00110
#> 4 PC2           -1.32      0.723    -1.83  0.0674 
#> 5 PC3            0.533     0.565     0.943 0.346  
#> 6 PC4            1.79      0.680     2.63  0.00851
#> 
#> $model_pca_4
#> # A tibble: 7 × 5
#>   term        estimate std.error statistic p.value
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)   -0.737     0.844    -0.874 0.382  
#> 2 x1treatment    1.98      1.39      1.43  0.153  
#> 3 PC1           -3.39      1.22     -2.77  0.00567
#> 4 PC2           -1.69      0.953    -1.77  0.0769 
#> 5 PC3            1.00      0.791     1.27  0.205  
#> 6 PC4            2.44      0.973     2.50  0.0123 
#> 7 PC5            0.420     0.536     0.785 0.432