lmhelprs

Shu Fai Cheung

Introduction

At the time of writing, lmhelprs has two sets of functions:

Hierarchical Regression Analysis

Users can use anova() manually to do hierarchical regression. hierarchical_lm() is written with these features:

Let’s fit three models for illustration:

library(lmhelprs)
data(data_test1)
lm1a <- lm(y ~ x1 + x2, data_test1)
lm1b <- lm(y ~ x1 + x2 + x3 + x4, data_test1)
lm1c <- lm(y ~ x1 + x2 + x3 + x4 + cat2, data_test1)

They are can be passed to hierarchical_lm() in any order. Hierarchical regression analysis will be conducted correctly, with R-squared and R-squared change:

hierarchical_lm(lm1b, lm1a, lm1c)
#> Analysis of Variance Table
#> 
#> Model 1: y ~ x1 + x2
#> Model 2: y ~ x1 + x2 + x3 + x4
#> Model 3: y ~ x1 + x2 + x3 + x4 + cat2
#>   adj.R.sq   R.sq R.sq.change Res.Df   RSS Df Sum of Sq      F Pr(>F)    
#> 1   0.5090 0.5189     0.00000     97 55.83                               
#> 2   0.7269 0.7380     0.21906     95 30.41  2    25.419 39.314 <0.001 ***
#> 3   0.7242 0.7437     0.00573     92 29.74  3     0.665  0.686  0.563    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If the models are not in hierarchical order, an error will be raised:

lm2a <- lm(y ~ x1 + x2, data_test1)
lm2b <- lm(y ~ x1 + x3 + x4, data_test1)
hierarchical_lm(lm2a, lm2b)
#> Error in hierarchical_lm(lm2a, lm2b): The models do not have hierarchical relations.

Please refer to the help page of hierarchical_lm() on how it works, and the print method of its output (print.hierarchical_lm()) on how to customize the printout.

Test The Highest Order Term

In a linear regression model with a term of second or higher order, the function test_highest() can be used to identify the highest order term, compare the model to a model with this term removed, and estimate and test the R-squared increase due to adding this term.

For example:

lm_mod <- lm(y ~ x1 + cat2 + cat1 + cat2:cat1, data_test1)
summary(lm_mod)
#> 
#> Call:
#> lm(formula = y ~ x1 + cat2 + cat1 + cat2:cat1, data = data_test1)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.54001 -0.63322 -0.02938  0.51963  1.45138 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         -0.099251   0.265818  -0.373    0.710    
#> x1                   0.655101   0.079709   8.219 1.78e-12 ***
#> cat2North           -0.510542   0.366084  -1.395    0.167    
#> cat2South           -0.062337   0.386181  -0.161    0.872    
#> cat2West            -0.070097   0.443321  -0.158    0.875    
#> cat1Beta             0.003442   0.386352   0.009    0.993    
#> cat1Gamma            0.026172   0.341096   0.077    0.939    
#> cat2North:cat1Beta  -0.011252   0.565646  -0.020    0.984    
#> cat2South:cat1Beta  -0.475115   0.577723  -0.822    0.413    
#> cat2West:cat1Beta    0.140028   0.577154   0.243    0.809    
#> cat2North:cat1Gamma  0.136924   0.534636   0.256    0.798    
#> cat2South:cat1Gamma  0.564488   0.497376   1.135    0.260    
#> cat2West:cat1Gamma   0.246710   0.606676   0.407    0.685    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.7945 on 87 degrees of freedom
#> Multiple R-squared:  0.5268, Adjusted R-squared:  0.4615 
#> F-statistic:  8.07 on 12 and 87 DF,  p-value: 5.975e-10

Although it has six product terms, in terms of variables, it only has one higher order term: cat2:cat1, the interaction between cat2 and cat1.

To automatically find this term, and compare this model to a model without this interaction, test_highest() can be used:

test_highest(lm_mod)
#> Analysis of Variance Table
#> 
#> Model 1: y ~ x1 + cat2 + cat1
#> Model 2: y ~ x1 + cat2 + cat1 + cat2:cat1
#>   adj.R.sq   R.sq R.sq.change Res.Df   RSS Df Sum of Sq     F Pr(>F)
#> 1   0.4682 0.5004     0.00000     93 57.97                          
#> 2   0.4615 0.5268     0.02633     87 54.91  6     3.055 0.807  0.567

The R-squared change due to adding this interaction (represented by six product terms) to a model without interaction is 0.02633, and the p-value of this change is 0.567.

The function test_highest() can be used for third and higher order term. For example:

lm_mod3 <- lm(y ~ x1 + x2 + x3*x4*cat2, data_test1)
summary(lm_mod3)
#> 
#> Call:
#> lm(formula = y ~ x1 + x2 + x3 * x4 * cat2, data = data_test1)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.19218 -0.37090 -0.03274  0.35201  1.16865 
#> 
#> Coefficients:
#>                  Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)     -0.078753   0.110170  -0.715  0.47674    
#> x1               0.434185   0.066903   6.490 6.14e-09 ***
#> x2               0.263447   0.057819   4.556 1.79e-05 ***
#> x3               0.536450   0.127818   4.197 6.82e-05 ***
#> x4               0.393184   0.148079   2.655  0.00952 ** 
#> cat2North       -0.119296   0.178050  -0.670  0.50473    
#> cat2South        0.180559   0.165472   1.091  0.27839    
#> cat2West         0.105126   0.176470   0.596  0.55300    
#> x3:x4            0.140337   0.223254   0.629  0.53136    
#> x3:cat2North    -0.358721   0.190254  -1.885  0.06291 .  
#> x3:cat2South    -0.177283   0.158573  -1.118  0.26684    
#> x3:cat2West     -0.341833   0.256182  -1.334  0.18579    
#> x4:cat2North     0.004772   0.186679   0.026  0.97967    
#> x4:cat2South    -0.167742   0.200490  -0.837  0.40521    
#> x4:cat2West      0.046933   0.248811   0.189  0.85085    
#> x3:x4:cat2North -0.202793   0.245556  -0.826  0.41128    
#> x3:x4:cat2South -0.264180   0.243922  -1.083  0.28196    
#> x3:x4:cat2West  -0.274082   0.336493  -0.815  0.41770    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.5771 on 82 degrees of freedom
#> Multiple R-squared:  0.7646, Adjusted R-squared:  0.7159 
#> F-statistic: 15.67 on 17 and 82 DF,  p-value: < 2.2e-16
test_highest(lm_mod3)
#> Analysis of Variance Table
#> 
#> Model 1: y ~ x1 + x2 + x3 + x4 + cat2 + x3:x4 + x3:cat2 + x4:cat2
#> Model 2: y ~ x1 + x2 + x3 * x4 * cat2
#>   adj.R.sq   R.sq R.sq.change Res.Df   RSS Df Sum of Sq     F Pr(>F)
#> 1   0.7217 0.7611    0.000000     85 27.72                          
#> 2   0.7159 0.7646    0.003568     82 27.31  3     0.414 0.414  0.743

The model with three-way interaction is compared to a model with only two-way interactions (x3:x4, x3:cat2, and x4:cat2).

If a model has more than one term of the highest order, an error will be raised:

lm_mod2 <- lm(y ~ x1 + x2*x3 + x2*x4, data_test1)
test_highest(lm_mod2)
#> Error in highest_order(lm_out): No unique highest order term.

Please refer to the help page of test_highest() on how it works,

Issues

If you have any suggestions and found any bugs, please feel feel to open a GitHub issue. Thanks.