The standardize package provides tools for controlling continuous variable scaling and factor contrasts. The goal of these standardizations is to keep the regression parameters on similar scales, and to ensure that the intercept (which is the predicted value of an observation when all other coefficients are multiplied by 0) represents the corrected mean (i.e. the predicted value for an observation which is average in every way, holding covariates at their mean values and averaging over group differences in factors). When the predictors are all on a similar scale, there are computational benefits for both frequentist and Bayesian approaches in mixed effects regressions, reasonable Bayesian priors are easier to specify, and regression output is easier to interpret.
Throughout this vignette, we will use the ptk dataset to demonstrate the use of the standardize package. A summary of the data can be seen below. We will first discuss scaling continuous variables with the scale function from base R, and with the scale_by function from standardize. Then we’ll discuss contrasts for unordered and ordered factors with named_sum_contr and scaled_contr_poly (respectively). Finally, we will use the standardize function to quickly use all of these tools at once.
library(standardize)
#>
#> ***********************************************************
#> Loading standardize package version 0.2.2
#> Call standardize.news() to see new features/changes
#> ***********************************************************
summary(ptk)
#> cdur vdur place stress prevowel
#> Min. : 61.0 Min. : 0.00 Bilabial:228 Post-Tonic:250 a:217
#> 1st Qu.:117.0 1st Qu.: 49.00 Dental :275 Tonic :252 e:169
#> Median :133.0 Median : 62.00 Velar :248 Unstressed:249 i:146
#> Mean :137.5 Mean : 60.69 o:151
#> 3rd Qu.:155.0 3rd Qu.: 77.00 u: 68
#> Max. :281.0 Max. :190.00
#>
#> posvowel wordpos wordfreq speechrate sex
#> a:212 Initial:322 Min. : 1 Min. : 2.000 Female:384
#> e:144 Medial :429 1st Qu.: 1104 1st Qu.: 5.000 Male :367
#> i: 86 Median : 4899 Median : 6.000
#> o:255 Mean : 18840 Mean : 5.812
#> u: 54 3rd Qu.: 26610 3rd Qu.: 7.000
#> Max. :158168 Max. :10.000
#>
#> speaker
#> s02 : 45
#> s03 : 45
#> s04 : 45
#> s07 : 45
#> s11 : 45
#> s16 : 44
#> (Other):482
Continuous variables include covariates (i.e. fixed effects which take on continuous values) and the dependent variable in linear regression. The scale function in base R, with its default arguments, places continuous variables on unit scale by subtracting the mean of the variable and dividing the result by the variable’s standard deviation (also sometimes called z-scoring or simply scaling). The result is that the values in the transformed variable have the same relationship to one another as in the untransformed variable, but the transformed variable has mean 0 and standard deviation 1. As an example, consider the simple linear regression on total consonant duration cdur with speechrate as the predictor when we don’t scale either variable:
mean(ptk$cdur)
#> [1] 137.5473
sd(ptk$cdur)
#> [1] 30.074
mean(ptk$speechrate)
#> [1] 5.811869
sd(ptk$speechrate)
#> [1] 1.294624
summary(lm(cdur ~ speechrate, ptk))
#>
#> Call:
#> lm(formula = cdur ~ speechrate, data = ptk)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -75.064 -19.688 -4.486 16.514 123.142
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 186.7623 4.7061 39.69 <2e-16 ***
#> speechrate -8.4680 0.7904 -10.71 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 28.02 on 749 degrees of freedom
#> Multiple R-squared: 0.1329, Adjusted R-squared: 0.1317
#> F-statistic: 114.8 on 1 and 749 DF, p-value: < 2.2e-16
In the output for the regression on the raw variables, the intercept is 186.8 ms. This is the predicted value when all of the other coefficients are multiplied by 0. In this case, this does not describe anything interpretable, since from the data summary we can see that the minimum speech rate is 2 nuclei per second (and a value of 0 nuclei per second is not possible). The coefficient for speech rate (-8.5) means that for each increase of 1 nucleus per second in speechrate, there is an expected decrease in total consonant duration of 8.5 ms. But is an increase of 1 nucleus per second a large increase? And is a decrease in 8.5 ms a large decrease? This information is not in the output. When we scale cdur and speechrate so that they both have mean 0 and standard deviation 1, the output becomes easier to interpret:
ptk$cdur_scaled <- scale(ptk$cdur)[, 1]
ptk$sr_scaled <- scale(ptk$speechrate)[, 1]
mean(ptk$cdur_scaled)
#> [1] -1.539605e-16
sd(ptk$cdur_scaled)
#> [1] 1
mean(ptk$sr_scaled)
#> [1] -1.573508e-16
sd(ptk$sr_scaled)
#> [1] 1
summary(lm(cdur_scaled ~ sr_scaled, ptk))
#>
#> Call:
#> lm(formula = cdur_scaled ~ sr_scaled, data = ptk)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.4960 -0.6547 -0.1492 0.5491 4.0946
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.164e-16 3.400e-02 0.00 1
#> sr_scaled -3.645e-01 3.402e-02 -10.71 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9318 on 749 degrees of freedom
#> Multiple R-squared: 0.1329, Adjusted R-squared: 0.1317
#> F-statistic: 114.8 on 1 and 749 DF, p-value: < 2.2e-16
Looking at the output of the regression on the scaled variables, we see that the R-squared value and F-statistic are unchanged, but now the intercept term is 0 and the estimate for the effect of scaled speech rate is -0.36. In this case, a value of 0 for scaled speech rate indicates the average value, and so the intercept represents the predicted consonant duration on unit scale at an average speech rate. The effect of scaled speech rate now represents the expected change in standard deviations of consonant duration for a 1-standard-deviation increase in speech rate; that is, for a 1-SD increase in speech rate, we expect a 0.36-SD decrease in consonant duration, which we could call an effect of moderate magnitude. When we add more covariates into the model, so long as they are all scaled prior to regression, we can directly compare the effect sizes of the different coefficients with no further math required, since they all represent the expected change in dependent variable standard deviations for a 1-SD increase in the covariate.
In addition to the scale function in base R, the standardize package has the function scale_by, which allows a continuous variable to be placed on unit scale within each level of a factor (or the interaction of several factors). For example, say that we are interested in whether or not a speaker’s relative speech rate affects their total consonant durations rather than speech rate in general. In other words, some speakers may simply speak more quickly or more slowly in general, and some may exhibit more speech rate variation than others, and we are interested in modeling speech rate relative to the speaker’s tendencies. In this case, we can use the scale_by function:
ptk$sr_scaled_by_speaker <- scale_by(speechrate ~ speaker, ptk)
mean(ptk$sr_scaled_by_speaker)
#> [1] -3.607564e-17
sd(ptk$sr_scaled_by_speaker)
#> [1] 0.9886017
with(ptk, tapply(speechrate, speaker, mean))
#> s01 s02 s03 s04 s05 s06 s07 s08
#> 5.540706 5.650011 5.773467 6.049278 6.005344 5.619478 6.516664 5.802228
#> s09 s10 s11 s12 s13 s14 s15 s16
#> 5.855702 5.381143 5.702743 5.483161 6.290449 6.353387 5.358924 5.769699
#> s17 s18
#> 5.542354 5.777147
with(ptk, tapply(speechrate, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08
#> 1.084740 1.230526 1.136553 1.261346 1.294189 1.339326 1.242599 1.350230
#> s09 s10 s11 s12 s13 s14 s15 s16
#> 1.472762 1.184024 1.146887 1.160242 1.437768 1.145203 1.432073 1.254818
#> s17 s18
#> 1.357880 1.250270
with(ptk, tapply(sr_scaled, speaker, mean))
#> s01 s02 s03 s04 s05 s06
#> -0.209453180 -0.125023252 -0.029663172 0.183380177 0.149444392 -0.148607756
#> s07 s08 s09 s10 s11 s12
#> 0.544401163 -0.007446973 0.033857642 -0.332704012 -0.084291855 -0.253902532
#> s13 s14 s15 s16 s17 s18
#> 0.369666620 0.418281977 -0.349865920 -0.032573278 -0.208180769 -0.026820301
with(ptk, tapply(sr_scaled, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08
#> 0.8378801 0.9504890 0.8779020 0.9742948 0.9996641 1.0345287 0.9598147 1.0429513
#> s09 s10 s11 s12 s13 s14 s15 s16
#> 1.1375979 0.9145695 0.8858839 0.8961996 1.1105678 0.8845834 1.1061685 0.9692528
#> s17 s18
#> 1.0488603 0.9657396
with(ptk, tapply(sr_scaled_by_speaker, speaker, mean))
#> s01 s02 s03 s04 s05
#> -2.960268e-16 3.165268e-18 -3.119346e-16 1.004790e-16 2.183358e-16
#> s06 s07 s08 s09 s10
#> -7.174906e-17 -1.953684e-16 -2.200950e-16 1.468327e-16 -1.266348e-16
#> s11 s12 s13 s14 s15
#> -2.923792e-16 3.850162e-16 1.980309e-16 6.325185e-17 1.427623e-16
#> s16 s17 s18
#> -2.220446e-16 1.582818e-16 -2.505491e-16
with(ptk, tapply(sr_scaled_by_speaker, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
The overall mean of sr_scaled_by_speaker is still 0, and while the standard deviation is not exactly 1, it is very close at 0.99. When we look at the raw speechrate variable by speaker, we can see that speakers differ somewhat in their mean speech rate and in the amount that the deviate from their own mean speech rate. When we look at the sr_scaled variable that we created earlier with the base R scale function, these same differences persist, but now expressed on unit scale in terms of the overall dataset. For sr_scaled_by_speaker, however, each speaker’s mean value is 0 and each speaker’s standard deviation is 1, and the variable is interpreted as how slowly or quickly the speaker was talking given their own speech rate tendencies.
If we wanted the variables resulting from scale and scale_by to have a different standard deviation than 1, this can be easily accomplished in the following way:
ptk$sr_scaled_0.5 <- scale(ptk$speechrate) * 0.5
ptk$sr_scaled_by_speaker_0.5 <- scale_by(speechrate ~ speaker, ptk, scale = 0.5)
mean(ptk$sr_scaled_0.5)
#> [1] -7.867542e-17
sd(ptk$sr_scaled_0.5)
#> [1] 0.5
with(ptk, tapply(sr_scaled_by_speaker_0.5, speaker, mean))
#> s01 s02 s03 s04 s05
#> -1.480134e-16 1.582634e-18 -1.559673e-16 5.023952e-17 1.091679e-16
#> s06 s07 s08 s09 s10
#> -3.587453e-17 -9.768421e-17 -1.100475e-16 7.341635e-17 -6.331741e-17
#> s11 s12 s13 s14 s15
#> -1.461896e-16 1.925081e-16 9.901544e-17 3.162593e-17 7.138116e-17
#> s16 s17 s18
#> -1.110223e-16 7.914090e-17 -1.252746e-16
with(ptk, tapply(sr_scaled_by_speaker_0.5, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18
#> 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
Factors are variables which take on a defined set of categorical values called levels rather than continuous values. In regression, a factor with K levels is modeled through the use of K - 1 dummy variables. Each level of the factor is assigned a value for each dummy variable based on a contrast matrix. So, for example, a factor with four levels has a contrast matrix with four rows (one for each level) and three columns (one for each dummy variable), with the values in the cells of the matrix determining the numerical expression for the factor levels in the dummy variables. There are two general types of factors, ordered and unordered, whose contrasts are treated differently.
Unordered factors take on two or more categorical values which are not intrinsically ordered (or have a somewhat ordered interpretation but there are only two categories, as is sometimes the case with factors coded as false vs. true, 0 vs. 1, or no vs. yes). For unordered factors, the default in R is to use treatment contrasts, where the first level is coded as 0 for all of the dummy variables, and the remaining levels each have a dummy variable for which they are coded +1, and are then coded as 0 for the other dummy variables, as can be seen in the following example using prevowel (the preceding vowel’s phoneme identity).
options(contrasts = c("contr.treatment", "contr.poly"))
contrasts(ptk$prevowel)
#> e i o u
#> a 0 0 0 0
#> e 1 0 0 0
#> i 0 1 0 0
#> o 0 0 1 0
#> u 0 0 0 1
summary(lm(cdur_scaled ~ prevowel, ptk))
#>
#> Call:
#> lm(formula = cdur_scaled ~ prevowel, data = ptk)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.3844 -0.6886 -0.1139 0.5714 4.7076
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.06240 0.06765 0.922 0.3566
#> prevowele -0.11995 0.10224 -1.173 0.2411
#> prevoweli 0.09704 0.10667 0.910 0.3632
#> prevowelo -0.22329 0.10561 -2.114 0.0348 *
#> prevowelu -0.10358 0.13850 -0.748 0.4547
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9965 on 746 degrees of freedom
#> Multiple R-squared: 0.01219, Adjusted R-squared: 0.006891
#> F-statistic: 2.301 on 4 and 746 DF, p-value: 0.05722
With treatment contrasts, the intercept loses the interpretation of the corrected mean, since when all of the dummy variables in the contrast matrix above are multiplied by 0, the resulting value corresponds to a preceding /a/. The coefficients prevowele … prevowelu then represent the difference between each of the other preceding vowel categories and /a/. To avoid this (and to ensure that the coefficients for the dummy variables stay, on average, closer to zero, but without altering the ultimate interpretation of the results), sum contrasts are used. With sum contrasts in R, the first K - 1 levels each get a dummy variable for which they are coded +1, and then are valued 0 for the other dummy variables. The last level is assigned a value of -1 for all of the dummy variables. Sum contrasts also have additional computational benefits in comparison to treatment contrasts for similar reasons as covariate scaling. Recoding the contrasts for prevowel with sum contrasts, we get:
options(contrasts = c("contr.sum", "contr.poly"))
contrasts(ptk$prevowel)
#> [,1] [,2] [,3] [,4]
#> a 1 0 0 0
#> e 0 1 0 0
#> i 0 0 1 0
#> o 0 0 0 1
#> u -1 -1 -1 -1
summary(lm(cdur_scaled ~ prevowel, ptk))
#>
#> Call:
#> lm(formula = cdur_scaled ~ prevowel, data = ptk)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.3844 -0.6886 -0.1139 0.5714 4.7076
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.007554 0.039210 -0.193 0.8473
#> prevowel1 0.069957 0.065448 1.069 0.2855
#> prevowel2 -0.049994 0.071157 -0.703 0.4825
#> prevowel3 0.167001 0.074958 2.228 0.0262 *
#> prevowel4 -0.153338 0.074051 -2.071 0.0387 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9965 on 746 degrees of freedom
#> Multiple R-squared: 0.01219, Adjusted R-squared: 0.006891
#> F-statistic: 2.301 on 4 and 746 DF, p-value: 0.05722
With sum contrasts, the intercept maintains the interpretation of the corrected mean, since when all of the dummy variable coefficients are multiplied by 0, it averages over their effects (note that no row in the contrast matrix above has all 0’s, and thus multiplying all of the coefficients by 0 cannot describe any one level; rather, the mean of the values in each column is 0, and so multiplying all of the dummy variable coefficients by 0 averages over their effects). However, one downside to the default implementation is that the coefficients are numbered rather than named. The named_contr_sum function from the standardize package orders levels alphabetically, applies sum contrasts to them, and names the contrasts:
contrasts(ptk$prevowel) <- named_contr_sum(ptk$prevowel)
contrasts(ptk$prevowel)
#> a e i o
#> a 1 0 0 0
#> e 0 1 0 0
#> i 0 0 1 0
#> o 0 0 0 1
#> u -1 -1 -1 -1
summary(lm(cdur_scaled ~ prevowel, ptk))
#>
#> Call:
#> lm(formula = cdur_scaled ~ prevowel, data = ptk)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.3844 -0.6886 -0.1139 0.5714 4.7076
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.007554 0.039210 -0.193 0.8473
#> prevowela 0.069957 0.065448 1.069 0.2855
#> prevowele -0.049994 0.071157 -0.703 0.4825
#> prevoweli 0.167001 0.074958 2.228 0.0262 *
#> prevowelo -0.153338 0.074051 -2.071 0.0387 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9965 on 746 degrees of freedom
#> Multiple R-squared: 0.01219, Adjusted R-squared: 0.006891
#> F-statistic: 2.301 on 4 and 746 DF, p-value: 0.05722
Note that the named output is identical in all other ways to the number output in this case. The intercept is the corrected mean, prevowela … prevowelo represent the difference between the named category and the corrected mean, and the /u/ category is obtained by subtracting all four the prevowel coefficients from the intercept.
The named_contr_sum function also accepts a scale argument by which the entire contrast matrix is multiplied. This allows the contrast matrix deviation magnitude to be defined. For example:
contrasts(ptk$prevowel) <- named_contr_sum(ptk$prevowel, scale = 0.5)
contrasts(ptk$prevowel)
#> a e i o
#> a 0.5 0.0 0.0 0.0
#> e 0.0 0.5 0.0 0.0
#> i 0.0 0.0 0.5 0.0
#> o 0.0 0.0 0.0 0.5
#> u -0.5 -0.5 -0.5 -0.5
One last note on named_contr_sum is that if there are only two levels and they are equal to (ignoring case) “F” and “T”, “FALSE” and “TRUE”, “N”, and “Y”, “NO”, and “YES”, or “0” and “1”, then their order is reversed. This makes it so the positive level gets the dummy coefficient rather than the negative level, yielding a more intuitive interpretation for the resulting coefficients. For example, if we were interested in differentiating only between high vowels (/i/ and /u/) and non-high vowels (/e/, /o/, and /a/), we could do the following (note also that return_contr can be set to FALSE to create a factor with the contrasts, which is useful when the original variable is not a factor):
ptk$prehigh <- ptk$prevowel %in% c("i", "u")
class(ptk$prehigh)
#> [1] "logical"
unique(ptk$prehigh)
#> [1] FALSE TRUE
ptk$prehigh <- named_contr_sum(ptk$prehigh, return_contr = FALSE)
class(ptk$prehigh)
#> [1] "factor"
levels(ptk$prehigh)
#> [1] "TRUE" "FALSE"
contrasts(ptk$prehigh)
#> TRUE
#> TRUE 1
#> FALSE -1
Ordered factors are variables which take on more than two categorical values, with the categories representing a hierarchy for which we may not expect the trend to be strictly linear. For example, say we are interested in the effect of preceding vowel height, considering three levels (/a/ = Low, /e o/ = Mid, /i u/ = High). We might hypothesize that the general trend is for relatively higher vowels to have shorter durations than relatively lower vowels, but at the same time not expect that the difference between Low and Mid is the same as the difference between Mid and High. In this case we could create an ordered factor preheight with levels Low < Mid < High. In R, the default contrasts for ordered factors are orthogonal polynomial contrasts. That is, if there are K factor levels, then the contrast matrix is an orthogonal polynomial of degree K - 1, where the first contrast column is the linear component, the second the quadratic, the third the cubic, the fourth the fourth power, etc. until the K - 1’th power is reached. Our hypothesis that the trend is in general negative then would be supported by a negative linear trend. This is exemplified in the following code:
ptk$preheight <- "Mid"
ptk$preheight[ptk$prevowel == "a"] <- "Low"
ptk$preheight[ptk$prevowel %in% c("i", "u")] <- "High"
ptk$preheight <- factor(ptk$preheight, ordered = TRUE, levels = c("Low",
"Mid", "High"))
head(ptk$preheight)
#> [1] Low High High High Mid High
#> Levels: Low < Mid < High
contrasts(ptk$preheight)
#> .L .Q
#> [1,] -7.071068e-01 0.4082483
#> [2,] -9.073800e-17 -0.8164966
#> [3,] 7.071068e-01 0.4082483
However, the scale of the contrast matrix columns depends on the number of levels. The mean of each contrast column is always 0, and the columns of the contrast matrix are completely uncorrelated, but the standard deviations of the contrast matrix columns decreases as the number of levels increases:
contr3 <- contr.poly(3)
contr5 <- contr.poly(5)
apply(contr3, 2, mean)
#> .L .Q
#> -6.723860e-17 3.700743e-17
apply(contr5, 2, mean)
#> .L .Q .C ^4
#> -6.591949e-18 2.219362e-17 4.148700e-17 2.783689e-18
apply(contr3, 2, sd)
#> .L .Q
#> 0.7071068 0.7071068
apply(contr5, 2, sd)
#> .L .Q .C ^4
#> 0.5 0.5 0.5 0.5
For this reason, the standardize package has a function scaled_contr_poly where the standard deviation of the contrast matrix columns for ordered factors can be specified through its scale argument (with default 1):
sc_1_contr3 <- scaled_contr_poly(3)
sc_0.5_contr3 <- scaled_contr_poly(3, scale = 0.5)
sc_1_contr3
#> .L .Q
#> [1,] -1.00000e+00 0.5773503
#> [2,] -3.32076e-17 -1.1547005
#> [3,] 1.00000e+00 0.5773503
apply(sc_1_contr3, 2, sd)
#> .L .Q
#> 1 1
sc_0.5_contr3
#> .L .Q
#> [1,] -5.00000e-01 0.2886751
#> [2,] -1.66038e-17 -0.5773503
#> [3,] 5.00000e-01 0.2886751
apply(sc_0.5_contr3, 2, sd)
#> .L .Q
#> 0.5 0.5
The resulting contrasts are still orthogonal polynomial contrasts; they are simply placed on a uniform scale regardless of the number of factor levels. This affects the magnitude of the resulting coefficients, but, as with the standardization of unordered factors and continuous variables discussed above, it does not alter the significance of the variable:
contrasts(ptk$preheight)
#> .L .Q
#> [1,] -7.071068e-01 0.4082483
#> [2,] -9.073800e-17 -0.8164966
#> [3,] 7.071068e-01 0.4082483
summary(lm(cdur_scaled ~ preheight, ptk))
#>
#> Call:
#> lm(formula = cdur_scaled ~ preheight, data = ptk)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.4390 -0.7099 -0.1139 0.5536 4.7076
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.01726 0.03702 0.466 0.641
#> preheight.L 0.02354 0.06792 0.347 0.729
#> preheight.Q 0.15135 0.06007 2.519 0.012 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.997 on 748 degrees of freedom
#> Multiple R-squared: 0.008562, Adjusted R-squared: 0.005911
#> F-statistic: 3.23 on 2 and 748 DF, p-value: 0.04011
contrasts(ptk$preheight) <- scaled_contr_poly(ptk$preheight)
contrasts(ptk$preheight)
#> .L .Q
#> Low -1.00000e+00 0.5773503
#> Mid -3.32076e-17 -1.1547005
#> High 1.00000e+00 0.5773503
summary(lm(cdur_scaled ~ preheight, ptk))
#>
#> Call:
#> lm(formula = cdur_scaled ~ preheight, data = ptk)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.4390 -0.7099 -0.1139 0.5536 4.7076
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.01726 0.03702 0.466 0.641
#> preheight.L 0.01665 0.04803 0.347 0.729
#> preheight.Q 0.10702 0.04248 2.519 0.012 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.997 on 748 degrees of freedom
#> Multiple R-squared: 0.008562, Adjusted R-squared: 0.005911
#> F-statistic: 3.23 on 2 and 748 DF, p-value: 0.04011
The standardize function implements scale, scale_by, named_contr_sum, and scaled_contr_poly automatically, allowing regressions to be easily fit in a standardized space. It takes the following arguments:
The function first calls model.frame. If family is gaussian (i.e. a linear model), then the response (i.e. dependent variable) is placed on unit scale (mean 0 and standard deviation 1) regardless of what the scale argument to standardize is; if the response contains a call to scale_by, then it is placed on unit scale within each level of the conditioning factor. For offsets (again, if family* is gaussian), the values are divided by the standard deviation of the response variable prior to scaling (within-factor-level if scale_by** is used on the response). Then, for all values of family, random effects groups (if any) are coerced to unordered factors, and any other predictors which are characters or which contain only two unique non-NA values (regardless of their class) are converted to unordered factors and assigned contrasts with named_contr_sum, passing along the scale argument. Ordered factors are assigned contrasts with scaled_contr_poly, passing along the scale argument. Continuous predictors which contain a call to scale_by are re-calculated passing along the scale argument. Finally, continuous predictors which do not contain a call to scale_by are scaled using the scale function, ensuring that the resulting variable has standard deviation equal to the scale argument to standardize.
The column names of the model frame are then renamed so that they are valid variable names, and the formula passed to standardize is updated with these variable names. A list with class standardized is then returned with the following elements:
To illustrate the use of the standardize function, we will fit a linear mixed effects regression with lmer in the lme4 package with cdur as the response, place, stress, preheight, the natural log of wordfreq, and speecrhate scaled by speaker as fixed effects, and random intercepts for speaker. We begin by creating preheight and then calling standardize:
ptk$preheight <- "Mid"
ptk$preheight[ptk$prevowel == "a"] <- "Low"
ptk$preheight[ptk$prevowel %in% c("i", "u")] <- "High"
ptk$preheight <- factor(ptk$preheight, ordered = TRUE, levels = c("Low",
"Mid", "High"))
sobj <- standardize(cdur ~ place + stress + preheight + log(wordfreq) +
scale_by(speechrate ~ speaker) + (1 | speaker), ptk)
Now lets examine the sobj object created by standardize:
is.standardized(sobj)
#> [1] TRUE
sobj
#>
#> Call:
#> standardize(formula = cdur ~ place + stress + preheight + log(wordfreq) +
#> scale_by(speechrate ~ speaker) + (1 | speaker), data = ptk)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Standardized Formula:
#> cdur ~ place + stress + preheight + log_wordfreq + speechrate_scaled_by_speaker +
#> (1 | speaker)
#>
#> Variables:
#> Variable Standardized Name Class
#> cdur cdur response.numeric
#> place place factor
#> stress stress factor
#> preheight preheight ordered
#> log(wordfreq) log_wordfreq numeric
#> scale_by(speechrate ~ speaker) speechrate_scaled_by_speaker scaledby
#> speaker speaker group
#>
#> Response has mean 0 and standard deviation 1.
#>
#> Continuous variables have mean 0 and standard deviation 1
#> (within-factor-level if scale_by was used)
#> Unordered factors have sum contrasts with deviation 1
#> Ordered factors have orthogonal polynomial contrasts whose
#> columns have standard deviation 1
#> Grouping factors are coded as unordered factors with default contrasts
names(sobj)
#> [1] "call" "scale" "formula" "family" "data" "offset"
#> [7] "pred" "variables" "contrasts" "groups"
head(sobj$data)
#> cdur place stress preheight log_wordfreq
#> 1 0.58032620 Velar Unstressed Low -1.64753617
#> 2 0.21456174 Velar Post-Tonic High -1.64753617
#> 3 -0.11795140 Dental Unstressed High -0.05570344
#> 4 0.01505386 Dental Post-Tonic High -1.90076049
#> 5 0.91283934 Velar Tonic Mid -0.19203879
#> 6 1.51136300 Dental Post-Tonic High -0.22152148
#> speechrate_scaled_by_speaker speaker
#> 1 -0.4984662 s01
#> 2 1.3452937 s01
#> 3 -0.4984662 s01
#> 4 0.6454974 s01
#> 5 2.2671736 s01
#> 6 -2.3422261 s01
mean(sobj$data$cdur)
#> [1] -1.539605e-16
sd(sobj$data$cdur)
#> [1] 1
mean(sobj$data$log_wordfreq)
#> [1] 1.550764e-16
sd(sobj$data$log_wordfreq)
#> [1] 1
all.equal(scale(log(ptk$wordfreq))[, 1], sobj$data$log_wordfreq[, 1])
#> [1] TRUE
with(sobj$data, tapply(speechrate_scaled_by_speaker, speaker, mean))
#> s01 s02 s03 s04 s05
#> -2.960268e-16 3.165268e-18 -3.119346e-16 1.004790e-16 2.183358e-16
#> s06 s07 s08 s09 s10
#> -7.174906e-17 -1.953684e-16 -2.200950e-16 1.468327e-16 -1.266348e-16
#> s11 s12 s13 s14 s15
#> -2.923792e-16 3.850162e-16 1.980309e-16 6.325185e-17 1.427623e-16
#> s16 s17 s18
#> -2.220446e-16 1.582818e-16 -2.505491e-16
with(sobj$data, tapply(speechrate_scaled_by_speaker, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
sobj$contrasts
#> $place
#> Bilabial Dental
#> Bilabial 1 0
#> Dental 0 1
#> Velar -1 -1
#>
#> $stress
#> Post-Tonic Tonic
#> Post-Tonic 1 0
#> Tonic 0 1
#> Unstressed -1 -1
#>
#> $preheight
#> .L .Q
#> Low -1.00000e+00 0.5773503
#> Mid -3.32076e-17 -1.1547005
#> High 1.00000e+00 0.5773503
sobj$groups
#> $speaker
#> [1] "s01" "s02" "s03" "s04" "s05" "s06" "s07" "s08" "s09" "s10" "s11" "s12"
#> [13] "s13" "s14" "s15" "s16" "s17" "s18"
We can see that all of the regression variables have been placed on a similar scale, using the default scale argument of 1. The majority of the predictors did not change name, but those which included function calls (log(wordfreq) and scale_by(speechrate ~ speaker)) have been altered so that they are valid variable names. If we were to call standardize with scale = 0.5, then cdur would still have mean 0 and standard deviation 1, but the predictors would all have scale 0.5:
sobj <- standardize(cdur ~ place + stress + preheight + log(wordfreq) +
scale_by(speechrate ~ speaker) + (1 | speaker), ptk, scale = 0.5)
sobj
#>
#> Call:
#> standardize(formula = cdur ~ place + stress + preheight + log(wordfreq) +
#> scale_by(speechrate ~ speaker) + (1 | speaker), data = ptk,
#> scale = 0.5)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Standardized Formula:
#> cdur ~ place + stress + preheight + log_wordfreq + speechrate_scaled_by_speaker +
#> (1 | speaker)
#>
#> Variables:
#> Variable Standardized Name Class
#> cdur cdur response.numeric
#> place place factor
#> stress stress factor
#> preheight preheight ordered
#> log(wordfreq) log_wordfreq numeric
#> scale_by(speechrate ~ speaker) speechrate_scaled_by_speaker scaledby
#> speaker speaker group
#>
#> Response has mean 0 and standard deviation 1.
#>
#> Continuous variables have mean 0 and standard deviation 0.5
#> (within-factor-level if scale_by was used)
#> Unordered factors have sum contrasts with deviation 0.5
#> Ordered factors have orthogonal polynomial contrasts whose
#> columns have standard deviation 0.5
#> Grouping factors are coded as unordered factors with default contrasts
names(sobj)
#> [1] "call" "scale" "formula" "family" "data" "offset"
#> [7] "pred" "variables" "contrasts" "groups"
head(sobj$data)
#> cdur place stress preheight log_wordfreq
#> 1 0.58032620 Velar Unstressed Low -0.82376808
#> 2 0.21456174 Velar Post-Tonic High -0.82376808
#> 3 -0.11795140 Dental Unstressed High -0.02785172
#> 4 0.01505386 Dental Post-Tonic High -0.95038025
#> 5 0.91283934 Velar Tonic Mid -0.09601939
#> 6 1.51136300 Dental Post-Tonic High -0.11076074
#> speechrate_scaled_by_speaker speaker
#> 1 -0.2492331 s01
#> 2 0.6726468 s01
#> 3 -0.2492331 s01
#> 4 0.3227487 s01
#> 5 1.1335868 s01
#> 6 -1.1711131 s01
mean(sobj$data$cdur)
#> [1] -1.539605e-16
sd(sobj$data$cdur)
#> [1] 1
mean(sobj$data$log_wordfreq)
#> [1] 7.753821e-17
sd(sobj$data$log_wordfreq)
#> [1] 0.5
all.equal(0.5 * scale(log(ptk$wordfreq))[, 1], sobj$data$log_wordfreq[, 1])
#> [1] TRUE
with(sobj$data, tapply(speechrate_scaled_by_speaker, speaker, mean))
#> s01 s02 s03 s04 s05
#> -1.480134e-16 1.582634e-18 -1.559673e-16 5.023952e-17 1.091679e-16
#> s06 s07 s08 s09 s10
#> -3.587453e-17 -9.768421e-17 -1.100475e-16 7.341635e-17 -6.331741e-17
#> s11 s12 s13 s14 s15
#> -1.461896e-16 1.925081e-16 9.901544e-17 3.162593e-17 7.138116e-17
#> s16 s17 s18
#> -1.110223e-16 7.914090e-17 -1.252746e-16
with(sobj$data, tapply(speechrate_scaled_by_speaker, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18
#> 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
sobj$contrasts
#> $place
#> Bilabial Dental
#> Bilabial 0.5 0.0
#> Dental 0.0 0.5
#> Velar -0.5 -0.5
#>
#> $stress
#> Post-Tonic Tonic
#> Post-Tonic 0.5 0.0
#> Tonic 0.0 0.5
#> Unstressed -0.5 -0.5
#>
#> $preheight
#> .L .Q
#> Low -5.00000e-01 0.2886751
#> Mid -1.66038e-17 -0.5773503
#> High 5.00000e-01 0.2886751
sobj$groups
#> $speaker
#> [1] "s01" "s02" "s03" "s04" "s05" "s06" "s07" "s08" "s09" "s10" "s11" "s12"
#> [13] "s13" "s14" "s15" "s16" "s17" "s18"
The standardized object can be used to fit regression models, and the resulting regression model can then be used with functions from other packages such as lme4, afex, and emmeans with a few caveats.
We fit the mixed effects regression using the standardized object with scale = 0.5 by simply passing its formula and data elements to the lmer function:
library(lme4)
#> Loading required package: Matrix
#> Warning: package 'Matrix' was built under R version 3.6.3
mod <- lmer(sobj$formula, sobj$data)
summary(mod)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula:
#> cdur ~ place + stress + preheight + log_wordfreq + speechrate_scaled_by_speaker +
#> (1 | speaker)
#> Data: sobj$data
#>
#> REML criterion at convergence: 2006.9
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -2.5479 -0.6919 -0.1014 0.5883 4.2981
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> speaker (Intercept) 0.09375 0.3062
#> Residual 0.79081 0.8893
#> Number of obs: 751, groups: speaker, 18
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 0.01576 0.07946 0.198
#> placeBilabial 0.01651 0.09553 0.173
#> placeDental -0.02341 0.09183 -0.255
#> stressPost-Tonic -0.12491 0.09681 -1.290
#> stressTonic 0.42435 0.09595 4.423
#> preheight.L 0.03529 0.09331 0.378
#> preheight.Q 0.11910 0.07848 1.518
#> log_wordfreq -0.07557 0.06763 -1.117
#> speechrate_scaled_by_speaker -0.61373 0.06643 -9.239
#>
#> Correlation of Fixed Effects:
#> (Intr) plcBlb plcDnt strP-T strssT prhg.L prhg.Q lg_wrd
#> placeBilabl 0.024
#> placeDental -0.039 -0.500
#> strssPst-Tn 0.004 0.053 -0.051
#> stressTonic -0.006 0.003 -0.003 -0.524
#> preheight.L 0.009 0.075 -0.106 -0.272 0.276
#> preheight.Q 0.083 -0.041 -0.127 0.029 -0.059 0.023
#> log_wordfrq 0.005 -0.092 0.062 -0.131 -0.029 0.106 0.093
#> spchrt_sc__ 0.007 0.015 -0.043 0.058 0.046 0.061 0.073 0.013
When predicting new data with the regression model, the new data first needs to be placed into the same standardized space as sobj$data. This can be done by calling predict with the standardied object as the first argument. The predict method for the standardized class also takes logical arguments response (whether or not the new data contains the response variable; default FALSE), fixed (whether or not the new data contains the fixed effects variables; default TRUE), and random (whether or not the new data contains the random effects variables; default TRUE). We will illustrate the use of the standardized predict method using the original data ptk as if it were new data:
newdata <- predict(sobj, ptk)
newdata_fe <- predict(sobj, ptk, random = FALSE)
newdata_re <- predict(sobj, ptk, fixed = FALSE)
head(newdata)
#> place stress preheight log_wordfreq speechrate_scaled_by_speaker speaker
#> 1 Velar Unstressed Low -0.82376808 -0.2492331 s01
#> 2 Velar Post-Tonic High -0.82376808 0.6726468 s01
#> 3 Dental Unstressed High -0.02785172 -0.2492331 s01
#> 4 Dental Post-Tonic High -0.95038025 0.3227487 s01
#> 5 Velar Tonic Mid -0.09601939 1.1335868 s01
#> 6 Dental Post-Tonic High -0.11076074 -1.1711131 s01
head(newdata_fe)
#> place stress preheight log_wordfreq speechrate_scaled_by_speaker
#> 1 Velar Unstressed Low -0.82376808 -0.2492331
#> 2 Velar Post-Tonic High -0.82376808 0.6726468
#> 3 Dental Unstressed High -0.02785172 -0.2492331
#> 4 Dental Post-Tonic High -0.95038025 0.3227487
#> 5 Velar Tonic Mid -0.09601939 1.1335868
#> 6 Dental Post-Tonic High -0.11076074 -1.1711131
head(newdata_re)
#> speaker
#> 1 s01
#> 2 s01
#> 3 s01
#> 4 s01
#> 5 s01
#> 6 s01
This standardized new data can then be used as the newdata argument to predict with the regression model as the first argument. The predict method may generate warnings about contrasts being dropped, but these warnings can be ignored (i.e. the predictions are still correct; warnings shouldn’t occur with the latest version of lme4, but may occur for older versions, and will occur for the predict method for fixed-effects-only models):
# predictions using both the fixed and random effects
preds <- predict(mod, newdata = newdata)
all.equal(preds, fitted(mod))
#> [1] TRUE
# predictions using only the fixed effects
preds_fe <- predict(mod, newdata = newdata_fe, re.form = NA)
head(preds)
#> 1 2 3 4 5 6
#> 0.36567215 -0.07756456 0.32566494 0.13159718 -0.26161147 0.98498387
head(preds_fe)
#> 1 2 3 4 5 6
#> 0.10143783 -0.34179887 0.06143062 -0.13263714 -0.52584579 0.72074955
When obtaining p-values for the predictors with the mixed function in the afex package, the check_contrasts argument should be set to FALSE to ensure that the correct contrasts are used (this shouldn’t matter when a regression model is passed to mixed, but if the formula and data elements of the standardized object are passed directly to mixed, it is necessary; it is best to just always specify check_contrasts = FALSE):
library(afex)
#> Registered S3 methods overwritten by 'car':
#> method from
#> influence.merMod lme4
#> cooks.distance.influence.merMod lme4
#> dfbeta.influence.merMod lme4
#> dfbetas.influence.merMod lme4
#> Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
#> when loading 'dplyr'
#> ************
#> Welcome to afex. For support visit: http://afex.singmann.science/
#> - Functions for ANOVAs: aov_car(), aov_ez(), and aov_4()
#> - Methods for calculating p-values with mixed(): 'KR', 'S', 'LRT', and 'PB'
#> - 'afex_aov' and 'mixed' objects can be passed to emmeans() for follow-up tests
#> - NEWS: library('emmeans') now needs to be called explicitly!
#> - Get and set global package options with: afex_options()
#> - Set orthogonal sum-to-zero contrasts globally: set_sum_contrasts()
#> - For example analyses see: browseVignettes("afex")
#> ************
#>
#> Attaching package: 'afex'
#> The following object is masked from 'package:lme4':
#>
#> lmer
pvals <- mixed(mod, data = sobj$data, check_contrasts = FALSE)
#> Formula (the first argument) converted to formula.
#> Fitting one lmer() model. [DONE]
#> Calculating p-values. [DONE]
pvals
#> Mixed Model Anova Table (Type 3 tests, KR-method)
#>
#> Model: cdur ~ place + stress + preheight + log_wordfreq + speechrate_scaled_by_speaker +
#> Model: (1 | speaker)
#> Data: $
#> Data: sobj
#> Data: data
#> Effect df F p.value
#> 1 place 2, 726.63 0.03 .967
#> 2 stress 2, 726.38 10.51 *** <.001
#> 3 preheight 2, 732.84 1.21 .299
#> 4 log_wordfreq 1, 731.67 1.25 .264
#> 5 speechrate_scaled_by_speaker 1, 725.10 85.35 *** <.001
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
The emmeans function in the emmeans package can be called as it would normally be called, but note that if the regression model contains polynomial covariates (i.e. if the formula passed to standardize includes calls to the poly function), then the results may be misleading (as is often the case). We will illustrate the use of emmeans with the stress factor (since the p-value for the overall factor was <.0001):
library(emmeans)
stress_comparisons <- emmeans(mod, pairwise ~ stress)
stress_comparisons
#> $emmeans
#> stress emmean SE df lower.CL upper.CL
#> Post-Tonic -0.0467 0.0932 32.5 -0.2365 0.1431
#> Tonic 0.2279 0.0926 31.6 0.0392 0.4166
#> Unstressed -0.1340 0.0924 31.4 -0.3223 0.0544
#>
#> Results are averaged over the levels of: place, preheight
#> Degrees-of-freedom method: kenward-roger
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df t.ratio p.value
#> (Post-Tonic) - Tonic -0.2746 0.0842 727 -3.263 0.0033
#> (Post-Tonic) - Unstressed 0.0873 0.0825 726 1.058 0.5406
#> Tonic - Unstressed 0.3619 0.0817 726 4.428 <.0001
#>
#> Results are averaged over the levels of: place, preheight
#> Degrees-of-freedom method: kenward-roger
#> P value adjustment: tukey method for comparing a family of 3 estimates
The standardize package provides several functions which aid in placing regression variables on similar scales, namely scale_by, named_contr_sum, and scaled_contr_poly. The standardize function offers a convenient way to make use of these functions (along with the scale function in base R) automatically, resulting in a standardized object which contains a standardized formula and data frame that can be passed to regression fitting functions. The most common use of the standardize package is to call the standardize function, passing the same formula, data, and family arguments as would normally be passed to a regression fitting function, leaving the scale argument at its default, and then passing the formula and data elements of the standardized object on to the regression fitting function with all other options for the regression fitting function specified as they would normally be specified.
If you use the standardize package in a publication, please cite:
Eager, Christopher D. (2017). standardize: Tools for Standardizing Variables for Regression in R. R package version 0.2.1. https://CRAN.R-project.org/package=standardize
If you analyze the ptk dataset in a publication, please cite:
Eager, Christopher D. (2017). Contrast preservation and constraints on individual phonetic variation. Doctoral thesis. University of Illinois at Urbana-Champaign.