An Introduction to t_TOST

A new function for TOST with t-tests

Aaron R. Caldwell

2024-05-08

In an effort to make TOSTER more informative and easier to use, I created the functions t_TOST and simple_htest. These function operates very similarly to base R’s t.test function with a few exceptions. First, t_TOST performs 3 t-tests (one two-tailed and two one-tailed tests). Second, simple_htest allows you to run equivalence testing or minimal effects testing using a t-test or Wilcoxon-Mann-Whitney tests using the alternative argument and the output is the same as t.test or wilcox.test (in that the object is of the class htest). In addition, these functions have a generic method where two vectors can be supplied or a formula can be given (e.g.,y ~ group). These functions make it easier to switch between types of t-tests. All three types (two sample, one sample, and paired samples) can be performed/calculated from the same function. Moreover, the summary information and visualizations have been upgraded. This should make the decisions derived from the function more informative and user-friendly.

These functions are not limited to equivalence tests. Minimal effects testing (MET) is possible. MET is useful for situations where the hypothesis is about a minimal effect and the null hypothesis is equivalence.

In the general introduction to this package, we detailed how to look at old results and how to apply TOST to interpreting those results. However, in many cases, users may have new data that needs to be analyzed. Therefore, t_TOST and simple_htest can be applied to new data. This vignette will use the iris and the sleep data.

data('sleep')
data('iris')

Independent Groups

For this example, we will use the sleep data. In this data there is a group variable and an outcome extra.

head(sleep)
#>   extra group ID
#> 1   0.7     1  1
#> 2  -1.6     1  2
#> 3  -0.2     1  3
#> 4  -1.2     1  4
#> 5  -0.1     1  5
#> 6   3.4     1  6

We will assume the data are independent, and that we have equivalence bounds of +/- 0.5 raw units. All we need to do is provide the formula, data, and eqb arguments for the function to run appropriately. In addition, we can set the var.equal argument (to assume equal variance), and the paired argument (sets if the data is paired or not). Both are logical indicators that can be set to TRUE or FALSE. The alpha is automatically set to 0.05 but this can also be adjusted by the user. The Hedges correction is also automatically calculated, but this can be overridden with the bias_correction argument. The hypothesis is automatically set to “EQU” for equivalence but if a minimal effect is of interest then “MET” can be supplied. Note: for this example, we will set smd_ci to “goulet” since it will reduce the time to produce plots.

res1 = t_TOST(formula = extra ~ group,
              data = sleep,
              eqb = .5,
              smd_ci = "goulet")

res1a = t_TOST(x = subset(sleep,group==1)$extra,
               y = subset(sleep,group==2)$extra,
               eqb = .5)

We can also using the “simpler” approach with simple_htest.

# Simple htest

res1b = simple_htest(formula = extra ~ group,
                     data = sleep,
                     mu = .5, # set equivalence bound
                     alternative = "e")

Once the function has run, we can print the results with the print command. This provides a verbose summary of the results.


# t_TOST
print(res1)
#> 
#> Welch Two Sample t-test
#> 
#> The equivalence test was non-significant, t(17.78) = -1.3, p = 0.89
#> The null hypothesis test was non-significant, t(17.78) = -1.86p = 0.08
#> NHST: don't reject null significance hypothesis that the effect is equal to zero 
#> TOST: don't reject null equivalence hypothesis
#> 
#> TOST Results 
#>                 t    df p.value
#> t-test     -1.861 17.78   0.079
#> TOST Lower -1.272 17.78   0.890
#> TOST Upper -2.450 17.78   0.012
#> 
#> Effect Sizes 
#>                Estimate     SE               C.I. Conf. Level
#> Raw             -1.5800 0.8491 [-3.0534, -0.1066]         0.9
#> Hedges's g(av)  -0.7965 0.4976 [-1.6843, -0.0615]         0.9
#> Note: SMD confidence intervals are an approximation. See vignette("SMD_calcs").

# htest

print(res1b)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  extra by group
#> t = -1.2719, df = 17.776, p-value = 0.8901
#> alternative hypothesis: equivalence
#> null values:
#> difference in means difference in means 
#>                -0.5                 0.5 
#> 90 percent confidence interval:
#>  -3.0533815 -0.1066185
#> sample estimates:
#> mean of x mean of y 
#>      0.75      2.33

Plots

Another nice feature is the generic plot method that can provide a visual summary of the results (only available for t_TOST). All of the plots in this package were inspired by the concurve R package. There are two types of plots that can be produced. The first, and default, is the consonance density plot (type = "cd").

plot(res1, type = "cd")

The shading pattern can be modified with the ci_shades.

# Set to shade only the 90% and 95% CI areas
plot(res1, type = "cd",
     ci_shades = c(.9,.95))

Consonance plots, where all confidence intervals can be simultaneous plotted, can also be produced. The advantage here is multiple confidence interval lines can plotted at once.

plot(res1, type = "c",
     ci_lines =  c(.9,.95))

Descriptions

A description of the results can also be produced with the describe or describe_htest method and function respectively.

describe(res1)

describe_htest(res1b)

Using the Welch Two Sample t-test, a null hypothesis significance test (NHST), and a equivalence test, via two one-sided tests (TOST), were performed with an alpha-level of 0.05. These tested the null hypotheses that true mean difference is equal to 0 (NHST), and true mean difference is more extreme than -0.5 and 0.5 (TOST). Both the equivalence test (p = 0.89), and the NHST (p = 0.079) were not significant (mean difference = -1.58 90% C.I.[-3.05, -0.107]; Hedges’s g(av) = -0.796 90% C.I.[-1.68, -0.0615]). Therefore, the results are inconclusive: neither null hypothesis can be rejected.

The Welch Two Sample t-test is not statistically significant (t(17.776) = -1.27, p = 0.89, mean of x = 0.75, mean of y = 2.33, 90% C.I.[-3.05, -0.107]) at a 0.05 alpha-level. The null hypothesis cannot be rejected. At the desired error rate, it cannot be stated that the true difference in means is between -0.5 and 0.5.

Paired Samples

To perform a paired samples TOST, the process does not change much. We could process the test the same way by providing a formula. All we would need to then is change paired to TRUE.

res2 = t_TOST(formula = extra ~ group,
              data = sleep,
              paired = TRUE,
              eqb = .5)
res2
#> 
#> Paired t-test
#> 
#> The equivalence test was non-significant, t(9) = -2.8, p = 0.99
#> The null hypothesis test was significant, t(9) = -4.06p < 0.01
#> NHST: reject null significance hypothesis that the effect is equal to zero 
#> TOST: don't reject null equivalence hypothesis
#> 
#> TOST Results 
#>                 t df p.value
#> t-test     -4.062  9   0.003
#> TOST Lower -2.777  9   0.989
#> TOST Upper -5.348  9 < 0.001
#> 
#> Effect Sizes 
#>               Estimate     SE               C.I. Conf. Level
#> Raw             -1.580 0.3890   [-2.293, -0.867]         0.9
#> Hedges's g(z)   -1.174 0.4412 [-1.8046, -0.4977]         0.9
#> Note: SMD confidence intervals are an approximation. See vignette("SMD_calcs").

res2b = simple_htest(
  formula = extra ~ group,
  data = sleep,
  paired = TRUE,
  mu = .5,
  alternative = "e")
res2b
#> 
#>  Paired t-test
#> 
#> data:  extra by group
#> t = -2.7766, df = 9, p-value = 0.9892
#> alternative hypothesis: equivalence
#> null values:
#> mean difference mean difference 
#>            -0.5             0.5 
#> 90 percent confidence interval:
#>  -2.2930053 -0.8669947
#> sample estimates:
#> mean difference 
#>           -1.58

However, we may have two vectors of data that are paired. So we may want to just provide those separately rather than using a data set and setting the formula. This can be demonstrated with the “iris” data.

res3 = t_TOST(x = iris$Sepal.Length,
              y = iris$Sepal.Width,
              paired = TRUE,
              eqb = 1)
res3
#> 
#> Paired t-test
#> 
#> The equivalence test was non-significant, t(149) = 22.32, p = 1
#> The null hypothesis test was significant, t(149) = 34.815p < 0.01
#> NHST: reject null significance hypothesis that the effect is equal to zero 
#> TOST: don't reject null equivalence hypothesis
#> 
#> TOST Results 
#>                t  df p.value
#> t-test     34.82 149 < 0.001
#> TOST Lower 47.31 149 < 0.001
#> TOST Upper 22.32 149       1
#> 
#> Effect Sizes 
#>               Estimate      SE             C.I. Conf. Level
#> Raw              2.786 0.08002 [2.6536, 2.9184]         0.9
#> Hedges's g(z)    2.828 0.18393 [2.5252, 3.1244]         0.9
#> Note: SMD confidence intervals are an approximation. See vignette("SMD_calcs").

res3a = simple_htest(
  x = iris$Sepal.Length,
  y = iris$Sepal.Width,
  paired = TRUE,
  mu = 1,
  alternative = "e"
)
res3a
#> 
#>  Paired t-test
#> 
#> data:  x and y
#> t = 22.319, df = 149, p-value = 1
#> alternative hypothesis: equivalence
#> null values:
#> mean difference mean difference 
#>              -1               1 
#> 90 percent confidence interval:
#>  2.653551 2.918449
#> sample estimates:
#> mean difference 
#>           2.786

We may want to perform a Minimal Effect Test with the hypothesis argument set to “MET”.

res_met = t_TOST(x = iris$Sepal.Length,
              y = iris$Sepal.Width,
               paired = TRUE,
               hypothesis = "MET",
               eqb = 1,
              smd_ci = "goulet")
res_met
#> 
#> Paired t-test
#> 
#> The minimal effect test was significant, t(149) = 47.31, p < 0.01
#> The null hypothesis test was significant, t(149) = 34.815p < 0.01
#> NHST: reject null significance hypothesis that the effect is equal to zero 
#> TOST: reject null MET hypothesis
#> 
#> TOST Results 
#>                t  df p.value
#> t-test     34.82 149 < 0.001
#> TOST Lower 47.31 149       1
#> TOST Upper 22.32 149 < 0.001
#> 
#> Effect Sizes 
#>               Estimate      SE             C.I. Conf. Level
#> Raw              2.786 0.08002 [2.6536, 2.9184]         0.9
#> Hedges's g(z)    2.835 0.25311 [2.5719, 3.1284]         0.9
#> Note: SMD confidence intervals are an approximation. See vignette("SMD_calcs").

res_metb = simple_htest(x = iris$Sepal.Length,
                       y = iris$Sepal.Width,
                       paired = TRUE,
                       mu = 1,
                       alternative = "minimal.effect")
res_metb
#> 
#>  Paired t-test
#> 
#> data:  x and y
#> t = 22.319, df = 149, p-value < 2.2e-16
#> alternative hypothesis: minimal.effect
#> null values:
#> mean difference mean difference 
#>              -1               1 
#> 90 percent confidence interval:
#>  2.653551 2.918449
#> sample estimates:
#> mean difference 
#>           2.786

Descriptions

A description of the results can also be produced with the describe or describe_htest method and function respectively.

describe(res_met)

describe_htest(res_metb)

Using the Paired t-test, a null hypothesis significance test (NHST), and a minimal effect test, via two one-sided tests (TOST), were performed with an alpha-level of 0.05. These tested the null hypotheses that true mean difference is equal to 0 (NHST), and true mean difference is greater than -1 or less than 1 (TOST). The minimal effect test was significant, t(149) = 22.319, p < 0.001 (mean difference = 2.786 90% C.I.[2.654, 2.918]; Hedges’s g(z) = 2.835 90% C.I.[2.572, 3.128]). At the desired error rate, it can be stated that the true mean difference is less than -1 or greater than 1.

The Paired t-test is statistically significant (t(149) = 22.319, p < 0.001, mean difference = 2.786, 90% C.I.[2.654, 2.918]) at a 0.05 alpha-level. The null hypothesis can be rejected. At the desired error rate, it can be stated that the true mean difference is less than -1 or greater than 1.

One Sample t-test

In other cases we may just have a one sample test. If that is the case all we have to do is supply the x argument for the data. For this test we may hypothesis that the mean of Sepal.Length is not more than 5.5 points greater or less than 8.5.

res4 = t_TOST(x = iris$Sepal.Length,
              hypothesis = "EQU",
              eqb = c(5.5,8.5),
              smd_ci = "goulet")
res4
#> 
#> One Sample t-test
#> 
#> The equivalence test was significant, t(149) = 5.08, p < 0.01
#> The null hypothesis test was significant, t(149) = 86.425p < 0.01
#> NHST: reject null significance hypothesis that the effect is equal to zero 
#> TOST: reject null equivalence hypothesis
#> 
#> TOST Results 
#>                  t  df p.value
#> t-test      86.425 149 < 0.001
#> TOST Lower   5.078 149 < 0.001
#> TOST Upper -39.293 149 < 0.001
#> 
#> Effect Sizes 
#>            Estimate      SE             C.I. Conf. Level
#> Raw           5.843 0.06761 [5.7314, 5.9552]         0.9
#> Hedges's g    7.021 0.42002 [6.4067, 7.7882]         0.9
#> Note: SMD confidence intervals are an approximation. See vignette("SMD_calcs").

Only have the summary statistics? No problem!

In some cases you may only have access to the summary statistics. Therefore, we created a function, tsum_TOST, to perform the same tests just based on the summary statistics. This involves providing the function with a number of different arguments.

The results from above can be replicated with the tsum_TOST

res_tsum = tsum_TOST(
  m1 = mean(iris$Sepal.Length, na.rm=TRUE),
  sd1 = sd(iris$Sepal.Length, na.rm=TRUE),
  n1 = length(na.omit(iris$Sepal.Length)),
  hypothesis = "EQU",
  eqb = c(5.5,8.5)
)

res_tsum
#> 
#> One-sample t-test
#> 
#> The equivalence test was significant, t(149) = 5.078, p = 5.62e-07
#> The null hypothesis test was significant, t(149) = 86.425, p = 3.33e-129
#> NHST: reject null significance hypothesis that the effect is equal to zero 
#> TOST: reject null equivalence hypothesis
#> 
#> TOST Results 
#>                  t  df p.value
#> t-test      86.425 149 < 0.001
#> TOST Lower   5.078 149 < 0.001
#> TOST Upper -39.293 149 < 0.001
#> 
#> Effect Sizes 
#>            Estimate      SE             C.I. Conf. Level
#> Raw           5.843 0.06761 [5.7314, 5.9552]         0.9
#> Hedges's g    7.021 0.41350  [6.327, 7.6914]         0.9
#> Note: SMD confidence intervals are an approximation. See vignette("SMD_calcs").
plot(res_tsum)

describe(res_tsum)
#> [1] "Using the One-sample t-test, a null hypothesis significance test (NHST), and a equivalence test, via two one-sided tests (TOST), were performed with an alpha-level of 0.05. These tested the null hypotheses that true mean is equal to 0 (NHST), and true mean is more extreme than 5.5 and 8.5 (TOST). The equivalence test was significant, t(149) = 5.078, p < 0.001 (mean = 5.843 90% C.I.[5.731, 5.955]; Hedges's g = 7.021 90% C.I.[6.327, 7.691]). At the desired error rate, it can be stated that the true mean is between 5.5 and 8.5."

Power Analysis for t-test based TOST

We also created power_t_TOST to allow for power calculations for TOST analyses that utilize t-tests. This function uses a more accurate method than the older functions in TOSTER and match the results of the commercially available PASS software. The exact calculations of power are based on Owen’s Q-function or by direct integration of the bivariate non-central t-distribution1. Approximate power is implemented via the non-central t-distribution or the ‘shifted’ central t-distribution Diletti, Hauschke, and Steinijans (1992). The function is limited to power analyses involves one sample, two sample, and paired sample cases. More options are available in the PowerTOST R package.

The interface for this function is quite simple and was intended to mimic the base R function power.t.test. The user must specify the 2 equivalence bounds, and leave only one of the other options blank (alpha, power, or n). The “true difference” can be set with delta and the standard deviation (default is 1) can be set with the sd argument. Once everything is set and the function is run, a object of the power.htest class will be returned.

As an example, let’s say we are looking at an equivalence study where we assume the true difference is at least 1 unit, the standard deviation is 2.5, and we set the equivalence bounds to 2.5 units as well. If we want to find the sample size adequate to have 95% power at an alpha of 0.025 we enter the following:

power_t_TOST(n = NULL,
  delta = 1,
  sd = 2.5,
  eqb = 2.5,
  alpha = .025,
  power = .95,
  type = "two.sample")
#> 
#>      Two-sample TOST power calculation 
#> 
#>           power = 0.95
#>            beta = 0.05
#>           alpha = 0.025
#>               n = 73.16747
#>           delta = 1
#>              sd = 2.5
#>          bounds = -2.5, 2.5
#> 
#> NOTE: n is number in *each* group

From the analysis above we would conclude that adequate power is achieved with 74 participants per group and 148 participants in total.

References

Diletti, E, D Hauschke, and VW Steinijans. 1992. “Sample Size Determination for Bioequivalence Assessment by Means of Confidence Intervals.” International Journal of Clinical Pharmacology, Therapy, and Toxicology 30 Suppl 1: S51—8.
Labes, Detlew, Helmut Schütz, and Benjamin Lang. 2021. PowerTOST: Power and Sample Size for (Bio)equivalence Studies. https://CRAN.R-project.org/package=PowerTOST.
Phillips, Kem F. 1990. “Power of the Two One-Sided Tests Procedure in Bioequivalence.” Journal of Pharmacokinetics and Biopharmaceutics 18 (2): 137–44. https://doi.org/10.1007/bf01063556.

  1. Inspired by Labes, Schütz, and Lang (2021) in the PowerTOST R package. Please see this package for more options↩︎