Colossus Description

library(Colossus)
library(data.table)
library(parallel)

Colossus Description

Model Structure

At its full potential, Colossus is able to analyze a vast number of possible risk/rate models. Specifically Colossus is designed to allow for a combination of linear and non-linear models to estimate cox proportional hazard ratios, Poisson model rates, and Fine-Grey Competing risks ratios. The simplest of which is an exponential model. The simplest model generally used for hazard ratios in survival analysis is the exponential of a linear function of covariates.

\[ \begin{aligned} R(\vec{\beta},\vec{x})=\exp(\vec{\beta} \cdot \vec{x}) \end{aligned} \]

In Colossus, this general model is extended by abstracting to the sum and product of terms and subterms. In this vignette, the value calculated to quantify the risk of an event will be denoted as the risk, R, however the exact meaning differs. In Cox proportional hazard modeling, the risk calculated is a hazard ratio. In Poisson modeling the risk calculated is an estimated number of events per person-year. The risk (\(R\)) has a set formula dependent on terms (\(T\)). Each term has a formula dependent on the product of subterms (\(S\)). Each subterm is a function of a single covariate (\(x\)) and several parameters (\(\alpha,\beta\)).

There are currently five types of risk models available. The risk can be expressed as an additive model (\(R_A\)), product additive model (\(R_{PA}\)), Product additive excess model (\(R_{PAE}\)), multiplicative excess model (\(R_M\)), or the geometric mixture model with relative risks or excess risks (\(R_{GMIX}\)).

\[ \begin{aligned} R_{A}= \sum_{i=0}^n T_i\\ R_{PA}= T_0 \times \sum_{i=1}^n T_i\\ R_{PAE}= T_0 \times (1 + \sum_{i=1}^n T_i)\\ R_{M}= T_0 \times \prod_{i=1}^n( 1 + T_i)\\ R_{GMIX} = T_0 \times \left(\prod_{i=1}^n(T^{*}_i) \right) ^ \theta \times \left( 1 + \sum_{i=1}^n (T^{*}_i - 1) \right)^{1-\theta}\\ T^{*}_i = \begin{cases} T_i &\text{Relative Risk} \\ T_i+1 &\text{Excess Risk} \end{cases}\\ \end{aligned} \]

Each term is composed of a combination of 4 types of subterms. Every covariate is part of a log-linear subterm, a linear subterm, a product-linear subterm, or a general non-linear term. The log-linear subterm (\(S_{LL}\)) is the exponential of a linear combination of covariates. The linear subterm (\(S_L\)) is a linear combination of covariates. The product-linear subterm (\(S_{PL}\)) is one plus a linear combination of covariates. The general non-linear term (\(S_{NL}\)) is the sum of exponential terms, quadratic terms, linear-threshold terms (\(F_{LT}\)), step function terms (\(F_{STP}\)), linear-quadratic terms (\(F_{LQ}\)), and linear-exponential terms (\(F_{LEXP}\)). Each term is the product of the non-empty subterms.

\[ \begin{aligned} S_{LL}=\prod_{i} (\exp{(x_i \cdot \beta_i)})\\ S_{L}=\sum_i (x_i \cdot \beta_i)\\ S_{PL}=1+ \sum (x_i \cdot \beta_i)\\ S_{NL}=\sum_i (\alpha_i \times \exp(x_i \cdot \beta_i)) + \sum_i (\beta_i \cdot (x_i)^2) + \sum_i F_{LT} + \sum_i F_{STP} + \sum_i F_{LQ} + \sum_i F_{LEXP}\\ F_{LT} = \begin{cases} \alpha_i \cdot (x-\beta_i) & (x>\beta_i) \\ 0 &\text{else} \end{cases}\\ F_{STP} = \begin{cases} \alpha_i & (x>\beta_i) \\ 0 &\text{else} \end{cases}\\ F_{LQ} = \begin{cases} \beta_i \cdot x & (x>\alpha_i) \\ \lambda_i \cdot x^2 + \nu_i &\text{else} \end{cases}\\ F_{LEXP} = \begin{cases} \beta_i \cdot x & (x>\alpha_i) \\ \lambda_i - \exp{(\nu_i + \mu \cdot x)} &\text{else} \end{cases}\\ T_j=S_{LL,j} \times S_{L,j} \times S_{PL,j} \times S_{NL,j} \end{aligned} \]

In short, every element of the risk model has a subterm type, term number, covariate, and parameter value.

Using The Standard Model

In Colossus every equation is defined in a very similar fashion. On the user side of the code, elements of the risk equation can be viewed as rows of a table. The columns store covariate names, term numbers, subterms types, and a starting point. For more complex regression most parameters can be also be set to remain constant over the regression. Doing so forces the parameter value to remain constant but not remove the element from the risk calculations. Constant elements are used for risk calculation, but are not used for calculating steps or standard deviations. For a multiplicative excess risk model the following table and equation are equivalent.

Term number subterm type covariate
0 LogLinear a
1 Linear b
1 Linear c

\[ \begin{aligned} R = \exp{(\beta_{a} \cdot x_{a})} \times (1+\beta_{b} \cdot x_{b} + \beta_{c} \cdot x_{c}) \end{aligned} \]

If the user wanted to update the model to include a new term with a product-linear subterm, then the table would only need to be updated with a new row.

Term number subterm type covariate
0 LogLinear a
1 Linear b
1 Linear c
2 Product-Linear d

\[ \begin{aligned} R = \exp{(\beta_{a} \cdot x_{a})} \times (1 + (\beta_{b} \cdot x_{b} + \beta_{c} \cdot x_{c})) \times (2 + \beta_{d} \cdot x_{d}) \end{aligned} \]

Note that the multiplicative model is taking the product of terms, each of which is the product of subterms. The same equation may be written with subterms distributed over different terms in the multiplicative model without changing the final value. The only exception is subterms moved from or into the “first” term in the multiplicative excess model. The same equation could be expressed in ways with different computational complexity.

The specific model options are passed to the code as separate variables. The model identifier (A, M, PA, PAE, GMIX) is passed as a string. Every model except the additive model has a distinction between the first and remaining terms. Colossus defaults to term 0 being the unique term, however it can be set to any used term number without producing an error.

The final table is also equivalent to the following code

names <- c("a", "b", "c", "d")
term_n <- c(0, 1, 1, 2)
tform <- c("loglin", "lin", "lin", "plin")
modelform <- "M"
fir <- 0

a_n <- c(0.1, 0.1, 0.1, 0.1)

names denotes the column names used, term_n denotes the term numbers, tform denotes the subterm formula, modelform denotes the term formula, fir denotes the “first” term indexed from zero which defaults to 0, and a_n denotes the initial guesses for the parameter values. All of which are assumed to be in the same order. A function is called prior to regression that reorders the inputs in order of terms, subterm types, etc.

Survival Time and Event Data

Colossus performs survival analysis via either a Cox Proportional Hazards or a Poisson model regression. In both cases the user specifies which columns contain time duration and events of interest. For the Poisson model regression these would be which column contains the person-years and number of events for each row of data. For the Cox Proportional Hazards regression the user identifies which columns provide starting and ending times, and what column gives the event status. The event status is assumed to be a binary covariate which is 1 for intervals containing an event. Colossus supports left censored data, right censored data, and interval censored data. The data is interval censored by default, so left and right censoring are handled by defining interval endpoints outside the minimum or maximum event times. For Poisson model regression the user only provides columns for the person-years per row and number of events. Poisson model regression supports any non-negative number of events.

On the user side, the names of columns containing the time and events need to be given. Lets assume we have a dataframe organized as follows:

UserID Starting_Age Ending_Age Cancer_Status
112 18 30 0
114 20 45 0
213 18 57 1

For a Cox Proportional Hazard we need to provide three column names: the starting age, the ending age, and if an event happened during the interval. In this case that would be “Starting_Age”, “Ending_Age”, and “Cancer_Status”. In this example we have interval data, however that may not always be the case. Colossus is designed to allow for the user to input “%trunc%” in-place of a missing side of the interval. Colossus assumes that this means that the person has the missing endpoint outside of the available data range and creates a dummy column to reference as the interval endpoint. So in code these variables would look like this:

df <- data.table(
  "UserID" = c(112, 114, 213, 214, 115, 116, 117),
  "Starting_Age" = c(18, 20, 18, 19, 21, 20, 18),
  "Ending_Age" = c(30, 45, 57, 47, 36, 60, 55),
  "Cancer_Status" = c(0, 0, 1, 0, 1, 0, 0),
  "a" = c(0, 1, 1, 0, 1, 0, 1),
  "b" = c(1, 1.1, 2.1, 2, 0.1, 1, 0.2),
  "c" = c(10, 11, 10, 11, 12, 9, 11),
  "d" = c(0, 0, 0, 1, 1, 1, 1)
)
# For the interval case
time1 <- "Starting_Age"
time2 <- "Ending_Age"
event <- "Cancer_Status"

# Supposing we had left truncated data the following would change
time1 <- "Starting_Age"
time2 <- "%trunc%"

# and with right truncated data the following is used
time1 <- "%trunc%"
time2 <- "Ending_Age"

# setting back to normal
time1 <- "Starting_Age"
time2 <- "Ending_Age"

Suppose instead we were interested in a Poisson model regression, the only difference is instead of providing interval endpoints we give a column of interval lengths. In this case we need to define a new column, but that may not be the case in general. Suppose we are now working with the following table:

UserID Starting_Age Ending_Age Person_Years Cancer_Status
112 18 30 12 0
114 20 45 25 0
213 18 57 39 1

we would just provide the duration column and event column, or “Person_Years” and “Cancer_Status”. Which in code looks like the following:

df$Person_Years <- df$Ending_Age - df$Starting_Age
pyr <- "Person_Years"
event <- "Cancer_Status"

Control and Verbosity

Colossus offers several options that control how the input data is used and returned. This can be split into two general categories. The first category is convergence parameters. These cover the number of iterations used, limits on parameter changes, and stopping criteria. The second category is parameters used for debugging and additional information. A more common option is the verbosity. The verbose mode prints time-steps and intermediate values for the sum of terms, sum of risks, and log-likelihood each iteration. A less common option is the der_iden parameter, which is used change only one parameter at a time by a set amount. This was used in the debugging process for recording intermediate values on a plane. This option could be used by the user in conjunction with the verbose option to plot intermediate values by parameter value.

The convergence parameters currently supported in general are as follows:

Parameter Description
lr learning rate applied to steps taken, smaller values avoid overshooting at the risk of slower convergence. Applied to the solution for setting first derivative of Log-Likelihood to zero
maxiter maximum number of iterations run
maxiters list of maximum iterations if a list of initial values is given
halfmax Maximum number of half-steps taken each iteration
epsilon Minimum parameter change above convergence. If change is below this, the regression exits
dbeta_max learning rate applied to limit on steps taken. Applied to solution to zero Log-Likelihood, for double_step=0
deriv_epsilon Minimum Log-Likelihood first derivative above convergence. If the largest first derivative is below this, the regression exits
abs_max Largest starting step size. Code reduces it every time halfmax is hit without finding an improvement. When abs_max is below epsilon, the code exits
dose_abs_max Largest starting step size for threshold parameters. Assumes the threshold parameters may be orders of magnitude different than other parameters
ties Allows for efron or breslow tie methods
double_step Changes what step calculation method is used. double_step=0 changes each parameter by the ratio of first to second derivative independently. double_step=1 solves the system of equations for all first and second derivatives

The additional information parameters currently supported are as follows:

Parameter Description
verbose integer valued 0-4 controlling what information is printed to the terminal. Each level includes the lower levels. 0: silent, 1: errors printed, 2: warnings printed, 3: notes printed, 4: debug information printed. Errors are situations that stop the regression, warnings are situations that assume default values that the user might not have intended, notes provide information on regression progress, and debug prints out C++ progress and intermediate results. The default level is 2 and True/False is converted to 3/0.
change_all TRUE/FALSE flag to choose if the user wants a regression or to calculate log-likelihoods on a grid changing only one parameter at a time
ncores Number of physical cores assigned, cannot be greater than the number of cores available

The previous options are all contained within a list of parameters, however there are additional standalone parameters which control the convergence and information. They are as follows:

Parameter Description
der_iden If change_all is used, selects which parameter number indexed from 0 to change. The parameter is changed by abs_max or dose_abs_max every iteration over the maximum number of iterations
keep_constant a vector of 0/1 to denote if any parameters should be held constant over the calculations. 0 denoting false and 1 denoting true

Going back to our example, the code may look like the following:

keep_constant <- c(0, 0, 0, 0)
der_iden <- 0

control <- list(
  "ncores" = 1, "lr" = 0.75, "maxiter" = 100, "halfmax" = 5, "epsilon" = 1e-9,
  "dbeta_max" = 0.5, "deriv_epsilon" = 1e-9, "abs_max" = 1.0, "change_all" = TRUE,
  "dose_abs_max" = 100.0, "verbose" = FALSE, "ties" = "breslow", "double_step" = 1
)

In this example, no parameters are held constant, 100 iterations with 5 half-steps are used, a regression is performed with matrix step solutions, and breslow’s method for tied event times is used. There are no errors for unused control parameters being used, they will simply be ignored. Every commonly required control parameter has a default value which is automatically used if not declared. This includes an optional list of model control options. These are used for nonstandard model options not covered in this vignette.

Running the Regression and Output

Finally the user calls the regression function they need. The names used throughout this vignette are the defaults assumed. Both the Cox PH regression and Poisson model regression are called directly and return a list of results. Colossus contains a suite of additional checks it runs on the inputs prior to starting and calculations, which are designed to output explicit details on any issues. Printing error details may require the verbose option to be TRUE. In code the functions are called as follows:

# assuming the table of covariates is stored in a data.table "df"

e <- RunCoxRegression(
  df, time1, time2, event, names, term_n, tform, keep_constant,
  a_n, modelform, fir, der_iden, control
)
print(e)
#> $LogLik
#> [1] -0.6753644
#> 
#> $First_Der
#> [1]  0.000000e+00 -7.187040e-05  7.361232e-05  1.919948e-04
#> 
#> $Second_Der
#>              [,1]         [,2]          [,3]          [,4]
#> [1,] 0.000000e+00 0.000000e+00  0.000000e+00  4.965508e-19
#> [2,] 0.000000e+00 1.742209e-08  7.238366e-07  2.311365e-07
#> [3,] 0.000000e+00 7.238366e-07 -1.501037e-06 -2.356033e-07
#> [4,] 4.965508e-19 2.311365e-07 -2.356033e-07 -3.687577e-06
#> 
#> $beta_0
#> [1]  41.26157  98.72266  96.82311 101.10000
#> 
#> $Standard_Deviation
#> [1]      NaN      NaN 177.9643   0.0000
#> 
#> $AIC
#> [1] 9.350729
#> 
#> $BIC
#> [1] 8.517767
#> 
#> $Parameter_Lists
#> $Parameter_Lists$term_n
#> [1] 0 1 1 2
#> 
#> $Parameter_Lists$tforms
#> [1] "loglin" "lin"    "lin"    "plin"  
#> 
#> $Parameter_Lists$names
#> [1] "a" "b" "c" "d"
#> 
#> 
#> $Control_List
#> $Control_List$Iteration
#> [1] 100
#> 
#> $Control_List$`Maximum Step`
#> [1] 1
#> 
#> $Control_List$`Derivative Limiting`
#> [1] 0.0001919948
#> 
#> 
#> $Converged
#> [1] FALSE
#> 
#> $Status
#> [1] "PASSED"

# or a Poisson model regression
a_n <- c(0.1, 0.1, 0.1, 0.1) # a_n is updated when either regression is called
e <- RunPoissonRegression(
  df, pyr, event, names, term_n, tform, keep_constant, a_n,
  modelform, fir, der_iden, control
)
print(e)
#> $LogLik
#> [1] -44.73982
#> 
#> $First_Der
#> [1]   -19.67462  -290.53655 -2314.56447   -19.64770
#> 
#> $Second_Der
#>              [,1]      [,2]        [,3]          [,4]
#> [1,]   -21.674624 -153.3672  -1393.0153    -4.8644043
#> [2,]  -153.367172 -108.7036   -666.9745  -101.9254174
#> [3,] -1393.015328 -666.9745 -21096.6437 -1058.8112374
#> [4,]    -4.864404 -101.9254  -1058.8112    -0.5929361
#> 
#> $beta_0
#> [1] -0.40680015 -0.01846667 -0.07585910 -0.70133825
#> 
#> $Standard_Deviation
#> [1] 0.383529879 0.092209643 0.008136992 0.535536589
#> 
#> $AIC
#> [1] 93.47964
#> 
#> $BIC
#> [1] 97.26328
#> 
#> $Deviation
#> [1] 85.47964
#> 
#> $Parameter_Lists
#> $Parameter_Lists$term_n
#> [1] 0 1 1 2
#> 
#> $Parameter_Lists$tforms
#> [1] "loglin" "lin"    "lin"    "plin"  
#> 
#> $Parameter_Lists$names
#> [1] "a" "b" "c" "d"
#> 
#> 
#> $Control_List
#> $Control_List$Iteration
#> [1] 12
#> 
#> $Control_List$`Maximum Step`
#> [1] 2.910383e-11
#> 
#> $Control_List$`Derivative Limiting`
#> [1] 2314.564
#> 
#> 
#> $Converged
#> [1] FALSE
#> 
#> $Status
#> [1] "PASSED"

The following are output:

Item Description
LogLik Final Log-Likelihood
First_Der First derivative of Log-Likelihood at the final point
Second_Der Matrix of second derivatives of Log-Likelihood at the final point
beta_0 The final parameter list
Standard_Deviation Inverse-Hessian estimate of the standard deviation
AIC Akaike information criterion for the final solution
Deviation Deviance calculated using the difference in log-likelihood contributions between the saturated model and current model
Parameter_Lists the term numbers and term formulas used, to compare against the expected
Control_List a list containing the number of iterations, the maximum step at the final iteration, and the highest first derivative magnitude
Converged TRUE/FALSE if the regression converged

For both cases, the regressions did not converge. Given that the examples were arbitrary this is not unexpected. The Cox PH ran out of iterations and did not decrease the step limit, so if it were restarted at the new optimal guess it would be expected to continue converging. The Poisson model regression exited due to the step limit being below the threshold, which means that it hit some local optimum. To find a better solution the user would likely have to either change the model equation or provide a new starting guess. This is an example of how to interpret the results of a Colossus run.

There are additional variants of these functions which will be described outside of this section. They return a similar list of results.