This tutorial shows how to use the MADPop package to test for genetic differences between two populations, of which the individuals of a same species contain a variable number of alleles.
## Loading required package: MADPop
## Loading required package: rstan
## Loading required package: StanHeaders
##
## rstan version 2.32.5 (Stan version 2.32.2)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)
Our data consists of \(N = 215\) recordings of Major Histocompatibility Complex (MHC) genotypes of lake trout from \(K = 11\) lakes in Ontario, Canada. For each of the fish, between 1-4 alleles in the MHC genotype are recorded. This is partially because duplicate genes are undetectable by current instrumentation, and possibly because the fish possess a variable number of alleles of a given MHC gene.
Our dataset fish215
is included with
MADPop. A random sample from it looks like this:
## Lake A1 A2 A3 A4
## 68 Michipicoten r.3 r.4
## 167 Kingscote r.1 r.8
## 129 Dickey v.3 r.4
## 162 Kingscote w.5
## 43 Crystal r.4 t.7
## 14 Hogan x.4 r.4
The first column is the lake name (or population ID) for each sample,
the remaining four columns are for potentially recorded allele codes
(A1
-A4
). Here the code to identify a unique
allele is a small letter followed by a number, but it could have been
the sequence of integers \(1, 2,\ldots,
A\), which for the fish215
data is \(A = 57\) unique alleles.
It is relatively straightforward to import a CSV file into the format above. An example of this is given along with our raw data in the extdata directory of the local copy of the MADPop package.
Suppose that we wish to compare two lakes, say Dickey and Simcoe. The
allele counts in these lakes are in the table below. It is a subset of
the full contingency table on all \(K =
11\) lakes, which is produced by the MADPop
function UM.suff()
:
popId <- c("Dickey", "Simcoe") # lakes to compare
Xsuff <- UM.suff(fish215) # summary statistics for dataset
ctab <- Xsuff$tab[popId,] # contingency table
ctab <- ctab[,colSums(ctab) > 0] # remove alleles with no counts
#ctab
rbind(ctab, Total = colSums(ctab))
## 1.20 1.4.5 1.4.9.45 1.5 10 20 20.54 27.28.29 3 4 4.10 4.11 4.20.27
## Dickey 1 1 0 2 0 2 1 1 0 0 1 0 1
## Simcoe 0 0 1 1 2 0 0 0 1 1 2 1 0
## Total 1 1 1 3 2 2 1 1 1 1 3 1 1
## 4.20.31 4.20.45 4.27 4.33 4.5 4.53.55 4.56 4.7 4.7.10 5 5.20 5.30 5.32
## Dickey 1 1 1 0 1 1 1 0 0 0 1 1 1
## Simcoe 0 0 0 1 1 0 0 3 1 1 0 0 0
## Total 1 1 1 1 2 1 1 3 1 1 1 1 1
## 5.7 7 9.11
## Dickey 1 0 0
## Simcoe 1 2 1
## Total 2 2 1
The unique allele identifiers are encoded as integers between \(1\) and \(A\) and separated by dots. The original
allele names are stored in Xsuff$A
, such that the genotype
of the first column 1.20
is
## [1] 1 20
## A1 A20
## "r.1" "u.4"
There are \(C = 29\) genotype combinations observed in these two lakes, corresponding to each column of the table.
In the two-population problem we have \(K = 2\) lakes with \(N_1\) and \(N_2\) fish sampled from each. Let \(\boldsymbol Y_k = (Y_{k1}, \ldots Y_{kC})\) denote the counts for each genotype observed in lake \(k\), such that \(\sum_{i=1}^C Y_{ki} = N_k\). The sampling model for these data is \[ \boldsymbol Y_k \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \boldsymbol \rho_k), \quad k = 1,2, \] where \(\boldsymbol \rho_k = (\rho_{k1},...,\rho_{kC})\) are the population proportions of each genotype, and \(\sum_{i=1}^C \rho_{ki} = 1\).
Our objective is to test \[ \begin{split} H_0 &: \textrm{The two populations have the same genotype proportions} \\ & \phantom{: } \iff \boldsymbol \rho_1 = \boldsymbol \rho_2. \end{split} \] The classical test statistics for assessing \(H_0\) are Pearson’s Chi-Square statistic \(\mathcal X\) and the Likelihood Ratio statistic \(\Lambda\), \[ \mathcal X = \sum_{k=1}^2 \sum_{i=1}^C \frac{(N_k\hat\rho_i - Y_{ki})^2}{N_k\hat\rho_i}, \qquad \Lambda = 2 \sum_{k=1}^2 \sum_{i=1}^C Y_{ki} \log\left(\frac{Y_{ki}}{N_k\hat\rho_i}\right), \qquad \hat \rho_i = \frac{Y_{1i} + Y_{2i}}{N_1 + N_2}. \] Under \(H_0\), the asymptotic distribution of either of these test statistics \(T = \mathcal X\) or \(\Lambda\) is \(\chi^2_{(C-1)}\), such that the \(p\)-value \[ \textrm{p}_\textrm{v}= \textrm{Pr}(T > T_\textrm{obs} \mid H_0) \] for an observed value of \(T_\textrm{obs}\) can be estimated as follows:
# observed values of the test statistics
chi2.obs <- chi2.stat(ctab) # Pearson's chi^2
LRT.obs <- LRT.stat(ctab) # LR test statistic
T.obs <- c(chi2 = chi2.obs, LRT = LRT.obs)
# p-value with asymptotic calculation
C <- ncol(ctab)
pv.asy <- pchisq(q = T.obs, df = C-1, lower.tail = FALSE)
signif(pv.asy, 2)
## chi2 LRT
## 0.330 0.041
The Chi-Square and LR tests are asymptotically equivalent and so should give roughly the same \(p\)-values. The huge discrepancy observed here indicates that the sample sizes are too small for asymptotics to kick in. A more reliable \(p\)-value estimate can be obtained by the Bootstrap method, which in this case consists of simulating \(M\) contigency tables with \(\boldsymbol Y_k^{(m)} \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \hat{\boldsymbol \rho})\), \(m = 1, \ldots, M\), where \(\hat{\boldsymbol \rho}\) is the estimate of the common probability vector \(\boldsymbol \rho_1 = \boldsymbol \rho_2\) under \(H_0\). For each contingency table \((\boldsymbol Y_1^{(m)}, \boldsymbol Y_2^{(m)})\), we calculate the test statistic \(T^{(m)}\), and the bootstrapped p-value is defined as \[ \textrm{p}_\textrm{v}^{\textrm{boot}}= \frac 1 M \sum_{m=1}^M \delta(T^{(m)} \ge T_\textrm{obs}). \] \(\textrm{p}_\textrm{v}^{\textrm{boot}}\) is calculated with MADPop as follows:
N1 <- sum(ctab[1,]) # size of first sample
N2 <- sum(ctab[2,]) # size of second sample
rho.hat <- colSums(ctab)/(N1+N2) # common probability vector
# bootstrap distribution of the test statistics
# set verbose = TRUE for progress output
system.time({
T.boot <- UM.eqtest(N1 = N1, N2 = N2, p0 = rho.hat, nreps = 1e4,
verbose = FALSE)
})
## user system elapsed
## 0.224 0.014 0.240
## chi2 LRT
## 0.0068 0.0059
Note that the bootstrap \(p\)-values for both tests are roughly the same and decisively reject \(H_0\), whereas the less reliable asymptotic \(p\)-values both failed to reject (at quite different significance levels).
Bootstrapping overcomes many deficiencies of the asymptotic \(p\)-value calculation. However,
bootstrapping has a tendency to reject \(H_0\) when sample sizes are small. To see
why this is, consider all columns of ctab
which have only
one genotype count between the two lakes:
itab1 <- colSums(ctab) == 1 # single count genotypes
cbind(ctab[,itab1],
Other = rowSums(ctab[,!itab1]),
Total = rowSums(ctab))
## 1.20 1.4.5 1.4.9.45 20.54 27.28.29 3 4 4.11 4.20.27 4.20.31 4.20.45 4.27
## Dickey 1 1 0 1 1 0 0 0 1 1 1 1
## Simcoe 0 0 1 0 0 1 1 1 0 0 0 0
## 4.33 4.53.55 4.56 4.7.10 5 5.20 5.30 5.32 9.11 Other Total
## Dickey 0 1 1 0 0 1 1 1 0 7 20
## Simcoe 1 0 0 1 1 0 0 0 1 12 20
There are \(c_1 = 21\) such columns,
accounting for \(\hat p_1 = 0.525\) of
the common genotype distribution under \(H_0\), as estimated from the two-lake
sample. For each of these columns, observing counts in one lake but not
the other provides evidence against \(H_0\). Moreover, under the estimated common
distribution \(\hat {\boldsymbol
\rho}\), it is very unlikely to have counts in only one of the
lakes for each of these \(c_1 = 21\)
genotypes. Therefore, the data appear to provide very strong evidence
against \(H_0\). However, it is not so
unlikely to have \(c_1 = 21\) one-count
genotypes if the true number of unique genotypes in these two lakes is
much larger than the observed value of \(C =
29\). With \(C = 29\) unique
genotypes in only \(N = N_1 + N_2 =
40\) fish samples, it is quite plausible that a new sample of
fish would yield several genotypes which are not present in the original
two-lake sample ctab
.
One way to obtain information about the unobserved genotypes in lakes Dickey and Simcoe is to consider the genotypes in all \(K = 11\) Ontario lakes for which we have collected data. A natural way to do this is through a hierarchical model: \[ \begin{split} \boldsymbol Y_k \mid \boldsymbol \rho_k & \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \boldsymbol \rho_k), \quad k = 1,\ldots,K \\ \boldsymbol \rho_k & \stackrel{\textrm{iid}}{\sim}\textrm{Dirichlet}(\boldsymbol \alpha), \qquad \boldsymbol \alpha= (\alpha_1, \ldots, \alpha_C). \end{split} \] That is, each population is allowed to have its own probability vector \(\boldsymbol \rho_k\). However, these probability vectors are drawn from a common Dirichlet distribution. The common distribution specifies that before any data is drawn, we have \[ E[\boldsymbol \rho] = \bar{\boldsymbol \alpha}, \qquad \textrm{var}(\rho_i) = \frac{(\bar \alpha_i)(1-\bar \alpha_i)}{\alpha_0 + 1}, \] where \(\alpha_0 = \sum_{i=1}^C \alpha_i\) and \(\bar{\boldsymbol \alpha} = \boldsymbol \alpha/ \alpha_0\). Moreover, the posterior distribution of \(\boldsymbol \rho_k\) given \(\boldsymbol \alpha\) and the data is also Dirichlet: \[ \boldsymbol \rho_k \mid \boldsymbol Y\stackrel{\textrm{ind}}{\sim}\textrm{Dirichlet}(\boldsymbol \alpha+ \boldsymbol Y_k). \] In this sense, the posterior estimate of the proportion of genotype \(i\) in population \(k\), \(\rho_{ki}\), can be non-zero even if \(Y*{ki} = 0\). In practice, \(\boldsymbol \alpha\) is estimated from \(\boldsymbol Y\), the data from all \(K\) populations (details in the following section). The extent to which the genotypes observed in one lake affect inference in the other lakes is determined by \(\alpha*0\) (the larger this value, the larger the effect).
The likelihood function for this model corresponds to a Dirichlet-Multinomial distribution, \[ \begin{split} \mathcal L(\boldsymbol \alpha\mid \boldsymbol Y) = \prod_{k=1}^K p(\boldsymbol Y_k \mid \boldsymbol \alpha) & = \prod_{k=1}^K \int p(\boldsymbol Y_k \mid \boldsymbol \rho_k) p(\boldsymbol \rho_k \mid \boldsymbol \alpha) \,\mathrm{d}\boldsymbol \rho_k \\ & = \prod_{k=1}^K \left[\frac{N_k!\cdot\Gamma(\alpha_0)}{\Gamma(N_k+\alpha_0)} \prod_{i=1}^C \frac{\Gamma(Y_{ki} + \alpha_i)}{Y_{ki}!\cdot\Gamma(\alpha_i)}\right], \end{split} \] from which \(\boldsymbol \alpha\) can be estimated by maximum likelihood (Minka 2000). Alternatively we estimate \(\boldsymbol \alpha\) using a Bayesian approach, which is more readily extensible to more complex genotype models, for which \(\mathcal L(\boldsymbol \alpha\mid \boldsymbol Y)\) has no closed form (see Future Work).
First, we specify a prior distribution \[ \pi(\alpha_0, \bar{\boldsymbol \alpha}) = \frac{1}{(1+ \alpha_0)^2}, \] which is a uniform distribution on the prior probability vector \(\bar{\boldsymbol \alpha}\), and a uniform distribution on the variance-like parameter \(A = (1+\alpha_0)^{-1}\), in the sense that the prior mean and variance of a given genotype population probability are \[ E[\rho_{kj}] = \tilde \alpha_j, \qquad \textrm{var}(\rho_{kj}) = A \cdot \tilde \alpha_j(1-\tilde \alpha_j). \] Then, we sample from the posterior distribution \(p(\boldsymbol \alpha\mid \boldsymbol Y) \propto \mathcal L(\boldsymbol \alpha\mid \boldsymbol Y) \pi(\boldsymbol \alpha)\) using a Markov chain Monte Carlo (MCMC) algorithm provided by the Stan programming language (Stan Development Team 2016), as detailed below.
Under the hierarchical model, the null hypothesis of equal genetic proportions between two populations, say \(k = 1,2\), is \(H_0: \boldsymbol \rho_1 = \boldsymbol \rho_2 = \boldsymbol \rho_{12}\). Our Bayesian hypothesis-testing strategy is implemented in two steps:
In order to estimate \(\boldsymbol
\alpha\) under \(H_0\), we apply
the Dirichlet-Multinomial distribution to \(K-1\) lakes, with the first two being
collapsed into a common count vector \(\boldsymbol Y_{12} = \boldsymbol Y_1 + \boldsymbol
Y_2\) with sample size \(N_{12} = N_1 +
N_2\). MCMC sampling is implemented via a Hybrid Monte Carlo
(HMC) variant provided by the R package
rstan. In MADPop, sampling from \(p(\boldsymbol \alpha\mid \boldsymbol Y,
H_0)\) is accomplished with the function hUM.post()
,
where “hUM” stands for hierarchical Unconstrained Multinomial
model. We start by merging the two samples from equal distributions
under \(H_0\):
## [1] "Dickey" "Simcoe"
eqId0 <- paste0(popId, collapse = ".") # merged lake name
popId0 <- as.character(fish215$Lake) # all lake names
popId0[popId0 %in% popId] <- eqId0
table(popId0, dnn = NULL) # merged lake counts
## Crystal Dickey.Simcoe Hogan Kingscote Macdonald
## 20 40 21 18 19
## Manitou Michipicoten Opeongo Seneca Slate
## 20 20 20 18 19
Next, we sample from \(p(\boldsymbol \alpha\mid \boldsymbol Y, H_0)\) using rstan:
X0 <- cbind(Id = popId0, fish215[-1]) # allele data with merged lakes
nsamples <- 1e4
fit0 <- hUM.post(nsamples = nsamples, X = X0,
rhoId = eqId0, # output rho only for merged lakes
chains = 1, # next two arguments are passed to rstan
warmup = min(1e4, floor(nsamples/10)),
full.stan.out = FALSE)
##
## SAMPLING FOR MODEL 'DirichletMultinomial' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 5.9e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.59 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 1: Iteration: 1001 / 10000 [ 10%] (Sampling)
## Chain 1: Iteration: 2000 / 10000 [ 20%] (Sampling)
## Chain 1: Iteration: 3000 / 10000 [ 30%] (Sampling)
## Chain 1: Iteration: 4000 / 10000 [ 40%] (Sampling)
## Chain 1: Iteration: 5000 / 10000 [ 50%] (Sampling)
## Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.83 seconds (Warm-up)
## Chain 1: 7.557 seconds (Sampling)
## Chain 1: 8.387 seconds (Total)
## Chain 1:
rstan typically outputs a number of messages and
warnings, many of which are harmless. The MADPop
package lists rstan as a dependency, such that its
sophisticated tuning mechanism is exposed. Optionally,
hUM.post()
returns the entire rstan output
(full.stan.out = TRUE
), which can be run through
rstan’s MCMC convergence diagnostics.
By setting the rhoId
argument of
hUM.post()
, the fit0
object contains samples
from \(p(\boldsymbol \alpha, \boldsymbol
\rho_{12} \mid \boldsymbol Y, H_0)\). Boxplots of the 50 highest
posterior probability genotypes in the common distribution \(\boldsymbol \rho_{12}\) are displayed
below:
rho.post <- fit0$rho[,1,] # p(rho12 | Y)
# sort genotype counts by decreasing order
rho.count <- colSums(Xsuff$tab[popId,])
rho.ord <- order(colMeans(rho.post), decreasing = TRUE)
# plot
nbxp <- 50 # number of genotypes for which to make boxplots
clrs <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2",
"#D55E00", "#CC79A7")
rho.ct <- rho.count[rho.ord[1:nbxp]]
rho.lgd <- unique(sort(rho.ct))
rho.box <- rho.post[,rho.ord[1:nbxp]]
par(mar = c(5,4.5,.5,.5)+.1, cex.axis = .7)
boxplot(x = rho.box,
las = 2, col = clrs[rho.ct+1], pch = 16, cex = .2)
title(xlab = "Genotype", line = 4)
title(ylab = expression(p(bold(rho)[(12)]*" | "*bold(Y))))
legend("topright", legend = rev(rho.lgd), fill = rev(clrs[rho.lgd+1]),
title = "Counts", ncol = 2, cex = .8)
This is calculated analogously to the bootstrapped p-value, by generating \(M\) contingency tables with \(\boldsymbol Y_k^{(m)} \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \boldsymbol \rho_{12}^{(m)}\), \(m = 1, \ldots, M\). However, for each contingency table, the common probability vector \(\boldsymbol \rho_{12}^{(m)}\) is different, corresponding to a random draw from \(p(\boldsymbol \rho_{12} \mid \boldsymbol Y, H_0)\). The test statistic \(T^{(m)}\) is then calculated and we have \[ \textrm{p}_\textrm{v}^{\textrm{post}}= \frac 1 M \sum_{m=1}^M \delta(T^{(m)} \ge T_{\textrm{obs}}). \] \(\textrm{p}_\textrm{v}^{\textrm{post}}\) is calculated with MADPop as follows:
## user system elapsed
## 0.387 0.032 0.422
We can now compare the three types of p-values (asymptotic, bootstrapped, posterior) for each test statistic (\(\mathcal X\) and \(\Lambda\)):
Asymptotic | Bootstrap | Posterior | |
---|---|---|---|
\(\mathcal X\) | 33.0 | 0.68 | 13 |
\(\Lambda\) | 4.1 | 0.59 | 12 |
We see that \(\textrm{p}_\textrm{v}^{\textrm{post}}\) is much more conservative than \(\textrm{p}_\textrm{v}^{\textrm{boot}}\).
Here we used a completely unconstrained Multinomial model for the genotype counts, in the sense that the only restriction on \(\boldsymbol \rho_k\) is that \(\rho_{ki} \ge 0\) and \(\sum_{i=1}^C \rho_{ki} = 1\). However, it is possible to impose genetic constraints such as Hardy-Weinberg equilibrium, preferential mating, etc., which effectively reduce the degrees of freedom of \(\boldsymbol \rho_k\) below \(C-1\). In this case, the closed-form likelihood \(\mathcal L(\boldsymbol \alpha\mid \boldsymbol Y)\) is typically unavailable, but the Stan code can easily be modified to sample from \(p(\boldsymbol \alpha, \boldsymbol{\mathcal R}\mid \boldsymbol Y)\), where \(\boldsymbol{\mathcal R}= (\boldsymbol \rho_1, \ldots, \boldsymbol \rho_K)\). We hope to feature some of these extensions to the basic Dirichlet-Multinomial model in the next version of MADPop.