In order to be computationally efficient and allow for multiallelic
markers, with rubias
we boil most of the data down to a
bunch of integer vectors in a data structure that we operate on with
some compiled code.
This document is intended to document that data structure (mostly for Eric’s benefit, at this point. We should have had a document like this a long time ago).
param_list
The basic data structure is what we call a param_list
.
it has the following named elements, which are briefly described here.
We will describe each in detail in separate sections below.
L
: the number of loci, an integer
N
: the number of individuals (integer)
C
: the number of collections in the reference data
set (integer)
A
: the number of alleles at each locus (integer
vector of length L)
CA
: the “cumulative number of alleles” at each
locus. For each locus this gives the base-0 index of the first allele at
the locus (if you were to line all the alleles at each locus up one
after another.)
coll
: an integer vector of length N that gives the
index of the collection that each fish is in.
coll_N
: an integer vector of length C that gives the
number of fish in each of the collections
RU_vec
: This is the hardest one to figure out /
remember. Imagine that each collection has an index from 1 up to C, and
imagine that each collection belongs to a single reporting unit. Each
reporting unit is assigned an integer. Now, sort everything first by
reporting unit index and then by collection index. The order that you
get is the order of the collections in RU_vec
. This vector
is a named integer vector. The collections are in the order as described
above. The names are the collection names and the values are the base-1
index of each collection.
RU_starts
: The base-0 index of the starting position
of each reporting unit in the RU_vec
vector. This is a
named integer vector. For example, the first few entries of the chinook
data set are:
ploidies <- check_refmix(chinook, 5)
cpar <- tcf2param_list(chinook, 5, summ = FALSE, ploidies = ploidies)
cpar$RU_starts[1:5]
#> CentralValleyfa CentralValleysp CentralValleywi CaliforniaCoast KlamathR
#> 0 8 12 13 15
and if we look at the first 15 elements of RU_vec
it
gives us the names and the indices of the collections in those first 4
listed reporting units:
I
: an integer vector giving the allelic type of each
gene copy carried by each individual. For ploidy = 2 (the only case
implemented so far) this vector is of length (N * L * 2). An entry of 0
denotes missing data, and the observed alleles are named 1, 2,
…
AC
: this is a flat integer vector of the counts of
alleles of different types in the different populations. It has length C
* sum(A) (i.e. the number of collections in the reference times that
total number of alleles at all the loci.). This is created by a somewhat
lengthy process: first the function
reference_allele_counts()
makes a long data frame that has
collection
, locus
, allele
, and
counts
. This then gets turned into a list of matrices in
a_freq_list()
. One matrix for each collection. The rows are
the different alleles and the columns are the different populations.
Then in list_diploid_params()
that list of matrices gets
flattened into one long integer vector.
One of the weaknesses as I see that now, is that the loci are arranged
alphabetically, rather than by input order. We should at least include
the names of the loci in the order in which they appear so that we can
get back to the loci, if necessary. The order of the loci coming out of
this process is used to make sure that it corresponds to the order of
the loci in I
, which is good, but not super intuitive. At
any rate, from the foregoing, it can be deduced that we can index into
this vector thus (all indexes are base-0): if we want the count of the
a-th allele at the l-th locus in the c-th collection then we get that by
base-0 subscipting AC
by
[C * CA[l] + c * A[l] + a]. Where
Cis the number of collections,
CAis the cumulative number of alleles, and
Ais the number of alleles at each locus. Now it should be clear why we store
CA`—this
is where we use it!
sum_AC
: the sum of the allele counts at each locus
for each collection in the reference data set. (Basically the number of
observed gene copies at the locus in the reference data set). This gets
computed in list_diploid_params()
from the list of matrices
returned by a_freq_list()
. It is of length L * C. It is a
named vector with the names taking Locus.Collection
, but I
don’t think those names get used at all. It gets indexed as
[l * C + c]
DP
: this is a vector completely parallel to
AC
but in which the prior weights have been added to each
allele in each collection.
sum_DP
: this is the sum of Dirichlet Parameters
DP
for each locus and each collection. It is parallel to
sum_AC
.
Finally, we have some entries that we should have had from day one,
but didn’t, so they aren’t consistently used throughout the code to
access the names of entities ordered as they ended up ordered: -
indiv_names
- collection_names
-
repunit_names
- locus_names
This is a trickier question than it seems, because things are done slightly differently in the different top-level functions.
In both of these functions, the original data sets gets read in,
collection and repunit get converted to factors, and then the
param_list
is made inside a single function:
tcf2param_list()
.
Same as above, this uses tcf2param_list()
after doing a
few other steps on the original data frame.
Uses tcf2param_list()
unless it is using
preCompiledParams so that it can run through stuff during infer_mixture
to compute the locus-specific means and variances of the
log-likelihoods.
This is the tough one. Because we end up doing multiple mixture
collections, we couldn’t simply use tcf2param_list()
in the
function. Rather, we create a summary for the reference sample (keeping
track of alleles found in both the reference and the mixture), and then
we split the mixture samples up by mixture collection and use
One problem with the current approach is that it is terribly slow when you start to get 10K+ SNPs. It would be much faster to read and store those data in an 012 matrix. Here is how I am thinking I could deal with that:
For the functions that use tcf2param_list()
I could
just write another function, tcf2param_list_012()
, that
took D
as just a data frame with sample_type
,
collection
, and repunit
and
indiv
, and then had an 012 matrix with the genetic data in
it, with indiv names in the rownames and locus names in the colnames.
Then we just have to deal with seeing AC_list and I_list correctly.
Actually, looking at it now, I can just do it in the same
tcf2param_list()
function. If the d012
parameter is not NULL we would:
cleaned$long
NULL, and set
cleaned$clean_short
to D
.AC_list
directly from the 012 matrix. This
should be super straightforward. I would probably want to drop
monomorphic loci first.Cool, in order to do all this I should make two new functions:
reference_allele_counts_012
and
allelic_list_012
. That might give me enough insight that I
could easily do it for infer_mixture
, too.