Binomial Distribution of Allele Data

Summary population allele data can be modeled using a binomial mixture distribtion with homogeneous reference populations. Here we show an example of this using the gnomAD V2.1 African/African-American data set, and homogeneous African and European reference panels from 1000 Genomes. We adopt the following model:

$$ P \left( n \vert N, \Theta \right) = \mbox{Binom} \left( n \bigg\vert N, \sum_{k=1}^{K} \pi_k \theta_k \right) $$ $$ \ell( \Theta ) = ln \mathcal{L} (\Theta \vert X) = \sum_{i=1}^S ln \left[ \mbox{Binom} \left( n \bigg\vert N, \sum_{k=1}^{K} \pi_k \theta_k \right) \right] $$

where

S is the set of SNPs

K are ancestries

$\pi_k$ are ancestry proportions for k

$n_i$ is the Allele Count for that SNP

$N_i$ is the Allele Number for that SNP

$\theta_k$ is the Allele Frequency for that SNP

This leads to the above image which shows a maximation of the log likelihood at the following values:

AFR: 0.8277273 EUR: 0.1722727

These values are consistent with known admixture within the gnomAD sample, and are confirmed with other estimation methods (Summix, ADMIXTURE). There are several ways to maximize the log-likelihood including grid-search and Expecation-Maximization algorithms. The binomial distribution can also be inverted and solved using gradient descent methods, such as Sequential Quadratic Programming.

Ian Arriaga-MacKenzie
Ian Arriaga-MacKenzie
Statistics and Computational Mathematics

My interests include optimization, algorithm design, and efficient experiment implementation.