Binomial Distribution of Allele Data

Ian Arriaga-MacKenzie

Last updated on Jun 3, 2022

Summary population allele data can be modeled using a binomial mixture distribtion with homogeneous reference populations. Here we show an example of this using the gnomAD V2.1 African/African-American data set, and homogeneous African and European reference panels from 1000 Genomes. We adopt the following model:

P (n | N, Θ) = Binom (n | N, \sum_{k = 1}^{K} π_{k} θ_{k})

ℓ (Θ) = l n L (Θ | X) = \sum_{i = 1}^{S} l n [Binom (n | N, \sum_{k = 1}^{K} π_{k} θ_{k})]

where

S is the set of SNPs

K are ancestries

$π_{k}$ are ancestry proportions for k

$n_{i}$ is the Allele Count for that SNP

$N_{i}$ is the Allele Number for that SNP

$θ_{k}$ is the Allele Frequency for that SNP

This leads to the above image which shows a maximation of the log likelihood at the following values:

AFR: 0.8277273 EUR: 0.1722727

These values are consistent with known admixture within the gnomAD sample, and are confirmed with other estimation methods (Summix, ADMIXTURE). There are several ways to maximize the log-likelihood including grid-search and Expecation-Maximization algorithms. The binomial distribution can also be inverted and solved using gradient descent methods, such as Sequential Quadratic Programming.

Binomial Distribution of Allele Data

Ian Arriaga-MacKenzie

Statistics and Computational Mathematics