Lab 05: Statistical Models of Genetic Inheritance
Mathematical Foundations: This lab explores the statistical models that form the mathematical core of Bonsai v3's relationship inference capabilities. Understanding these models is crucial for grasping how the system transforms raw genetic data into probabilistic relationship assessments.
Probability Distributions in Genetic Inheritance
The Poisson Model for Segment Counts
The number of IBD segments shared between relatives follows a Poisson distribution, a discrete probability distribution that models the number of events occurring in a fixed interval when these events happen independently at a constant average rate.
In Bonsai v3, the Poisson parameter λ (expected number of segments) for a relationship with meiotic distance d is calculated as:
λ = \frac{a(rd+c)}{2^{d-1}} \cdot p(t)
Where:
- a is the number of common ancestors (1 for half-relationships, 2 for full relationships)
- r is the recombination rate (approximately 34 recombination events per meiosis for the human genome)
- c is the number of chromosomes (22 autosomes for humans)
- d is the meiotic distance between individuals
- p(t) is the probability that a segment exceeds the detection threshold t, calculated as e-dt/100
The function get_eta()
in the moments.py
module implements this calculation:
def get_eta(g, r=34.0, c=22, a=1):
"""
Calculates the IBD segment count parameter (eta) for a relationship.
Args:
g: Meiotic distance (sum of up and down)
r: Recombination rate per meiosis (default 34 for human genome)
c: Number of chromosomes (default 22 for humans)
a: Number of common ancestors (1 for half, 2 for full relationships)
Returns:
eta: Expected number of segments before threshold filtering
"""
return a * (r * g + c) / (2 ** (g - 1))
This formula has a strong theoretical foundation in genetic inheritance models and has been validated against empirical data from known relationships. It captures how the expected number of IBD segments decreases with increasing relationship distance (though not linearly) and differs between full and half relationships.
The Exponential Model for Segment Lengths
The lengths of IBD segments follow an exponential distribution, a continuous probability distribution that describes the time between events in a Poisson point process. In genetic terms, this distribution emerges from the random nature of recombination events along chromosomes.
For a relationship with meiotic distance d, the probability density function for segment length x, given that it exceeds the detection threshold t, is:
f(x|x>t) = \frac{d}{100} \cdot e^{-d(x-t)/100}
This distribution has a mean of 100/d + t, indicating that the expected segment length decreases with increasing meiotic distance. The function get_lam_a_m()
in the likelihoods.py
module calculates the rate parameter for this distribution:
def get_lam_a_m(g, num_ancs, min_seg_len=7.0):
"""
Calculates the rate parameter for segment length distribution.
Args:
g: Meiotic distance (sum of up and down)
num_ancs: Number of common ancestors (1 or 2)
min_seg_len: Minimum detectable segment length
Returns:
lam: Rate parameter for exponential distribution
"""
# Base rate is meiotic distance / 100
lam = g / 100.0
# Adjust for relationship-specific factors
if num_ancs == 2 and g == 2: # Full siblings
lam *= 0.9 # Empirical adjustment for full siblings
return lam
This exponential model is particularly powerful because the rate parameter has a direct biological interpretation: it represents the recombination rate per unit of genetic distance, scaled by the meiotic distance. The model accurately predicts that closer relatives share longer segments on average, while more distant relatives share primarily shorter segments.
The Normal Model for Total IBD
While segment counts follow a Poisson distribution and segment lengths follow an exponential distribution, the total amount of IBD sharing (which is essentially the sum of many segment lengths) can be approximated by a normal distribution due to the Central Limit Theorem.
For a relationship with meiotic distance d, the expected total IBD sharing is:
E[Total IBD] = \lambda \cdot (100/d + t)
Where λ is the expected number of segments and t is the minimum detection threshold. The variance can be approximated as:
Var[Total IBD] \approx \lambda \cdot (100/d)^2
This normal approximation is implemented in the moments.py
module with functions like get_muT()
and get_sigT()
:
def get_muT(g, num_ancs=1, min_seg_len=7.0):
"""
Calculates the expected total IBD sharing.
Args:
g: Meiotic distance
num_ancs: Number of common ancestors
min_seg_len: Minimum segment length
Returns:
muT: Expected total IBD in cM
"""
# Expected number of segments
eta = get_eta(g, a=num_ancs)
# Probability of detection
p_obs = math.exp(-g * min_seg_len / 100.0)
# Expected total IBD
muT = eta * p_obs * (100.0 / g)
return muT
The normal model allows Bonsai v3 to calculate the likelihood of observed total IBD values, which is a key component of relationship inference. However, the actual implementation includes additional refinements to account for the truncation of the distribution at zero (since IBD sharing cannot be negative) and the discrete nature of IBD segments.
The Three-Parameter Model
Relationship Representation as Tuples
Bonsai v3 represents genetic relationships using a compact three-tuple notation: (up, down, num_ancs)
. This representation encodes all the necessary information to model the statistical properties of IBD sharing:
- up: The number of generations from the first individual to the most recent common ancestor (MRCA)
- down: The number of generations from the MRCA to the second individual
- num_ancs: The number of common ancestors (1 for half relationships, 2 for full relationships)
For example:
(0, 1, 1)
: Parent-child (parent is individual 1)(1, 1, 2)
: Full siblings (sharing both parents)(1, 1, 1)
: Half siblings (sharing one parent)(2, 2, 2)
: Full first cousins (sharing both grandparents)
This representation is not only efficient computationally but also biologically meaningful, as the meiotic distance up + down
directly influences the pattern of IBD sharing. The pedigrees.py
module contains functions like get_simple_rel_tuple()
that calculate these relationship tuples from pedigree structures:
def get_simple_rel_tuple(up_node_dict, id1, id2):
"""
Calculate the relationship tuple (up, down, num_ancs) between two individuals.
Args:
up_node_dict: Dictionary mapping individuals to their parents
id1, id2: IDs of the individuals
Returns:
Tuple of (up, down, num_ancs) representing the relationship
"""
if id1 == id2:
return (0, 0, 2) # Self relationship
# Find common ancestors
common_ancs = get_common_ancs(up_node_dict, [id1], [id2])
if not common_ancs:
return None # No relationship found
# Find minimum distances to common ancestors
min_up = float('inf')
min_down = float('inf')
for anc in common_ancs:
up_dist = get_dist_up(up_node_dict, id1, anc)
down_dist = get_dist_up(up_node_dict, id2, anc)
# Update minimum distances
if up_dist + down_dist < min_up + min_down:
min_up = up_dist
min_down = down_dist
# Determine number of common ancestors
num_ancs = 1
if min_up == 1 and min_down == 1:
# Check if they share two parents
parents1 = set(up_node_dict.get(id1, {}).keys())
parents2 = set(up_node_dict.get(id2, {}).keys())
if len(parents1.intersection(parents2)) == 2:
num_ancs = 2
return (min_up, min_down, num_ancs)
This function implements a sophisticated algorithm to identify the closest relationship between two individuals in a pedigree, accounting for complex cases where multiple relationship paths might exist.
The Three Key Parameters
Bonsai v3 characterizes each relationship type using three key statistical parameters that determine the distribution of IBD sharing:
- Lambda (λ): The expected number of IBD segments above the detection threshold
- Mean Segment Length: The expected average length of IBD segments, typically 100/d + t centimorgans
- IBD2 Proportion: The expected proportion of the genome shared as IBD2 (both chromosomes identical)
These parameters fully characterize the expected IBD sharing pattern for a given relationship, enabling quantitative comparison between theoretical expectations and observed data. The actual implementation includes a number of refinements to these basic parameters:
- Relationship-Specific Adjustments: Empirical scaling factors for specific relationships based on observed data
- Chromosome-Specific Models: Separate parameters for each chromosome to account for differences in length and recombination rates
- Background IBD Model: Parameters for population-specific background IBD sharing for unrelated individuals
The moments.py
module contains functions that calculate these parameters for any relationship tuple:
def get_ibd_moments(rel_tuple, min_seg_len=7.0):
"""
Calculate the IBD moments for a relationship tuple.
Args:
rel_tuple: Tuple of (up, down, num_ancs)
min_seg_len: Minimum segment length threshold
Returns:
Dictionary of IBD moments (mu, sigma, ibd2_prop)
"""
up, down, num_ancs = rel_tuple
g = up + down # Meiotic distance
# Handle special case for self
if g == 0 and num_ancs == 2:
return {
'mu1': 0, # Expected segment count
'sigma1': 0, # Std dev of segment count
'mu2': 3400, # Expected total IBD (full genome)
'sigma2': 0, # Std dev of total IBD
'ibd2_prop': 1.0 # Full IBD2
}
# Calculate segment count parameters
mu1 = get_mu1(g, num_ancs, min_seg_len)
sigma1 = get_sigma1(mu1)
# Calculate total IBD parameters
mu2 = get_mu2(g, num_ancs, min_seg_len)
sigma2 = get_sigma2(mu2, g)
# Calculate IBD2 proportion
ibd2_prop = 0.0
if g == 2 and num_ancs == 2: # Full siblings
ibd2_prop = 0.25 # 25% of genome is IBD2 for full siblings
return {
'mu1': mu1,
'sigma1': sigma1,
'mu2': mu2,
'sigma2': sigma2,
'ibd2_prop': ibd2_prop
}
These parameter calculations form the statistical foundation upon which all of Bonsai's relationship inference is built.
Parameter Variation Across Relationships
The three-parameter model produces distinctive patterns across different relationship types, creating a statistical "fingerprint" that can be used for relationship inference:
Relationship | Lambda (λ) | Mean Length (cM) | IBD2 Proportion |
---|---|---|---|
Parent-Child | 30-37 | 100 | 0% |
Full Siblings | 35-45 | 50 | 25% |
Half Siblings | 20-30 | 50 | 0% |
First Cousins | 15-25 | 25 | 0% |
Second Cousins | 8-15 | 16.67 | 0% |
Third Cousins | 3-8 | 12.5 | 0% |
These patterns directly reflect the biological mechanisms of genetic inheritance. For example:
- The lambda parameter decreases with increasing relationship distance due to the reduced probability of inheriting segments from more distant ancestors
- The mean segment length decreases with relationship distance due to additional meioses introducing more recombination points
- The IBD2 proportion is non-zero only for full siblings and duplicate relationships because these are the only relationships where both copies of a chromosome region can be inherited from the same ancestors
The likelihoods.py
module implements functions to calculate these parameters for any relationship tuple and to compute the probability of observed IBD patterns given these parameters.
Likelihood Functions and Inference
Likelihood-Based Relationship Inference
Bonsai v3 uses the three-parameter model to implement likelihood functions that quantify how well observed IBD patterns match the expected patterns for different relationships. The core of this approach is the PwLogLike
class in the likelihoods.py
module:
class PwLogLike:
"""
Class for computing pairwise relationship likelihoods.
"""
def __init__(self, ibd_data, bio_info=None, chrom_lengths=None, min_seg_len=7.0):
"""
Initialize with IBD data and optional biological info.
Args:
ibd_data: IBD segments or precomputed statistics
bio_info: Dictionary with age and sex information
chrom_lengths: Dictionary of chromosome lengths
min_seg_len: Minimum segment length threshold
"""
self.ibd_data = ibd_data
self.bio_info = bio_info
self.chrom_lengths = chrom_lengths or get_default_chrom_lengths()
self.min_seg_len = min_seg_len
# Cache for computed likelihoods
self._ll_cache = {}
def get_relationship_log_like(self, id1, id2, rel_tuple):
"""
Compute log-likelihood of a relationship between two individuals.
Args:
id1, id2: IDs of the individuals
rel_tuple: Tuple of (up, down, num_ancs)
Returns:
Log-likelihood of the relationship
"""
# Check cache
cache_key = (id1, id2, rel_tuple)
if cache_key in self._ll_cache:
return self._ll_cache[cache_key]
# Get observed IBD statistics
obs_stats = self.get_observed_stats(id1, id2)
# Get expected IBD moments for this relationship
exp_moments = get_ibd_moments(rel_tuple, self.min_seg_len)
# Compute segment count likelihood
count_ll = self.get_log_seg_count_pdf(
obs_stats['segment_count'],
exp_moments['mu1'],
exp_moments['sigma1']
)
# Compute total IBD likelihood
total_ll = self.get_log_total_ibd_pdf(
obs_stats['total_ibd'],
exp_moments['mu2'],
exp_moments['sigma2']
)
# Compute IBD2 proportion likelihood
ibd2_ll = self.get_log_ibd2_pdf(
obs_stats['ibd2_proportion'],
exp_moments['ibd2_prop']
)
# Combine likelihoods
combined_ll = self.combine_log_likes([count_ll, total_ll, ibd2_ll])
# Cache and return
self._ll_cache[cache_key] = combined_ll
return combined_ll
This class implements three key likelihood functions:
- Segment Count Likelihood: Based on the Poisson distribution with parameter λ
- Total IBD Likelihood: Based on a normal approximation with parameters μ and σ
- IBD2 Proportion Likelihood: Based on a beta distribution for the proportion of IBD2 sharing
These functions quantify how well the observed IBD statistics match the expected patterns for each relationship type. By combining these likelihoods, Bonsai can compute an overall likelihood score for each potential relationship.
Combined Likelihood Function
The combined likelihood function in Bonsai v3 integrates evidence from multiple sources to create a comprehensive assessment of relationship probability. The combination is performed in log space to prevent numerical underflow:
def combine_log_likes(log_likes, weights=None):
"""
Combine multiple log-likelihoods, optionally with weights.
Args:
log_likes: List of log-likelihood values
weights: Optional list of weights for each likelihood
Returns:
Combined log-likelihood
"""
if weights is None:
weights = [1.0] * len(log_likes)
# Filter out invalid likelihoods (-inf)
valid_likes = [(ll, w) for ll, w in zip(log_likes, weights) if ll != float('-inf')]
if not valid_likes:
return float('-inf')
# Combine weighted log-likelihoods
return sum(ll * w for ll, w in valid_likes) / sum(w for _, w in valid_likes)
This function allows for different weights to be assigned to different types of evidence, reflecting their relative reliability. For example, in Bonsai v3's actual implementation:
- IBD2 evidence is given higher weight for close relationships where it's more informative
- Segment count evidence is given higher weight for distant relationships where total IBD can be more variable
- Age evidence is weighted based on the completeness of the age information
The combined likelihood function implements a form of Bayesian inference, where the likelihood of each relationship hypothesis is updated based on multiple sources of evidence. This approach allows Bonsai to leverage all available information while handling missing or unreliable data gracefully.
Relationship Inference Algorithm
Bonsai v3's relationship inference algorithm uses the likelihood functions to identify the most probable relationship between two individuals. The core algorithm is implemented in the infer_relationship()
method of the PwLogLike
class:
def infer_relationship(self, id1, id2, max_degree=4):
"""
Infer the most likely relationship between two individuals.
Args:
id1, id2: IDs of the individuals
max_degree: Maximum relationship degree to consider
Returns:
List of (relationship tuple, log-likelihood) sorted by likelihood
"""
# Generate all possible relationship tuples up to max_degree
rel_tuples = self.get_relationship_options(max_degree)
# Calculate likelihood for each relationship
likelihoods = []
for rel_tuple in rel_tuples:
# Skip invalid relationships based on sex
if not self.is_valid_relationship(id1, id2, rel_tuple):
continue
# Calculate log-likelihood
log_ll = self.get_relationship_log_like(id1, id2, rel_tuple)
likelihoods.append((rel_tuple, log_ll))
# Sort by likelihood (highest first)
likelihoods.sort(key=lambda x: x[1], reverse=True)
return likelihoods
This algorithm involves several key steps:
- Candidate Generation: Generate all possible relationship tuples up to a specified degree
- Validity Filtering: Filter out relationships that are biologically implausible based on sex and age
- Likelihood Calculation: Calculate the log-likelihood for each valid relationship
- Ranking: Sort relationships by likelihood to identify the most probable ones
The result is a ranked list of relationship hypotheses with their associated likelihoods, which can be used for further analysis or visualization. This approach allows Bonsai to provide not just a single relationship prediction but a complete probabilistic assessment of different possibilities.
Stochasticity and Uncertainty
The Stochastic Nature of Genetic Inheritance
One of the fundamental challenges in genetic genealogy is the inherently stochastic nature of genetic inheritance. Due to the random processes of meiosis and recombination, even the same relationship type can produce widely varying patterns of IBD sharing. Bonsai v3 explicitly models this stochasticity to provide realistic confidence assessments.
Key sources of stochasticity in IBD sharing include:
- Random Assortment: Which chromosomes are inherited from each parent is random
- Recombination Variability: The number and location of recombination events vary randomly
- Segment Breakage: Where segments break during transmission to the next generation is stochastic
The statistical models in Bonsai v3 capture this stochasticity through the variance parameters in the distributions. For example, the variance of segment counts follows the Poisson distribution's property that variance equals mean, while the variance of total IBD is modeled based on the Central Limit Theorem:
def get_sigma1(mu1):
"""
Calculate standard deviation of segment count.
Args:
mu1: Expected number of segments
Returns:
Standard deviation (sqrt of variance)
"""
return math.sqrt(mu1) # For Poisson distribution, variance = mean
def get_sigma2(mu2, g):
"""
Calculate standard deviation of total IBD.
Args:
mu2: Expected total IBD
g: Meiotic distance
Returns:
Standard deviation of total IBD
"""
# Variance increases with expected total and decreases with meiotic distance
return max(mu2 * 0.2, math.sqrt(mu2 * (100.0 / g)))
These variance models are calibrated based on empirical data from thousands of known relationships, ensuring that they accurately reflect the true biological variability in genetic inheritance.
Uncertainty in Relationship Inference
Due to the stochastic nature of genetic inheritance, there is inherent uncertainty in relationship inference. Some relationships may be difficult to distinguish based on IBD patterns alone. Bonsai v3 explicitly quantifies this uncertainty through likelihood ratios and posterior probabilities.
For example, the following relationships often produce similar IBD patterns:
- Half Siblings vs. Grandparent-Grandchild vs. Avuncular: All share ~25% of DNA through a single common ancestor
- First Cousins Once Removed vs. Half First Cousins: Both share ~6.25% of DNA
- Second Cousins vs. First Cousins Twice Removed: Both share ~3.125% of DNA
To handle this uncertainty, Bonsai v3 implements a technique called "likelihood ratio testing" to determine when the evidence is insufficient to distinguish between competing hypotheses:
def is_significantly_better(ll1, ll2, threshold=2.0):
"""
Determine if one likelihood is significantly better than another.
Args:
ll1, ll2: Log-likelihood values
threshold: Log-likelihood difference threshold (2.0 = ~7.4x more likely)
Returns:
True if ll1 is significantly better than ll2
"""
return ll1 - ll2 > threshold
When multiple relationship hypotheses have similar likelihoods, Bonsai reports the uncertainty rather than making an arbitrary choice. This honest assessment of uncertainty is critical for reliable pedigree reconstruction, as it prevents the accumulation of errors in the pedigree structure.
Improving Inference with Additional Evidence
To reduce uncertainty in relationship inference, Bonsai v3 incorporates additional sources of evidence beyond IBD statistics. The most important of these is age information, which can help distinguish relationships with similar genetic sharing patterns.
For example, age differences follow distinct distributions for different relationships:
- Parent-Child: Typically 20-40 years (parent older)
- Siblings: Typically 0-10 years (either direction)
- Grandparent-Grandchild: Typically 40-80 years (grandparent older)
- Avuncular: Typically 10-30 years (aunt/uncle older, but can be reversed)
Bonsai v3 models these age differences as normal distributions with relationship-specific means and variances. The get_age_log_like()
function calculates the likelihood of observed age differences under different relationship hypotheses:
def get_age_log_like(age1, age2, rel_tuple):
"""
Calculate log-likelihood of age difference for a relationship.
Args:
age1, age2: Ages of individuals 1 and 2
rel_tuple: Relationship tuple (up, down, num_ancs)
Returns:
Log-likelihood of the age difference
"""
if age1 is None or age2 is None:
return 0.0 # No age information
up, down, _ = rel_tuple
age_diff = age1 - age2
# Parent-child relationship
if up == 0 and down == 1: # Parent-child (1 is parent)
return stats.norm.logpdf(age_diff, 30, 10)
elif up == 1 and down == 0: # Child-parent (2 is parent)
return stats.norm.logpdf(-age_diff, 30, 10)
# Sibling relationship
elif up == 1 and down == 1: # Siblings
return stats.norm.logpdf(abs(age_diff), 5, 5)
# Other relationships...
# [Additional cases for other relationship types]
# Default - weak constraint for other relationships
return stats.norm.logpdf(abs(age_diff), 0, 50)
By combining genetic evidence with age information, Bonsai can often resolve ambiguities that would be impossible to address using IBD patterns alone. This multi-evidence approach is a key strength of Bonsai v3's statistical framework.
Theoretical Foundation: Bonsai v3's statistical models demonstrate how complex biological phenomena can be captured through mathematical abstractions. By reducing the intricate process of genetic inheritance to a set of probability distributions with biologically meaningful parameters, the system creates a bridge between the physical reality of DNA transmission and the computational tools needed to interpret genetic relationships.
Comparing Notebook and Production Code
The Lab05 notebook provides simplified implementations of the statistical models used in Bonsai v3, while the actual implementation includes additional sophistication:
- Calibrated Parameters: The production code uses parameters calibrated on large datasets of known relationships
- Chromosome-Specific Models: Separate models for each chromosome that account for differences in length and recombination rates
- Population Background Models: Adjustments for population-specific background levels of IBD sharing
- Relationship Priors: Incorporation of prior probabilities for different relationship types
- Age Distribution Models: Sophisticated models of age differences for various relationships
- Comprehensive Validation: Extensive validation against known relationships in diverse populations
Despite these differences, the core mathematical principles demonstrated in the notebook directly correspond to those used in the production system, providing an accurate conceptual foundation for understanding how Bonsai transforms genetic data into relationship assessments.
Interactive Lab Environment
Run the interactive Lab 05 notebook in Google Colab:
Google Colab Environment
Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.
Data will be automatically downloaded from S3 when you run the notebook.
Note: You may need a Google account to save your work in Google Drive.
Beyond the Code
As you explore the statistical models in Bonsai v3, consider these broader implications:
- Model Selection: How the choice of probability distributions affects the system's ability to capture biological reality
- Parameter Estimation: The challenges of calibrating statistical models using empirical data
- Uncertainty Quantification: The importance of honest assessment of confidence in scientific inference
- Multidisciplinary Integration: How Bonsai combines insights from genetics, statistics, and computer science
These considerations highlight how Bonsai v3's statistical models represent a sophisticated application of probability theory to a complex biological problem, creating a powerful framework for deriving meaningful insights from genetic data.
This lab is part of the Bonsai v3 Deep Dive track:
Introduction
Lab 01
Architecture
Lab 02
IBD Formats
Lab 03
Statistics
Lab 04
Models
Lab 05
Relationships
Lab 06
PwLogLike
Lab 07
Age Modeling
Lab 08
Data Structures
Lab 09
Up-Node Dict
Lab 10