Lab 06: Probabilistic Relationship Inference
Bayesian Inference: This lab examines how Bonsai v3 transforms observed IBD patterns into probabilistic assessments of relationships using statistical models and Bayesian inference. Understanding this process is key to grasping the system's core functionality and interpreting its outputs.
The Mathematical Framework
Relationship Representation
Bonsai v3 uses a compact tuple representation for genetic relationships that captures the essential genealogical structure. There are two equivalent formats used in different parts of the codebase:
- Three-value tuple:
(up, down, num_ancs)
up
: Generations from person 1 to common ancestordown
: Generations from common ancestor to person 2num_ancs
: Number of common ancestors (1 for half-relationships, 2 for full relationships)
- Five-value tuple:
(degree1, removal1, degree2, removal2, half)
degree1
: Genealogical degree of person 1 (ancestors to common ancestor)removal1
: Removal for person 1 (generational distance within degree)degree2
: Genealogical degree of person 2 (ancestors to common ancestor)removal2
: Removal for person 2 (generational distance within degree)half
: 1 for half relationships (one common ancestor), 0 for full relationships (two common ancestors)
The three-value tuple representation is more commonly used in the core algorithmic components. For example, the function get_simple_rel_tuple()
in pedigrees.py
returns this representation when computing relationships in pedigree structures:
def get_simple_rel_tuple(up_node_dict, id1, id2):
"""Compute the relationship tuple between two individuals in a pedigree.
Args:
up_node_dict: Dictionary representing the pedigree structure
id1, id2: IDs of the individuals to compare
Returns:
Tuple of (up, down, num_ancs) representing the relationship
"""
This representation directly encodes the meiotic distance between individuals (up + down
), which is the primary determinant of expected IBD sharing patterns. The tuple approach allows the system to represent any possible relationship within a consistent mathematical framework, enabling efficient comparison and computation.
Bayesian Inference for Relationships
At its core, Bonsai v3 implements a Bayesian approach to relationship inference. For two individuals with observed IBD data D, the posterior probability of relationship R is calculated as:
P(R|D) = \frac{P(D|R) \cdot P(R)}{P(D)}
Where:
- P(R|D) is the posterior probability of relationship R given the observed IBD data
- P(D|R) is the likelihood of observing the IBD data given relationship R
- P(R) is the prior probability of relationship R
- P(D) is the marginal probability of the IBD data
Since the goal is to compare different possible relationships for the same IBD data, the denominator P(D) is constant across all relationship hypotheses. This allows Bonsai to focus on the numerator, comparing relationships based on the product of likelihood and prior:
P(R|D) \propto P(D|R) \cdot P(R)
In practice, Bonsai v3 performs these calculations in log space to avoid numerical underflow with very small probabilities:
\log P(R|D) \propto \log P(D|R) + \log P(R)
This Bayesian framework is implemented in the get_relationship_log_like()
method of the PwLogLike
class, which combines log-likelihoods from multiple sources of evidence with relationship priors to compute posterior log-probabilities.
Relationship Priors
Bonsai v3 incorporates prior probabilities for different relationship types, reflecting the relative frequency of these relationships in typical human populations. These priors are implemented in the get_rel_ll_prior()
function in the likelihoods.py
module:
def get_rel_ll_prior(rel_tuple):
"""Calculate prior log-likelihood for a relationship tuple.
Args:
rel_tuple: Tuple of (up, down, num_ancs)
Returns:
Prior log-likelihood
"""
up, down, num_ancs = rel_tuple
# Calculate meiotic distance
m = up + down
# No relationship has a very low prior
if rel_tuple is None:
return -20.0
# Self-comparison has a low prior in most contexts
if m == 0 and num_ancs == 2: # Self
return -10.0
# Prior decreases with increasing meiotic distance
base_prior = -0.5 * m
# Full relationships are more common than half relationships
anc_prior = 0.5 if num_ancs == 2 else 0.0
# Symmetric relationships (up ≈ down) are more common
sym_prior = -0.2 * abs(up - down)
return base_prior + anc_prior + sym_prior
This prior model encodes several important insights about human kinship patterns:
- The probability of relationship decreases with increasing genealogical distance
- Full relationships (sharing two common ancestors) are more common than half relationships
- Symmetric relationships (same generation) are more common than asymmetric ones
The prior serves as a regularization mechanism, guiding the inference process toward more plausible relationships when the genetic evidence alone is ambiguous. This is particularly important for distant relationships where IBD sharing patterns can be highly variable.
Statistical Moments of IBD Distributions
The Lambda Parameter
The lambda parameter in Bonsai v3 controls the distribution of IBD segment lengths. For a relationship with meiotic distance m, the lambda parameter is calculated as:
\lambda = \frac{m}{100}
This parameter has a direct biological interpretation: it represents the rate of decrease in the probability of observing longer segments, reflecting the increased chances of recombination breaking up segments over more meiotic events.
The lambda parameter determines several key properties of the expected IBD distribution:
- The expected mean segment length is approximately 100/m centimorgans
- The probability density function for segment length x is λe-λx
- The probability that a segment exceeds length t is e-λt
The get_lam_a_m()
function in likelihoods.py
implements this calculation with additional refinements for specific relationship types:
def get_lam_a_m(g, num_ancs=1, min_len=7.0):
"""Calculate lambda parameter for segment length distribution.
Args:
g: Meiotic distance (up + down)
num_ancs: Number of common ancestors
min_len: Minimum segment length
Returns:
Lambda parameter
"""
# Base lambda is meiotic distance / 100
lam = g / 100.0
# Adjust for specific relationship patterns
if g == 2 and num_ancs == 2: # Full siblings
lam *= 0.9 # Empirical adjustment for full siblings
return lam
This parameter is central to Bonsai's relationship inference capabilities, as it directly ties the observable pattern of segment lengths to the underlying genealogical relationship.
The Eta Parameter
The eta parameter represents the expected number of IBD segments for a relationship. For a relationship with meiotic distance m and a common ancestors, eta is calculated as:
\eta = \frac{a(rm+c)}{2^{m-1}} \cdot e^{-\frac{m \cdot t}{100}}
Where:
- r is the recombination rate (~34 crossovers per meiosis for the human genome)
- c is the number of chromosomes (22 autosomes)
- t is the minimum detectable segment length
This formula has a strong theoretical foundation in genetic inheritance models:
- The term rm+c represents the expected number of segments before any filtering
- The division by 2m-1 accounts for the probability of inheriting segments through specific lineages
- The exponential term accounts for segments that fall below the detection threshold
The get_eta()
function in moments.py
implements this calculation:
def get_eta(g, r=34.0, c=22, a=1):
"""Calculate expected number of IBD segments.
Args:
g: Meiotic distance
r: Recombination rate per meiosis
c: Number of chromosomes
a: Number of common ancestors
Returns:
Expected number of segments
"""
if g <= 1:
return a * (r * g + c)
else:
return a * (r * g + c) / (2 ** (g - 1))
This parameter directly influences the probability distribution of observed segment counts, which follows a Poisson distribution with mean η:
P(k|R) = \frac{\eta^k e^{-\eta}}{k!}
Where k is the observed number of segments and R is the relationship. This distribution is implemented in the get_log_seg_count_pdf()
function in likelihoods.py
.
Expected Total IBD Sharing
The expected total IBD sharing for a relationship combines the eta and lambda parameters. For a relationship with meiotic distance m and expected segment count η, the expected total IBD sharing is:
E[Total\;IBD] = \eta \cdot (\frac{100}{m} + t)
Where t is the minimum detectable segment length. This formula accounts for the truncation of the exponential distribution at the minimum threshold.
The expected total IBD sharing follows characteristic patterns for different relationships:
- Parent-Child: ~3400 cM (50% of the genome)
- Full Siblings: ~2550 cM (37.5% of the genome)
- Half Siblings/Grandparent/Avuncular: ~1700 cM (25% of the genome)
- First Cousins: ~850 cM (12.5% of the genome)
- Second Cousins: ~212 cM (3.125% of the genome)
The get_expected_total_ibd()
function in moments.py
calculates this value for any relationship tuple:
def get_expected_total_ibd(rel_tuple, min_seg_len=7.0):
"""Calculate expected total IBD sharing.
Args:
rel_tuple: (up, down, num_ancs) tuple
min_seg_len: Minimum segment length threshold
Returns:
Expected total IBD in cM
"""
up, down, num_ancs = rel_tuple
m = up + down
# Special case for self
if m == 0 and num_ancs == 2:
return 3400.0
# Calculate expected segment count
eta = get_eta(m, a=num_ancs)
# Adjust for detection threshold
p_obs = math.exp(-m * min_seg_len / 100.0)
eta_obs = eta * p_obs
# Calculate expected total IBD
expected_total = eta_obs * (100.0 / m + min_seg_len)
return expected_total
This expected value serves as the mean of a normal distribution that models the variation in total IBD sharing. The variance of this distribution is also modeled based on theoretical considerations and empirical calibration.
Likelihood Functions
Likelihood for Segment Counts
The likelihood function for segment counts quantifies the probability of observing k segments given a relationship R. This likelihood is based on the Poisson distribution with parameter η (expected segment count):
P(k|R) = \frac{\eta^k e^{-\eta}}{k!}
In log space, this becomes:
\log P(k|R) = k \log \eta - \eta - \log(k!)
The get_log_seg_count_pdf()
function in likelihoods.py
implements this calculation:
def get_log_seg_count_pdf(k, eta, sigma=None):
"""Calculate log-likelihood of observing k segments.
Args:
k: Observed segment count
eta: Expected segment count
sigma: Standard deviation (not used for Poisson)
Returns:
Log-likelihood
"""
if eta <= 0:
return -np.inf if k > 0 else 0.0
# Use Poisson PMF
return stats.poisson.logpmf(k, eta)
This likelihood function captures the inherent stochasticity in segment counts due to the random nature of recombination and inheritance. It accounts for the fact that even for a fixed relationship type, the observed number of segments can vary significantly.
Likelihood for Segment Lengths
The likelihood function for segment lengths quantifies the probability of observing segments with specific lengths given a relationship R. For a relationship with meiotic distance m, the probability density function for a segment of length x (given that it exceeds the minimum threshold t) is:
f(x|R,x>t) = \frac{m}{100} e^{-\frac{m}{100}(x-t)}
The log-likelihood for a set of segments is the sum of the log-likelihoods for each individual segment:
\log P(\{x_1, x_2, ..., x_n\}|R) = \sum_{i=1}^{n} \log f(x_i|R,x_i>t)
This calculation is implemented in the get_log_seg_len_pdf()
function:
def get_log_seg_len_pdf(segs, lam, min_len=7.0):
"""Calculate log-likelihood of observed segment lengths.
Args:
segs: List of segment lengths
lam: Lambda parameter for exponential distribution
min_len: Minimum segment length threshold
Returns:
Log-likelihood
"""
if not segs:
return 0.0 # No segments
# Calculate log-likelihood for each segment
log_likes = []
for seg_len in segs:
if seg_len < min_len:
continue
# Log-pdf for truncated exponential distribution
adj_len = seg_len - min_len
log_like = np.log(lam) - lam * adj_len
log_likes.append(log_like)
# Sum log-likelihoods
return sum(log_likes)
This function handles the truncated exponential distribution that arises from imposing a minimum detection threshold on segment lengths. It captures the distinctive patterns of segment length distributions that characterize different relationship types.
The Combined Likelihood Function
The core of Bonsai v3's relationship inference is the get_log_seg_pdf()
function, which combines the likelihoods from segment counts and segment lengths into a comprehensive likelihood model:
def get_log_seg_pdf(segments, rel_tuple, min_seg_len=7.0):
"""Calculate log-likelihood of observing segments given a relationship.
Args:
segments: List of segment lengths
rel_tuple: (up, down, num_ancs) tuple
min_seg_len: Minimum segment length threshold
Returns:
Log-likelihood
"""
up, down, num_ancs = rel_tuple
m = up + down
# Handle special case for self
if m == 0 and num_ancs == 2:
return 0.0 if not segments else -np.inf
# Calculate parameters
lam = get_lam_a_m(m, num_ancs, min_seg_len)
eta = get_eta(m, a=num_ancs)
# Adjust for detection threshold
p_obs = math.exp(-m * min_seg_len / 100.0)
eta_obs = eta * p_obs
# Calculate likelihood for segment count
count_ll = get_log_seg_count_pdf(len(segments), eta_obs)
# Calculate likelihood for segment lengths
length_ll = get_log_seg_len_pdf(segments, lam, min_seg_len)
# Combine likelihoods with appropriate weights
# Weight factors are empirically calibrated
count_weight = 0.4
length_weight = 0.6
return count_weight * count_ll + length_weight * length_ll
This function implements a weighted combination of the count and length likelihoods, with weights that are empirically calibrated to optimize inference accuracy. The weights reflect the relative information content of the different statistics - segment lengths often provide more discriminative power for distinguishing close relationships, while segment counts become more important for distant relationships.
In the actual Bonsai v3 implementation, this function includes additional refinements:
- Special handling for parent-child relationships, which have distinctive IBD patterns
- Adjustments for IBD2 segments, which are present in full sibling relationships
- Chromosome-specific models that account for varying recombination rates
- Population-specific calibration factors for different demographic backgrounds
These refinements ensure that the likelihood function accurately captures the complex patterns of IBD sharing observed in real human relationships.
Handling Uncertainty and Stochasticity
The Stochastic Nature of Genetic Inheritance
One of the fundamental challenges in genetic relationship inference is the inherently stochastic nature of genetic inheritance. Due to the random processes of meiosis and recombination, even the same relationship type can produce widely varying patterns of IBD sharing.
Key sources of stochasticity include:
- Random Segregation: Which chromosomes are inherited from each parent is random
- Recombination Variability: The number and location of crossovers during meiosis is variable
- Sampling Error: Genotyping covers only a subset of markers, introducing sampling variance
- Detection Error: IBD detection algorithms have false positives and false negatives
This stochasticity means that a single observed IBD pattern could be consistent with multiple possible relationships. For example, the coefficient of variation (standard deviation divided by mean) for total IBD sharing between second cousins can be as high as 50%, meaning that the actual sharing can easily range from half to double the expected value!
Bonsai v3 explicitly models this stochasticity through the probability distributions in its likelihood functions. The variance parameters in these distributions are calibrated using both theoretical models and empirical data from thousands of known relationships.
Likelihood Ratios and Bayes Factors
To quantify uncertainty in relationship inference, Bonsai v3 uses likelihood ratios (also known as Bayes factors) to compare different relationship hypotheses. For two relationship hypotheses R₁ and R₂, the likelihood ratio is:
LR = \frac{P(D|R_1)}{P(D|R_2)}
In log space, this becomes a simple subtraction:
\log LR = \log P(D|R_1) - \log P(D|R_2)
Bonsai v3 implements a function to determine when one hypothesis is significantly better than another:
def is_significantly_better(ll1, ll2, threshold=2.0):
"""Determine if one log-likelihood is significantly better than another.
Args:
ll1, ll2: Log-likelihood values
threshold: Difference threshold (default 2.0 ≈ 7.4x more likely)
Returns:
True if ll1 is significantly better than ll2
"""
return ll1 - ll2 > threshold
The threshold value of 2.0 corresponds to a likelihood ratio of approximately 7.4, which is often used as a standard for "strong evidence" in Bayesian statistics. When multiple relationship hypotheses have similar likelihoods (differences below the threshold), Bonsai reports the ambiguity rather than making an arbitrary choice.
This approach to uncertainty quantification is critical for reliable pedigree reconstruction, as it prevents the propagation of errors from ambiguous relationships to the larger pedigree structure.
Ambiguous Relationship Cases
Certain pairs of relationships are notoriously difficult to distinguish based on IBD patterns alone. The most common ambiguous cases are:
Ambiguous Group | Shared DNA | Distinguishing Features |
---|---|---|
Half-Siblings Grandparent-Grandchild Avuncular (Aunt/Uncle) |
~25% | Age differences Segment length patterns |
Half-Avuncular First Cousins |
~12.5% | Age differences Segment count |
Half-First Cousins First Cousins Once Removed |
~6.25% | Age differences Generation gap |
Second Cousins Half-First Cousins Once Removed |
~3.125% | Difficult to distinguish reliably |
To address these ambiguities, Bonsai v3 implements several strategies:
- Age-Based Disambiguation: Using age information to distinguish relationships with similar IBD patterns
- Multiple Relationship Paths: Considering all possible ways individuals might be related
- Network Constraints: Using relationships with other individuals to constrain possibilities
- Explicit Uncertainty Reporting: Providing confidence scores and alternative hypotheses
The get_age_log_like()
function in likelihoods.py
specifically addresses the first strategy, computing the likelihood of observed age differences under different relationship hypotheses:
def get_age_log_like(age1, age2, rel_tuple):
"""Calculate log-likelihood of age difference for a relationship.
Args:
age1, age2: Ages of the two individuals
rel_tuple: (up, down, num_ancs) relationship tuple
Returns:
Log-likelihood of the age difference
"""
By combining genetic and demographic evidence, Bonsai can often resolve ambiguities that would be impossible to address using IBD patterns alone.
Statistical Foundation: Bonsai v3's probabilistic relationship inference demonstrates the power of Bayesian modeling for complex biological problems. By explicitly modeling the stochastic nature of genetic inheritance and quantifying uncertainty, the system provides not just point estimates but comprehensive probabilistic assessments of relationships, enabling more reliable pedigree reconstruction in the face of inherent biological variability.
Comparing Notebook and Production Code
The Lab06 notebook provides a simplified implementation of the probabilistic relationship inference methods used in Bonsai v3, while the actual implementation includes additional sophistication:
- Complex Demographic Models: The production code includes complex models of age differences for different relationship types
- Population-Specific Calibration: Parameters are adjusted for different population backgrounds
- Multiple Test Statistic: Additional statistics beyond segment counts and lengths are incorporated
- IBD2 Modeling: Sophisticated handling of IBD2 regions, particularly for distinguishing close relationships
- Multi-way Relationship Analysis: Consideration of multiple individuals simultaneously to constrain possible relationships
- Dynamic Prior Adjustment: Priors that adapt based on context and known relationships
Despite these differences, the fundamental mathematical principles demonstrated in the notebook directly correspond to those used in the production system, providing an accurate conceptual foundation for understanding Bonsai's approach to relationship inference.
Interactive Lab Environment
Run the interactive Lab 06 notebook in Google Colab:
Google Colab Environment
Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.
Data will be automatically downloaded from S3 when you run the notebook.
Note: You may need a Google account to save your work in Google Drive.
Beyond the Code
As you explore the probabilistic relationship inference methods in Bonsai v3, consider these broader implications:
- Probabilistic Thinking: How explicitly representing uncertainty improves decision-making in scientific contexts
- Model Selection: The balance between model complexity and interpretability in statistical inference
- Evidence Integration: How to combine multiple sources of evidence with different levels of reliability
- Communicating Uncertainty: The challenges of conveying probabilistic results to non-specialists
These considerations highlight how Bonsai v3's approach to relationship inference represents not just a technical solution but a principled scientific methodology for drawing conclusions from complex and noisy data.
This lab is part of the Bonsai v3 Deep Dive track:
Introduction
Lab 01
Architecture
Lab 02
IBD Formats
Lab 03
Statistics
Lab 04
Models
Lab 05
Relationships
Lab 06
PwLogLike
Lab 07
Age Modeling
Lab 08
Data Structures
Lab 09
Up-Node Dict
Lab 10