Lab 04: IBD Statistics Extraction and Analysis
Data Analysis: This lab explores how Bonsai v3 extracts and analyzes statistical patterns from IBD segments. Understanding these statistical measures is crucial for relationship inference and pedigree reconstruction.
Core IBD Statistics
The Five Fundamental Metrics
Bonsai v3 extracts five core statistics from IBD segment data, which serve as the foundation for all relationship inference:
- Total Half-IBD (IBD1): The total genetic length (in centiMorgans) of segments where individuals share exactly one allele
- Total Full-IBD (IBD2): The total genetic length (in centiMorgans) of segments where individuals share both alleles
- Number of Half-IBD Segments: Count of distinct IBD1 segments detected
- Number of Full-IBD Segments: Count of distinct IBD2 segments detected
- Maximum Segment Length: Length (in centiMorgans) of the longest shared IBD segment
These statistics are calculated for every pair of individuals who share IBD segments. In the Bonsai v3 codebase, this calculation is performed by the get_ibd_stats_unphased()
function in the ibd.py
module. This function processes raw IBD segments to compute these summary statistics efficiently, handling edge cases like overlapping segments and chromosome boundaries.
The IBDIndex Class
Bonsai v3 implements an IBDIndex
class to efficiently store and retrieve IBD statistics and segments. This class provides a critical layer of abstraction between raw IBD data and the statistical models used for relationship inference.
Key operations supported by the IBDIndex
class include:
- get_stats_for_pair(id1, id2): Retrieves computed statistics for a specific pair of individuals
- get_segments_for_pair(id1, id2): Returns all IBD segments shared between two individuals
- get_all_pairs(): Retrieves all pairs of individuals with detected IBD sharing
- get_total_ibd_between_id_sets(id_set1, id_set2): Calculates total IBD sharing between two sets of individuals
The class employs efficient data structures (hash tables) to ensure O(1) lookup time for most operations, which is essential for processing large datasets with thousands of individuals and millions of IBD segments. The implementation includes sophisticated caching mechanisms to avoid redundant computation, particularly important for operations that are called repeatedly during pedigree construction.
Relationship-Specific IBD Patterns
The five core IBD statistics form distinctive patterns for different relationship types. Bonsai v3 leverages these patterns for relationship inference:
Relationship | Total IBD1 (cM) | Total IBD2 (cM) | IBD1 Segments | IBD2 Segments | Max Segment (cM) |
---|---|---|---|---|---|
Parent-Child | ~3400 | 0 | ~22-35 | 0 | ~180-280 |
Full Siblings | ~2550 | ~850 | ~35-50 | ~10-20 | ~120-180 |
Half-Siblings | ~1700 | 0 | ~20-30 | 0 | ~100-140 |
First Cousins | ~850 | 0 | ~15-25 | 0 | ~70-100 |
Second Cousins | ~212 | 0 | ~8-15 | 0 | ~40-70 |
These patterns arise from the biological mechanisms of genetic inheritance. For example, parent-child pairs share exactly half of their DNA (one chromosome of each pair), resulting in approximately 3400 cM of IBD1 and no IBD2. Full siblings share on average 50% of their DNA, but unlike parent-child pairs, this includes regions where both chromosomes are shared (IBD2).
Critically, Bonsai v3 doesn't just rely on these average values but models the complete statistical distributions of these metrics for each relationship type, accounting for variance due to the stochastic nature of recombination.
Segment Length Distributions
The Importance of Segment Lengths
Beyond the five core statistics, Bonsai v3 analyzes the complete distributions of IBD segment lengths, which provide additional information for relationship inference:
- Close Relationships: Typically have longer segments due to fewer recombination events separating individuals
- Distant Relationships: Have progressively shorter segments as additional generations introduce more recombination points
- Complex Relationships: Often show bimodal or unusual distributions that can help identify unique family structures
The segment length distribution follows an exponential decay pattern, where the rate parameter is directly related to the meiotic distance between individuals. This mathematical relationship provides a powerful tool for distinguishing relationship types that might have similar total IBD amounts but different ancestral paths.
Mathematical Modeling of Segment Lengths
In Bonsai v3, segment length distributions are modeled mathematically using a modified exponential distribution. For a relationship with meiotic distance d (the total number of meioses separating two individuals through their common ancestor), the probability density function for segment length L is approximately:
f(L) = d/100 · e-d·L/100
This means:
- The expected segment length is inversely proportional to meiotic distance
- The probability of finding very long segments decreases exponentially with relationship distance
- The segment length distribution is a key parameter in Bonsai's likelihood models
In the actual implementation, Bonsai v3's likelihoods.py
module includes sophisticated functions like get_lam_a_m()
that implement these mathematical models with empirically calibrated parameters. The models are adjusted for factors like chromosome-specific recombination rates and population background.
Length Distribution Analysis in Practice
The IBDIndex
class in Bonsai v3 provides methods to access the complete segment length distributions for analysis:
def get_segment_length_distribution(self, id1, id2, ibd_type=None):
"""Get the distribution of segment lengths for a specific pair.
Args:
id1, id2: Individual IDs
ibd_type: Optional, 1 for IBD1-only, 2 for IBD2-only, None for all
Returns:
List of segment lengths in cM
"""
This function is used by the likelihood models to compute the probability of observed segment length distributions under different relationship hypotheses. It handles various edge cases like missing or corrupt segment data, and can filter by IBD type to analyze IBD1 and IBD2 patterns separately.
When visualized, these distributions reveal characteristic patterns:
- Parent-child relationships show a relatively flat distribution with many long segments
- Sibling relationships show a mixture of long and medium segments
- Cousin relationships show distributions heavily skewed toward shorter segments
- Very distant relationships (3rd cousins and beyond) primarily show segments below 20 cM
These visual patterns provide intuitive confirmation of the mathematical models implemented in Bonsai's codebase.
IBD Network Analysis
Building IBD Networks
Bonsai v3 constructs IBD networks as an intermediate step in pedigree reconstruction. These networks represent individuals as nodes and IBD sharing as weighted edges, providing a powerful visual and computational representation of genetic relatedness.
The network construction process in Bonsai includes:
- Node Creation: Each individual in the dataset becomes a node in the network
- Edge Creation: Edges are created between pairs with IBD sharing above a threshold (typically 7-10 cM)
- Edge Weighting: Each edge is weighted by the total amount of IBD sharing (IBD1 + IBD2)
- Edge Annotation: Additional attributes like IBD statistics are stored on edges for later analysis
The get_id_to_shared_ibd()
function in ibd.py
is a key component of this process, creating the necessary data structures for efficient network construction. The resulting networks capture the complex patterns of genetic sharing across multiple individuals, allowing for group-level analysis that goes beyond pairwise comparisons.
Community Detection Algorithms
A critical application of IBD networks in Bonsai v3 is the identification of related individuals using community detection algorithms. These algorithms identify clusters of densely connected nodes, which correspond to groups of related individuals.
Bonsai implements several community detection approaches:
- Louvain Method: A hierarchical clustering algorithm that optimizes modularity
- Connected Components: A simple approach for identifying completely separate networks
- Edge Filtering: Progressive removal of weak edges to reveal core family structures
The implementation in get_next_node()
and related functions uses these community detection results to prioritize which relationships to analyze first during pedigree reconstruction. This approach provides two major advantages:
- It reduces the computational complexity by processing related groups together
- It improves accuracy by ensuring that closely related individuals are placed in consistent positions
The community detection component allows Bonsai to scale efficiently to large datasets with complex family structures, breaking down the global optimization problem into more manageable subproblems.
Within-Community vs. Between-Community Analysis
Bonsai v3 performs sophisticated analysis of IBD sharing patterns within and between detected communities:
- Within-Community Sharing: Typically higher, representing close family relationships
- Between-Community Sharing: Typically lower, representing more distant connections
- Bridge Individuals: People with significant connections to multiple communities, often representing important cross-family links
The get_total_ibd_between_id_sets()
function in the IBDIndex
class specifically supports this type of analysis, calculating the aggregate IBD sharing between groups of individuals. This function is used in several contexts:
- Validating community detection results by measuring internal cohesion
- Identifying potential merger points between separate pedigrees
- Prioritizing which communities to process first during incremental pedigree building
By analyzing both intra- and inter-community IBD patterns, Bonsai can handle complex scenarios like endogamy (marriage within a relatively closed community), which creates unusual patterns of genetic sharing that simpler algorithms would misinterpret.
From Statistics to Relationship Inference
The PwLogLike Class
At the heart of Bonsai v3's relationship inference capability is the PwLogLike
class in the likelihoods.py
module. This class implements the statistical models that convert IBD statistics into relationship likelihoods:
Key methods in this class include:
get_relationship_options()
: Generates all possible relationship types up to a specified degreeget_relationship_log_like()
: Computes the log-likelihood of a specific relationship given observed IBD statisticsget_ll_pedigree_tuple()
: Calculates the likelihood of a specific up/down relationship configurationget_log_seg_pdf()
: Implements the statistical distribution models for segment counts and lengths
The class incorporates sophisticated statistical models for different aspects of IBD sharing:
- Poisson models for the expected number of segments
- Exponential distribution models for segment lengths
- Beta distribution models for IBD2 proportions
- Gaussian approximations for the total amount of sharing
These models are calibrated using empirical data from known relationships, ensuring that the likelihood calculations accurately reflect the biological reality of genetic inheritance.
Relationship Representation with Tuples
Bonsai v3 represents relationships using a compact tuple notation (up, down, num_ancs)
:
- up: Number of generations from individual 1 to the common ancestor
- down: Number of generations from common ancestor to individual 2
- num_ancs: Number of common ancestors (1 for half relationships, 2 for full)
For example:
(0, 1, 1)
: Parent-child (parent is individual 1)(1, 1, 2)
: Full siblings (shared both parents)(1, 1, 1)
: Half siblings (shared one parent)(2, 2, 2)
: Full first cousins (shared both grandparents)
This notation provides a computationally efficient way to represent and manipulate relationship hypotheses. It directly connects to the mathematical models of IBD sharing, where the total meiotic distance up + down
determines the expected amount and pattern of sharing.
The function get_simple_rel_tuple()
in pedigrees.py
calculates these relationship tuples directly from pedigree structures, providing a critical bridge between the graph-based representation of pedigrees and the statistical models of IBD sharing.
Combining Multiple Sources of Evidence
A key innovation in Bonsai v3 is its ability to combine multiple sources of evidence for relationship inference:
- IBD1 Statistics: Total amount and segment count of half-identical regions
- IBD2 Statistics: Presence and amount of fully identical regions
- Segment Length Distribution: Pattern of segment lengths, especially for distinguishing relationships with similar total IBD
- Age Information: When available, age differences provide additional constraints on possible relationships
- Sex Information: Constraints on relationship types based on biological sex
The get_relationship_log_like()
method combines these different sources into a single log-likelihood score for each relationship hypothesis. The method implements a sophisticated weighting scheme that accounts for the reliability of different evidence types and their correlations.
This multi-evidence approach is particularly powerful for resolving ambiguous cases where a single statistic might be consistent with multiple relationship types. For example, half-siblings, grandparent-grandchild, and avuncular relationships all share approximately 25% of their DNA, but they can be distinguished using segment length distributions and age information.
Mathematical Foundation: Bonsai v3's approach to IBD statistics analysis demonstrates how complex biological processes can be modeled with mathematical precision. By combining population genetics theory with empirical calibration, the system transforms raw genetic data into meaningful relationship assessments that account for the inherent randomness in genetic inheritance.
Comparing Notebook and Production Code
The Lab04 notebook provides simplified implementations of IBD statistics extraction and analysis, while the actual Bonsai v3 implementation includes numerous additional capabilities:
- Optimized Algorithms: The production code includes highly optimized algorithms for processing millions of IBD segments efficiently
- Caching: Sophisticated caching mechanisms avoid redundant computation of frequently accessed statistics
- Error Handling: Robust handling of edge cases like noisy data, missing segments, and detector artifacts
- Background IBD: Adjustments for population-specific background levels of IBD sharing
- Calibration: The statistical models are calibrated on large datasets of confirmed relationships for different populations
- Advanced Network Analysis: More sophisticated community detection algorithms adapted for genetic data
Despite these differences, the core concepts and approaches demonstrated in the notebook directly correspond to those used in the production system, providing an accurate foundation for understanding how Bonsai extracts and leverages IBD statistics.
Interactive Lab Environment
Run the interactive Lab 04 notebook in Google Colab:
Google Colab Environment
Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.
Data will be automatically downloaded from S3 when you run the notebook.
Note: You may need a Google account to save your work in Google Drive.
Beyond the Code
As you explore IBD statistics extraction and analysis in Bonsai v3, consider these broader implications:
- Statistical Thinking: How probability theory provides a framework for handling uncertainty in biological systems
- Data Representation: The importance of choosing appropriate data structures for efficient computation
- Evidence Integration: How multiple sources of evidence can be combined to improve inference accuracy
- Network Science: The application of graph theory to model complex biological relationships
These considerations highlight how IBD statistics analysis in Bonsai v3 represents a sophisticated application of both computational and statistical principles to a complex biological problem.
This lab is part of the Bonsai v3 Deep Dive track:
Introduction
Lab 01
Architecture
Lab 02
IBD Formats
Lab 03
Statistics
Lab 04
Models
Lab 05
Relationships
Lab 06
PwLogLike
Lab 07
Age Modeling
Lab 08
Data Structures
Lab 09
Up-Node Dict
Lab 10