Lab 04: IBD Statistics Extraction | Computational Genetic Genealogy

Lab 04: IBD Statistics Extraction and Analysis

Data Analysis: This lab explores how Bonsai v3 extracts and analyzes statistical patterns from IBD segments. Understanding these statistical measures is crucial for relationship inference and pedigree reconstruction.

Core IBD Statistics

The Five Fundamental Metrics

Bonsai v3 extracts five core statistics from IBD segment data, which serve as the foundation for all relationship inference:

Total Half-IBD (IBD1): The total genetic length (in centiMorgans) of segments where individuals share exactly one allele
Total Full-IBD (IBD2): The total genetic length (in centiMorgans) of segments where individuals share both alleles
Number of Half-IBD Segments: Count of distinct IBD1 segments detected
Number of Full-IBD Segments: Count of distinct IBD2 segments detected
Maximum Segment Length: Length (in centiMorgans) of the longest shared IBD segment

These statistics are calculated for every pair of individuals who share IBD segments. In the Bonsai v3 codebase, this calculation is performed by the get_ibd_stats_unphased() function in the ibd.py module. This function processes raw IBD segments to compute these summary statistics efficiently, handling edge cases like overlapping segments and chromosome boundaries.

The IBDIndex Class

Bonsai v3 implements an IBDIndex class to efficiently store and retrieve IBD statistics and segments. This class provides a critical layer of abstraction between raw IBD data and the statistical models used for relationship inference.

Key operations supported by the IBDIndex class include:

get_stats_for_pair(id1, id2): Retrieves computed statistics for a specific pair of individuals
get_segments_for_pair(id1, id2): Returns all IBD segments shared between two individuals
get_all_pairs(): Retrieves all pairs of individuals with detected IBD sharing
get_total_ibd_between_id_sets(id_set1, id_set2): Calculates total IBD sharing between two sets of individuals

The class employs efficient data structures (hash tables) to ensure O(1) lookup time for most operations, which is essential for processing large datasets with thousands of individuals and millions of IBD segments. The implementation includes sophisticated caching mechanisms to avoid redundant computation, particularly important for operations that are called repeatedly during pedigree construction.

Relationship-Specific IBD Patterns

The five core IBD statistics form distinctive patterns for different relationship types. Bonsai v3 leverages these patterns for relationship inference:

Relationship	Total IBD1 (cM)	Total IBD2 (cM)	IBD1 Segments	IBD2 Segments	Max Segment (cM)
Parent-Child	~3400	0	~22-35	0	~180-280
Full Siblings	~2550	~850	~35-50	~10-20	~120-180
Half-Siblings	~1700	0	~20-30	0	~100-140
First Cousins	~850	0	~15-25	0	~70-100
Second Cousins	~212	0	~8-15	0	~40-70

These patterns arise from the biological mechanisms of genetic inheritance. For example, parent-child pairs share exactly half of their DNA (one chromosome of each pair), resulting in approximately 3400 cM of IBD1 and no IBD2. Full siblings share on average 50% of their DNA, but unlike parent-child pairs, this includes regions where both chromosomes are shared (IBD2).

Critically, Bonsai v3 doesn't just rely on these average values but models the complete statistical distributions of these metrics for each relationship type, accounting for variance due to the stochastic nature of recombination.

Segment Length Distributions

The Importance of Segment Lengths

Beyond the five core statistics, Bonsai v3 analyzes the complete distributions of IBD segment lengths, which provide additional information for relationship inference:

Close Relationships: Typically have longer segments due to fewer recombination events separating individuals
Distant Relationships: Have progressively shorter segments as additional generations introduce more recombination points
Complex Relationships: Often show bimodal or unusual distributions that can help identify unique family structures

The segment length distribution follows an exponential decay pattern, where the rate parameter is directly related to the meiotic distance between individuals. This mathematical relationship provides a powerful tool for distinguishing relationship types that might have similar total IBD amounts but different ancestral paths.

Mathematical Modeling of Segment Lengths

In Bonsai v3, segment length distributions are modeled mathematically using a modified exponential distribution. For a relationship with meiotic distance d (the total number of meioses separating two individuals through their common ancestor), the probability density function for segment length L is approximately:

f(L) = d/100 · e^-d·L/100

This means:

The expected segment length is inversely proportional to meiotic distance
The probability of finding very long segments decreases exponentially with relationship distance
The segment length distribution is a key parameter in Bonsai's likelihood models

In the actual implementation, Bonsai v3's likelihoods.py module includes sophisticated functions like get_lam_a_m() that implement these mathematical models with empirically calibrated parameters. The models are adjusted for factors like chromosome-specific recombination rates and population background.

Length Distribution Analysis in Practice

The IBDIndex class in Bonsai v3 provides methods to access the complete segment length distributions for analysis:

def get_segment_length_distribution(self, id1, id2, ibd_type=None):
    """Get the distribution of segment lengths for a specific pair.
    
    Args:
        id1, id2: Individual IDs
        ibd_type: Optional, 1 for IBD1-only, 2 for IBD2-only, None for all
        
    Returns:
        List of segment lengths in cM
    """

This function is used by the likelihood models to compute the probability of observed segment length distributions under different relationship hypotheses. It handles various edge cases like missing or corrupt segment data, and can filter by IBD type to analyze IBD1 and IBD2 patterns separately.

When visualized, these distributions reveal characteristic patterns:

Parent-child relationships show a relatively flat distribution with many long segments
Sibling relationships show a mixture of long and medium segments
Cousin relationships show distributions heavily skewed toward shorter segments
Very distant relationships (3rd cousins and beyond) primarily show segments below 20 cM

These visual patterns provide intuitive confirmation of the mathematical models implemented in Bonsai's codebase.

IBD Network Analysis

Building IBD Networks

Bonsai v3 constructs IBD networks as an intermediate step in pedigree reconstruction. These networks represent individuals as nodes and IBD sharing as weighted edges, providing a powerful visual and computational representation of genetic relatedness.

The network construction process in Bonsai includes:

Node Creation: Each individual in the dataset becomes a node in the network
Edge Creation: Edges are created between pairs with IBD sharing above a threshold (typically 7-10 cM)
Edge Weighting: Each edge is weighted by the total amount of IBD sharing (IBD1 + IBD2)
Edge Annotation: Additional attributes like IBD statistics are stored on edges for later analysis

The get_id_to_shared_ibd() function in ibd.py is a key component of this process, creating the necessary data structures for efficient network construction. The resulting networks capture the complex patterns of genetic sharing across multiple individuals, allowing for group-level analysis that goes beyond pairwise comparisons.

Community Detection Algorithms

A critical application of IBD networks in Bonsai v3 is the identification of related individuals using community detection algorithms. These algorithms identify clusters of densely connected nodes, which correspond to groups of related individuals.

Bonsai implements several community detection approaches:

Louvain Method: A hierarchical clustering algorithm that optimizes modularity
Connected Components: A simple approach for identifying completely separate networks
Edge Filtering: Progressive removal of weak edges to reveal core family structures

The implementation in get_next_node() and related functions uses these community detection results to prioritize which relationships to analyze first during pedigree reconstruction. This approach provides two major advantages:

It reduces the computational complexity by processing related groups together
It improves accuracy by ensuring that closely related individuals are placed in consistent positions

The community detection component allows Bonsai to scale efficiently to large datasets with complex family structures, breaking down the global optimization problem into more manageable subproblems.

Within-Community vs. Between-Community Analysis

Bonsai v3 performs sophisticated analysis of IBD sharing patterns within and between detected communities:

Within-Community Sharing: Typically higher, representing close family relationships
Between-Community Sharing: Typically lower, representing more distant connections
Bridge Individuals: People with significant connections to multiple communities, often representing important cross-family links

The get_total_ibd_between_id_sets() function in the IBDIndex class specifically supports this type of analysis, calculating the aggregate IBD sharing between groups of individuals. This function is used in several contexts:

Validating community detection results by measuring internal cohesion
Identifying potential merger points between separate pedigrees
Prioritizing which communities to process first during incremental pedigree building

By analyzing both intra- and inter-community IBD patterns, Bonsai can handle complex scenarios like endogamy (marriage within a relatively closed community), which creates unusual patterns of genetic sharing that simpler algorithms would misinterpret.

From Statistics to Relationship Inference

The PwLogLike Class

At the heart of Bonsai v3's relationship inference capability is the PwLogLike class in the likelihoods.py module. This class implements the statistical models that convert IBD statistics into relationship likelihoods:

Key methods in this class include:

get_relationship_options(): Generates all possible relationship types up to a specified degree
get_relationship_log_like(): Computes the log-likelihood of a specific relationship given observed IBD statistics
get_ll_pedigree_tuple(): Calculates the likelihood of a specific up/down relationship configuration
get_log_seg_pdf(): Implements the statistical distribution models for segment counts and lengths

The class incorporates sophisticated statistical models for different aspects of IBD sharing:

Poisson models for the expected number of segments
Exponential distribution models for segment lengths
Beta distribution models for IBD2 proportions
Gaussian approximations for the total amount of sharing

These models are calibrated using empirical data from known relationships, ensuring that the likelihood calculations accurately reflect the biological reality of genetic inheritance.

Relationship Representation with Tuples

Bonsai v3 represents relationships using a compact tuple notation (up, down, num_ancs):

up: Number of generations from individual 1 to the common ancestor
down: Number of generations from common ancestor to individual 2
num_ancs: Number of common ancestors (1 for half relationships, 2 for full)

For example:

(0, 1, 1): Parent-child (parent is individual 1)
(1, 1, 2): Full siblings (shared both parents)
(1, 1, 1): Half siblings (shared one parent)
(2, 2, 2): Full first cousins (shared both grandparents)

This notation provides a computationally efficient way to represent and manipulate relationship hypotheses. It directly connects to the mathematical models of IBD sharing, where the total meiotic distance up + down determines the expected amount and pattern of sharing.

The function get_simple_rel_tuple() in pedigrees.py calculates these relationship tuples directly from pedigree structures, providing a critical bridge between the graph-based representation of pedigrees and the statistical models of IBD sharing.

Combining Multiple Sources of Evidence

A key innovation in Bonsai v3 is its ability to combine multiple sources of evidence for relationship inference:

IBD1 Statistics: Total amount and segment count of half-identical regions
IBD2 Statistics: Presence and amount of fully identical regions
Segment Length Distribution: Pattern of segment lengths, especially for distinguishing relationships with similar total IBD
Age Information: When available, age differences provide additional constraints on possible relationships
Sex Information: Constraints on relationship types based on biological sex

The get_relationship_log_like() method combines these different sources into a single log-likelihood score for each relationship hypothesis. The method implements a sophisticated weighting scheme that accounts for the reliability of different evidence types and their correlations.

This multi-evidence approach is particularly powerful for resolving ambiguous cases where a single statistic might be consistent with multiple relationship types. For example, half-siblings, grandparent-grandchild, and avuncular relationships all share approximately 25% of their DNA, but they can be distinguished using segment length distributions and age information.

Mathematical Foundation: Bonsai v3's approach to IBD statistics analysis demonstrates how complex biological processes can be modeled with mathematical precision. By combining population genetics theory with empirical calibration, the system transforms raw genetic data into meaningful relationship assessments that account for the inherent randomness in genetic inheritance.

Comparing Notebook and Production Code

The Lab04 notebook provides simplified implementations of IBD statistics extraction and analysis, while the actual Bonsai v3 implementation includes numerous additional capabilities:

Optimized Algorithms: The production code includes highly optimized algorithms for processing millions of IBD segments efficiently
Caching: Sophisticated caching mechanisms avoid redundant computation of frequently accessed statistics
Error Handling: Robust handling of edge cases like noisy data, missing segments, and detector artifacts
Background IBD: Adjustments for population-specific background levels of IBD sharing
Calibration: The statistical models are calibrated on large datasets of confirmed relationships for different populations
Advanced Network Analysis: More sophisticated community detection algorithms adapted for genetic data

Despite these differences, the core concepts and approaches demonstrated in the notebook directly correspond to those used in the production system, providing an accurate foundation for understanding how Bonsai extracts and leverages IBD statistics.

Interactive Lab Environment

Run the interactive Lab 04 notebook in Google Colab:

Google Colab Environment

Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.

Data will be automatically downloaded from S3 when you run the notebook.

Note: You may need a Google account to save your work in Google Drive.

Open Lab 04 Notebook in Google Colab

Beyond the Code

As you explore IBD statistics extraction and analysis in Bonsai v3, consider these broader implications:

Statistical Thinking: How probability theory provides a framework for handling uncertainty in biological systems
Data Representation: The importance of choosing appropriate data structures for efficient computation
Evidence Integration: How multiple sources of evidence can be combined to improve inference accuracy
Network Science: The application of graph theory to model complex biological relationships

These considerations highlight how IBD statistics analysis in Bonsai v3 represents a sophisticated application of both computational and statistical principles to a complex biological problem.

This lab is part of the Bonsai v3 Deep Dive track:

Introduction

Lab 01

Architecture

Lab 02

IBD Formats

Lab 03

Statistics

Lab 04

Models

Lab 05

Relationships

Lab 06

PwLogLike

Lab 07

Age Modeling

Lab 08

Data Structures

Lab 09

Up-Node Dict

Lab 10