Lab 01: Introduction to Genetic Genealogy and IBD Concepts
Foundational Concepts: This lab introduces the core biological and computational principles of Identity-by-Descent (IBD) that underpin the Bonsai v3 system. Understanding these foundational concepts is essential for mastering Bonsai's approach to pedigree reconstruction.
Biological Foundations
DNA Inheritance and Genetic Recombination
Genetic genealogy rests on the precise biological mechanisms of DNA inheritance. Each person inherits 50% of their autosomal DNA from each parent, but the specific segments inherited are determined by recombination—the process of genetic exchange during meiosis.
Key principles of DNA inheritance in Bonsai:
- Dilution Over Generations: The amount of DNA shared with an ancestor is approximately halved each generation back, creating a predictable decay pattern
- Random Assortment: Which segments are inherited follows stochastic patterns that create variance in actual sharing
- Recombination Hotspots: Certain genomic regions have higher recombination rates, affecting the distribution of segment lengths
- Genetic Distance vs. Physical Distance: Bonsai works with genetic distance (centiMorgans) rather than physical distance (base pairs) to properly model inheritance
Bonsai v3 implements these biological principles through sophisticated statistical models that account for recombination patterns, inheritance probabilities, and the stochastic nature of genetic transmission.
Identity by Descent (IBD)
Identity by Descent (IBD) is the fundamental unit of genetic relatedness in Bonsai. Two individuals share an IBD segment when they have inherited the exact same chromosomal segment from a common ancestor. IBD segments are the genetic "fingerprints" of relatedness that Bonsai analyzes to reconstruct family relationships.
Critical distinctions in IBD analysis:
- IBD1 (Half-Identical Regions): DNA segments where individuals share one chromosome of a pair—common in all relationships
- IBD2 (Fully Identical Regions): DNA segments where individuals share both chromosomes of a pair—indicative of very close relationships like siblings
- IBS (Identity by State): Segments that appear identical by chance but are not from a recent common ancestor—a source of potential false positives
Bonsai v3's architecture is designed around detecting, processing, and analyzing these IBD segments to extract relationship information. The system processes raw IBD segments from detectors like IBIS, Refined IBD, or HapIBD, applying sophisticated statistical models to distinguish true IBD from potential artifacts.
Measuring IBD: The CentiMorgan Scale
Bonsai quantifies IBD segments using centiMorgans (cM), a unit of genetic distance that accounts for recombination rates across the genome. Unlike physical measures (base pairs), centiMorgans provide a more accurate representation of inheritance patterns.
Key properties of the centiMorgan scale in Bonsai v3:
- The human genome is approximately 3400 cM in total length
- 1 cM represents approximately a 1% chance of recombination in a single generation
- cM distances vary across the genome based on recombination hotspots
- Segments under ~7 cM are generally considered less reliable for relationship inference
In the Bonsai v3 implementation, IBD segment lengths in cM directly inform statistical models of relationship likelihood. Longer segments generally indicate closer relationships, as they have had fewer generations in which recombination could break them up.
IBD Patterns in Relationships
Relationship-Specific IBD Signatures
Different relationship types exhibit characteristic patterns of IBD sharing that Bonsai v3 uses to distinguish between them. These patterns encompass total shared IBD, distribution of segment lengths, IBD1 vs. IBD2 ratios, and segment locations.
Key relationship signatures in Bonsai v3:
- Parent-Child: ~50% total IBD, exclusively IBD1, typically in very long segments
- Full Siblings: ~50% total IBD, but with IBD2 regions (approximately 25% of the genome) and a more variable distribution
- Half-Siblings/Avuncular/Grandparental: ~25% total IBD, exclusively IBD1, distinguishable by age patterns
- First Cousins: ~12.5% total IBD, exclusively IBD1, with shorter average segment lengths
- Second Cousins: ~3.125% total IBD, exclusively IBD1, with even shorter segments
Bonsai v3 leverages these distinctive patterns not just through the average expected sharing, but by modeling the complete distribution of sharing patterns for each relationship type. This allows it to handle the inherent stochasticity in genetic inheritance.
Statistical Variance in IBD Sharing
Actual IBD sharing between relatives exhibits substantial variance due to the random nature of recombination and inheritance. Bonsai v3 explicitly models this variance to provide accurate relationship probability calculations.
Sources of variance handled by Bonsai:
- Mendelian Randomness: The random assortment of chromosomes during meiosis
- Recombination Variance: The stochastic nature of crossover points
- Detection Artifacts: False positives and false negatives in IBD detection
- Population Background: Baseline levels of sharing in different populations
The Bonsai v3 codebase implements sophisticated statistical models to account for this variance. The likelihoods.py
module contains implementations of probability distributions for IBD sharing that accurately reflect the empirical variance observed in real data.
Bonsai's Approach to Pedigree Reconstruction
The Up-Node Dictionary Data Structure
At the heart of Bonsai v3's implementation is the "up-node dictionary"—an efficient data structure for representing and manipulating pedigrees. This structure encodes parent-child relationships in a directed graph format that facilitates rapid computation of genetic relationships.
Structure and implementation:
- Each individual is represented by a unique ID (positive for observed individuals, negative for inferred/latent individuals)
- Each individual maps to a dictionary of their parents, with values indicating the relationship type
- Founders (individuals with no parents in the dataset) map to empty dictionaries
- This structure enables efficient traversal, relationship determination, and pedigree operations
The up-node dictionary is implemented in the pedigrees.py
module, which provides a comprehensive set of functions for manipulating these structures, calculating relationships between individuals, finding common ancestors, and combining sub-pedigrees.
Pedigree Reconstruction Workflow
Bonsai v3 implements a sophisticated multi-stage workflow for reconstructing pedigrees from raw IBD data:
- IBD Processing: Raw IBD segments from detectors are loaded, filtered, and normalized
- Pairwise Relationship Inference: The
PwLogLike
class computes relationship likelihoods between all pairs - Small Pedigree Construction: High-confidence clusters are assembled into initial sub-pedigrees
- Pedigree Merging: Sub-pedigrees are systematically combined based on IBD evidence
- Incremental Addition: Remaining individuals are added one by one to the growing pedigree
- Optimization: The final pedigree is refined to maximize likelihood and resolve ambiguities
This stepwise approach allows Bonsai to tackle the combinatorial complexity of pedigree reconstruction by breaking it down into more manageable sub-problems. The workflow is primarily implemented in the bonsai.py
and connections.py
modules.
Statistical Relationship Models
Bonsai v3 employs sophisticated statistical models to evaluate potential relationships based on observed IBD patterns. These models are the mathematical foundation of Bonsai's accuracy in relationship inference.
Key statistical components:
- Segment Length Distributions: Models for the expected distribution of IBD segment lengths for different relationship types
- Segment Count Distributions: Poisson models for the expected number of IBD segments
- Age Difference Models: Probability distributions for age differences in various relationships
- Log-Likelihood Calculation: Combined scoring of genetic and demographic evidence
These statistical models are implemented in the likelihoods.py
module, particularly in the PwLogLike
class. This class computes log-likelihood scores for different relationship hypotheses, allowing Bonsai to select the most probable relationships.
Implementation Details
The Bonsai v3 Codebase Organization
Bonsai v3 is organized into a modular codebase with distinct components handling different aspects of the pedigree reconstruction process:
bonsai.py
: Core functions for building pedigrees, the main entry points for using the librarylikelihoods.py
: Statistical models for relationship inference, including thePwLogLike
classpedigrees.py
: Data structures and functions for manipulating pedigrees (up-node dictionaries)ibd.py
: Processing and analysis of IBD segmentsconnections.py
: Functions for connecting and merging pedigreesutils.py
: General utility functions used throughout the codebasetwins.py
: Specialized handling for twin relationshipsmoments.py
: Calculation of statistical moments for IBD distributionsexceptions.py
: Custom exception types for error handling
This modular design enables efficient development, testing, and extension of the Bonsai system, while maintaining a clear separation of concerns between different components.
IBD Processing
Bonsai v3 includes sophisticated tools for processing raw IBD detection output, implemented primarily in the ibd.py
module:
Key IBD processing functionalities:
- Loading IBD Segments: Functions for parsing IBD output from various detectors
- Filtering IBD Segments: Removal of short, unreliable, or artifact segments
- Normalizing Phased/Unphased Data: Handling both phased and unphased IBD formats
- Computing Total IBD: Summation of IBD sharing between individual pairs
- IBD Statistics: Calculation of segment counts, average lengths, and other metrics
The ibd.py
module provides essential functionality for transforming raw detector output into the standardized format that Bonsai's statistical models require.
Performance Optimizations
Bonsai v3 incorporates numerous optimizations to handle large datasets efficiently:
- Caching: Implemented in the
caching.py
module, provides memoization of expensive calculations - Efficient Data Structures: The up-node dictionary representation enables O(1) access to parental relationships
- Prioritization: Algorithms focus on high-confidence relationships first, reducing the search space
- Incremental Building: Processing individuals in batches reduces computational complexity
- Parallel Processing Support: Critical components support parallel computation
These optimizations allow Bonsai v3 to scale to datasets with thousands of individuals while maintaining reasonable computational performance.
Practical Applications
Real-World Use Cases
Bonsai v3's architecture is designed to address several real-world genetic genealogy scenarios:
- Unknown Parentage: Identifying biological parents through genetic connections
- Genetic Genealogy Research: Validating and extending paper-trail genealogy
- Population Studies: Reconstructing pedigrees within genetic databases
- Medical Genetics: Identifying inheritance patterns for genetic traits
- Forensic Applications: Finding genetic relatives for identification purposes
For each of these applications, Bonsai v3 provides not just relationship predictions but confidence estimates and alternative hypotheses, enabling informed decision-making.
Limitations and Challenges
While Bonsai v3 represents a sophisticated approach to pedigree reconstruction, it faces several inherent challenges:
- Genetic Equivalence: Some relationships produce statistically indistinguishable IBD patterns
- Missing Data: Incomplete sampling of pedigrees creates ambiguity
- IBD Detection Quality: Errors in the underlying IBD data propagate to relationship inference
- Endogamy: Population structures with high rates of consanguinity create complex patterns
- Computational Complexity: Optimal pedigree reconstruction is NP-hard, requiring heuristic approaches
Bonsai v3 addresses these challenges through probability-based approaches rather than deterministic algorithms, explicitly modeling uncertainty and providing confidence scores for inferred relationships.
Core Principle: Bonsai v3 approaches genetic genealogy as a statistical inference problem, not a deterministic matching process. This probabilistic foundation allows it to handle the inherent uncertainty in genetic inheritance and produce reliable relationship estimates even with incomplete or noisy data.
Comparing Notebook and Production Code
The Lab01 notebook provides a learning environment for exploring IBD and genetic genealogy concepts, while the actual Bonsai v3 codebase implements production-ready algorithms. Key differences include:
- Simplified Models: The notebook uses pedagogical models to illustrate concepts, while the actual codebase implements more sophisticated mathematical models based on empirical data
- Data Scale: The notebook works with small example datasets, while the production code is optimized for large-scale pedigree reconstruction
- Error Handling: The production code includes comprehensive error handling and edge case management not shown in the simplified examples
- Performance Optimization: The actual codebase includes numerous optimizations for computational efficiency
- Parameter Calibration: The production code uses carefully calibrated parameters derived from large-scale empirical analyses
Despite these differences, the core concepts and approaches introduced in the lab directly correspond to the actual implementation in Bonsai v3, providing a solid foundation for understanding how the system works.
Interactive Lab Environment
Run the interactive Lab 01 notebook in Google Colab:
Google Colab Environment
Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.
Data will be automatically downloaded from S3 when you run the notebook.
Note: You may need a Google account to save your work in Google Drive.
Beyond the Code
As you explore the Bonsai v3 system, consider these broader implications:
- Epistemological Foundations: How genetic evidence interacts with other forms of genealogical knowledge
- Ethical Considerations: Privacy implications of inferring relationships from genetic data
- Cultural Context: How biological relationships interact with social and cultural definitions of kinship
- Algorithmic Transparency: The importance of explaining relationship inferences in accessible terms
These considerations situate Bonsai v3 within broader scientific, ethical, and social contexts, highlighting the importance of thoughtful application of these powerful computational tools.
This lab is part of the Bonsai v3 Deep Dive track:
Introduction
Lab 01
Architecture
Lab 02
IBD Formats
Lab 03
Statistics
Lab 04
Models
Lab 05
Relationships
Lab 06
PwLogLike
Lab 07
Age Modeling
Lab 08
Data Structures
Lab 09
Up-Node Dict
Lab 10