Lab 01: IBD and Genealogy Introduction | Computational Genetic Genealogy

Lab 01: Introduction to Genetic Genealogy and IBD Concepts

Foundational Concepts: This lab introduces the core biological and computational principles of Identity-by-Descent (IBD) that underpin the Bonsai v3 system. Understanding these foundational concepts is essential for mastering Bonsai's approach to pedigree reconstruction.

Biological Foundations

DNA Inheritance and Genetic Recombination

Genetic genealogy rests on the precise biological mechanisms of DNA inheritance. Each person inherits 50% of their autosomal DNA from each parent, but the specific segments inherited are determined by recombination—the process of genetic exchange during meiosis.

Key principles of DNA inheritance in Bonsai:

Dilution Over Generations: The amount of DNA shared with an ancestor is approximately halved each generation back, creating a predictable decay pattern
Random Assortment: Which segments are inherited follows stochastic patterns that create variance in actual sharing
Recombination Hotspots: Certain genomic regions have higher recombination rates, affecting the distribution of segment lengths
Genetic Distance vs. Physical Distance: Bonsai works with genetic distance (centiMorgans) rather than physical distance (base pairs) to properly model inheritance

Bonsai v3 implements these biological principles through sophisticated statistical models that account for recombination patterns, inheritance probabilities, and the stochastic nature of genetic transmission.

Identity by Descent (IBD)

Identity by Descent (IBD) is the fundamental unit of genetic relatedness in Bonsai. Two individuals share an IBD segment when they have inherited the exact same chromosomal segment from a common ancestor. IBD segments are the genetic "fingerprints" of relatedness that Bonsai analyzes to reconstruct family relationships.

Critical distinctions in IBD analysis:

IBD1 (Half-Identical Regions): DNA segments where individuals share one chromosome of a pair—common in all relationships
IBD2 (Fully Identical Regions): DNA segments where individuals share both chromosomes of a pair—indicative of very close relationships like siblings
IBS (Identity by State): Segments that appear identical by chance but are not from a recent common ancestor—a source of potential false positives

Bonsai v3's architecture is designed around detecting, processing, and analyzing these IBD segments to extract relationship information. The system processes raw IBD segments from detectors like IBIS, Refined IBD, or HapIBD, applying sophisticated statistical models to distinguish true IBD from potential artifacts.

Measuring IBD: The CentiMorgan Scale

Bonsai quantifies IBD segments using centiMorgans (cM), a unit of genetic distance that accounts for recombination rates across the genome. Unlike physical measures (base pairs), centiMorgans provide a more accurate representation of inheritance patterns.

Key properties of the centiMorgan scale in Bonsai v3:

The human genome is approximately 3400 cM in total length
1 cM represents approximately a 1% chance of recombination in a single generation
cM distances vary across the genome based on recombination hotspots
Segments under ~7 cM are generally considered less reliable for relationship inference

In the Bonsai v3 implementation, IBD segment lengths in cM directly inform statistical models of relationship likelihood. Longer segments generally indicate closer relationships, as they have had fewer generations in which recombination could break them up.

IBD Patterns in Relationships

Relationship-Specific IBD Signatures

Different relationship types exhibit characteristic patterns of IBD sharing that Bonsai v3 uses to distinguish between them. These patterns encompass total shared IBD, distribution of segment lengths, IBD1 vs. IBD2 ratios, and segment locations.

Key relationship signatures in Bonsai v3:

Parent-Child: ~50% total IBD, exclusively IBD1, typically in very long segments
Full Siblings: ~50% total IBD, but with IBD2 regions (approximately 25% of the genome) and a more variable distribution
Half-Siblings/Avuncular/Grandparental: ~25% total IBD, exclusively IBD1, distinguishable by age patterns
First Cousins: ~12.5% total IBD, exclusively IBD1, with shorter average segment lengths
Second Cousins: ~3.125% total IBD, exclusively IBD1, with even shorter segments

Bonsai v3 leverages these distinctive patterns not just through the average expected sharing, but by modeling the complete distribution of sharing patterns for each relationship type. This allows it to handle the inherent stochasticity in genetic inheritance.

Statistical Variance in IBD Sharing

Actual IBD sharing between relatives exhibits substantial variance due to the random nature of recombination and inheritance. Bonsai v3 explicitly models this variance to provide accurate relationship probability calculations.

Sources of variance handled by Bonsai:

Mendelian Randomness: The random assortment of chromosomes during meiosis
Recombination Variance: The stochastic nature of crossover points
Detection Artifacts: False positives and false negatives in IBD detection
Population Background: Baseline levels of sharing in different populations

The Bonsai v3 codebase implements sophisticated statistical models to account for this variance. The likelihoods.py module contains implementations of probability distributions for IBD sharing that accurately reflect the empirical variance observed in real data.

Bonsai's Approach to Pedigree Reconstruction

The Up-Node Dictionary Data Structure

At the heart of Bonsai v3's implementation is the "up-node dictionary"—an efficient data structure for representing and manipulating pedigrees. This structure encodes parent-child relationships in a directed graph format that facilitates rapid computation of genetic relationships.

Structure and implementation:

Each individual is represented by a unique ID (positive for observed individuals, negative for inferred/latent individuals)
Each individual maps to a dictionary of their parents, with values indicating the relationship type
Founders (individuals with no parents in the dataset) map to empty dictionaries
This structure enables efficient traversal, relationship determination, and pedigree operations

The up-node dictionary is implemented in the pedigrees.py module, which provides a comprehensive set of functions for manipulating these structures, calculating relationships between individuals, finding common ancestors, and combining sub-pedigrees.

Pedigree Reconstruction Workflow

Bonsai v3 implements a sophisticated multi-stage workflow for reconstructing pedigrees from raw IBD data:

IBD Processing: Raw IBD segments from detectors are loaded, filtered, and normalized
Pairwise Relationship Inference: The PwLogLike class computes relationship likelihoods between all pairs
Small Pedigree Construction: High-confidence clusters are assembled into initial sub-pedigrees
Pedigree Merging: Sub-pedigrees are systematically combined based on IBD evidence
Incremental Addition: Remaining individuals are added one by one to the growing pedigree
Optimization: The final pedigree is refined to maximize likelihood and resolve ambiguities

This stepwise approach allows Bonsai to tackle the combinatorial complexity of pedigree reconstruction by breaking it down into more manageable sub-problems. The workflow is primarily implemented in the bonsai.py and connections.py modules.

Statistical Relationship Models

Bonsai v3 employs sophisticated statistical models to evaluate potential relationships based on observed IBD patterns. These models are the mathematical foundation of Bonsai's accuracy in relationship inference.

Key statistical components:

Segment Length Distributions: Models for the expected distribution of IBD segment lengths for different relationship types
Segment Count Distributions: Poisson models for the expected number of IBD segments
Age Difference Models: Probability distributions for age differences in various relationships
Log-Likelihood Calculation: Combined scoring of genetic and demographic evidence

These statistical models are implemented in the likelihoods.py module, particularly in the PwLogLike class. This class computes log-likelihood scores for different relationship hypotheses, allowing Bonsai to select the most probable relationships.

Implementation Details

The Bonsai v3 Codebase Organization

Bonsai v3 is organized into a modular codebase with distinct components handling different aspects of the pedigree reconstruction process:

bonsai.py: Core functions for building pedigrees, the main entry points for using the library
likelihoods.py: Statistical models for relationship inference, including the PwLogLike class
pedigrees.py: Data structures and functions for manipulating pedigrees (up-node dictionaries)
ibd.py: Processing and analysis of IBD segments
connections.py: Functions for connecting and merging pedigrees
utils.py: General utility functions used throughout the codebase
twins.py: Specialized handling for twin relationships
moments.py: Calculation of statistical moments for IBD distributions
exceptions.py: Custom exception types for error handling

This modular design enables efficient development, testing, and extension of the Bonsai system, while maintaining a clear separation of concerns between different components.

IBD Processing

Bonsai v3 includes sophisticated tools for processing raw IBD detection output, implemented primarily in the ibd.py module:

Key IBD processing functionalities:

Loading IBD Segments: Functions for parsing IBD output from various detectors
Filtering IBD Segments: Removal of short, unreliable, or artifact segments
Normalizing Phased/Unphased Data: Handling both phased and unphased IBD formats
Computing Total IBD: Summation of IBD sharing between individual pairs
IBD Statistics: Calculation of segment counts, average lengths, and other metrics

The ibd.py module provides essential functionality for transforming raw detector output into the standardized format that Bonsai's statistical models require.

Performance Optimizations

Bonsai v3 incorporates numerous optimizations to handle large datasets efficiently:

Caching: Implemented in the caching.py module, provides memoization of expensive calculations
Efficient Data Structures: The up-node dictionary representation enables O(1) access to parental relationships
Prioritization: Algorithms focus on high-confidence relationships first, reducing the search space
Incremental Building: Processing individuals in batches reduces computational complexity
Parallel Processing Support: Critical components support parallel computation

These optimizations allow Bonsai v3 to scale to datasets with thousands of individuals while maintaining reasonable computational performance.

Practical Applications

Real-World Use Cases

Bonsai v3's architecture is designed to address several real-world genetic genealogy scenarios:

Unknown Parentage: Identifying biological parents through genetic connections
Genetic Genealogy Research: Validating and extending paper-trail genealogy
Population Studies: Reconstructing pedigrees within genetic databases
Medical Genetics: Identifying inheritance patterns for genetic traits
Forensic Applications: Finding genetic relatives for identification purposes

For each of these applications, Bonsai v3 provides not just relationship predictions but confidence estimates and alternative hypotheses, enabling informed decision-making.

Limitations and Challenges

While Bonsai v3 represents a sophisticated approach to pedigree reconstruction, it faces several inherent challenges:

Genetic Equivalence: Some relationships produce statistically indistinguishable IBD patterns
Missing Data: Incomplete sampling of pedigrees creates ambiguity
IBD Detection Quality: Errors in the underlying IBD data propagate to relationship inference
Endogamy: Population structures with high rates of consanguinity create complex patterns
Computational Complexity: Optimal pedigree reconstruction is NP-hard, requiring heuristic approaches

Bonsai v3 addresses these challenges through probability-based approaches rather than deterministic algorithms, explicitly modeling uncertainty and providing confidence scores for inferred relationships.

Core Principle: Bonsai v3 approaches genetic genealogy as a statistical inference problem, not a deterministic matching process. This probabilistic foundation allows it to handle the inherent uncertainty in genetic inheritance and produce reliable relationship estimates even with incomplete or noisy data.

Comparing Notebook and Production Code

The Lab01 notebook provides a learning environment for exploring IBD and genetic genealogy concepts, while the actual Bonsai v3 codebase implements production-ready algorithms. Key differences include:

Simplified Models: The notebook uses pedagogical models to illustrate concepts, while the actual codebase implements more sophisticated mathematical models based on empirical data
Data Scale: The notebook works with small example datasets, while the production code is optimized for large-scale pedigree reconstruction
Error Handling: The production code includes comprehensive error handling and edge case management not shown in the simplified examples
Performance Optimization: The actual codebase includes numerous optimizations for computational efficiency
Parameter Calibration: The production code uses carefully calibrated parameters derived from large-scale empirical analyses

Despite these differences, the core concepts and approaches introduced in the lab directly correspond to the actual implementation in Bonsai v3, providing a solid foundation for understanding how the system works.

Interactive Lab Environment

Run the interactive Lab 01 notebook in Google Colab:

Google Colab Environment

Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.

Data will be automatically downloaded from S3 when you run the notebook.

Note: You may need a Google account to save your work in Google Drive.

Open Lab 01 Notebook in Google Colab

Beyond the Code

As you explore the Bonsai v3 system, consider these broader implications:

Epistemological Foundations: How genetic evidence interacts with other forms of genealogical knowledge
Ethical Considerations: Privacy implications of inferring relationships from genetic data
Cultural Context: How biological relationships interact with social and cultural definitions of kinship
Algorithmic Transparency: The importance of explaining relationship inferences in accessible terms

These considerations situate Bonsai v3 within broader scientific, ethical, and social contexts, highlighting the importance of thoughtful application of these powerful computational tools.

This lab is part of the Bonsai v3 Deep Dive track:

Introduction

Lab 01

Architecture

Lab 02

IBD Formats

Lab 03

Statistics

Lab 04

Models

Lab 05

Relationships

Lab 06

PwLogLike

Lab 07

Age Modeling

Lab 08

Data Structures

Lab 09

Up-Node Dict

Lab 10