Computational Genetic Genealogy

IBD and Genealogy Introduction

Lab 01: Introduction to Genetic Genealogy and IBD Concepts

Foundational Concepts: This lab introduces the core biological and computational principles of Identity-by-Descent (IBD) that underpin the Bonsai v3 system. Understanding these foundational concepts is essential for mastering Bonsai's approach to pedigree reconstruction.

Biological Foundations

DNA Inheritance and Genetic Recombination

Genetic genealogy rests on the precise biological mechanisms of DNA inheritance. Each person inherits 50% of their autosomal DNA from each parent, but the specific segments inherited are determined by recombination—the process of genetic exchange during meiosis.

Key principles of DNA inheritance in Bonsai:

  • Dilution Over Generations: The amount of DNA shared with an ancestor is approximately halved each generation back, creating a predictable decay pattern
  • Random Assortment: Which segments are inherited follows stochastic patterns that create variance in actual sharing
  • Recombination Hotspots: Certain genomic regions have higher recombination rates, affecting the distribution of segment lengths
  • Genetic Distance vs. Physical Distance: Bonsai works with genetic distance (centiMorgans) rather than physical distance (base pairs) to properly model inheritance

Bonsai v3 implements these biological principles through sophisticated statistical models that account for recombination patterns, inheritance probabilities, and the stochastic nature of genetic transmission.

Identity by Descent (IBD)

Identity by Descent (IBD) is the fundamental unit of genetic relatedness in Bonsai. Two individuals share an IBD segment when they have inherited the exact same chromosomal segment from a common ancestor. IBD segments are the genetic "fingerprints" of relatedness that Bonsai analyzes to reconstruct family relationships.

Critical distinctions in IBD analysis:

  • IBD1 (Half-Identical Regions): DNA segments where individuals share one chromosome of a pair—common in all relationships
  • IBD2 (Fully Identical Regions): DNA segments where individuals share both chromosomes of a pair—indicative of very close relationships like siblings
  • IBS (Identity by State): Segments that appear identical by chance but are not from a recent common ancestor—a source of potential false positives

Bonsai v3's architecture is designed around detecting, processing, and analyzing these IBD segments to extract relationship information. The system processes raw IBD segments from detectors like IBIS, Refined IBD, or HapIBD, applying sophisticated statistical models to distinguish true IBD from potential artifacts.

Measuring IBD: The CentiMorgan Scale

Bonsai quantifies IBD segments using centiMorgans (cM), a unit of genetic distance that accounts for recombination rates across the genome. Unlike physical measures (base pairs), centiMorgans provide a more accurate representation of inheritance patterns.

Key properties of the centiMorgan scale in Bonsai v3:

  • The human genome is approximately 3400 cM in total length
  • 1 cM represents approximately a 1% chance of recombination in a single generation
  • cM distances vary across the genome based on recombination hotspots
  • Segments under ~7 cM are generally considered less reliable for relationship inference

In the Bonsai v3 implementation, IBD segment lengths in cM directly inform statistical models of relationship likelihood. Longer segments generally indicate closer relationships, as they have had fewer generations in which recombination could break them up.

IBD Patterns in Relationships

Relationship-Specific IBD Signatures

Different relationship types exhibit characteristic patterns of IBD sharing that Bonsai v3 uses to distinguish between them. These patterns encompass total shared IBD, distribution of segment lengths, IBD1 vs. IBD2 ratios, and segment locations.

Key relationship signatures in Bonsai v3:

  • Parent-Child: ~50% total IBD, exclusively IBD1, typically in very long segments
  • Full Siblings: ~50% total IBD, but with IBD2 regions (approximately 25% of the genome) and a more variable distribution
  • Half-Siblings/Avuncular/Grandparental: ~25% total IBD, exclusively IBD1, distinguishable by age patterns
  • First Cousins: ~12.5% total IBD, exclusively IBD1, with shorter average segment lengths
  • Second Cousins: ~3.125% total IBD, exclusively IBD1, with even shorter segments

Bonsai v3 leverages these distinctive patterns not just through the average expected sharing, but by modeling the complete distribution of sharing patterns for each relationship type. This allows it to handle the inherent stochasticity in genetic inheritance.

Statistical Variance in IBD Sharing

Actual IBD sharing between relatives exhibits substantial variance due to the random nature of recombination and inheritance. Bonsai v3 explicitly models this variance to provide accurate relationship probability calculations.

Sources of variance handled by Bonsai:

  • Mendelian Randomness: The random assortment of chromosomes during meiosis
  • Recombination Variance: The stochastic nature of crossover points
  • Detection Artifacts: False positives and false negatives in IBD detection
  • Population Background: Baseline levels of sharing in different populations

The Bonsai v3 codebase implements sophisticated statistical models to account for this variance. The likelihoods.py module contains implementations of probability distributions for IBD sharing that accurately reflect the empirical variance observed in real data.

Bonsai's Approach to Pedigree Reconstruction

The Up-Node Dictionary Data Structure

At the heart of Bonsai v3's implementation is the "up-node dictionary"—an efficient data structure for representing and manipulating pedigrees. This structure encodes parent-child relationships in a directed graph format that facilitates rapid computation of genetic relationships.

Structure and implementation:

  • Each individual is represented by a unique ID (positive for observed individuals, negative for inferred/latent individuals)
  • Each individual maps to a dictionary of their parents, with values indicating the relationship type
  • Founders (individuals with no parents in the dataset) map to empty dictionaries
  • This structure enables efficient traversal, relationship determination, and pedigree operations

The up-node dictionary is implemented in the pedigrees.py module, which provides a comprehensive set of functions for manipulating these structures, calculating relationships between individuals, finding common ancestors, and combining sub-pedigrees.

Pedigree Reconstruction Workflow

Bonsai v3 implements a sophisticated multi-stage workflow for reconstructing pedigrees from raw IBD data:

  1. IBD Processing: Raw IBD segments from detectors are loaded, filtered, and normalized
  2. Pairwise Relationship Inference: The PwLogLike class computes relationship likelihoods between all pairs
  3. Small Pedigree Construction: High-confidence clusters are assembled into initial sub-pedigrees
  4. Pedigree Merging: Sub-pedigrees are systematically combined based on IBD evidence
  5. Incremental Addition: Remaining individuals are added one by one to the growing pedigree
  6. Optimization: The final pedigree is refined to maximize likelihood and resolve ambiguities

This stepwise approach allows Bonsai to tackle the combinatorial complexity of pedigree reconstruction by breaking it down into more manageable sub-problems. The workflow is primarily implemented in the bonsai.py and connections.py modules.

Statistical Relationship Models

Bonsai v3 employs sophisticated statistical models to evaluate potential relationships based on observed IBD patterns. These models are the mathematical foundation of Bonsai's accuracy in relationship inference.

Key statistical components:

  • Segment Length Distributions: Models for the expected distribution of IBD segment lengths for different relationship types
  • Segment Count Distributions: Poisson models for the expected number of IBD segments
  • Age Difference Models: Probability distributions for age differences in various relationships
  • Log-Likelihood Calculation: Combined scoring of genetic and demographic evidence

These statistical models are implemented in the likelihoods.py module, particularly in the PwLogLike class. This class computes log-likelihood scores for different relationship hypotheses, allowing Bonsai to select the most probable relationships.

Implementation Details

The Bonsai v3 Codebase Organization

Bonsai v3 is organized into a modular codebase with distinct components handling different aspects of the pedigree reconstruction process:

  • bonsai.py: Core functions for building pedigrees, the main entry points for using the library
  • likelihoods.py: Statistical models for relationship inference, including the PwLogLike class
  • pedigrees.py: Data structures and functions for manipulating pedigrees (up-node dictionaries)
  • ibd.py: Processing and analysis of IBD segments
  • connections.py: Functions for connecting and merging pedigrees
  • utils.py: General utility functions used throughout the codebase
  • twins.py: Specialized handling for twin relationships
  • moments.py: Calculation of statistical moments for IBD distributions
  • exceptions.py: Custom exception types for error handling

This modular design enables efficient development, testing, and extension of the Bonsai system, while maintaining a clear separation of concerns between different components.

IBD Processing

Bonsai v3 includes sophisticated tools for processing raw IBD detection output, implemented primarily in the ibd.py module:

Key IBD processing functionalities:

  • Loading IBD Segments: Functions for parsing IBD output from various detectors
  • Filtering IBD Segments: Removal of short, unreliable, or artifact segments
  • Normalizing Phased/Unphased Data: Handling both phased and unphased IBD formats
  • Computing Total IBD: Summation of IBD sharing between individual pairs
  • IBD Statistics: Calculation of segment counts, average lengths, and other metrics

The ibd.py module provides essential functionality for transforming raw detector output into the standardized format that Bonsai's statistical models require.

Performance Optimizations

Bonsai v3 incorporates numerous optimizations to handle large datasets efficiently:

  • Caching: Implemented in the caching.py module, provides memoization of expensive calculations
  • Efficient Data Structures: The up-node dictionary representation enables O(1) access to parental relationships
  • Prioritization: Algorithms focus on high-confidence relationships first, reducing the search space
  • Incremental Building: Processing individuals in batches reduces computational complexity
  • Parallel Processing Support: Critical components support parallel computation

These optimizations allow Bonsai v3 to scale to datasets with thousands of individuals while maintaining reasonable computational performance.

Practical Applications

Real-World Use Cases

Bonsai v3's architecture is designed to address several real-world genetic genealogy scenarios:

  • Unknown Parentage: Identifying biological parents through genetic connections
  • Genetic Genealogy Research: Validating and extending paper-trail genealogy
  • Population Studies: Reconstructing pedigrees within genetic databases
  • Medical Genetics: Identifying inheritance patterns for genetic traits
  • Forensic Applications: Finding genetic relatives for identification purposes

For each of these applications, Bonsai v3 provides not just relationship predictions but confidence estimates and alternative hypotheses, enabling informed decision-making.

Limitations and Challenges

While Bonsai v3 represents a sophisticated approach to pedigree reconstruction, it faces several inherent challenges:

  • Genetic Equivalence: Some relationships produce statistically indistinguishable IBD patterns
  • Missing Data: Incomplete sampling of pedigrees creates ambiguity
  • IBD Detection Quality: Errors in the underlying IBD data propagate to relationship inference
  • Endogamy: Population structures with high rates of consanguinity create complex patterns
  • Computational Complexity: Optimal pedigree reconstruction is NP-hard, requiring heuristic approaches

Bonsai v3 addresses these challenges through probability-based approaches rather than deterministic algorithms, explicitly modeling uncertainty and providing confidence scores for inferred relationships.

Core Principle: Bonsai v3 approaches genetic genealogy as a statistical inference problem, not a deterministic matching process. This probabilistic foundation allows it to handle the inherent uncertainty in genetic inheritance and produce reliable relationship estimates even with incomplete or noisy data.

Comparing Notebook and Production Code

The Lab01 notebook provides a learning environment for exploring IBD and genetic genealogy concepts, while the actual Bonsai v3 codebase implements production-ready algorithms. Key differences include:

Despite these differences, the core concepts and approaches introduced in the lab directly correspond to the actual implementation in Bonsai v3, providing a solid foundation for understanding how the system works.

Interactive Lab Environment

Run the interactive Lab 01 notebook in Google Colab:

Google Colab Environment

Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.

Data will be automatically downloaded from S3 when you run the notebook.

Note: You may need a Google account to save your work in Google Drive.

Open Lab 01 Notebook in Google Colab

Beyond the Code

As you explore the Bonsai v3 system, consider these broader implications:

These considerations situate Bonsai v3 within broader scientific, ethical, and social contexts, highlighting the importance of thoughtful application of these powerful computational tools.

This lab is part of the Bonsai v3 Deep Dive track:

Introduction

Lab 01

Architecture

Lab 02

IBD Formats

Lab 03

Statistics

Lab 04

Models

Lab 05

Relationships

Lab 06

PwLogLike

Lab 07

Age Modeling

Lab 08

Data Structures

Lab 09

Up-Node Dict

Lab 10