Lab 02: Bonsai Architecture | Computational Genetic Genealogy

Lab 02: Bonsai v3 Architecture Overview and Core Data Structures

System Design: This lab examines the architectural design and data structures of Bonsai v3. Understanding how this system is organized is essential for mastering its operation and extending its capabilities.

Architectural Overview

Design Philosophy

Bonsai v3 is built around several key design principles that shape its architecture:

Modularity: The system is organized into distinct components with clear responsibilities, enabling easier maintenance and extension
Separation of Concerns: Different aspects of pedigree reconstruction are handled by specialized modules
Statistical Foundations: The system represents relationship inference as a probabilistic problem, using likelihood-based approaches
Incremental Construction: Pedigrees are built progressively, from closely related clusters to more distant connections
Efficient Data Structures: The system uses specialized data structures optimized for pedigree operations

These principles enable Bonsai v3 to handle the complexities of pedigree reconstruction while maintaining code readability and performance.

Module Organization

Bonsai v3 is organized into several key modules, each with specific responsibilities:

bonsai.py: Main entry point and orchestration module that coordinates the complete workflow
ibd.py: Handles IBD data processing, conversion, and extraction of genetic sharing statistics
likelihoods.py: Implements statistical models for relationship inference and likelihood calculation
pedigrees.py: Provides data structures and operations for representing and manipulating pedigrees
connections.py: Handles the logic for connecting individuals and merging pedigree fragments
utils.py: Contains utility functions used throughout the codebase
constants.py: Stores system constants and configuration parameters
caching.py: Implements performance optimizations through memoization
exceptions.py: Defines custom exception types for error handling
rendering.py: Provides visualization and rendering capabilities for pedigrees

This modular organization allows each component to focus on its specific responsibility while working together to achieve the overall goal of pedigree reconstruction.

Main Workflow

The core workflow in Bonsai v3 is orchestrated by the build_pedigree() function in bonsai.py, which follows these steps:

Data Preprocessing: IBD data is loaded, filtered, and normalized for analysis
Relationship Inference: Pairwise relationships between individuals are computed using likelihood models
Initial Clusters: Closely related individuals are grouped into small, high-confidence pedigree clusters
Cluster Merging: Small clusters are progressively merged into larger pedigrees based on genetic evidence
Refinement: The merged pedigree is optimized to maximize overall likelihood
Output Generation: The final pedigree is returned along with likelihood scores and statistics

Each step in this workflow involves multiple operations handled by different modules, coordinated by the main build_pedigree() function.

The Up-Node Dictionary: Bonsai's Central Data Structure

Structure and Components

The up-node dictionary is the fundamental data structure used in Bonsai v3 to represent pedigrees. It efficiently encodes parent-child relationships in a dictionary-based representation:

{
    1000: {1001: 1, 1002: 1},  # Individual 1000 has parents 1001 and 1002
    1003: {1001: 1, 1002: 1},  # Individual 1003 has the same parents (siblings)
    1004: {-1: 1, -2: 1},      # Individual 1004 has inferred parents -1 and -2
    1005: {-1: 1, 1002: 1},    # Individual 1005 has one inferred parent and one known parent
    -1: {1006: 1, 1007: 1},    # Inferred individual -1 has parents 1006 and 1007
    1001: {},  # Empty dictionaries represent founder individuals with no recorded parents
    1002: {},
    1006: {},
    1007: {},
    -2: {}
}

Key aspects of this structure include:

Each individual is identified by a unique ID (key in the outer dictionary)
The value for each individual is another dictionary mapping to their parents
The inner dictionary's values (typically 1) can represent relationship types
Individuals with no parents (founders) map to empty dictionaries
Negative IDs represent inferred (latent) individuals not present in the original data

This structure allows Bonsai to represent arbitrary pedigree configurations while maintaining computational efficiency.

Advantages of the Up-Node Dictionary

The up-node dictionary offers several advantages over alternative pedigree representations:

Memory Efficiency: Only stores necessary relationships, making it compact for large pedigrees
Fast Lookup: Provides O(1) access to an individual's parents
Flexible Structure: Can represent complex relationships including half-siblings, multiple generations, and missing individuals
Explicit Latent Nodes: Uses negative IDs to clearly identify inferred individuals not present in the original data
Simple Operations: Common pedigree operations (finding ancestors, descendants, siblings) can be implemented efficiently
Direct Manipulation: Allows for straightforward modification of the pedigree structure

These properties make the up-node dictionary ideal for Bonsai's incremental pedigree construction process, which requires frequent pedigree manipulations and relationship queries.

Key Operations

Bonsai v3 implements numerous operations on up-node dictionaries in the pedigrees.py module:

Relationship Computation: Functions like get_simple_rel_tuple() determine the relationship between any two individuals
Ancestor Identification: get_common_ancs() finds common ancestors between individuals
Pedigree Traversal: Functions for navigating up and down the pedigree structure
Pedigree Joining: combine_up_dicts() merges separate pedigree fragments
Pedigree Validation: Functions that check biological consistency (no cycles, appropriate sexes)
Node Addition/Removal: Operations to modify the pedigree structure

These operations form the foundation for more complex algorithms in the Bonsai system, such as pedigree merging and relationship inference.

Core Modules in Depth

IBD Processing (ibd.py)

The ibd.py module handles all aspects of IBD data management and processing:

IBD Loading: Functions for parsing IBD segments from various detector formats
Format Conversion: Converting between phased and unphased IBD representations
Segment Filtering: Removing unreliable segments based on length or quality thresholds
IBD Statistics: Extracting metrics like total sharing, segment counts, and length distributions
Sharing Analysis: Identifying individuals with significant genetic sharing

Key functions include get_id_to_shared_ibd(), which summarizes genetic sharing between pairs of individuals, and get_closest_pair(), which identifies the most closely related individuals based on IBD patterns.

Relationship Likelihoods (likelihoods.py)

The likelihoods.py module implements the statistical models used for relationship inference:

PwLogLike Class: The primary class for computing pairwise relationship likelihoods
Segment Distribution Models: Statistical models for IBD segment lengths in different relationships
Segment Count Models: Poisson models for the expected number of IBD segments
Age Models: Probability distributions for age differences in various relationships
Combined Scoring: Methods for integrating genetic and demographic evidence

The PwLogLike class is particularly important, as it provides the foundation for all relationship inferences in Bonsai v3, implementing sophisticated statistical models calibrated on empirical data.

Pedigree Connections (connections.py)

The connections.py module handles the logic for connecting pedigree fragments:

Connection Assessment: assess_connections() evaluates potential connections between pedigrees
Connection Points: find_closest_pedigrees() identifies optimal points to join pedigrees
Pedigree Merging: combine_pedigrees() connects separate pedigrees into larger structures
Connection Validation: Functions that ensure connections satisfy biological constraints
Age-based Validation: passes_age_check() verifies that connections are chronologically plausible

This module enables Bonsai's hierarchical construction approach, where small pedigree fragments are progressively combined into larger structures.

Data Flow and Tracking Structures

End-to-End Data Flow

The complete data flow in Bonsai v3 proceeds through several transformations:

Input: Raw IBD detector output (segment lists, either phased or unphased)
Preprocessing: IBD data is normalized, filtered, and converted to a standard format
Pairwise Analysis: IBD patterns between each pair of individuals are analyzed
Relationship Inference: Most likely relationships are computed for each pair
Clustering: Closely related individuals are grouped into initial pedigree fragments
Merging: Pedigree fragments are progressively combined based on connecting relationships
Refinement: The merged pedigree is optimized to maximize overall likelihood
Output: The final pedigree is returned as an up-node dictionary with likelihood scores

This flow is coordinated by the build_pedigree() function in bonsai.py, which orchestrates the entire process.

Pedigree Tracking Structures

During the pedigree construction process, Bonsai v3 maintains several tracking structures to manage pedigree fragments:

idx_to_up_dict_ll_list: Maps pedigree indices to lists of (pedigree, log-likelihood) pairs, storing multiple hypothesis pedigrees for each cluster
id_to_idx: Maps individual IDs to their current pedigree cluster indices
idx_to_id_set: Maps pedigree indices to sets of contained individual IDs

These structures enable efficient pedigree merging operations by tracking which individuals belong to which pedigree fragments and maintaining multiple hypotheses when ambiguity exists.

Multiple Hypothesis Tracking

A key feature of Bonsai v3 is its ability to track multiple alternative pedigree hypotheses:

Each pedigree index in idx_to_up_dict_ll_list can map to multiple (pedigree, likelihood) pairs
These alternatives represent different possible structures with varying probabilities
During merging, the system considers combinations of hypotheses from each fragment
The system prunes unlikely hypotheses to maintain computational efficiency

This approach allows Bonsai to handle ambiguity in relationship inference, maintaining multiple possibilities until additional evidence clarifies the correct structure.

Performance Optimizations

Caching and Memoization

Bonsai v3 employs several caching strategies to optimize performance:

Relationship Caching: The caching.py module implements memoization for expensive relationship calculations
IBD Statistics Caching: IBD summary statistics are computed once and reused
Likelihood Caching: Relationship likelihood scores are cached to avoid redundant computation
Traversal Caching: Results of pedigree traversal operations are stored for reuse

These caching mechanisms significantly improve performance, especially for large pedigrees where the same relationships and traversal operations are queried repeatedly.

Algorithmic Optimizations

Beyond caching, Bonsai v3 incorporates several algorithmic optimizations:

Progressive Construction: Building pedigrees incrementally rather than attempting global optimization
Prioritization: Focusing on high-confidence relationships first to establish a reliable core structure
Pruning: Discarding low-likelihood alternative hypotheses to maintain manageable state space
Efficient Data Structures: Using dictionaries and sets for O(1) lookups in critical operations
Locality-Based Processing: Organizing computation to maximize cache efficiency

These optimizations enable Bonsai v3 to scale to large datasets while maintaining reasonable computational requirements.

Architectural Insight: Bonsai v3's architecture embodies an important principle in computational genetics: complex problems can often be decomposed into simpler, more manageable sub-problems. By breaking pedigree reconstruction into a series of localized decisions about relationships and connections, Bonsai transforms an intractable global optimization problem into a series of tractable local optimizations.

Comparing Notebook and Production Code

The Lab02 notebook provides simplified examples of Bonsai v3's data structures and operations, while the actual implementation includes additional complexity and optimizations:

Error Handling: The production code includes comprehensive error checking and validation
Edge Cases: The real implementation handles numerous special cases not covered in the educational examples
Performance Optimizations: The production code includes sophisticated caching and optimization strategies
Parallel Processing: Some components support parallel execution for better performance
Logging and Diagnostics: The real system includes detailed logging for troubleshooting

Despite these differences, the core data structures and algorithmic approaches demonstrated in the lab directly correspond to those used in the production system, providing an accurate conceptual foundation.

Interactive Lab Environment

Run the interactive Lab 02 notebook in Google Colab:

Google Colab Environment

Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.

Data will be automatically downloaded from S3 when you run the notebook.

Note: You may need a Google account to save your work in Google Drive.

Open Lab 02 Notebook in Google Colab

Beyond the Code

As you explore Bonsai v3's architecture, consider these broader implications:

Design Principles: How modular design and separation of concerns enable complex system development
Balancing Act: The tradeoffs between computational efficiency, accuracy, and code maintainability
Extensibility: How the architecture facilitates adding new features or supporting new data types
Software Engineering Practices: How testing, documentation, and error handling contribute to robust software

These considerations highlight how Bonsai v3's architecture reflects not just genetic genealogy principles but also software engineering best practices.

This lab is part of the Bonsai v3 Deep Dive track:

Introduction

Lab 01

Architecture

Lab 02

IBD Formats

Lab 03

Statistics

Lab 04

Models

Lab 05

Relationships

Lab 06

PwLogLike

Lab 07

Age Modeling

Lab 08

Data Structures

Lab 09

Up-Node Dict

Lab 10