Computational Genetic Genealogy

Bonsai v3 Architecture and Core Data Structures

Lab 02: Bonsai v3 Architecture Overview and Core Data Structures

System Design: This lab examines the architectural design and data structures of Bonsai v3. Understanding how this system is organized is essential for mastering its operation and extending its capabilities.

Architectural Overview

Design Philosophy

Bonsai v3 is built around several key design principles that shape its architecture:

  • Modularity: The system is organized into distinct components with clear responsibilities, enabling easier maintenance and extension
  • Separation of Concerns: Different aspects of pedigree reconstruction are handled by specialized modules
  • Statistical Foundations: The system represents relationship inference as a probabilistic problem, using likelihood-based approaches
  • Incremental Construction: Pedigrees are built progressively, from closely related clusters to more distant connections
  • Efficient Data Structures: The system uses specialized data structures optimized for pedigree operations

These principles enable Bonsai v3 to handle the complexities of pedigree reconstruction while maintaining code readability and performance.

Module Organization

Bonsai v3 is organized into several key modules, each with specific responsibilities:

  • bonsai.py: Main entry point and orchestration module that coordinates the complete workflow
  • ibd.py: Handles IBD data processing, conversion, and extraction of genetic sharing statistics
  • likelihoods.py: Implements statistical models for relationship inference and likelihood calculation
  • pedigrees.py: Provides data structures and operations for representing and manipulating pedigrees
  • connections.py: Handles the logic for connecting individuals and merging pedigree fragments
  • utils.py: Contains utility functions used throughout the codebase
  • constants.py: Stores system constants and configuration parameters
  • caching.py: Implements performance optimizations through memoization
  • exceptions.py: Defines custom exception types for error handling
  • rendering.py: Provides visualization and rendering capabilities for pedigrees

This modular organization allows each component to focus on its specific responsibility while working together to achieve the overall goal of pedigree reconstruction.

Main Workflow

The core workflow in Bonsai v3 is orchestrated by the build_pedigree() function in bonsai.py, which follows these steps:

  1. Data Preprocessing: IBD data is loaded, filtered, and normalized for analysis
  2. Relationship Inference: Pairwise relationships between individuals are computed using likelihood models
  3. Initial Clusters: Closely related individuals are grouped into small, high-confidence pedigree clusters
  4. Cluster Merging: Small clusters are progressively merged into larger pedigrees based on genetic evidence
  5. Refinement: The merged pedigree is optimized to maximize overall likelihood
  6. Output Generation: The final pedigree is returned along with likelihood scores and statistics

Each step in this workflow involves multiple operations handled by different modules, coordinated by the main build_pedigree() function.

The Up-Node Dictionary: Bonsai's Central Data Structure

Structure and Components

The up-node dictionary is the fundamental data structure used in Bonsai v3 to represent pedigrees. It efficiently encodes parent-child relationships in a dictionary-based representation:

{
    1000: {1001: 1, 1002: 1},  # Individual 1000 has parents 1001 and 1002
    1003: {1001: 1, 1002: 1},  # Individual 1003 has the same parents (siblings)
    1004: {-1: 1, -2: 1},      # Individual 1004 has inferred parents -1 and -2
    1005: {-1: 1, 1002: 1},    # Individual 1005 has one inferred parent and one known parent
    -1: {1006: 1, 1007: 1},    # Inferred individual -1 has parents 1006 and 1007
    1001: {},  # Empty dictionaries represent founder individuals with no recorded parents
    1002: {},
    1006: {},
    1007: {},
    -2: {}
}

Key aspects of this structure include:

  • Each individual is identified by a unique ID (key in the outer dictionary)
  • The value for each individual is another dictionary mapping to their parents
  • The inner dictionary's values (typically 1) can represent relationship types
  • Individuals with no parents (founders) map to empty dictionaries
  • Negative IDs represent inferred (latent) individuals not present in the original data

This structure allows Bonsai to represent arbitrary pedigree configurations while maintaining computational efficiency.

Advantages of the Up-Node Dictionary

The up-node dictionary offers several advantages over alternative pedigree representations:

  • Memory Efficiency: Only stores necessary relationships, making it compact for large pedigrees
  • Fast Lookup: Provides O(1) access to an individual's parents
  • Flexible Structure: Can represent complex relationships including half-siblings, multiple generations, and missing individuals
  • Explicit Latent Nodes: Uses negative IDs to clearly identify inferred individuals not present in the original data
  • Simple Operations: Common pedigree operations (finding ancestors, descendants, siblings) can be implemented efficiently
  • Direct Manipulation: Allows for straightforward modification of the pedigree structure

These properties make the up-node dictionary ideal for Bonsai's incremental pedigree construction process, which requires frequent pedigree manipulations and relationship queries.

Key Operations

Bonsai v3 implements numerous operations on up-node dictionaries in the pedigrees.py module:

  • Relationship Computation: Functions like get_simple_rel_tuple() determine the relationship between any two individuals
  • Ancestor Identification: get_common_ancs() finds common ancestors between individuals
  • Pedigree Traversal: Functions for navigating up and down the pedigree structure
  • Pedigree Joining: combine_up_dicts() merges separate pedigree fragments
  • Pedigree Validation: Functions that check biological consistency (no cycles, appropriate sexes)
  • Node Addition/Removal: Operations to modify the pedigree structure

These operations form the foundation for more complex algorithms in the Bonsai system, such as pedigree merging and relationship inference.

Core Modules in Depth

IBD Processing (ibd.py)

The ibd.py module handles all aspects of IBD data management and processing:

  • IBD Loading: Functions for parsing IBD segments from various detector formats
  • Format Conversion: Converting between phased and unphased IBD representations
  • Segment Filtering: Removing unreliable segments based on length or quality thresholds
  • IBD Statistics: Extracting metrics like total sharing, segment counts, and length distributions
  • Sharing Analysis: Identifying individuals with significant genetic sharing

Key functions include get_id_to_shared_ibd(), which summarizes genetic sharing between pairs of individuals, and get_closest_pair(), which identifies the most closely related individuals based on IBD patterns.

Relationship Likelihoods (likelihoods.py)

The likelihoods.py module implements the statistical models used for relationship inference:

  • PwLogLike Class: The primary class for computing pairwise relationship likelihoods
  • Segment Distribution Models: Statistical models for IBD segment lengths in different relationships
  • Segment Count Models: Poisson models for the expected number of IBD segments
  • Age Models: Probability distributions for age differences in various relationships
  • Combined Scoring: Methods for integrating genetic and demographic evidence

The PwLogLike class is particularly important, as it provides the foundation for all relationship inferences in Bonsai v3, implementing sophisticated statistical models calibrated on empirical data.

Pedigree Connections (connections.py)

The connections.py module handles the logic for connecting pedigree fragments:

  • Connection Assessment: assess_connections() evaluates potential connections between pedigrees
  • Connection Points: find_closest_pedigrees() identifies optimal points to join pedigrees
  • Pedigree Merging: combine_pedigrees() connects separate pedigrees into larger structures
  • Connection Validation: Functions that ensure connections satisfy biological constraints
  • Age-based Validation: passes_age_check() verifies that connections are chronologically plausible

This module enables Bonsai's hierarchical construction approach, where small pedigree fragments are progressively combined into larger structures.

Data Flow and Tracking Structures

End-to-End Data Flow

The complete data flow in Bonsai v3 proceeds through several transformations:

  1. Input: Raw IBD detector output (segment lists, either phased or unphased)
  2. Preprocessing: IBD data is normalized, filtered, and converted to a standard format
  3. Pairwise Analysis: IBD patterns between each pair of individuals are analyzed
  4. Relationship Inference: Most likely relationships are computed for each pair
  5. Clustering: Closely related individuals are grouped into initial pedigree fragments
  6. Merging: Pedigree fragments are progressively combined based on connecting relationships
  7. Refinement: The merged pedigree is optimized to maximize overall likelihood
  8. Output: The final pedigree is returned as an up-node dictionary with likelihood scores

This flow is coordinated by the build_pedigree() function in bonsai.py, which orchestrates the entire process.

Pedigree Tracking Structures

During the pedigree construction process, Bonsai v3 maintains several tracking structures to manage pedigree fragments:

  • idx_to_up_dict_ll_list: Maps pedigree indices to lists of (pedigree, log-likelihood) pairs, storing multiple hypothesis pedigrees for each cluster
  • id_to_idx: Maps individual IDs to their current pedigree cluster indices
  • idx_to_id_set: Maps pedigree indices to sets of contained individual IDs

These structures enable efficient pedigree merging operations by tracking which individuals belong to which pedigree fragments and maintaining multiple hypotheses when ambiguity exists.

Multiple Hypothesis Tracking

A key feature of Bonsai v3 is its ability to track multiple alternative pedigree hypotheses:

  • Each pedigree index in idx_to_up_dict_ll_list can map to multiple (pedigree, likelihood) pairs
  • These alternatives represent different possible structures with varying probabilities
  • During merging, the system considers combinations of hypotheses from each fragment
  • The system prunes unlikely hypotheses to maintain computational efficiency

This approach allows Bonsai to handle ambiguity in relationship inference, maintaining multiple possibilities until additional evidence clarifies the correct structure.

Performance Optimizations

Caching and Memoization

Bonsai v3 employs several caching strategies to optimize performance:

  • Relationship Caching: The caching.py module implements memoization for expensive relationship calculations
  • IBD Statistics Caching: IBD summary statistics are computed once and reused
  • Likelihood Caching: Relationship likelihood scores are cached to avoid redundant computation
  • Traversal Caching: Results of pedigree traversal operations are stored for reuse

These caching mechanisms significantly improve performance, especially for large pedigrees where the same relationships and traversal operations are queried repeatedly.

Algorithmic Optimizations

Beyond caching, Bonsai v3 incorporates several algorithmic optimizations:

  • Progressive Construction: Building pedigrees incrementally rather than attempting global optimization
  • Prioritization: Focusing on high-confidence relationships first to establish a reliable core structure
  • Pruning: Discarding low-likelihood alternative hypotheses to maintain manageable state space
  • Efficient Data Structures: Using dictionaries and sets for O(1) lookups in critical operations
  • Locality-Based Processing: Organizing computation to maximize cache efficiency

These optimizations enable Bonsai v3 to scale to large datasets while maintaining reasonable computational requirements.

Architectural Insight: Bonsai v3's architecture embodies an important principle in computational genetics: complex problems can often be decomposed into simpler, more manageable sub-problems. By breaking pedigree reconstruction into a series of localized decisions about relationships and connections, Bonsai transforms an intractable global optimization problem into a series of tractable local optimizations.

Comparing Notebook and Production Code

The Lab02 notebook provides simplified examples of Bonsai v3's data structures and operations, while the actual implementation includes additional complexity and optimizations:

Despite these differences, the core data structures and algorithmic approaches demonstrated in the lab directly correspond to those used in the production system, providing an accurate conceptual foundation.

Interactive Lab Environment

Run the interactive Lab 02 notebook in Google Colab:

Google Colab Environment

Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.

Data will be automatically downloaded from S3 when you run the notebook.

Note: You may need a Google account to save your work in Google Drive.

Open Lab 02 Notebook in Google Colab

Beyond the Code

As you explore Bonsai v3's architecture, consider these broader implications:

These considerations highlight how Bonsai v3's architecture reflects not just genetic genealogy principles but also software engineering best practices.

This lab is part of the Bonsai v3 Deep Dive track:

Introduction

Lab 01

Architecture

Lab 02

IBD Formats

Lab 03

Statistics

Lab 04

Models

Lab 05

Relationships

Lab 06

PwLogLike

Lab 07

Age Modeling

Lab 08

Data Structures

Lab 09

Up-Node Dict

Lab 10