Lab 02: Bonsai v3 Architecture Overview and Core Data Structures
System Design: This lab examines the architectural design and data structures of Bonsai v3. Understanding how this system is organized is essential for mastering its operation and extending its capabilities.
Architectural Overview
Design Philosophy
Bonsai v3 is built around several key design principles that shape its architecture:
- Modularity: The system is organized into distinct components with clear responsibilities, enabling easier maintenance and extension
- Separation of Concerns: Different aspects of pedigree reconstruction are handled by specialized modules
- Statistical Foundations: The system represents relationship inference as a probabilistic problem, using likelihood-based approaches
- Incremental Construction: Pedigrees are built progressively, from closely related clusters to more distant connections
- Efficient Data Structures: The system uses specialized data structures optimized for pedigree operations
These principles enable Bonsai v3 to handle the complexities of pedigree reconstruction while maintaining code readability and performance.
Module Organization
Bonsai v3 is organized into several key modules, each with specific responsibilities:
bonsai.py
: Main entry point and orchestration module that coordinates the complete workflowibd.py
: Handles IBD data processing, conversion, and extraction of genetic sharing statisticslikelihoods.py
: Implements statistical models for relationship inference and likelihood calculationpedigrees.py
: Provides data structures and operations for representing and manipulating pedigreesconnections.py
: Handles the logic for connecting individuals and merging pedigree fragmentsutils.py
: Contains utility functions used throughout the codebaseconstants.py
: Stores system constants and configuration parameterscaching.py
: Implements performance optimizations through memoizationexceptions.py
: Defines custom exception types for error handlingrendering.py
: Provides visualization and rendering capabilities for pedigrees
This modular organization allows each component to focus on its specific responsibility while working together to achieve the overall goal of pedigree reconstruction.
Main Workflow
The core workflow in Bonsai v3 is orchestrated by the build_pedigree()
function in bonsai.py
, which follows these steps:
- Data Preprocessing: IBD data is loaded, filtered, and normalized for analysis
- Relationship Inference: Pairwise relationships between individuals are computed using likelihood models
- Initial Clusters: Closely related individuals are grouped into small, high-confidence pedigree clusters
- Cluster Merging: Small clusters are progressively merged into larger pedigrees based on genetic evidence
- Refinement: The merged pedigree is optimized to maximize overall likelihood
- Output Generation: The final pedigree is returned along with likelihood scores and statistics
Each step in this workflow involves multiple operations handled by different modules, coordinated by the main build_pedigree()
function.
The Up-Node Dictionary: Bonsai's Central Data Structure
Structure and Components
The up-node dictionary is the fundamental data structure used in Bonsai v3 to represent pedigrees. It efficiently encodes parent-child relationships in a dictionary-based representation:
{
1000: {1001: 1, 1002: 1}, # Individual 1000 has parents 1001 and 1002
1003: {1001: 1, 1002: 1}, # Individual 1003 has the same parents (siblings)
1004: {-1: 1, -2: 1}, # Individual 1004 has inferred parents -1 and -2
1005: {-1: 1, 1002: 1}, # Individual 1005 has one inferred parent and one known parent
-1: {1006: 1, 1007: 1}, # Inferred individual -1 has parents 1006 and 1007
1001: {}, # Empty dictionaries represent founder individuals with no recorded parents
1002: {},
1006: {},
1007: {},
-2: {}
}
Key aspects of this structure include:
- Each individual is identified by a unique ID (key in the outer dictionary)
- The value for each individual is another dictionary mapping to their parents
- The inner dictionary's values (typically 1) can represent relationship types
- Individuals with no parents (founders) map to empty dictionaries
- Negative IDs represent inferred (latent) individuals not present in the original data
This structure allows Bonsai to represent arbitrary pedigree configurations while maintaining computational efficiency.
Advantages of the Up-Node Dictionary
The up-node dictionary offers several advantages over alternative pedigree representations:
- Memory Efficiency: Only stores necessary relationships, making it compact for large pedigrees
- Fast Lookup: Provides O(1) access to an individual's parents
- Flexible Structure: Can represent complex relationships including half-siblings, multiple generations, and missing individuals
- Explicit Latent Nodes: Uses negative IDs to clearly identify inferred individuals not present in the original data
- Simple Operations: Common pedigree operations (finding ancestors, descendants, siblings) can be implemented efficiently
- Direct Manipulation: Allows for straightforward modification of the pedigree structure
These properties make the up-node dictionary ideal for Bonsai's incremental pedigree construction process, which requires frequent pedigree manipulations and relationship queries.
Key Operations
Bonsai v3 implements numerous operations on up-node dictionaries in the pedigrees.py
module:
- Relationship Computation: Functions like
get_simple_rel_tuple()
determine the relationship between any two individuals - Ancestor Identification:
get_common_ancs()
finds common ancestors between individuals - Pedigree Traversal: Functions for navigating up and down the pedigree structure
- Pedigree Joining:
combine_up_dicts()
merges separate pedigree fragments - Pedigree Validation: Functions that check biological consistency (no cycles, appropriate sexes)
- Node Addition/Removal: Operations to modify the pedigree structure
These operations form the foundation for more complex algorithms in the Bonsai system, such as pedigree merging and relationship inference.
Core Modules in Depth
IBD Processing (ibd.py)
The ibd.py
module handles all aspects of IBD data management and processing:
- IBD Loading: Functions for parsing IBD segments from various detector formats
- Format Conversion: Converting between phased and unphased IBD representations
- Segment Filtering: Removing unreliable segments based on length or quality thresholds
- IBD Statistics: Extracting metrics like total sharing, segment counts, and length distributions
- Sharing Analysis: Identifying individuals with significant genetic sharing
Key functions include get_id_to_shared_ibd()
, which summarizes genetic sharing between pairs of individuals, and get_closest_pair()
, which identifies the most closely related individuals based on IBD patterns.
Relationship Likelihoods (likelihoods.py)
The likelihoods.py
module implements the statistical models used for relationship inference:
- PwLogLike Class: The primary class for computing pairwise relationship likelihoods
- Segment Distribution Models: Statistical models for IBD segment lengths in different relationships
- Segment Count Models: Poisson models for the expected number of IBD segments
- Age Models: Probability distributions for age differences in various relationships
- Combined Scoring: Methods for integrating genetic and demographic evidence
The PwLogLike
class is particularly important, as it provides the foundation for all relationship inferences in Bonsai v3, implementing sophisticated statistical models calibrated on empirical data.
Pedigree Connections (connections.py)
The connections.py
module handles the logic for connecting pedigree fragments:
- Connection Assessment:
assess_connections()
evaluates potential connections between pedigrees - Connection Points:
find_closest_pedigrees()
identifies optimal points to join pedigrees - Pedigree Merging:
combine_pedigrees()
connects separate pedigrees into larger structures - Connection Validation: Functions that ensure connections satisfy biological constraints
- Age-based Validation:
passes_age_check()
verifies that connections are chronologically plausible
This module enables Bonsai's hierarchical construction approach, where small pedigree fragments are progressively combined into larger structures.
Data Flow and Tracking Structures
End-to-End Data Flow
The complete data flow in Bonsai v3 proceeds through several transformations:
- Input: Raw IBD detector output (segment lists, either phased or unphased)
- Preprocessing: IBD data is normalized, filtered, and converted to a standard format
- Pairwise Analysis: IBD patterns between each pair of individuals are analyzed
- Relationship Inference: Most likely relationships are computed for each pair
- Clustering: Closely related individuals are grouped into initial pedigree fragments
- Merging: Pedigree fragments are progressively combined based on connecting relationships
- Refinement: The merged pedigree is optimized to maximize overall likelihood
- Output: The final pedigree is returned as an up-node dictionary with likelihood scores
This flow is coordinated by the build_pedigree()
function in bonsai.py
, which orchestrates the entire process.
Pedigree Tracking Structures
During the pedigree construction process, Bonsai v3 maintains several tracking structures to manage pedigree fragments:
- idx_to_up_dict_ll_list: Maps pedigree indices to lists of (pedigree, log-likelihood) pairs, storing multiple hypothesis pedigrees for each cluster
- id_to_idx: Maps individual IDs to their current pedigree cluster indices
- idx_to_id_set: Maps pedigree indices to sets of contained individual IDs
These structures enable efficient pedigree merging operations by tracking which individuals belong to which pedigree fragments and maintaining multiple hypotheses when ambiguity exists.
Multiple Hypothesis Tracking
A key feature of Bonsai v3 is its ability to track multiple alternative pedigree hypotheses:
- Each pedigree index in
idx_to_up_dict_ll_list
can map to multiple (pedigree, likelihood) pairs - These alternatives represent different possible structures with varying probabilities
- During merging, the system considers combinations of hypotheses from each fragment
- The system prunes unlikely hypotheses to maintain computational efficiency
This approach allows Bonsai to handle ambiguity in relationship inference, maintaining multiple possibilities until additional evidence clarifies the correct structure.
Performance Optimizations
Caching and Memoization
Bonsai v3 employs several caching strategies to optimize performance:
- Relationship Caching: The
caching.py
module implements memoization for expensive relationship calculations - IBD Statistics Caching: IBD summary statistics are computed once and reused
- Likelihood Caching: Relationship likelihood scores are cached to avoid redundant computation
- Traversal Caching: Results of pedigree traversal operations are stored for reuse
These caching mechanisms significantly improve performance, especially for large pedigrees where the same relationships and traversal operations are queried repeatedly.
Algorithmic Optimizations
Beyond caching, Bonsai v3 incorporates several algorithmic optimizations:
- Progressive Construction: Building pedigrees incrementally rather than attempting global optimization
- Prioritization: Focusing on high-confidence relationships first to establish a reliable core structure
- Pruning: Discarding low-likelihood alternative hypotheses to maintain manageable state space
- Efficient Data Structures: Using dictionaries and sets for O(1) lookups in critical operations
- Locality-Based Processing: Organizing computation to maximize cache efficiency
These optimizations enable Bonsai v3 to scale to large datasets while maintaining reasonable computational requirements.
Architectural Insight: Bonsai v3's architecture embodies an important principle in computational genetics: complex problems can often be decomposed into simpler, more manageable sub-problems. By breaking pedigree reconstruction into a series of localized decisions about relationships and connections, Bonsai transforms an intractable global optimization problem into a series of tractable local optimizations.
Comparing Notebook and Production Code
The Lab02 notebook provides simplified examples of Bonsai v3's data structures and operations, while the actual implementation includes additional complexity and optimizations:
- Error Handling: The production code includes comprehensive error checking and validation
- Edge Cases: The real implementation handles numerous special cases not covered in the educational examples
- Performance Optimizations: The production code includes sophisticated caching and optimization strategies
- Parallel Processing: Some components support parallel execution for better performance
- Logging and Diagnostics: The real system includes detailed logging for troubleshooting
Despite these differences, the core data structures and algorithmic approaches demonstrated in the lab directly correspond to those used in the production system, providing an accurate conceptual foundation.
Interactive Lab Environment
Run the interactive Lab 02 notebook in Google Colab:
Google Colab Environment
Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.
Data will be automatically downloaded from S3 when you run the notebook.
Note: You may need a Google account to save your work in Google Drive.
Beyond the Code
As you explore Bonsai v3's architecture, consider these broader implications:
- Design Principles: How modular design and separation of concerns enable complex system development
- Balancing Act: The tradeoffs between computational efficiency, accuracy, and code maintainability
- Extensibility: How the architecture facilitates adding new features or supporting new data types
- Software Engineering Practices: How testing, documentation, and error handling contribute to robust software
These considerations highlight how Bonsai v3's architecture reflects not just genetic genealogy principles but also software engineering best practices.
This lab is part of the Bonsai v3 Deep Dive track:
Introduction
Lab 01
Architecture
Lab 02
IBD Formats
Lab 03
Statistics
Lab 04
Models
Lab 05
Relationships
Lab 06
PwLogLike
Lab 07
Age Modeling
Lab 08
Data Structures
Lab 09
Up-Node Dict
Lab 10