Lab 03: IBD Data Formats and Preprocessing
Data Engineering: This lab explores the various IBD data formats used in genetic genealogy and how Bonsai v3 processes them. Understanding these formats and preprocessing steps is essential for effectively working with IBD detection tools and integrating them with Bonsai.
IBD Data Format Fundamentals
Unphased IBD Format
The unphased IBD format is the primary input format for Bonsai v3, typically produced by IBD detection tools like IBIS, RefinedIBD, and HapIBD. In this format, segments are represented without haplotype-specific information:
[id1, id2, chromosome, start_bp, end_bp, is_full_ibd, seg_cm]
The fields in this format have specific meanings:
- id1, id2: Identifiers for the two individuals sharing the IBD segment (usually ordered so that id1 < id2)
- chromosome: The chromosome number (1-22) where the segment is located
- start_bp, end_bp: The starting and ending positions in base pairs
- is_full_ibd: A boolean flag indicating whether the segment is IBD1 (0) or IBD2 (1)
- seg_cm: The genetic length of the segment in centiMorgans
In Bonsai v3's implementation, the ibd.py
module contains functions for working with this format, including get_ibd_stats_unphased()
which extracts statistical summaries from unphased IBD data. This is the most common format for IBD data in practice, as most IBD detectors produce unphased output.
Phased IBD Format
The phased IBD format contains haplotype-specific information, indicating which exact copies of the chromosome contain the shared segment:
[id1, id2, hap1, hap2, chromosome, start_cm, end_cm, seg_cm]
The fields in this format provide more detailed information:
- id1, id2: Identifiers for the two individuals sharing the IBD segment
- hap1, hap2: Indicators (0 or 1) specifying which haplotype from each individual contains the shared segment
- chromosome: The chromosome number (1-22) where the segment is located
- start_cm, end_cm: The starting and ending positions in centiMorgans (genetic distance)
- seg_cm: The genetic length of the segment in centiMorgans
Phased IBD data provides richer information by specifying exactly which haplotype copies match between individuals. This can be valuable for distinguishing complex relationships, as the pattern of haplotype sharing offers additional clues about genealogical connections. In the real Bonsai v3 implementation, phased data enables more precise relationship inference and can improve pedigree reconstruction accuracy.
Format Variations in IBD Detectors
Different IBD detection tools produce slightly different output formats that Bonsai v3 must handle:
- IBIS: Produces unphased output with segment coordinates in both base pairs and centiMorgans
- Refined IBD: Produces phased output with specific haplotype indicators
- HapIBD: Produces phased output with detailed quality scores
- GERMLINE: An older tool that produces unphased output in its own format
Bonsai v3's ibd.py
module includes functions to normalize these different formats into a standard representation for consistent processing. This normalization is a critical preprocessing step that enables Bonsai to work with output from any IBD detector.
IBD Data Processing in Bonsai v3
The ibd.py Module
The ibd.py
module in Bonsai v3 is responsible for all aspects of IBD data processing and serves as the interface between raw IBD detector output and Bonsai's pedigree reconstruction algorithms. This module implements numerous functions for working with IBD data:
get_phased_to_unphased()
: Converts phased IBD segments to unphased format by combining segments that overlap on different haplotypesget_unphased_to_phased()
: Creates pseudo-phased segments from unphased IBD data by assigning segments to random or inferred haplotypesget_ibd_stats_unphased()
: Extracts statistical summaries from unphased IBD data, including total sharing, segment counts, and length distributionsfilter_ibd_segments()
: Applies quality filters to remove unreliable segments based on length or other criterianormalize_ibd_segments()
: Standardizes IBD segment representations from different detectorsget_id_to_shared_ibd()
: Creates a mapping from individual pairs to their shared IBD segmentsget_closest_pair()
: Identifies the most closely related pairs based on IBD sharing
These functions collectively transform raw IBD detector output into the structured data that Bonsai's relationship inference algorithms require. The module handles all the complexities of different IBD formats, coordinate systems, and data quality issues.
Format Conversion Implementation
The conversion between phased and unphased formats involves sophisticated algorithms implemented in the ibd.py
module:
Phased to Unphased Conversion
The get_phased_to_unphased()
function implements a multi-step process:
- Group segments by individual pair and chromosome
- Identify segments that overlap on different haplotypes
- Determine if overlapping segments indicate IBD2 regions
- Create unphased segments that represent the combined information
- Apply appropriate coordinate conversions (cM to bp when necessary)
This process is particularly important for integrating phased IBD detection results from tools like Refined IBD into Bonsai's framework. The function handles complexities such as partial overlaps, multiple overlapping segments, and consistent pair ordering.
Unphased to Phased Conversion
The get_unphased_to_phased()
function performs the more challenging inverse operation:
- For IBD2 segments, create pairs of phased segments (one for each haplotype pair)
- For IBD1 segments, apply heuristics to assign them to specific haplotypes
- When possible, use other segments from the same individuals to infer likely haplotype assignments
- Apply coordinate conversions (bp to cM when necessary)
This conversion is inherently lossy since unphased data lacks haplotype specificity, but Bonsai implements sophisticated inference techniques to make the best possible assignments. The algorithm uses observed patterns of IBD sharing across the genome to infer the most likely haplotype configurations.
IBD Statistics Calculation
One of the most critical functions in ibd.py
is get_ibd_stats_unphased()
, which extracts statistical summaries that serve as the foundation for relationship inference:
Key Statistics Computed
- total_half: Total length (in cM) of IBD1 sharing
- total_full: Total length (in cM) of IBD2 sharing
- num_half: Number of IBD1 segments
- num_full: Number of IBD2 segments
- max_seg_cm: Length of the largest segment
- half_seg_lengths: List of all IBD1 segment lengths
- full_seg_lengths: List of all IBD2 segment lengths
These statistics capture the key dimensions of IBD sharing that differentiate relationship types. For example:
- Parent-child relationships show ~3400 cM of IBD1 sharing with no IBD2
- Full siblings show ~2550 cM total sharing with substantial IBD2 segments
- First cousins show ~850 cM of IBD1 sharing with smaller segment sizes
The function handles numerous edge cases and corrections, including overlapping segments, chromosome-specific effects, and detector-specific biases. It implements the statistical foundation upon which all higher-level relationship inference in Bonsai is built.
Advanced IBD Processing Topics
Genetic Map Integration
Bonsai v3's IBD processing includes sophisticated integration with genetic maps to convert between physical coordinates (base pairs) and genetic coordinates (centiMorgans):
- Position Conversion: Functions like
convert_bp_to_cm()
andconvert_cm_to_bp()
implement nonlinear transformations based on empirical recombination rates - Map Selection: The system supports multiple genetic maps (HapMap, deCODE, etc.) with different characteristics
- Interpolation: For positions between known map points, the system uses linear interpolation to estimate genetic distances
- Edge Handling: Special logic handles chromosome edges and regions with sparse map data
Accurate coordinate conversion is essential because genetic distance (cM) is more relevant for relationship inference than physical distance (bp). One centiMorgan represents approximately a 1% chance of recombination in a single generation, making it directly relevant to biological relatedness.
IBD Quality Filtering
The filter_ibd_segments()
function in ibd.py
implements sophisticated quality control to remove unreliable segments:
- Length Filtering: Removes segments below a minimum cM threshold (typically 4-7 cM)
- LOD Score Filtering: Uses statistical confidence scores when available from detectors
- Segment Density Checks: Identifies and filters out regions with suspiciously high segment density
- Population-Specific Filtering: Applies different thresholds based on population background
- Detector-Specific Adjustments: Customizes filtering based on known characteristics of different IBD detectors
Effective filtering is critical because false positive IBD segments can lead to incorrect relationship inferences. Bonsai v3 implements empirically calibrated filtering strategies based on analysis of confirmed relationships.
IBD Segment Merging and Splitting
Bonsai v3 includes sophisticated algorithms for merging nearby segments and splitting problematic segments:
Segment Merging
The merge_nearby_segments()
function identifies and combines segments that likely represent the same IBD region but were detected as separate due to technical limitations:
- Identifies segments separated by small gaps (below a configurable threshold)
- Applies heuristics to determine if they should be merged (considering haplotype consistency)
- Creates new merged segments with appropriate boundaries and attributes
Segment Splitting
Conversely, split_large_segments()
identifies suspiciously large segments that may represent detector errors or complex IBD patterns:
- Identifies segments above size thresholds based on relationship context
- Analyzes internal patterns to identify potential split points
- Creates multiple smaller segments when evidence supports splitting
These operations help correct for technical artifacts in IBD detection, improving the accuracy of downstream relationship inference. The algorithms implement sophisticated statistical models of expected segment characteristics.
From IBD Statistics to Relationship Inference
The Statistical Foundation
The IBD statistics extracted by get_ibd_stats_unphased()
form the foundation for Bonsai v3's relationship inference. Different relationship types exhibit distinctive patterns across multiple IBD statistics:
Relationship | Expected Total IBD (cM) | Expected Segments | IBD2 Presence | Typical Max Segment (cM) |
---|---|---|---|---|
Parent-Child | 3400 | 23 | No | ~280 |
Full Siblings | 2550 | 50 | Yes (25%) | ~160 |
Half-siblings | 1700 | 28 | No | ~160 |
First Cousins | 850 | 15 | No | ~123 |
Second Cousins | 212.5 | 4 | No | ~78 |
The actual implementation in Bonsai v3 goes beyond these simple averages to model the complete statistical distributions of these values for each relationship type. This allows for probabilistic inference that accounts for the natural variance in genetic inheritance.
From Statistics to Likelihoods
The ibd.py
module connects to the likelihoods.py
module through the IBD statistics it produces. The workflow follows these steps:
get_ibd_stats_unphased()
inibd.py
extracts statistical summaries from IBD segments- These statistics are passed to the
PwLogLike
class inlikelihoods.py
- The
PwLogLike
class computes likelihood scores for different relationship hypotheses - These likelihood scores guide the pedigree reconstruction process
This connection between IBD processing and relationship inference is at the core of Bonsai v3's approach. The system transforms raw genetic data into statistical summaries and then into probabilistic relationship assessments, creating a seamless pathway from detector output to pedigree structures.
Engineering Insight: Bonsai v3's IBD processing module demonstrates an important principle in bioinformatics: successful algorithms must bridge the gap between theoretical models and messy real-world data. The ibd.py
module's sophisticated handling of different formats, coordinate systems, and data quality issues enables Bonsai to work robustly with actual IBD detection results rather than just idealized data.
Comparing Notebook and Production Code
The Lab03 notebook provides simplified implementations of IBD processing functions, while the actual Bonsai v3 implementation includes numerous additional capabilities and optimizations:
- Performance Optimization: The production code includes efficient data structures and algorithms optimized for large-scale IBD datasets
- Robust Error Handling: Comprehensive validation and error handling for various edge cases in IBD data
- Detector-Specific Logic: Specialized handling for artifacts and biases from different IBD detection tools
- Genetic Map Integration: Sophisticated integration with genetic maps for accurate coordinate conversion
- Population Background Modeling: Adjustments for population-specific patterns of IBD sharing
- Caching: Memory of previous calculations to avoid redundant computation
Despite these differences, the core concepts and data structures demonstrated in the notebook directly correspond to those used in the production system, providing an accurate conceptual foundation for understanding Bonsai's IBD processing approach.
Interactive Lab Environment
Run the interactive Lab 03 notebook in Google Colab:
Google Colab Environment
Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.
Data will be automatically downloaded from S3 when you run the notebook.
Note: You may need a Google account to save your work in Google Drive.
Beyond the Code
As you explore IBD data formats and processing in Bonsai v3, consider these broader implications:
- Data Quality Impacts: How the quality of IBD detection directly affects the accuracy of relationship inference
- Format Standardization: The importance of standardized data formats in scientific computing
- Error Propagation: How errors in early processing stages can cascade through complex pipelines
- Statistical Robustness: The value of statistical approaches that account for variance and uncertainty
These considerations highlight how data preprocessing in Bonsai v3 is not merely a technical necessity but a critical component that directly impacts the system's overall performance and accuracy.
This lab is part of the Bonsai v3 Deep Dive track:
Introduction
Lab 01
Architecture
Lab 02
IBD Formats
Lab 03
Statistics
Lab 04
Models
Lab 05
Relationships
Lab 06
PwLogLike
Lab 07
Age Modeling
Lab 08
Data Structures
Lab 09
Up-Node Dict
Lab 10