Computational Genetic Genealogy

IBD Data Formats and Preprocessing

Lab 03: IBD Data Formats and Preprocessing

Data Engineering: This lab explores the various IBD data formats used in genetic genealogy and how Bonsai v3 processes them. Understanding these formats and preprocessing steps is essential for effectively working with IBD detection tools and integrating them with Bonsai.

IBD Data Format Fundamentals

Unphased IBD Format

The unphased IBD format is the primary input format for Bonsai v3, typically produced by IBD detection tools like IBIS, RefinedIBD, and HapIBD. In this format, segments are represented without haplotype-specific information:

[id1, id2, chromosome, start_bp, end_bp, is_full_ibd, seg_cm]

The fields in this format have specific meanings:

  • id1, id2: Identifiers for the two individuals sharing the IBD segment (usually ordered so that id1 < id2)
  • chromosome: The chromosome number (1-22) where the segment is located
  • start_bp, end_bp: The starting and ending positions in base pairs
  • is_full_ibd: A boolean flag indicating whether the segment is IBD1 (0) or IBD2 (1)
  • seg_cm: The genetic length of the segment in centiMorgans

In Bonsai v3's implementation, the ibd.py module contains functions for working with this format, including get_ibd_stats_unphased() which extracts statistical summaries from unphased IBD data. This is the most common format for IBD data in practice, as most IBD detectors produce unphased output.

Phased IBD Format

The phased IBD format contains haplotype-specific information, indicating which exact copies of the chromosome contain the shared segment:

[id1, id2, hap1, hap2, chromosome, start_cm, end_cm, seg_cm]

The fields in this format provide more detailed information:

  • id1, id2: Identifiers for the two individuals sharing the IBD segment
  • hap1, hap2: Indicators (0 or 1) specifying which haplotype from each individual contains the shared segment
  • chromosome: The chromosome number (1-22) where the segment is located
  • start_cm, end_cm: The starting and ending positions in centiMorgans (genetic distance)
  • seg_cm: The genetic length of the segment in centiMorgans

Phased IBD data provides richer information by specifying exactly which haplotype copies match between individuals. This can be valuable for distinguishing complex relationships, as the pattern of haplotype sharing offers additional clues about genealogical connections. In the real Bonsai v3 implementation, phased data enables more precise relationship inference and can improve pedigree reconstruction accuracy.

Format Variations in IBD Detectors

Different IBD detection tools produce slightly different output formats that Bonsai v3 must handle:

  • IBIS: Produces unphased output with segment coordinates in both base pairs and centiMorgans
  • Refined IBD: Produces phased output with specific haplotype indicators
  • HapIBD: Produces phased output with detailed quality scores
  • GERMLINE: An older tool that produces unphased output in its own format

Bonsai v3's ibd.py module includes functions to normalize these different formats into a standard representation for consistent processing. This normalization is a critical preprocessing step that enables Bonsai to work with output from any IBD detector.

IBD Data Processing in Bonsai v3

The ibd.py Module

The ibd.py module in Bonsai v3 is responsible for all aspects of IBD data processing and serves as the interface between raw IBD detector output and Bonsai's pedigree reconstruction algorithms. This module implements numerous functions for working with IBD data:

  • get_phased_to_unphased(): Converts phased IBD segments to unphased format by combining segments that overlap on different haplotypes
  • get_unphased_to_phased(): Creates pseudo-phased segments from unphased IBD data by assigning segments to random or inferred haplotypes
  • get_ibd_stats_unphased(): Extracts statistical summaries from unphased IBD data, including total sharing, segment counts, and length distributions
  • filter_ibd_segments(): Applies quality filters to remove unreliable segments based on length or other criteria
  • normalize_ibd_segments(): Standardizes IBD segment representations from different detectors
  • get_id_to_shared_ibd(): Creates a mapping from individual pairs to their shared IBD segments
  • get_closest_pair(): Identifies the most closely related pairs based on IBD sharing

These functions collectively transform raw IBD detector output into the structured data that Bonsai's relationship inference algorithms require. The module handles all the complexities of different IBD formats, coordinate systems, and data quality issues.

Format Conversion Implementation

The conversion between phased and unphased formats involves sophisticated algorithms implemented in the ibd.py module:

Phased to Unphased Conversion

The get_phased_to_unphased() function implements a multi-step process:

  1. Group segments by individual pair and chromosome
  2. Identify segments that overlap on different haplotypes
  3. Determine if overlapping segments indicate IBD2 regions
  4. Create unphased segments that represent the combined information
  5. Apply appropriate coordinate conversions (cM to bp when necessary)

This process is particularly important for integrating phased IBD detection results from tools like Refined IBD into Bonsai's framework. The function handles complexities such as partial overlaps, multiple overlapping segments, and consistent pair ordering.

Unphased to Phased Conversion

The get_unphased_to_phased() function performs the more challenging inverse operation:

  1. For IBD2 segments, create pairs of phased segments (one for each haplotype pair)
  2. For IBD1 segments, apply heuristics to assign them to specific haplotypes
  3. When possible, use other segments from the same individuals to infer likely haplotype assignments
  4. Apply coordinate conversions (bp to cM when necessary)

This conversion is inherently lossy since unphased data lacks haplotype specificity, but Bonsai implements sophisticated inference techniques to make the best possible assignments. The algorithm uses observed patterns of IBD sharing across the genome to infer the most likely haplotype configurations.

IBD Statistics Calculation

One of the most critical functions in ibd.py is get_ibd_stats_unphased(), which extracts statistical summaries that serve as the foundation for relationship inference:

Key Statistics Computed
  • total_half: Total length (in cM) of IBD1 sharing
  • total_full: Total length (in cM) of IBD2 sharing
  • num_half: Number of IBD1 segments
  • num_full: Number of IBD2 segments
  • max_seg_cm: Length of the largest segment
  • half_seg_lengths: List of all IBD1 segment lengths
  • full_seg_lengths: List of all IBD2 segment lengths

These statistics capture the key dimensions of IBD sharing that differentiate relationship types. For example:

  • Parent-child relationships show ~3400 cM of IBD1 sharing with no IBD2
  • Full siblings show ~2550 cM total sharing with substantial IBD2 segments
  • First cousins show ~850 cM of IBD1 sharing with smaller segment sizes

The function handles numerous edge cases and corrections, including overlapping segments, chromosome-specific effects, and detector-specific biases. It implements the statistical foundation upon which all higher-level relationship inference in Bonsai is built.

Advanced IBD Processing Topics

Genetic Map Integration

Bonsai v3's IBD processing includes sophisticated integration with genetic maps to convert between physical coordinates (base pairs) and genetic coordinates (centiMorgans):

  • Position Conversion: Functions like convert_bp_to_cm() and convert_cm_to_bp() implement nonlinear transformations based on empirical recombination rates
  • Map Selection: The system supports multiple genetic maps (HapMap, deCODE, etc.) with different characteristics
  • Interpolation: For positions between known map points, the system uses linear interpolation to estimate genetic distances
  • Edge Handling: Special logic handles chromosome edges and regions with sparse map data

Accurate coordinate conversion is essential because genetic distance (cM) is more relevant for relationship inference than physical distance (bp). One centiMorgan represents approximately a 1% chance of recombination in a single generation, making it directly relevant to biological relatedness.

IBD Quality Filtering

The filter_ibd_segments() function in ibd.py implements sophisticated quality control to remove unreliable segments:

  • Length Filtering: Removes segments below a minimum cM threshold (typically 4-7 cM)
  • LOD Score Filtering: Uses statistical confidence scores when available from detectors
  • Segment Density Checks: Identifies and filters out regions with suspiciously high segment density
  • Population-Specific Filtering: Applies different thresholds based on population background
  • Detector-Specific Adjustments: Customizes filtering based on known characteristics of different IBD detectors

Effective filtering is critical because false positive IBD segments can lead to incorrect relationship inferences. Bonsai v3 implements empirically calibrated filtering strategies based on analysis of confirmed relationships.

IBD Segment Merging and Splitting

Bonsai v3 includes sophisticated algorithms for merging nearby segments and splitting problematic segments:

Segment Merging

The merge_nearby_segments() function identifies and combines segments that likely represent the same IBD region but were detected as separate due to technical limitations:

  • Identifies segments separated by small gaps (below a configurable threshold)
  • Applies heuristics to determine if they should be merged (considering haplotype consistency)
  • Creates new merged segments with appropriate boundaries and attributes
Segment Splitting

Conversely, split_large_segments() identifies suspiciously large segments that may represent detector errors or complex IBD patterns:

  • Identifies segments above size thresholds based on relationship context
  • Analyzes internal patterns to identify potential split points
  • Creates multiple smaller segments when evidence supports splitting

These operations help correct for technical artifacts in IBD detection, improving the accuracy of downstream relationship inference. The algorithms implement sophisticated statistical models of expected segment characteristics.

From IBD Statistics to Relationship Inference

The Statistical Foundation

The IBD statistics extracted by get_ibd_stats_unphased() form the foundation for Bonsai v3's relationship inference. Different relationship types exhibit distinctive patterns across multiple IBD statistics:

Relationship Expected Total IBD (cM) Expected Segments IBD2 Presence Typical Max Segment (cM)
Parent-Child 3400 23 No ~280
Full Siblings 2550 50 Yes (25%) ~160
Half-siblings 1700 28 No ~160
First Cousins 850 15 No ~123
Second Cousins 212.5 4 No ~78

The actual implementation in Bonsai v3 goes beyond these simple averages to model the complete statistical distributions of these values for each relationship type. This allows for probabilistic inference that accounts for the natural variance in genetic inheritance.

From Statistics to Likelihoods

The ibd.py module connects to the likelihoods.py module through the IBD statistics it produces. The workflow follows these steps:

  1. get_ibd_stats_unphased() in ibd.py extracts statistical summaries from IBD segments
  2. These statistics are passed to the PwLogLike class in likelihoods.py
  3. The PwLogLike class computes likelihood scores for different relationship hypotheses
  4. These likelihood scores guide the pedigree reconstruction process

This connection between IBD processing and relationship inference is at the core of Bonsai v3's approach. The system transforms raw genetic data into statistical summaries and then into probabilistic relationship assessments, creating a seamless pathway from detector output to pedigree structures.

Engineering Insight: Bonsai v3's IBD processing module demonstrates an important principle in bioinformatics: successful algorithms must bridge the gap between theoretical models and messy real-world data. The ibd.py module's sophisticated handling of different formats, coordinate systems, and data quality issues enables Bonsai to work robustly with actual IBD detection results rather than just idealized data.

Comparing Notebook and Production Code

The Lab03 notebook provides simplified implementations of IBD processing functions, while the actual Bonsai v3 implementation includes numerous additional capabilities and optimizations:

Despite these differences, the core concepts and data structures demonstrated in the notebook directly correspond to those used in the production system, providing an accurate conceptual foundation for understanding Bonsai's IBD processing approach.

Interactive Lab Environment

Run the interactive Lab 03 notebook in Google Colab:

Google Colab Environment

Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.

Data will be automatically downloaded from S3 when you run the notebook.

Note: You may need a Google account to save your work in Google Drive.

Open Lab 03 Notebook in Google Colab

Beyond the Code

As you explore IBD data formats and processing in Bonsai v3, consider these broader implications:

These considerations highlight how data preprocessing in Bonsai v3 is not merely a technical necessity but a critical component that directly impacts the system's overall performance and accuracy.

This lab is part of the Bonsai v3 Deep Dive track:

Introduction

Lab 01

Architecture

Lab 02

IBD Formats

Lab 03

Statistics

Lab 04

Models

Lab 05

Relationships

Lab 06

PwLogLike

Lab 07

Age Modeling

Lab 08

Data Structures

Lab 09

Up-Node Dict

Lab 10