Lab 28: Integration with Other Genealogical Tools

Core Component: This lab explores how Bonsai v3 integrates with other genealogical tools and systems, particularly through the DRUID algorithm and other integration mechanisms. Understanding these integration capabilities is essential for creating comprehensive genetic genealogy workflows that leverage multiple data sources and analytical approaches.

Beyond Standalone Analysis

The Integration Imperative

While Bonsai v3 provides powerful genetic relationship inference capabilities, real-world genetic genealogy typically involves multiple tools and data sources. Effective integration with other systems enables more comprehensive analysis and better results:

Key Integration Benefits

Complementary Capabilities: Different tools excel at different aspects of genetic genealogy
Multiple Data Sources: Incorporating DNA, documentary, and contextual information
Workflow Continuity: Supporting end-to-end genetic genealogy processes
Expertise Leverage: Utilizing specialized algorithms from various domains
Ecosystem Compatibility: Fitting into existing user workflows and tool chains

Bonsai v3's integration capabilities enable it to function both as a standalone analysis tool and as a component within larger genetic genealogy workflows.

The Genetic Genealogy Ecosystem

Bonsai integrates with several categories of external tools and systems:

DNA Testing Platforms: Direct-to-consumer testing companies and research databases
Family Tree Systems: Genealogical record management tools
IBD Detection Tools: Specialized algorithms for identifying IBD segments
Population Genetics Software: Tools for analyzing population structure and admixture
Visualization Systems: Specialized tools for representing genetic relationships

The DRUID Algorithm

Degree Relationship Using IBD Data

One of Bonsai v3's key integration mechanisms is the DRUID (Degree Relationship Using IBD Data) algorithm, implemented in the druid.py module. This algorithm provides standardized relationship inference that can integrate with external systems:

DRUID Core Functionality

The DRUID algorithm uses a generalized approach to infer relationship degrees from IBD sharing data:

def infer_degree_generalized_druid(
    total_ibd: float,
    num_segments: int = None,
    longest_segment: float = None,
    total_full_ibd: float = None,
):
    """
    Infer relationship degree using the generalized DRUID algorithm.
    
    This algorithm estimates the degree of relationship based on
    total IBD sharing and optional segment characteristics.
    
    Args:
        total_ibd: Total IBD sharing in centiMorgans
        num_segments: Optional number of IBD segments
        longest_segment: Optional length of longest segment in cM
        total_full_ibd: Optional total fully identical region length
        
    Returns:
        Estimated relationship degree (1.0 = first degree, etc.)
    """
    # Implementation uses model-based prediction of relationship degree
    # based on IBD statistics, calibrated with known relationships

This function provides a standardized interface for relationship inference that external systems can easily incorporate, without needing to understand Bonsai's more complex internal mechanisms.

DRUID Integration Example

# Example of how an external tool might use DRUID
def analyze_match_with_druid(match_data):
    """
    Analyze a DNA match using the DRUID algorithm.
    
    Args:
        match_data: Dictionary with match statistics
        
    Returns:
        Dictionary with relationship prediction
    """
    # Extract IBD statistics from match data
    total_ibd = match_data['shared_cm']
    num_segments = match_data.get('num_segments')
    longest_segment = match_data.get('longest_segment')
    
    # Call DRUID algorithm
    degree = infer_degree_generalized_druid(
        total_ibd=total_ibd,
        num_segments=num_segments,
        longest_segment=longest_segment
    )
    
    # Convert degree to relationship description
    relationship = degree_to_relationship(degree)
    
    return {
        'predicted_degree': degree,
        'relationship_description': relationship,
        'confidence': calculate_confidence(total_ibd, degree)
    }

Key DRUID Advantages

Simplicity: Straightforward interface requiring minimal data
Standardization: Consistent relationship degree scale
Robustness: Works with varied input quality and completeness
Calibration: Empirically calibrated with known relationships
Extensibility: Can incorporate additional evidence when available

Data Exchange Formats

Standardized Information Transfer

Effective integration requires standardized data exchange formats. Bonsai v3 supports several key formats for importing and exporting genetic and relationship data:

IBD Data Formats

Bonsai supports several common IBD data formats:

Format	Description	Common Sources
Phased IBD Format	Detailed segment data with phase information	Research tools like IBIS, Refined-IBD
Unphased Segment Format	Simpler format without phase information	Consumer testing companies, GERMLINE
Summary Statistics Format	Aggregated IBD metrics without segment details	Consumer websites, limited data sharing
Match List Format	Simple listing of genetic matches and basic metrics	Consumer testing platforms, simple exports

Pedigree Data Formats

For exchanging pedigree information, Bonsai supports:

GEDCOM: Standard genealogical data exchange format
CSV Relationship Format: Simple tabular relationship data
JSON Pedigree Format: Hierarchical pedigree representation
Graph Exchange Format (GXF): Standard format for network structures

Format Conversion Example

# Example of converting between IBD formats
def convert_to_bonsai_format(external_segment_data, format_type):
    """
    Convert external IBD data to Bonsai's internal format.
    
    Args:
        external_segment_data: IBD data in external format
        format_type: String identifying the external format
        
    Returns:
        IBD segments in Bonsai's internal format
    """
    bonsai_segments = []
    
    if format_type == "23andme":
        # Convert 23andMe format
        for segment in external_segment_data:
            bonsai_segments.append({
                "chromosome": segment["chromosome"],
                "start_pos": int(segment["start_point"]),
                "end_pos": int(segment["end_point"]),
                "cm_length": float(segment["centimorgans"]),
                "snp_count": int(segment["snps"])
            })
    
    elif format_type == "ancestry":
        # Convert Ancestry.com format
        for segment in external_segment_data:
            bonsai_segments.append({
                "chromosome": segment["Chr"],
                "start_pos": int(segment["Start"]),
                "end_pos": int(segment["End"]),
                "cm_length": float(segment["cM"]),
                "snp_count": int(segment.get("SNPs", 0))
            })
    
    # More format conversions...
    
    return bonsai_segments

Data Transformation Challenges

Converting between different data formats presents several challenges:

Information Loss: Some formats contain less information than others
Coordinate Systems: Different genomic coordinate references
Identifier Mapping: Reconciling different individual identifiers
Quality Variations: Varying data quality and completeness

Bonsai's data exchange utilities include mechanisms to handle these challenges and maintain data integrity during format conversions.

API Integration

Programmatic Access and Control

Bonsai v3 provides several API mechanisms for programmatic integration with other systems:

Python API

Bonsai's primary API is its Python interface, which allows direct integration with other Python-based tools:

# Example of using Bonsai's Python API
from bonsai.v3 import PedigreeBuilder, IBDProcessor

# Initialize Bonsai components
ibd_processor = IBDProcessor()
pedigree_builder = PedigreeBuilder()

# Process IBD data
processed_ibd = ibd_processor.process_segments(raw_segments)

# Build pedigree from processed IBD
pedigree = pedigree_builder.build_from_ibd(processed_ibd)

# Export results in desired format
pedigree.export_to_gedcom("results.ged")

This API enables seamless integration with other Python-based genetic and genealogical tools, creating unified analysis workflows.

Command-Line Interface

For integration with non-Python systems, Bonsai provides a command-line interface:

# Example of command-line integration
$ bonsai-process --input segments.csv --format 23andme --output processed.json
$ bonsai-build --input processed.json --output pedigree.ged

This command-line interface enables easy integration with shell scripts, workflows, and other command-line tools.

Web API

For distributed or service-oriented architectures, Bonsai can be deployed with a REST API:

# Example of REST API access
POST /api/v1/process-ibd
{
  "segments": [...],
  "format": "23andme"
}

Response:
{
  "processed_data": [...],
  "statistics": {...}
}

POST /api/v1/build-pedigree
{
  "processed_data": [...],
  "parameters": {...}
}

Response:
{
  "pedigree": {...},
  "statistics": {...}
}

This web API enables integration with web applications, cloud-based services, and other distributed systems.

Integration with DNA Testing Platforms

Connecting with Commercial and Research Platforms

Bonsai v3 includes specific integration mechanisms for major DNA testing platforms, enabling direct data exchange and analysis coordination:

Supported Testing Platforms

23andMe: Personal genome testing focused on health and ancestry
AncestryDNA: Genealogy-focused genetic testing service
Family Tree DNA: Service focusing on genetic genealogy and deep ancestry
MyHeritage DNA: Combined genetic testing and family tree service
LivingDNA: Testing with detailed geographic ancestry resolution
All of Us: NIH research program with genetic data
UK Biobank: Large-scale biomedical database and research resource

Integration Approaches

Bonsai supports several methods for integrating with these platforms:

Data Import: Reading raw data files downloaded from testing platforms
API Connections: Direct API integration where supported
Format Conversion: Converting between platform-specific and standard formats
Browser Extensions: Supporting data extraction from web interfaces

23andMe Integration Example

# Example of integrating with 23andMe data
def process_23andme_data(raw_data_file, matches_file):
    """
    Process 23andMe data files for Bonsai analysis.
    
    Args:
        raw_data_file: Path to 23andMe raw data file
        matches_file: Path to 23andMe matches CSV export
        
    Returns:
        Processed data ready for Bonsai analysis
    """
    # Load and parse raw genotype data
    genotypes = parse_23andme_raw_data(raw_data_file)
    
    # Load and parse matches data
    matches = parse_23andme_matches(matches_file)
    
    # Convert to Bonsai format
    bonsai_segments = []
    for match in matches:
        match_segments = extract_segments_from_match(match)
        bonsai_segments.extend(match_segments)
    
    # Process with Bonsai
    processed_data = ibd_processor.process_segments(bonsai_segments)
    
    return processed_data

Platform-Specific Considerations

Each testing platform has unique characteristics that affect integration:

Data Completeness: Some platforms provide more detailed data than others
Access Mechanisms: Varying API availability and data export options
Coordinate Systems: Different genomic build references
Privacy Controls: Platform-specific restrictions on data sharing

Bonsai's integration modules account for these differences to provide consistent analysis capabilities across platforms.

Integration with Family Tree Systems

Combining Genetic and Documentary Evidence

One of the most powerful aspects of genetic genealogy is the integration of genetic evidence with traditional family tree information. Bonsai v3 supports bidirectional integration with family tree systems:

Family Tree Import

Bonsai can import existing family tree data to:

Provide Context: Using known relationships to inform genetic analysis
Generate Hypotheses: Creating relationship hypotheses to test with genetic data
Pre-populate Pedigrees: Starting with documentary pedigrees and confirming/extending with genetic evidence
Identify Gaps: Finding areas where genetic evidence might resolve uncertainties

Family Tree Export

Bonsai can export its analysis results to family tree systems for:

Verification: Confirming documentary relationships with genetic evidence
Extension: Adding genetically discovered relationships to existing trees
Correction: Identifying and resolving contradictions between genetic and documentary evidence
Documentation: Recording confidence levels and evidence sources

GEDCOM Integration Example

# Example of integrating with GEDCOM family tree data
def integrate_gedcom_with_genetic_data(gedcom_file, genetic_data):
    """
    Integrate GEDCOM family tree with genetic data in Bonsai.
    
    Args:
        gedcom_file: Path to GEDCOM file
        genetic_data: Processed genetic data from Bonsai
        
    Returns:
        Integrated pedigree with both documentary and genetic evidence
    """
    # Parse GEDCOM file
    gedcom_pedigree = parse_gedcom(gedcom_file)
    
    # Convert to Bonsai pedigree format
    documentary_pedigree = convert_to_bonsai_pedigree(gedcom_pedigree)
    
    # Create genetic pedigree
    genetic_pedigree = pedigree_builder.build_from_data(genetic_data)
    
    # Integrate the pedigrees
    integrated_pedigree = pedigree_integrator.integrate_pedigrees(
        documentary_pedigree,
        genetic_pedigree,
        conflict_resolution="genetic_priority"
    )
    
    # Annotate with confidence information
    annotated_pedigree = confidence_annotator.annotate_pedigree(
        integrated_pedigree,
        genetic_data
    )
    
    return annotated_pedigree

Supported Family Tree Systems

Bonsai can integrate with various family tree systems:

Desktop Software: Programs like Family Tree Maker, RootsMagic, Legacy
Online Services: Platforms like Ancestry, MyHeritage, FamilySearch
Open Source Systems: Tools like Gramps, webtrees
Research Databases: Specialized academic and professional systems

Integration with IBD Detection Tools

Leveraging Specialized Detection Algorithms

Bonsai focuses on relationship inference from IBD data, but often relies on specialized external tools for the initial IBD detection. Bonsai v3 includes integration mechanisms for several IBD detection tools:

Supported IBD Detection Tools

GERMLINE: Fast IBD detection for large datasets
Refined-IBD: High-precision IBD detection
IBDseq: IBD detection for sequencing data
IBIS: Identity-by-descent imputation system
hap-IBD: Haplotype-based IBD detection
iLASH: IBD detection for biobank-scale data

Integration Workflow

Input Preparation: Formatting genetic data for IBD detection tools
Tool Execution: Running the detection algorithm (directly or via wrappers)
Result Processing: Converting detection results to Bonsai's internal format
Quality Assessment: Evaluating the reliability of detected segments
Normalization: Adjusting for tool-specific biases and characteristics

Refined-IBD Integration Example

# Example of integrating with Refined-IBD
def process_with_refined_ibd(vcf_file, map_file):
    """
    Process VCF data with Refined-IBD and integrate with Bonsai.
    
    Args:
        vcf_file: Path to VCF file with genetic data
        map_file: Path to genetic map file
        
    Returns:
        Processed IBD segments ready for Bonsai analysis
    """
    # Prepare Refined-IBD input
    refined_ibd_input = prepare_refined_ibd_input(vcf_file, map_file)
    
    # Run Refined-IBD (external process)
    refined_ibd_output = run_refined_ibd(refined_ibd_input)
    
    # Parse Refined-IBD output
    detected_segments = parse_refined_ibd_output(refined_ibd_output)
    
    # Convert to Bonsai format
    bonsai_segments = convert_to_bonsai_format(detected_segments, "refined-ibd")
    
    # Process with Bonsai
    processed_segments = ibd_processor.process_segments(bonsai_segments)
    
    return processed_segments

Tool Selection Considerations

Different IBD detection tools have different strengths and limitations:

Tool	Strengths	Limitations	Best For
GERMLINE	Speed, scalability	Lower precision	Large datasets, initial screening
Refined-IBD	Accuracy, modeled error rates	Computational intensity	High-precision requirements
hap-IBD	Robust to phasing errors	Complex parameters	Datasets with phasing challenges
IBIS	Works with unphased data	Limited to shorter segments	Unphased consumer data

Bonsai's integration modules account for these differences and can adjust its analysis approach based on the IBD detection tool used.

Creating Integrated Workflows

End-to-End Genetic Genealogy Processes

By combining Bonsai v3 with other tools and systems, researchers can create comprehensive genetic genealogy workflows tailored to specific research questions and contexts:

Example Workflow: Unknown Parentage Case

Data Collection: Testing with multiple platforms for maximum match coverage
IBD Detection: Using specialized tools to identify shared DNA segments
Relationship Inference: Using Bonsai to predict relationships from IBD patterns
Match Clustering: Grouping matches by likely family branches
Tree Building: Constructing partial family trees for each cluster
Common Ancestor Identification: Finding connecting points between trees
Hypothesis Validation: Using documentary research to verify predictions

Example Workflow: Population Study

Sample Collection: Gathering genetic data from the population of interest
Admixture Analysis: Using population genetics tools to assess ancestry
IBD Detection: Identifying shared segments within the population
Relationship Network Construction: Using Bonsai to build a comprehensive relationship network
Historical Context Integration: Incorporating documentary and demographic information
Network Analysis: Applying social network analysis to the relationship structure
Visualization and Reporting: Presenting the findings with appropriate visualizations

Integration Pipeline Example

# Example of a complete integration pipeline
def run_integrated_workflow(raw_data_files, known_relationships=None):
    """
    Run a complete integrated genetic genealogy workflow.
    
    Args:
        raw_data_files: List of paths to raw genetic data files
        known_relationships: Optional dict of known family relationships
        
    Returns:
        Complete analysis results
    """
    # Phase 1: Data preparation
    processed_data = []
    for file_path in raw_data_files:
        file_format = detect_file_format(file_path)
        processed_file = process_raw_data(file_path, file_format)
        processed_data.append(processed_file)
    
    # Phase 2: IBD detection (using appropriate external tool)
    ibd_segments = detect_ibd_segments(processed_data)
    
    # Phase 3: Relationship inference with Bonsai
    relationship_predictions = bonsai_analyzer.infer_relationships(ibd_segments)
    
    # Phase 4: Family tree integration
    if known_relationships:
        integrated_pedigree = integrate_with_known_relationships(
            relationship_predictions, 
            known_relationships
        )
    else:
        integrated_pedigree = build_pedigree_from_predictions(relationship_predictions)
    
    # Phase 5: Visualization and reporting
    visualizations = generate_visualizations(integrated_pedigree)
    report = generate_analysis_report(integrated_pedigree, relationship_predictions)
    
    return {
        "pedigree": integrated_pedigree,
        "relationships": relationship_predictions,
        "visualizations": visualizations,
        "report": report
    }

Conclusion and Next Steps

Bonsai v3's integration capabilities enable it to function as a key component in comprehensive genetic genealogy workflows, connecting with DNA testing platforms, family tree systems, IBD detection tools, and other specialized resources. Through mechanisms like the DRUID algorithm, standardized data exchange formats, and flexible APIs, Bonsai can adapt to diverse research contexts and leverage complementary tools to enhance its relationship inference capabilities.

By understanding and utilizing these integration mechanisms, researchers can create powerful, customized workflows that combine the strengths of multiple tools and data sources to address complex genetic genealogy challenges.

In the next lab, we'll explore how to implement end-to-end pedigree reconstruction pipelines using Bonsai v3, integrating all the components we've studied throughout this course.

Your Learning Pathway

Lab 27: Prior Probability Models Lab 29: End-to-End Implementation

Interactive Lab Environment

Run the interactive Lab 28 notebook in Google Colab:

Google Colab Environment

Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.

Data will be automatically downloaded from S3 when you run the notebook.

Note: You may need a Google account to save your work in Google Drive.

Open Lab 28 Notebook in Google Colab

This lab is part of the Visualization & Advanced Applications track:

Rendering

Lab 21

Interpreting

Lab 22

Twins

Lab 23

Complex

Lab 24

Real-World

Lab 25

Performance

Lab 26

Prior Models

Lab 27

Integration

Lab 28

End-to-End

Lab 29

Advanced

Lab 30