Computational Genetic Genealogy

Error Handling and Data Validation

Lab 20: Error Handling and Data Validation

Core Component: This lab explores the error handling and data validation techniques used in Bonsai v3 to ensure robust performance even with imperfect input data. Effective error handling is critical for genetic genealogy applications, where data quality can vary significantly and consistent results are essential.

The Importance of Robust Error Handling

Why Error Handling Matters in Genetic Genealogy

Genetic genealogy applications deal with complex, often imperfect data from various sources. Robust error handling is essential to:

  • Maintain Data Integrity: Ensure that invalid data doesn't compromise analysis results
  • Provide Actionable Feedback: Help users understand and address data issues
  • Enable Graceful Degradation: Continue operation despite partial failures
  • Support Debugging: Facilitate efficient troubleshooting of issues

Without proper error handling, genetic genealogy tools can produce misleading results or fail entirely when encountering unexpected data patterns, significantly reducing their usefulness in real-world scenarios.

Custom Exception Hierarchy

Specialized Exceptions for Clear Error Communication

Bonsai v3 implements a custom exception hierarchy that enables more targeted error handling and clearer error messages. This hierarchical approach allows catching specific types of errors or broader categories as needed.

Bonsai's Exception Hierarchy
class BonsaiException(Exception):
    """Base exception class for all Bonsai-specific exceptions."""
    
    def __init__(self, message, details=None):
        self.message = message
        self.details = details or {}
        super().__init__(self.message)
    
# Input-related exceptions
class InputError(BonsaiException):
    """Base class for input-related errors."""
    pass

class ValidationError(InputError):
    """Raised when input data fails validation."""
    pass

class DataFormatError(InputError):
    """Raised when input data has incorrect format."""
    pass

class MissingDataError(InputError):
    """Raised when required data is missing."""
    pass

# Processing-related exceptions
class ProcessingError(BonsaiException):
    """Base class for processing-related errors."""
    pass

# Configuration-related exceptions
class ConfigurationError(BonsaiException):
    """Raised when there's an issue with the configuration."""
    pass

# Resource-related exceptions
class ResourceError(BonsaiException):
    """Base class for resource-related errors."""
    pass

This hierarchy allows for more granular error handling. For example, code can catch specific types of input errors (like ValidationError) while letting other types of errors propagate, or catch all Bonsai-specific errors (BonsaiException) while letting system errors propagate.

Benefits of Custom Exceptions

Custom exceptions provide several key advantages:

  • More informative error messages with context
  • Ability to include structured details about the error
  • Clearer distinction between different error types
  • More targeted error recovery strategies

Input Validation Techniques

Ensuring Data Quality at the Entry Point

Input validation is a critical first line of defense against data errors. Bonsai v3 uses several complementary techniques for thorough input validation:

Data Classes with Built-in Validation

Core data structures like IBDSegment use Python's dataclass with custom validation in the __post_init__ method:

@dataclass
class IBDSegment:
    """A data class representing an IBD segment with built-in validation."""
    start_pos: int
    end_pos: int
    cm: float
    snps: int
    chromosome: str = "1"  # Default to chromosome 1
    
    def __post_init__(self):
        """Validate the segment after initialization."""
        # Check types
        if not isinstance(self.start_pos, int):
            raise ValidationError(
                "start_pos must be an integer",
                details={"start_pos": self.start_pos}
            )
        
        # Check values
        if self.start_pos < 0:
            raise ValidationError(
                "start_pos must be non-negative",
                details={"start_pos": self.start_pos}
            )
        
        if self.end_pos <= self.start_pos:
            raise ValidationError(
                "end_pos must be greater than start_pos",
                details={"start_pos": self.start_pos, "end_pos": self.end_pos}
            )
        
        # Additional validations...
Validator Functions for Reusable Validation

For more flexible validation across different contexts, Bonsai v3 uses dedicated validator functions:

class Validator:
    """A collection of validation functions for genetic data."""
    
    @staticmethod
    def validate_chromosome(chrom):
        """Validate a chromosome identifier."""
        valid_chromosomes = [str(i) for i in range(1, 23)] + ["X", "Y"]
        
        if not isinstance(chrom, str):
            raise ValidationError(
                "Chromosome must be a string",
                details={"chromosome": chrom, "type": type(chrom).__name__}
            )
        
        # Normalize the chromosome format
        normalized = chrom.replace("chr", "").upper()
        
        if normalized not in valid_chromosomes:
            raise ValidationError(
                "Chromosome must be 1-22, X, or Y",
                details={"chromosome": chrom, "normalized": normalized}
            )
        
        return normalized
        
    # Additional validators...

These validator functions can be used across the codebase to ensure consistent validation rules and reduce duplication of validation logic.

Validation Best Practices

Bonsai v3 follows these validation best practices:

  • Validate early to catch errors at their source
  • Provide detailed error messages that explain the issue
  • Include context (such as variable names and values) in error messages
  • Return normalized values to ensure consistent data format

Defensive Programming

Anticipating and Preventing Problems

Defensive programming is a technique that anticipates potential problems and handles them gracefully before they can cause failures. Bonsai v3 uses several defensive programming techniques:

Precondition and Postcondition Checks

Critical functions verify that their inputs meet required conditions (preconditions) and that their outputs satisfy expected properties (postconditions):

def get_common_ancestors(id1, id2, up_dict):
    """Find common ancestors of two individuals in a pedigree."""
    # Precondition checks
    if up_dict is None:
        raise MissingDataError("Pedigree data is missing (up_dict is None)")
    
    if not isinstance(up_dict, dict):
        raise ValidationError(
            "Pedigree data must be a dictionary",
            details={"actual_type": type(up_dict).__name__}
        )
    
    # Validate individual IDs
    for id_val, label in [(id1, "id1"), (id2, "id2")]:
        if id_val is None:
            raise MissingDataError(f"Individual ID is missing ({label} is None)")
        
        if not id_val in up_dict:
            raise ValidationError(
                f"Individual not found in pedigree",
                details={"id": id_val, "label": label}
            )
    
    # Function implementation...
    
    # Postcondition checks
    if id1 in ancestors2:
        # id1 is an ancestor of id2
        assert id1 in common_ancestors, "id1 should be in common ancestors if it's an ancestor of id2"
    
    return common_ancestors
Check for Cycles in Pedigrees

Bonsai v3 performs specific checks for common data integrity issues, such as cycles in pedigree structures:

def check_for_cycles(iid, visited=None, path=None):
    """Check if there are cycles in the ancestry path."""
    if visited is None:
        visited = set()
    if path is None:
        path = []
    
    if iid in path:
        # Found a cycle
        cycle_path = path[path.index(iid):] + [iid]
        raise ValidationError(
            "Cycle detected in pedigree",
            details={"cycle": "->".join(str(i) for i in cycle_path)}
        )
    
    # Continue checking...
Defensive Programming in Action

Consider how Bonsai v3 handles complex pedigree operations defensively:

  1. First, validate all input parameters thoroughly
  2. Check for structural issues like cycles in the pedigree
  3. Verify relationships for biological plausibility
  4. Apply constraints based on known demographic patterns
  5. Validate the final output against expected properties

This multi-layered approach prevents incorrect pedigree structures from being created or processed, ensuring reliable results.

Graceful Degradation

Continuing Operation Despite Partial Failures

In complex genetic genealogy applications, it's important to continue operating even when parts of the system fail. Bonsai v3 implements graceful degradation strategies to maintain functionality with reduced capabilities when faced with errors.

Fallback Mechanisms

When a preferred operation fails, Bonsai v3 often provides fallback mechanisms:

def get_segments(self, chromosome=None, min_cm=None, min_snps=None):
    """Get segments with optional filtering."""
    # Start with all segments
    result = self.segments
    
    # Apply filters if provided
    if chromosome is not None:
        try:
            chrom = Validator.validate_chromosome(chromosome)
            result = [s for s in result if s["chromosome"] == chrom]
        except ValidationError as e:
            # Fallback: Log the error but return empty list
            self.logger.warning(f"Invalid chromosome filter: {e}")
            return []
    
    if min_cm is not None:
        try:
            cm_threshold = Validator.validate_centimorgans(min_cm)
            result = [s for s in result if s["cm"] >= cm_threshold]
        except ValidationError as e:
            # Fallback: Use the configured threshold instead
            self.logger.warning(
                f"Invalid min_cm filter ({min_cm}), using default: {self.config['min_cm']}"
            )
            result = [s for s in result if s["cm"] >= self.config["min_cm"]]
    
    return result
Component Isolation

Bonsai v3 isolates different analysis components so that failures in one don't necessarily compromise the entire system:

def analyze(self, data):
    """Analyze genetic data with graceful degradation."""
    # Track all analysis results and errors
    results = {}
    errors = {}
    
    # Try each analyzer and gracefully handle failures
    for name, analyzer in self.analyzers.items():
        if name not in self.available_analyzers:
            self.logger.debug(f"Skipping unavailable analyzer: {name}")
            continue
        
        try:
            self.logger.debug(f"Running {name} analyzer")
            results[name] = analyzer(data)
        except MissingDataError as e:
            self.logger.info(f"{name} analyzer skipped: {e}")
            errors[name] = {"error": "missing_data", "message": str(e)}
        except ValidationError as e:
            self.logger.info(f"{name} analyzer failed validation: {e}")
            errors[name] = {"error": "validation", "message": str(e)}
        except Exception as e:
            self.logger.warning(f"{name} analyzer failed unexpectedly: {e}")
            errors[name] = {"error": "unexpected", "message": str(e)}
            
            # Mark this analyzer as unavailable for future calls
            self.available_analyzers.remove(name)
    
    # If all analyzers failed, raise an error
    if not results and errors:
        raise ProcessingError(
            "All analyzers failed",
            details={"errors": errors}
        )
    
    # Try to combine results for a final assessment
    final_assessment = self._combine_results(results, errors)
    
    # Return the complete analysis with partial results if needed
    return {
        "individual_results": results,
        "errors": errors,
        "final_assessment": final_assessment,
        "available_analyzers": list(self.available_analyzers),
        "status": "partial" if errors else "complete"
    }
Gradual Degradation Principle

Bonsai v3 follows the principle of gradual degradation—as more components or data elements become unavailable, the system's capabilities reduce proportionally rather than failing completely. This allows it to extract maximum value from available data even in suboptimal conditions.

Logging and Debugging

Comprehensive Visibility into System Operation

Effective logging and debugging facilities are essential for diagnosing and resolving issues in complex genetic genealogy applications. Bonsai v3 implements a sophisticated logging framework that provides comprehensive visibility into system operation.

Structured Logging

Bonsai v3 uses structured logging to include relevant context with each log message:

def _log(self, level, message, **kwargs):
    """Internal method to format and log messages with structured data."""
    # Add timestamp and structured data if provided
    if kwargs:
        # Format structured data for readability
        data_str = ", ".join(f"{k}={self._format_value(v)}" for k, v in kwargs.items())
        full_message = f"{message} [{data_str}]"
    else:
        full_message = message
    
    # Log the message
    self.logger.log(level, full_message)
Performance Tracking

To diagnose performance issues, Bonsai v3 includes checkpoint logging for timing critical operations:

def checkpoint(self, name):
    """Log a performance checkpoint."""
    now = time.time()
    elapsed = now - self.last_checkpoint
    total_elapsed = now - self.start_time
    
    self.info(
        f"Checkpoint: {name}",
        elapsed_seconds=f"{elapsed:.3f}",
        total_elapsed=f"{total_elapsed:.3f}"
    )
    
    self.last_checkpoint = now
    return elapsed
Exception Logging

Bonsai v3 provides specialized functions for logging exceptions with context:

def log_exception(self, e, context=None):
    """Log an exception with context."""
    tb = traceback.format_exc()
    context_dict = context or {}
    
    # Extract exception details
    exc_type = type(e).__name__
    exc_message = str(e)
    
    # Make a clean traceback for logging
    tb_lines = tb.split('\n')
    if len(tb_lines) > 10:
        # Truncate if too long
        tb_summary = '\n'.join(tb_lines[:3] + ["..."] + tb_lines[-5:])
    else:
        tb_summary = tb
    
    # Log the exception
    self.error(
        f"Exception: {exc_type}: {exc_message}",
        exception_type=exc_type,
        **context_dict
    )
    
    # Log the traceback at debug level
    self.debug(f"Traceback:\n{tb_summary}")
Logging in a Multi-Step Process

Here's how Bonsai v3 might log a typical pedigree reconstruction process:

  1. INFO: "Starting pedigree reconstruction for 120 individuals"
  2. DEBUG: "Loading IBD data from file: data.seg"
  3. INFO: "Loaded 1,523 IBD segments between 87 pairs of individuals"
  4. DEBUG: "Checkpoint: Data preparation completed [elapsed_seconds=2.342, total_elapsed=2.342]"
  5. INFO: "Validated segments [valid=1498, invalid=25]"
  6. WARNING: "25 segments failed validation [reason=end_before_start]"
  7. DEBUG: "Checkpoint: Relationship inference started [elapsed_seconds=0.128, total_elapsed=2.470]"
  8. INFO: "Completed relationship inference for 87 pairs"
  9. DEBUG: "Checkpoint: Pedigree construction [elapsed_seconds=5.231, total_elapsed=7.701]"
  10. INFO: "Constructed pedigree with 112 individuals in 14 family groups"

This comprehensive logging provides clear visibility into each step of the process, making it easier to identify and address any issues that arise.

Practical Implementation in Bonsai v3

Integrating Error Handling Throughout the Codebase

In Bonsai v3, error handling is not an afterthought but an integral part of every component. This comprehensive approach ensures that errors are caught and handled appropriately at every level of the system.

Component Error Handling Approach Key Techniques
IBD Processing (ibd.py) Strict validation with fallbacks Data structure validation, threshold-based filtering
Relationship Inference (likelihoods.py) Graceful degradation Multiple evidence sources, confidence scoring
Pedigree Construction (pedigree.py) Defensive programming Cycle detection, consistency checking
Data Loading (loaders.py) Progressive enhancement Format detection, partial parsing
User Interface (ui.py) User-friendly error reporting Actionable error messages, suggested fixes

This strategic application of different error handling techniques ensures that Bonsai v3 is robust in real-world usage scenarios with imperfect data.

Conclusion and Next Steps

Error handling and data validation are critical components of Bonsai v3's architecture, ensuring reliable operation even with imperfect data inputs. By implementing a comprehensive error handling strategy—including custom exceptions, thorough validation, defensive programming, graceful degradation, and detailed logging—Bonsai v3 achieves the robustness required for real-world genetic genealogy applications.

These techniques allow Bonsai v3 to focus on extracting maximum value from available data while clearly communicating any limitations or issues to users, leading to more trustworthy and actionable results.

In the next lab, we'll explore pedigree rendering and visualization techniques in Bonsai v3, which help users interpret and understand the complex pedigree structures inferred from genetic data.

Interactive Lab Environment

Run the interactive Lab 20 notebook in Google Colab:

Google Colab Environment

Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.

Data will be automatically downloaded from S3 when you run the notebook.

Note: You may need a Google account to save your work in Google Drive.

Open Lab 20 Notebook in Google Colab

This lab is part of the Pedigree Building & Optimization track:

Connection Points

Lab 11

Assessment

Lab 12

Small Pedigrees

Lab 13

Optimization

Lab 14

Combine Dicts

Lab 15

Merging

Lab 16

Incremental

Lab 17

Techniques

Lab 18

Caching

Lab 19

Error Handling

Lab 20