Lab 27: Custom Prior Probability Models

Core Component: This lab explores Bonsai v3's prior probability framework, which allows for incorporating demographic information, historical records, and domain-specific knowledge to enhance the accuracy of relationship predictions. Understanding how to develop and integrate custom prior models is essential for adapting Bonsai to specific research contexts.

The Power of Prior Probabilities

Beyond Genetic Data Alone

While genetic data provides powerful evidence for relationship inference, incorporating prior knowledge about the relative likelihood of different relationships can significantly enhance prediction accuracy. Bonsai v3's prior probability framework enables this integration:

The Role of Priors in Bayesian Inference

Bonsai's relationship inference follows Bayesian principles, where the posterior probability of a relationship depends on both the likelihood of the observed genetic data and the prior probability of the relationship:

P(Relationship | Genetic Data) ∝ P(Genetic Data | Relationship) × P(Relationship)

The prior probability term—P(Relationship)—represents our belief about the relationship probability before considering the genetic evidence. By carefully constructing these priors based on contextual knowledge, we can improve relationship inference accuracy, especially in cases where genetic evidence alone is ambiguous.

When Priors Matter Most

Prior probabilities have the greatest impact in scenarios where:

Genetic Evidence is Ambiguous: When multiple relationships have similar likelihoods
Data is Limited: When genetic data is sparse or uncertain
Domain Knowledge is Strong: When you have reliable contextual information
Specific Relationships are Particularly Likely/Unlikely: When certain relationships have strongly skewed probabilities in your context

The prior.py Module in Bonsai v3

Framework for Custom Prior Models

Bonsai v3 includes a dedicated prior.py module that provides a framework for defining, evaluating, and combining prior probability models:

# Simplified representation of the prior.py module structure
class PriorModel:
    """
    Base class for relationship prior probability models.
    
    This class defines the interface for prior models and
    provides common functionality for prior probability
    calculation.
    """
    
    def get_prior_probability(self, id1, id2, relationship):
        """
        Calculate the prior probability of a specific relationship
        between two individuals.
        
        Args:
            id1: ID of the first individual
            id2: ID of the second individual
            relationship: Relationship tuple or identifier
            
        Returns:
            Prior probability (0-1) of the relationship
        """
        # Implementation in derived classes
        raise NotImplementedError
    
    def get_relationship_priors(self, id1, id2, relationships=None):
        """
        Calculate prior probabilities for multiple possible
        relationships between two individuals.
        
        Args:
            id1: ID of the first individual
            id2: ID of the second individual
            relationships: Optional list of relationships to consider
            
        Returns:
            Dictionary mapping relationships to prior probabilities
        """
        # Default implementation calls get_prior_probability
        # for each relationship and normalizes results
        # ...

class CompositePriorModel(PriorModel):
    """
    A prior model that combines multiple component models.
    
    This class allows for integrating multiple sources of
    prior information through weighted combination.
    """
    
    def __init__(self, component_models, weights=None):
        """
        Initialize with component models and optional weights.
        
        Args:
            component_models: List of PriorModel instances
            weights: Optional list of weights for each model
        """
        # ...
    
    def get_prior_probability(self, id1, id2, relationship):
        """
        Calculate combined prior probability from component models.
        
        Args:
            id1: ID of the first individual
            id2: ID of the second individual
            relationship: Relationship tuple or identifier
            
        Returns:
            Weighted combination of prior probabilities
        """
        # Implementation combines results from component models
        # ...

This flexible framework allows for creating custom prior models tailored to specific research contexts and combining multiple sources of information to form comprehensive prior probability distributions.

Types of Prior Models

Common Prior Probability Sources

Bonsai v3 supports several types of prior models, each drawing on different sources of information:

1. Demographic Prior Models

These models use demographic characteristics to inform relationship probabilities:

Age-Based Priors: Using age differences to constrain plausible relationships
Geographic Priors: Incorporating spatial proximity or migration patterns
Ethnicity Priors: Considering population background and admixture patterns

Example: Age-Based Prior Model

class AgePriorModel(PriorModel):
    """
    A prior model based on age differences between individuals.
    
    This model assigns prior probabilities to relationships based
    on the compatibility of age differences with relationship types.
    """
    
    def __init__(self, age_dict):
        """
        Initialize with age information.
        
        Args:
            age_dict: Dictionary mapping individual IDs to ages
        """
        self.age_dict = age_dict
    
    def get_prior_probability(self, id1, id2, relationship):
        """
        Calculate prior probability based on age difference.
        
        Args:
            id1: ID of the first individual
            id2: ID of the second individual
            relationship: Relationship tuple
            
        Returns:
            Prior probability based on age compatibility
        """
        # Get ages or return uniform prior if ages unavailable
        age1 = self.age_dict.get(id1)
        age2 = self.age_dict.get(id2)
        if age1 is None or age2 is None:
            return 1.0  # Uniform prior when ages unknown
        
        age_diff = age1 - age2  # Positive if id1 is older
        
        # Get relationship type
        rel_type = get_relationship_type(relationship)
        
        # Apply age-appropriate priors for each relationship type
        if rel_type == "parent-child":
            # Parents typically 15-50 years older than children
            if 15 <= age_diff <= 50:
                return 1.0
            elif 10 <= age_diff < 15 or 50 < age_diff <= 60:
                return 0.5  # Possible but less common
            else:
                return 0.01  # Highly unlikely
        
        elif rel_type == "full-siblings":
            # Siblings typically 0-15 years apart
            if abs(age_diff) <= 15:
                return 1.0
            elif abs(age_diff) <= 25:
                return 0.5  # Possible but less common
            else:
                return 0.01  # Highly unlikely
        
        # More relationship types...

2. Historical Record Priors

These models incorporate information from documentary sources:

Family Tree Priors: Using existing genealogical records
Census Data Priors: Incorporating household composition information
Vital Records Priors: Using birth, marriage, and death records

3. Population Structure Priors

These models account for population-level patterns:

Endogamy Priors: Adjusting for elevated background relatedness
Population Size Priors: Considering the size of relevant populations
Migration Pattern Priors: Incorporating historical population movements

Uniform vs. Informative Priors

Prior models in Bonsai can range from uniform (all relationships equally likely a priori) to highly informative (strong preference for specific relationships). The choice depends on:

Data Quality: How reliable is your contextual information?
Research Goals: Are you testing hypotheses or exploring possibilities?
Context: How strong are the population-level patterns in your research context?

Bonsai generally favors moderately informative priors that guide inference without overwhelming genetic evidence.

Building Custom Prior Models

Developing Tailored Prior Distributions

Creating effective custom prior models involves several key steps:

1. Identifying Relevant Information Sources

Begin by identifying sources of information that could inform relationship probabilities:

Demographic Data: Ages, locations, ethnicity information
Historical Records: Census data, parish records, family bibles
Population Studies: Endogamy rates, migration patterns, cultural practices
Expert Knowledge: Insights from genealogists, historians, or cultural experts

2. Quantifying Prior Beliefs

Convert qualitative knowledge into quantitative probability distributions:

Direct Specification: Explicitly setting probabilities based on expert judgment
Statistical Modeling: Using historical data to estimate relationship frequencies
Constraint-Based Methods: Using logical constraints to bound probabilities
Parameterized Functions: Creating mathematical models of relationship likelihood

Example: Creating a Location-Based Prior Model

class LocationPriorModel(PriorModel):
    """
    Prior model based on geographic proximity.
    
    This model assigns higher prior probabilities to
    relationships between individuals who lived near
    each other.
    """
    
    def __init__(self, location_dict, distance_matrix):
        """
        Initialize with location data.
        
        Args:
            location_dict: Dictionary mapping IDs to location identifiers
            distance_matrix: Matrix of distances between locations
        """
        self.location_dict = location_dict
        self.distance_matrix = distance_matrix
        
        # Parameters derived from historical data
        self.distance_decay_rates = {
            "parent-child": 0.01,      # Slow decay - family members tend to live close
            "siblings": 0.01,          # Slow decay - siblings often live near each other
            "cousins": 0.05,           # Moderate decay - cousins somewhat dispersed
            "distant": 0.1,            # Rapid decay - distant relatives often separated
            "unrelated": 0.001         # Very slow decay - unrelated people everywhere
        }
    
    def get_prior_probability(self, id1, id2, relationship):
        """
        Calculate prior probability based on geographic proximity.
        
        Args:
            id1: ID of the first individual
            id2: ID of the second individual
            relationship: Relationship tuple
            
        Returns:
            Prior probability based on location compatibility
        """
        # Get locations or return uniform prior if unavailable
        loc1 = self.location_dict.get(id1)
        loc2 = self.location_dict.get(id2)
        if loc1 is None or loc2 is None:
            return 1.0  # Uniform prior when locations unknown
        
        # Calculate distance
        distance = self.distance_matrix[loc1][loc2]
        
        # Get relationship category for distance decay parameter
        rel_category = self.get_relationship_category(relationship)
        decay_rate = self.distance_decay_rates[rel_category]
        
        # Apply exponential distance decay model
        prior_prob = math.exp(-decay_rate * distance)
        
        return prior_prob

3. Validating Prior Models

It's essential to validate prior models to ensure they reflect reality:

Historical Validation: Testing against known historical relationships
Expert Review: Having domain experts evaluate prior distributions
Sensitivity Analysis: Assessing how prior variations affect inference
Cross-Validation: Testing predictive performance on reserved data

4. Combining Multiple Prior Sources

Often, you'll want to integrate multiple sources of prior information:

Weighted Combination: Combining models with importance weights
Sequential Updating: Using one prior's output as another's input
Constraint Satisfaction: Finding distributions that satisfy all constraints
Hierarchical Models: Structuring priors in levels of specificity

Integrating Prior Models with Genetic Evidence

Combining Priors and Likelihoods

Once prior models are defined, they need to be effectively integrated with genetic evidence for optimal relationship inference:

The Bayesian Framework

Bonsai v3 follows Bayesian principles for integrating priors with genetic evidence:

# Simplified pseudocode for Bayesian integration
def infer_relationship_with_priors(id1, id2, genetic_data, prior_model):
    """
    Infer relationship using both genetic data and prior model.
    
    Args:
        id1: ID of the first individual
        id2: ID of the second individual
        genetic_data: Genetic comparison data between individuals
        prior_model: Prior probability model
        
    Returns:
        Dictionary mapping relationships to posterior probabilities
    """
    # Define possible relationships to consider
    relationships = get_plausible_relationships()
    
    # Calculate likelihood for each relationship
    likelihoods = {}
    for rel in relationships:
        likelihoods[rel] = calculate_likelihood(genetic_data, rel)
    
    # Get prior probabilities
    priors = prior_model.get_relationship_priors(id1, id2, relationships)
    
    # Calculate unnormalized posterior (prior × likelihood)
    unnormalized_posterior = {}
    for rel in relationships:
        unnormalized_posterior[rel] = priors[rel] * likelihoods[rel]
    
    # Normalize to get proper probability distribution
    total = sum(unnormalized_posterior.values())
    posterior = {rel: prob/total for rel, prob in unnormalized_posterior.items()}
    
    return posterior

Integration Challenges

Several challenges arise when integrating priors with genetic evidence:

Prior Strength Calibration: Determining how much weight to give priors relative to genetic evidence
Prior Uncertainty: Accounting for uncertainty in the prior models themselves
Relationship Alignment: Ensuring prior and likelihood models use compatible relationship definitions
Computational Efficiency: Maintaining performance with complex prior calculations

Prior Strength Calibration

class CalibratedPriorModel(PriorModel):
    """
    A prior model with adjustable strength.
    
    This wrapper allows controlling how strongly the
    prior influences the posterior relative to the
    likelihood.
    """
    
    def __init__(self, base_prior_model, strength=1.0):
        """
        Initialize with base model and strength parameter.
        
        Args:
            base_prior_model: The underlying prior model
            strength: How strongly to weight the prior (0=uniform, 1=full strength)
        """
        self.base_model = base_prior_model
        self.strength = strength
    
    def get_prior_probability(self, id1, id2, relationship):
        """
        Calculate prior with adjusted strength.
        
        Args:
            id1: ID of the first individual
            id2: ID of the second individual
            relationship: Relationship tuple
            
        Returns:
            Adjusted prior probability
        """
        # Get base prior
        base_prior = self.base_model.get_prior_probability(id1, id2, relationship)
        
        # Adjust strength (interpolate between uniform and full prior)
        if self.strength == 1.0:
            return base_prior
        elif self.strength == 0.0:
            return 1.0  # Uniform prior
        else:
            # Interpolate between uniform (1.0) and base prior
            return (1.0 - self.strength) + self.strength * base_prior

When Priors and Genetic Evidence Conflict

When prior models strongly contradict genetic evidence, several approaches are possible:

Flag for Review: Identify cases where priors and evidence disagree for human review
Evidence Threshold: Override priors when genetic evidence is particularly strong
Alternative Hypothesis Exploration: Present multiple possible interpretations
Seek Additional Evidence: Gather more data to resolve the contradiction

Case Studies in Prior Model Development

Learning from Real-World Examples

Several case studies illustrate the development and application of prior models in different contexts:

Case Study 1: Endogamous Population

For a research project involving an endogamous historical population:

Challenge: Standard relationship priors failed to account for elevated background relatedness
Approach:
- Analyzed 50 known pedigrees from the population
- Measured typical relationship frequencies within the community
- Created a custom prior model with adjusted relationship probabilities
Result: 35% improvement in relationship prediction accuracy compared to uniform priors

Case Study 2: Historical Migrations

For a project tracking family connections across a historical migration:

Challenge: Determining likely family connections between origin and destination regions
Approach:
- Integrated historical migration records into a geographic prior model
- Created time-dependent spatial probability distributions
- Incorporated known migration patterns from historical records
Result: Successfully identified multiple previously unknown family connections across regions

Case Study 3: Genealogical Records Integration

For a project integrating DNA evidence with existing family trees:

Challenge: Determining how to weight sometimes-incorrect documentary evidence against genetic data
Approach:
- Created a prior model based on documentary records
- Calibrated prior strength based on record reliability metrics
- Implemented a conflict detection system to flag major discrepancies
Result: Identified several documentary errors while confirming most recorded relationships

Ethical Considerations in Prior Model Development

Ensuring Responsible Prior Specification

Developing prior models raises important ethical considerations that must be addressed:

Potential Ethical Issues

Bias Amplification: Priors based on biased historical data may perpetuate those biases
Cultural Assumptions: Prior models may incorporate culturally specific assumptions
Privacy Implications: Some prior information may have privacy implications
Confirmation Bias: Priors may be inadvertently selected to confirm expected relationships

Best Practices for Ethical Prior Development

Transparency: Document all assumptions and data sources used in prior development
Validation: Test priors against diverse datasets to ensure they don't disadvantage specific groups
Sensitivity Analysis: Examine how variations in prior assumptions affect conclusions
Cultural Competence: Consult with cultural experts when developing priors for specific populations
Privacy Protection: Ensure prior models don't inadvertently reveal sensitive information

Documenting Prior Assumptions

When developing custom prior models, it's important to thoroughly document:

The data sources used to develop the prior
The assumptions made during prior specification
The strength of the prior relative to genetic evidence
Any known limitations or potential biases in the prior
Validation methods and results

This documentation enables critical evaluation of results and transparent scientific practice.

Conclusion and Next Steps

Custom prior probability models provide a powerful mechanism for integrating domain knowledge, historical records, and demographic information into genetic genealogy analysis. By developing and applying appropriate prior models within Bonsai v3's flexible framework, researchers can significantly enhance the accuracy and contextual relevance of relationship predictions.

The key to effective prior model development lies in carefully balancing informative contextual knowledge with appropriate caution about the strength of prior assumptions. When done well, prior models complement genetic evidence to create more robust and accurate relationship inferences.

In the next lab, we'll explore how Bonsai v3 integrates with other genealogical tools through the DRUID algorithm and other integration mechanisms, enabling comprehensive genetic genealogy workflows.

Your Learning Pathway

Lab 26: Performance Tuning Lab 28: Integration with Other Tools

Interactive Lab Environment

Run the interactive Lab 27 notebook in Google Colab:

Google Colab Environment

Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.

Data will be automatically downloaded from S3 when you run the notebook.

Note: You may need a Google account to save your work in Google Drive.

Open Lab 27 Notebook in Google Colab

This lab is part of the Visualization & Advanced Applications track:

Rendering

Lab 21

Interpreting

Lab 22

Twins

Lab 23

Complex

Lab 24

Real-World

Lab 25

Performance

Lab 26

Prior Models

Lab 27

Integration

Lab 28

End-to-End

Lab 29

Advanced

Lab 30