Lab 27: Custom Prior Probability Models
Core Component: This lab explores Bonsai v3's prior probability framework, which allows for incorporating demographic information, historical records, and domain-specific knowledge to enhance the accuracy of relationship predictions. Understanding how to develop and integrate custom prior models is essential for adapting Bonsai to specific research contexts.
The Power of Prior Probabilities
Beyond Genetic Data Alone
While genetic data provides powerful evidence for relationship inference, incorporating prior knowledge about the relative likelihood of different relationships can significantly enhance prediction accuracy. Bonsai v3's prior probability framework enables this integration:
The Role of Priors in Bayesian Inference
Bonsai's relationship inference follows Bayesian principles, where the posterior probability of a relationship depends on both the likelihood of the observed genetic data and the prior probability of the relationship:
P(Relationship | Genetic Data) ∝ P(Genetic Data | Relationship) × P(Relationship)
The prior probability term—P(Relationship)
—represents our belief about the relationship probability before considering the genetic evidence. By carefully constructing these priors based on contextual knowledge, we can improve relationship inference accuracy, especially in cases where genetic evidence alone is ambiguous.
When Priors Matter Most
Prior probabilities have the greatest impact in scenarios where:
- Genetic Evidence is Ambiguous: When multiple relationships have similar likelihoods
- Data is Limited: When genetic data is sparse or uncertain
- Domain Knowledge is Strong: When you have reliable contextual information
- Specific Relationships are Particularly Likely/Unlikely: When certain relationships have strongly skewed probabilities in your context
The prior.py Module in Bonsai v3
Framework for Custom Prior Models
Bonsai v3 includes a dedicated prior.py
module that provides a framework for defining, evaluating, and combining prior probability models:
# Simplified representation of the prior.py module structure class PriorModel: """ Base class for relationship prior probability models. This class defines the interface for prior models and provides common functionality for prior probability calculation. """ def get_prior_probability(self, id1, id2, relationship): """ Calculate the prior probability of a specific relationship between two individuals. Args: id1: ID of the first individual id2: ID of the second individual relationship: Relationship tuple or identifier Returns: Prior probability (0-1) of the relationship """ # Implementation in derived classes raise NotImplementedError def get_relationship_priors(self, id1, id2, relationships=None): """ Calculate prior probabilities for multiple possible relationships between two individuals. Args: id1: ID of the first individual id2: ID of the second individual relationships: Optional list of relationships to consider Returns: Dictionary mapping relationships to prior probabilities """ # Default implementation calls get_prior_probability # for each relationship and normalizes results # ... class CompositePriorModel(PriorModel): """ A prior model that combines multiple component models. This class allows for integrating multiple sources of prior information through weighted combination. """ def __init__(self, component_models, weights=None): """ Initialize with component models and optional weights. Args: component_models: List of PriorModel instances weights: Optional list of weights for each model """ # ... def get_prior_probability(self, id1, id2, relationship): """ Calculate combined prior probability from component models. Args: id1: ID of the first individual id2: ID of the second individual relationship: Relationship tuple or identifier Returns: Weighted combination of prior probabilities """ # Implementation combines results from component models # ...
This flexible framework allows for creating custom prior models tailored to specific research contexts and combining multiple sources of information to form comprehensive prior probability distributions.
Types of Prior Models
Common Prior Probability Sources
Bonsai v3 supports several types of prior models, each drawing on different sources of information:
1. Demographic Prior Models
These models use demographic characteristics to inform relationship probabilities:
- Age-Based Priors: Using age differences to constrain plausible relationships
- Geographic Priors: Incorporating spatial proximity or migration patterns
- Ethnicity Priors: Considering population background and admixture patterns
Example: Age-Based Prior Model
class AgePriorModel(PriorModel): """ A prior model based on age differences between individuals. This model assigns prior probabilities to relationships based on the compatibility of age differences with relationship types. """ def __init__(self, age_dict): """ Initialize with age information. Args: age_dict: Dictionary mapping individual IDs to ages """ self.age_dict = age_dict def get_prior_probability(self, id1, id2, relationship): """ Calculate prior probability based on age difference. Args: id1: ID of the first individual id2: ID of the second individual relationship: Relationship tuple Returns: Prior probability based on age compatibility """ # Get ages or return uniform prior if ages unavailable age1 = self.age_dict.get(id1) age2 = self.age_dict.get(id2) if age1 is None or age2 is None: return 1.0 # Uniform prior when ages unknown age_diff = age1 - age2 # Positive if id1 is older # Get relationship type rel_type = get_relationship_type(relationship) # Apply age-appropriate priors for each relationship type if rel_type == "parent-child": # Parents typically 15-50 years older than children if 15 <= age_diff <= 50: return 1.0 elif 10 <= age_diff < 15 or 50 < age_diff <= 60: return 0.5 # Possible but less common else: return 0.01 # Highly unlikely elif rel_type == "full-siblings": # Siblings typically 0-15 years apart if abs(age_diff) <= 15: return 1.0 elif abs(age_diff) <= 25: return 0.5 # Possible but less common else: return 0.01 # Highly unlikely # More relationship types...
2. Historical Record Priors
These models incorporate information from documentary sources:
- Family Tree Priors: Using existing genealogical records
- Census Data Priors: Incorporating household composition information
- Vital Records Priors: Using birth, marriage, and death records
3. Population Structure Priors
These models account for population-level patterns:
- Endogamy Priors: Adjusting for elevated background relatedness
- Population Size Priors: Considering the size of relevant populations
- Migration Pattern Priors: Incorporating historical population movements
Uniform vs. Informative Priors
Prior models in Bonsai can range from uniform (all relationships equally likely a priori) to highly informative (strong preference for specific relationships). The choice depends on:
- Data Quality: How reliable is your contextual information?
- Research Goals: Are you testing hypotheses or exploring possibilities?
- Context: How strong are the population-level patterns in your research context?
Bonsai generally favors moderately informative priors that guide inference without overwhelming genetic evidence.
Building Custom Prior Models
Developing Tailored Prior Distributions
Creating effective custom prior models involves several key steps:
1. Identifying Relevant Information Sources
Begin by identifying sources of information that could inform relationship probabilities:
- Demographic Data: Ages, locations, ethnicity information
- Historical Records: Census data, parish records, family bibles
- Population Studies: Endogamy rates, migration patterns, cultural practices
- Expert Knowledge: Insights from genealogists, historians, or cultural experts
2. Quantifying Prior Beliefs
Convert qualitative knowledge into quantitative probability distributions:
- Direct Specification: Explicitly setting probabilities based on expert judgment
- Statistical Modeling: Using historical data to estimate relationship frequencies
- Constraint-Based Methods: Using logical constraints to bound probabilities
- Parameterized Functions: Creating mathematical models of relationship likelihood
Example: Creating a Location-Based Prior Model
class LocationPriorModel(PriorModel): """ Prior model based on geographic proximity. This model assigns higher prior probabilities to relationships between individuals who lived near each other. """ def __init__(self, location_dict, distance_matrix): """ Initialize with location data. Args: location_dict: Dictionary mapping IDs to location identifiers distance_matrix: Matrix of distances between locations """ self.location_dict = location_dict self.distance_matrix = distance_matrix # Parameters derived from historical data self.distance_decay_rates = { "parent-child": 0.01, # Slow decay - family members tend to live close "siblings": 0.01, # Slow decay - siblings often live near each other "cousins": 0.05, # Moderate decay - cousins somewhat dispersed "distant": 0.1, # Rapid decay - distant relatives often separated "unrelated": 0.001 # Very slow decay - unrelated people everywhere } def get_prior_probability(self, id1, id2, relationship): """ Calculate prior probability based on geographic proximity. Args: id1: ID of the first individual id2: ID of the second individual relationship: Relationship tuple Returns: Prior probability based on location compatibility """ # Get locations or return uniform prior if unavailable loc1 = self.location_dict.get(id1) loc2 = self.location_dict.get(id2) if loc1 is None or loc2 is None: return 1.0 # Uniform prior when locations unknown # Calculate distance distance = self.distance_matrix[loc1][loc2] # Get relationship category for distance decay parameter rel_category = self.get_relationship_category(relationship) decay_rate = self.distance_decay_rates[rel_category] # Apply exponential distance decay model prior_prob = math.exp(-decay_rate * distance) return prior_prob
3. Validating Prior Models
It's essential to validate prior models to ensure they reflect reality:
- Historical Validation: Testing against known historical relationships
- Expert Review: Having domain experts evaluate prior distributions
- Sensitivity Analysis: Assessing how prior variations affect inference
- Cross-Validation: Testing predictive performance on reserved data
4. Combining Multiple Prior Sources
Often, you'll want to integrate multiple sources of prior information:
- Weighted Combination: Combining models with importance weights
- Sequential Updating: Using one prior's output as another's input
- Constraint Satisfaction: Finding distributions that satisfy all constraints
- Hierarchical Models: Structuring priors in levels of specificity
Integrating Prior Models with Genetic Evidence
Combining Priors and Likelihoods
Once prior models are defined, they need to be effectively integrated with genetic evidence for optimal relationship inference:
The Bayesian Framework
Bonsai v3 follows Bayesian principles for integrating priors with genetic evidence:
# Simplified pseudocode for Bayesian integration def infer_relationship_with_priors(id1, id2, genetic_data, prior_model): """ Infer relationship using both genetic data and prior model. Args: id1: ID of the first individual id2: ID of the second individual genetic_data: Genetic comparison data between individuals prior_model: Prior probability model Returns: Dictionary mapping relationships to posterior probabilities """ # Define possible relationships to consider relationships = get_plausible_relationships() # Calculate likelihood for each relationship likelihoods = {} for rel in relationships: likelihoods[rel] = calculate_likelihood(genetic_data, rel) # Get prior probabilities priors = prior_model.get_relationship_priors(id1, id2, relationships) # Calculate unnormalized posterior (prior × likelihood) unnormalized_posterior = {} for rel in relationships: unnormalized_posterior[rel] = priors[rel] * likelihoods[rel] # Normalize to get proper probability distribution total = sum(unnormalized_posterior.values()) posterior = {rel: prob/total for rel, prob in unnormalized_posterior.items()} return posterior
Integration Challenges
Several challenges arise when integrating priors with genetic evidence:
- Prior Strength Calibration: Determining how much weight to give priors relative to genetic evidence
- Prior Uncertainty: Accounting for uncertainty in the prior models themselves
- Relationship Alignment: Ensuring prior and likelihood models use compatible relationship definitions
- Computational Efficiency: Maintaining performance with complex prior calculations
Prior Strength Calibration
class CalibratedPriorModel(PriorModel): """ A prior model with adjustable strength. This wrapper allows controlling how strongly the prior influences the posterior relative to the likelihood. """ def __init__(self, base_prior_model, strength=1.0): """ Initialize with base model and strength parameter. Args: base_prior_model: The underlying prior model strength: How strongly to weight the prior (0=uniform, 1=full strength) """ self.base_model = base_prior_model self.strength = strength def get_prior_probability(self, id1, id2, relationship): """ Calculate prior with adjusted strength. Args: id1: ID of the first individual id2: ID of the second individual relationship: Relationship tuple Returns: Adjusted prior probability """ # Get base prior base_prior = self.base_model.get_prior_probability(id1, id2, relationship) # Adjust strength (interpolate between uniform and full prior) if self.strength == 1.0: return base_prior elif self.strength == 0.0: return 1.0 # Uniform prior else: # Interpolate between uniform (1.0) and base prior return (1.0 - self.strength) + self.strength * base_prior
When Priors and Genetic Evidence Conflict
When prior models strongly contradict genetic evidence, several approaches are possible:
- Flag for Review: Identify cases where priors and evidence disagree for human review
- Evidence Threshold: Override priors when genetic evidence is particularly strong
- Alternative Hypothesis Exploration: Present multiple possible interpretations
- Seek Additional Evidence: Gather more data to resolve the contradiction
Case Studies in Prior Model Development
Learning from Real-World Examples
Several case studies illustrate the development and application of prior models in different contexts:
Case Study 1: Endogamous Population
For a research project involving an endogamous historical population:
- Challenge: Standard relationship priors failed to account for elevated background relatedness
- Approach:
- Analyzed 50 known pedigrees from the population
- Measured typical relationship frequencies within the community
- Created a custom prior model with adjusted relationship probabilities
- Result: 35% improvement in relationship prediction accuracy compared to uniform priors
Case Study 2: Historical Migrations
For a project tracking family connections across a historical migration:
- Challenge: Determining likely family connections between origin and destination regions
- Approach:
- Integrated historical migration records into a geographic prior model
- Created time-dependent spatial probability distributions
- Incorporated known migration patterns from historical records
- Result: Successfully identified multiple previously unknown family connections across regions
Case Study 3: Genealogical Records Integration
For a project integrating DNA evidence with existing family trees:
- Challenge: Determining how to weight sometimes-incorrect documentary evidence against genetic data
- Approach:
- Created a prior model based on documentary records
- Calibrated prior strength based on record reliability metrics
- Implemented a conflict detection system to flag major discrepancies
- Result: Identified several documentary errors while confirming most recorded relationships
Ethical Considerations in Prior Model Development
Ensuring Responsible Prior Specification
Developing prior models raises important ethical considerations that must be addressed:
Potential Ethical Issues
- Bias Amplification: Priors based on biased historical data may perpetuate those biases
- Cultural Assumptions: Prior models may incorporate culturally specific assumptions
- Privacy Implications: Some prior information may have privacy implications
- Confirmation Bias: Priors may be inadvertently selected to confirm expected relationships
Best Practices for Ethical Prior Development
- Transparency: Document all assumptions and data sources used in prior development
- Validation: Test priors against diverse datasets to ensure they don't disadvantage specific groups
- Sensitivity Analysis: Examine how variations in prior assumptions affect conclusions
- Cultural Competence: Consult with cultural experts when developing priors for specific populations
- Privacy Protection: Ensure prior models don't inadvertently reveal sensitive information
Documenting Prior Assumptions
When developing custom prior models, it's important to thoroughly document:
- The data sources used to develop the prior
- The assumptions made during prior specification
- The strength of the prior relative to genetic evidence
- Any known limitations or potential biases in the prior
- Validation methods and results
This documentation enables critical evaluation of results and transparent scientific practice.
Conclusion and Next Steps
Custom prior probability models provide a powerful mechanism for integrating domain knowledge, historical records, and demographic information into genetic genealogy analysis. By developing and applying appropriate prior models within Bonsai v3's flexible framework, researchers can significantly enhance the accuracy and contextual relevance of relationship predictions.
The key to effective prior model development lies in carefully balancing informative contextual knowledge with appropriate caution about the strength of prior assumptions. When done well, prior models complement genetic evidence to create more robust and accurate relationship inferences.
In the next lab, we'll explore how Bonsai v3 integrates with other genealogical tools through the DRUID algorithm and other integration mechanisms, enabling comprehensive genetic genealogy workflows.
Your Learning Pathway
Interactive Lab Environment
Run the interactive Lab 27 notebook in Google Colab:
Google Colab Environment
Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.
Data will be automatically downloaded from S3 when you run the notebook.
Note: You may need a Google account to save your work in Google Drive.
This lab is part of the Visualization & Advanced Applications track:
Rendering
Lab 21
Interpreting
Lab 22
Twins
Lab 23
Complex
Lab 24
Real-World
Lab 25
Performance
Lab 26
Prior Models
Lab 27
Integration
Lab 28
End-to-End
Lab 29
Advanced
Lab 30