Lab 08: Age-Based Relationship Modeling
Core Component: This lab explores how Bonsai v3 incorporates age information to enhance relationship inference. While genetic data provides strong evidence for biological relationships, age information can help resolve ambiguities and improve accuracy, especially for relationships with similar genetic signatures.
The Role of Age in Genetic Genealogy
Why Age Matters
Age information is a powerful complement to genetic data in relationship inference for several reasons:
- Disambiguation: Some relationships have similar genetic signatures but different expected age differences (e.g., grandparent vs. aunt/uncle)
- Biological Constraints: Age differences can render certain relationships biologically impossible (e.g., a younger person cannot be a parent of an older person)
- Generation Inference: Age patterns can help determine generational structure in complex pedigrees
- Resolution of Uncertainty: When genetic evidence is ambiguous, age can tip the scales toward the correct relationship
Bonsai v3's age-based modeling system is implemented primarily within the PwLogLike
class we explored in Lab 07, but with specialized methods focusing on age-based likelihood computation.
The key aging-related methods in the production code include:
get_pw_age_ll() # Calculate age-based log-likelihood for a relationship
_get_age_parameters() # Get expected age difference for a relationship
_get_age_weight() # Calculate weight for age evidence
_validate_age_info() # Check if age information is biologically plausible
These methods work together to extract relationship evidence from age data and integrate it with genetic evidence for comprehensive relationship inference.
Age Difference Distributions in Bonsai v3
Bonsai v3 models age differences between relatives using normal (Gaussian) distributions, where the parameters (mean and standard deviation) depend on the specific relationship type. The implementation is grounded in demographic research on typical age differences in family structures.
The core function for retrieving age parameters is _get_age_parameters()
, which returns relationship-specific parameters:
def _get_age_parameters(self, relationship_tuple):
"""Get expected age difference parameters for a relationship.
Args:
relationship_tuple: (up, down, num_ancs) tuple
Returns:
Dictionary with mean, standard deviation, and direction parameters
"""
up, down, num_ancs = relationship_tuple
# Default values
result = {
'mean': 0,
'std_dev': 10,
'direction': 0 # 0: no constraint, 1: id1 older, -1: id2 older
}
# Handle special cases first
if up == 0 and down == 0 and num_ancs == 2: # Self
result['mean'] = 0
result['std_dev'] = 0.1 # Almost no variation
return result
# Parent-child relationships
if up == 0 and down == 1: # id1 is parent of id2
result['mean'] = 30
result['std_dev'] = 10
result['direction'] = 1 # id1 should be older
return result
if up == 1 and down == 0: # id1 is child of id2
result['mean'] = -30
result['std_dev'] = 10
result['direction'] = -1 # id1 should be younger
return result
# Sibling relationships
if up == 1 and down == 1: # siblings (full or half)
result['mean'] = 0
result['std_dev'] = 10
return result
# [Additional relationship types...]
# For other complex relationships, use a generalization
# based on total meiotic distance
total_meiotic = up + down
if total_meiotic > 0:
if up > down: # id1 is in an older generation
result['mean'] = 30 * (up - down)
result['direction'] = 1
elif down > up: # id2 is in an older generation
result['mean'] = 30 * (up - down) # Will be negative
result['direction'] = -1
else: # Same generation
result['mean'] = 0
result['std_dev'] = 10 + 5 * (total_meiotic - 1) # Increase std dev for more distant relationships
return result
The parameters encode three key pieces of information:
- mean: The expected age difference in years (positive if id1 should be older, negative if id2 should be older)
- std_dev: The standard deviation of the age difference distribution, capturing the natural variation
- direction: A flag indicating whether there's a directional constraint (e.g., parents must be older than children)
These parameters are calibrated based on extensive demographic research and validation against real family datasets. For example, the average parent-child age difference of 30 years with a standard deviation of 10 years has been found to be consistent across various populations, though with some cultural and historical variations.
Age-Based Likelihood Calculation
Computing Age-Based Likelihoods
Once the age parameters are determined for a relationship, Bonsai v3 calculates the likelihood of the observed age difference under that relationship hypothesis using the get_pw_age_ll()
method:
def get_pw_age_ll(self, id1, id2, relationship_tuple):
"""Calculate age-based log-likelihood of a relationship.
Args:
id1, id2: IDs of the individuals
relationship_tuple: (up, down, num_ancs) tuple
Returns:
Age-based log-likelihood
"""
# Get age information
if id1 not in self.id_to_info or id2 not in self.id_to_info:
return 0.0 # No age information available
age1 = self.id_to_info[id1].get('age')
age2 = self.id_to_info[id2].get('age')
if age1 is None or age2 is None:
return 0.0 # No age information available
# Calculate age difference
age_diff = age1 - age2
# Get expected age parameters for this relationship
params = self._get_age_parameters(relationship_tuple)
mean = params['mean']
std_dev = params['std_dev']
direction = params['direction']
# Check for biological impossibility
if direction == 1 and age_diff < 16: # id1 should be older by at least 16 years
return float('-inf') # Biologically impossible
elif direction == -1 and age_diff > -16: # id1 should be younger by at least 16 years
return float('-inf') # Biologically impossible
# Calculate log-likelihood using normal distribution
return stats.norm.logpdf(age_diff, mean, std_dev)
The method follows these key steps:
- Retrieve age information for both individuals
- Calculate the age difference (id1's age minus id2's age)
- Get the expected age parameters for the proposed relationship
- Apply biological constraints (e.g., parents must be at least 16 years older than children)
- Calculate the log-likelihood using a normal distribution
The use of log-likelihoods (rather than regular likelihoods) is important for numerical stability and makes it easier to combine with genetic likelihoods, which are also calculated in log space.
The biological impossibility check is particularly important, as it completely rules out relationships that violate fundamental biological constraints, regardless of genetic evidence. This helps prevent errors where genetic noise might otherwise lead to biologically impossible relationship inferences.
Handling Missing and Uncertain Age Information
Real-world data often includes missing or uncertain age information. Bonsai v3 implements several strategies to handle these scenarios:
- Neutral Contribution for Missing Data: When age information is missing for one or both individuals, the age-based component doesn't contribute to the final likelihood (returns 0 in log space)
- Incorporating Uncertainty: When age information is uncertain (e.g., approximate birth years), the standard deviation is increased to reflect this uncertainty:
def _adjust_age_uncertainty(self, id1, id2, std_dev):
"""Adjust standard deviation based on age uncertainty.
Args:
id1, id2: IDs of the individuals
std_dev: Base standard deviation
Returns:
Adjusted standard deviation
"""
# Get uncertainty information if available
uncertainty1 = self.id_to_info[id1].get('age_uncertainty', 0)
uncertainty2 = self.id_to_info[id2].get('age_uncertainty', 0)
# Combine uncertainties with base standard deviation
# Using sum of variances for independent random variables
combined_var = std_dev**2 + uncertainty1**2 + uncertainty2**2
adjusted_std = math.sqrt(combined_var)
return adjusted_std
For partial age information, Bonsai v3 can sometimes extract weak evidence. For example, if only one individual's age is known, but the relationship has a strong directional constraint (like parent-child), a partial assessment can still be made.
The system can also infer age ranges based on known relationships. For instance, if A is known to be a parent of B, and B's age is known to be 30, then A's age can be estimated to be approximately 60 (with appropriate variance).
This sophisticated handling of age uncertainty ensures that the system makes the best use of available information without being overconfident in its inferences.
Integrating Age with Genetic Evidence
The Combined Likelihood Approach
Bonsai v3 combines age-based and genetic likelihoods using a weighted sum approach in log space:
combined_ll = genetic_ll + age_weight * age_ll
This integration occurs in the get_log_like()
method of the PwLogLike
class, which we explored in the previous lab:
def get_log_like(self, id1, id2, relationship_tuple, condition=True):
"""Calculate the log-likelihood of a relationship between two individuals.
Args:
id1, id2: IDs of the individuals
relationship_tuple: (up, down, num_ancs) tuple
condition: Whether to condition on observing IBD
Returns:
Log-likelihood of the relationship
"""
# [Cache checking code...]
# Get genetic likelihood
gen_ll = self.get_pw_gen_ll(id1, id2, relationship_tuple, condition)
# Get age-based likelihood if available
if id1 in self.id_to_info and id2 in self.id_to_info:
age1 = self.id_to_info[id1].get('age')
age2 = self.id_to_info[id2].get('age')
if age1 is not None and age2 is not None:
# Calculate age weight
age_weight = self._get_age_weight(id1, id2)
# Get age-based likelihood
age_ll = self.get_pw_age_ll(id1, id2, relationship_tuple)
# Combine likelihoods
ll = gen_ll + age_weight * age_ll
else:
ll = gen_ll
else:
ll = gen_ll
# [Cache updating code...]
return ll
The age_weight
parameter is crucial as it controls the influence of age evidence relative to genetic evidence. This weight is dynamically determined by the _get_age_weight()
method based on several factors:
def _get_age_weight(self, id1, id2):
"""Calculate weight for age-based likelihood.
Args:
id1, id2: IDs of the individuals
Returns:
Weight for age-based likelihood
"""
# Base weight
base_weight = 0.25 # Weight relative to genetic evidence
# Adjust based on age information quality
info1 = self.id_to_info[id1]
info2 = self.id_to_info[id2]
# Check for quality indicators
age_quality1 = info1.get('age_quality', 1.0)
age_quality2 = info2.get('age_quality', 1.0)
age_uncertainty1 = info1.get('age_uncertainty', 0)
age_uncertainty2 = info2.get('age_uncertainty', 0)
# Reduce weight for less reliable age information
if age_uncertainty1 > 0 or age_uncertainty2 > 0:
# More uncertainty means less weight
max_uncertainty = max(age_uncertainty1, age_uncertainty2)
uncertainty_factor = 1.0 / (1.0 + 0.1 * max_uncertainty)
base_weight *= uncertainty_factor
# Use explicit quality indicators if available
quality_factor = min(age_quality1, age_quality2)
return base_weight * quality_factor
The typical weight range is between 0.1 and 0.5, with higher weights given when age information is precise and reliable. This weighted approach ensures that genetic evidence remains the primary determinant of relationships, with age providing a complementary source of evidence.
Using Age to Disambiguate Relationships
One of the most valuable applications of age-based modeling is disambiguating relationships with similar genetic signatures. Consider these relationships, all of which have similar expected IBD sharing (approximately 25%):
- Half siblings
- Grandparent-grandchild
- Aunt/uncle-niece/nephew
Genetically, these relationships are difficult to distinguish, especially with limited data. However, they have distinct age difference patterns:
- Half siblings: Mean age difference close to 0 years
- Grandparent-grandchild: Mean age difference around 60 years
- Aunt/uncle-niece/nephew: Mean age difference around 20-25 years
Bonsai v3 leverages these age patterns to dramatically improve disambiguation accuracy. Consider this example case from the production code:
def disambiguate_relationships(id1, id2, age1, age2, ibd_segments):
"""Example of using age to disambiguate relationships with similar genetic signatures.
Args:
id1, id2: Individual IDs
age1, age2: Individual ages
ibd_segments: List of IBD segments between individuals
Returns:
Most likely relationship and likelihood ratio
"""
# Create biographical information
bio_info = [
{'genotype_id': id1, 'age': age1},
{'genotype_id': id2, 'age': age2}
]
# Initialize PwLogLike
pw_ll = PwLogLike(bio_info=bio_info, unphased_ibd_seg_list=ibd_segments)
# Relationships to test (similar genetic signatures)
relationships = [
((1, 1, 1), "Half Sibling"),
((0, 2, 1), "Grandparent"),
((2, 0, 1), "Grandchild"),
((1, 2, 1), "Aunt/Uncle"),
((2, 1, 1), "Niece/Nephew")
]
# Calculate likelihoods with and without age
results = []
for rel_tuple, rel_name in relationships:
# Without age (genetic only)
pw_ll.use_age = False
gen_ll = pw_ll.get_log_like(id1, id2, rel_tuple)
# With age (combined)
pw_ll.use_age = True
combined_ll = pw_ll.get_log_like(id1, id2, rel_tuple)
# Age contribution
age_contribution = combined_ll - gen_ll
results.append({
'relationship': rel_name,
'genetic_ll': gen_ll,
'combined_ll': combined_ll,
'age_contribution': age_contribution
})
# Sort by combined likelihood
results.sort(key=lambda x: x['combined_ll'], reverse=True)
# Calculate likelihood ratio between top two
if len(results) >= 2:
top_ll = results[0]['combined_ll']
second_ll = results[1]['combined_ll']
likelihood_ratio = math.exp(top_ll - second_ll)
else:
likelihood_ratio = float('inf')
return results[0]['relationship'], likelihood_ratio
In real-world testing, age-based disambiguation has been shown to improve relationship classification accuracy by 15-30% for these ambiguous relationship types, demonstrating the power of integrating multiple evidence sources.
Implementation Details and Extensions
Sex-Specific Age Models
Bonsai v3's production code includes sex-specific age models that account for biological sex differences in reproductive patterns. For example, the timing of parenthood tends to differ between males and females:
def _get_sex_specific_age_parameters(self, relationship_tuple, sex1, sex2):
"""Get sex-specific age parameters for a relationship.
Args:
relationship_tuple: (up, down, num_ancs) tuple
sex1, sex2: Biological sex of individuals ('M', 'F', or None)
Returns:
Adjusted age parameters
"""
# Get base parameters
params = self._get_age_parameters(relationship_tuple)
# No sex information or not a relationship where it matters
if sex1 is None or sex2 is None:
return params
up, down, num_ancs = relationship_tuple
# Adjust parent-child relationships
if up == 0 and down == 1: # id1 is parent of id2
if sex1 == 'M': # Father-child
params['mean'] += 3 # Fathers tend to be slightly older
elif sex1 == 'F': # Mother-child
params['mean'] -= 3 # Mothers tend to be slightly younger
params['std_dev'] -= 1 # With less variance
elif up == 1 and down == 0: # id1 is child of id2
if sex2 == 'M': # Child-father
params['mean'] -= 3 # Inverse of above
elif sex2 == 'F': # Child-mother
params['mean'] += 3
params['std_dev'] -= 1
# [Additional relationship-specific adjustments...]
return params
These sex-specific adjustments are based on demographic research showing systematic differences in parental age patterns between males and females across many populations. They are particularly valuable for distinguishing maternal from paternal relationships when both genetic and age evidence are available.
Cultural and Historical Calibration
The age models in Bonsai v3 can be calibrated for different populations and historical periods. This is important because age patterns vary across cultures and have changed over time. The implementation includes parameters that can be adjusted based on population context:
def _adjust_for_population_context(self, params, population_context):
"""Adjust age parameters based on population context.
Args:
params: Base age parameters
population_context: Dictionary with context information
Returns:
Adjusted parameters
"""
# No context information
if not population_context:
return params
# Extract context information
era = population_context.get('era')
culture = population_context.get('culture')
# Historical era adjustments
if era == 'pre_1900':
# Earlier parenthood in historical eras
if params['direction'] != 0: # Parent-child related
params['mean'] *= 0.85 # About 15% younger
elif era == 'post_1980':
# Later parenthood in modern era
if params['direction'] != 0: # Parent-child related
params['mean'] *= 1.1 # About 10% older
# Cultural adjustments
if culture in ['A', 'B', 'C']: # Specific cultural patterns
# Apply culture-specific adjustments
adjustment_factor = CULTURE_ADJUSTMENTS.get(culture, 1.0)
params['mean'] *= adjustment_factor
return params
These contextual adjustments make Bonsai v3's age modeling system adaptable to diverse genealogical scenarios, from modern populations to historical pedigree reconstruction where age patterns may differ significantly from contemporary norms.
Core Component: Bonsai v3's age-based relationship modeling provides a powerful complement to genetic evidence, helping to resolve ambiguities and improve relationship inference accuracy. By modeling age differences as relationship-specific normal distributions and integrating them with genetic evidence through a weighted likelihood approach, the system leverages all available information for more accurate pedigree reconstruction.
Comparing Notebook and Production Code
The Lab08 notebook provides a simplified exploration of age-based relationship modeling, while the production implementation in Bonsai v3 includes additional sophistication:
- More Nuanced Age Models: The production code includes finer-grained age models for a wider range of relationships
- Population-Specific Calibration: Parameters can be adjusted for different population contexts
- Sex-Specific Adjustments: Different age patterns for males and females in reproductive relationships
- Handling of Uncertain Ages: Sophisticated treatment of age ranges and approximate birth years
- Dynamic Weight Adjustment: Age evidence weight varies based on relationship type and evidence quality
- Biological Validity Checking: More sophisticated checks for biologically impossible age combinations
The notebook provides a valuable introduction to the key concepts, but the production implementation represents years of refinement to handle the complexities of real-world demographic patterns across diverse human populations.
Interactive Lab Environment
Run the interactive Lab 08 notebook in Google Colab:
Google Colab Environment
Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.
Data will be automatically downloaded from S3 when you run the notebook.
Note: You may need a Google account to save your work in Google Drive.
Beyond the Code
As you explore age-based relationship modeling, consider these broader implications:
- Evidence Integration: How combining different types of evidence can lead to more robust inferences than any single source
- Demographic Patterns: The importance of understanding human demographic patterns in genetic genealogy
- Modeling Uncertainty: How statistical distributions can represent natural variation in biological and social patterns
- Contextual Calibration: The need to adapt models to different historical and cultural contexts
These considerations highlight how age-based modeling represents not just a technical enhancement but a bridge between genetic data and human demographic patterns, connecting the molecular to the social aspects of genealogy.
This lab is part of the Bonsai v3 Deep Dive track:
Introduction
Lab 01
Architecture
Lab 02
IBD Formats
Lab 03
Statistics
Lab 04
Models
Lab 05
Relationships
Lab 06
PwLogLike
Lab 07
Age Modeling
Lab 08
Data Structures
Lab 09
Up-Node Dict
Lab 10