Computational Genetic Genealogy

Age-Based Relationship Modeling

Lab 08: Age-Based Relationship Modeling

Core Component: This lab explores how Bonsai v3 incorporates age information to enhance relationship inference. While genetic data provides strong evidence for biological relationships, age information can help resolve ambiguities and improve accuracy, especially for relationships with similar genetic signatures.

The Role of Age in Genetic Genealogy

Why Age Matters

Age information is a powerful complement to genetic data in relationship inference for several reasons:

  • Disambiguation: Some relationships have similar genetic signatures but different expected age differences (e.g., grandparent vs. aunt/uncle)
  • Biological Constraints: Age differences can render certain relationships biologically impossible (e.g., a younger person cannot be a parent of an older person)
  • Generation Inference: Age patterns can help determine generational structure in complex pedigrees
  • Resolution of Uncertainty: When genetic evidence is ambiguous, age can tip the scales toward the correct relationship

Bonsai v3's age-based modeling system is implemented primarily within the PwLogLike class we explored in Lab 07, but with specialized methods focusing on age-based likelihood computation.

The key aging-related methods in the production code include:

get_pw_age_ll()      # Calculate age-based log-likelihood for a relationship
_get_age_parameters() # Get expected age difference for a relationship
_get_age_weight()     # Calculate weight for age evidence
_validate_age_info()  # Check if age information is biologically plausible

These methods work together to extract relationship evidence from age data and integrate it with genetic evidence for comprehensive relationship inference.

Age Difference Distributions in Bonsai v3

Bonsai v3 models age differences between relatives using normal (Gaussian) distributions, where the parameters (mean and standard deviation) depend on the specific relationship type. The implementation is grounded in demographic research on typical age differences in family structures.

The core function for retrieving age parameters is _get_age_parameters(), which returns relationship-specific parameters:

def _get_age_parameters(self, relationship_tuple):
    """Get expected age difference parameters for a relationship.
    
    Args:
        relationship_tuple: (up, down, num_ancs) tuple
        
    Returns:
        Dictionary with mean, standard deviation, and direction parameters
    """
    up, down, num_ancs = relationship_tuple
    
    # Default values
    result = {
        'mean': 0,
        'std_dev': 10,
        'direction': 0  # 0: no constraint, 1: id1 older, -1: id2 older
    }
    
    # Handle special cases first
    if up == 0 and down == 0 and num_ancs == 2:  # Self
        result['mean'] = 0
        result['std_dev'] = 0.1  # Almost no variation
        return result
    
    # Parent-child relationships
    if up == 0 and down == 1:  # id1 is parent of id2
        result['mean'] = 30
        result['std_dev'] = 10
        result['direction'] = 1  # id1 should be older
        return result
    
    if up == 1 and down == 0:  # id1 is child of id2
        result['mean'] = -30
        result['std_dev'] = 10
        result['direction'] = -1  # id1 should be younger
        return result
    
    # Sibling relationships
    if up == 1 and down == 1:  # siblings (full or half)
        result['mean'] = 0
        result['std_dev'] = 10
        return result
    
    # [Additional relationship types...]
    
    # For other complex relationships, use a generalization
    # based on total meiotic distance
    total_meiotic = up + down
    if total_meiotic > 0:
        if up > down:  # id1 is in an older generation
            result['mean'] = 30 * (up - down)
            result['direction'] = 1
        elif down > up:  # id2 is in an older generation
            result['mean'] = 30 * (up - down)  # Will be negative
            result['direction'] = -1
        else:  # Same generation
            result['mean'] = 0
        
        result['std_dev'] = 10 + 5 * (total_meiotic - 1)  # Increase std dev for more distant relationships
    
    return result

The parameters encode three key pieces of information:

  • mean: The expected age difference in years (positive if id1 should be older, negative if id2 should be older)
  • std_dev: The standard deviation of the age difference distribution, capturing the natural variation
  • direction: A flag indicating whether there's a directional constraint (e.g., parents must be older than children)

These parameters are calibrated based on extensive demographic research and validation against real family datasets. For example, the average parent-child age difference of 30 years with a standard deviation of 10 years has been found to be consistent across various populations, though with some cultural and historical variations.

Age-Based Likelihood Calculation

Computing Age-Based Likelihoods

Once the age parameters are determined for a relationship, Bonsai v3 calculates the likelihood of the observed age difference under that relationship hypothesis using the get_pw_age_ll() method:

def get_pw_age_ll(self, id1, id2, relationship_tuple):
    """Calculate age-based log-likelihood of a relationship.
    
    Args:
        id1, id2: IDs of the individuals
        relationship_tuple: (up, down, num_ancs) tuple
        
    Returns:
        Age-based log-likelihood
    """
    # Get age information
    if id1 not in self.id_to_info or id2 not in self.id_to_info:
        return 0.0  # No age information available
        
    age1 = self.id_to_info[id1].get('age')
    age2 = self.id_to_info[id2].get('age')
    
    if age1 is None or age2 is None:
        return 0.0  # No age information available
    
    # Calculate age difference
    age_diff = age1 - age2
    
    # Get expected age parameters for this relationship
    params = self._get_age_parameters(relationship_tuple)
    mean = params['mean']
    std_dev = params['std_dev']
    direction = params['direction']
    
    # Check for biological impossibility
    if direction == 1 and age_diff < 16:  # id1 should be older by at least 16 years
        return float('-inf')  # Biologically impossible
    elif direction == -1 and age_diff > -16:  # id1 should be younger by at least 16 years
        return float('-inf')  # Biologically impossible
    
    # Calculate log-likelihood using normal distribution
    return stats.norm.logpdf(age_diff, mean, std_dev)

The method follows these key steps:

  1. Retrieve age information for both individuals
  2. Calculate the age difference (id1's age minus id2's age)
  3. Get the expected age parameters for the proposed relationship
  4. Apply biological constraints (e.g., parents must be at least 16 years older than children)
  5. Calculate the log-likelihood using a normal distribution

The use of log-likelihoods (rather than regular likelihoods) is important for numerical stability and makes it easier to combine with genetic likelihoods, which are also calculated in log space.

The biological impossibility check is particularly important, as it completely rules out relationships that violate fundamental biological constraints, regardless of genetic evidence. This helps prevent errors where genetic noise might otherwise lead to biologically impossible relationship inferences.

Handling Missing and Uncertain Age Information

Real-world data often includes missing or uncertain age information. Bonsai v3 implements several strategies to handle these scenarios:

  1. Neutral Contribution for Missing Data: When age information is missing for one or both individuals, the age-based component doesn't contribute to the final likelihood (returns 0 in log space)
  2. Incorporating Uncertainty: When age information is uncertain (e.g., approximate birth years), the standard deviation is increased to reflect this uncertainty:
def _adjust_age_uncertainty(self, id1, id2, std_dev):
    """Adjust standard deviation based on age uncertainty.
    
    Args:
        id1, id2: IDs of the individuals
        std_dev: Base standard deviation
        
    Returns:
        Adjusted standard deviation
    """
    # Get uncertainty information if available
    uncertainty1 = self.id_to_info[id1].get('age_uncertainty', 0)
    uncertainty2 = self.id_to_info[id2].get('age_uncertainty', 0)
    
    # Combine uncertainties with base standard deviation
    # Using sum of variances for independent random variables
    combined_var = std_dev**2 + uncertainty1**2 + uncertainty2**2
    adjusted_std = math.sqrt(combined_var)
    
    return adjusted_std

For partial age information, Bonsai v3 can sometimes extract weak evidence. For example, if only one individual's age is known, but the relationship has a strong directional constraint (like parent-child), a partial assessment can still be made.

The system can also infer age ranges based on known relationships. For instance, if A is known to be a parent of B, and B's age is known to be 30, then A's age can be estimated to be approximately 60 (with appropriate variance).

This sophisticated handling of age uncertainty ensures that the system makes the best use of available information without being overconfident in its inferences.

Integrating Age with Genetic Evidence

The Combined Likelihood Approach

Bonsai v3 combines age-based and genetic likelihoods using a weighted sum approach in log space:

combined_ll = genetic_ll + age_weight * age_ll

This integration occurs in the get_log_like() method of the PwLogLike class, which we explored in the previous lab:

def get_log_like(self, id1, id2, relationship_tuple, condition=True):
    """Calculate the log-likelihood of a relationship between two individuals.
    
    Args:
        id1, id2: IDs of the individuals
        relationship_tuple: (up, down, num_ancs) tuple
        condition: Whether to condition on observing IBD
        
    Returns:
        Log-likelihood of the relationship
    """
    # [Cache checking code...]
    
    # Get genetic likelihood
    gen_ll = self.get_pw_gen_ll(id1, id2, relationship_tuple, condition)
    
    # Get age-based likelihood if available
    if id1 in self.id_to_info and id2 in self.id_to_info:
        age1 = self.id_to_info[id1].get('age')
        age2 = self.id_to_info[id2].get('age')
        
        if age1 is not None and age2 is not None:
            # Calculate age weight
            age_weight = self._get_age_weight(id1, id2)
            
            # Get age-based likelihood
            age_ll = self.get_pw_age_ll(id1, id2, relationship_tuple)
            
            # Combine likelihoods
            ll = gen_ll + age_weight * age_ll
        else:
            ll = gen_ll
    else:
        ll = gen_ll
    
    # [Cache updating code...]
    return ll

The age_weight parameter is crucial as it controls the influence of age evidence relative to genetic evidence. This weight is dynamically determined by the _get_age_weight() method based on several factors:

def _get_age_weight(self, id1, id2):
    """Calculate weight for age-based likelihood.
    
    Args:
        id1, id2: IDs of the individuals
        
    Returns:
        Weight for age-based likelihood
    """
    # Base weight
    base_weight = 0.25  # Weight relative to genetic evidence
    
    # Adjust based on age information quality
    info1 = self.id_to_info[id1]
    info2 = self.id_to_info[id2]
    
    # Check for quality indicators
    age_quality1 = info1.get('age_quality', 1.0)
    age_quality2 = info2.get('age_quality', 1.0)
    age_uncertainty1 = info1.get('age_uncertainty', 0)
    age_uncertainty2 = info2.get('age_uncertainty', 0)
    
    # Reduce weight for less reliable age information
    if age_uncertainty1 > 0 or age_uncertainty2 > 0:
        # More uncertainty means less weight
        max_uncertainty = max(age_uncertainty1, age_uncertainty2)
        uncertainty_factor = 1.0 / (1.0 + 0.1 * max_uncertainty)
        base_weight *= uncertainty_factor
    
    # Use explicit quality indicators if available
    quality_factor = min(age_quality1, age_quality2)
    
    return base_weight * quality_factor

The typical weight range is between 0.1 and 0.5, with higher weights given when age information is precise and reliable. This weighted approach ensures that genetic evidence remains the primary determinant of relationships, with age providing a complementary source of evidence.

Using Age to Disambiguate Relationships

One of the most valuable applications of age-based modeling is disambiguating relationships with similar genetic signatures. Consider these relationships, all of which have similar expected IBD sharing (approximately 25%):

  • Half siblings
  • Grandparent-grandchild
  • Aunt/uncle-niece/nephew

Genetically, these relationships are difficult to distinguish, especially with limited data. However, they have distinct age difference patterns:

  • Half siblings: Mean age difference close to 0 years
  • Grandparent-grandchild: Mean age difference around 60 years
  • Aunt/uncle-niece/nephew: Mean age difference around 20-25 years

Bonsai v3 leverages these age patterns to dramatically improve disambiguation accuracy. Consider this example case from the production code:

def disambiguate_relationships(id1, id2, age1, age2, ibd_segments):
    """Example of using age to disambiguate relationships with similar genetic signatures.
    
    Args:
        id1, id2: Individual IDs
        age1, age2: Individual ages
        ibd_segments: List of IBD segments between individuals
        
    Returns:
        Most likely relationship and likelihood ratio
    """
    # Create biographical information
    bio_info = [
        {'genotype_id': id1, 'age': age1},
        {'genotype_id': id2, 'age': age2}
    ]
    
    # Initialize PwLogLike
    pw_ll = PwLogLike(bio_info=bio_info, unphased_ibd_seg_list=ibd_segments)
    
    # Relationships to test (similar genetic signatures)
    relationships = [
        ((1, 1, 1), "Half Sibling"),
        ((0, 2, 1), "Grandparent"),
        ((2, 0, 1), "Grandchild"),
        ((1, 2, 1), "Aunt/Uncle"),
        ((2, 1, 1), "Niece/Nephew")
    ]
    
    # Calculate likelihoods with and without age
    results = []
    for rel_tuple, rel_name in relationships:
        # Without age (genetic only)
        pw_ll.use_age = False
        gen_ll = pw_ll.get_log_like(id1, id2, rel_tuple)
        
        # With age (combined)
        pw_ll.use_age = True
        combined_ll = pw_ll.get_log_like(id1, id2, rel_tuple)
        
        # Age contribution
        age_contribution = combined_ll - gen_ll
        
        results.append({
            'relationship': rel_name,
            'genetic_ll': gen_ll,
            'combined_ll': combined_ll,
            'age_contribution': age_contribution
        })
    
    # Sort by combined likelihood
    results.sort(key=lambda x: x['combined_ll'], reverse=True)
    
    # Calculate likelihood ratio between top two
    if len(results) >= 2:
        top_ll = results[0]['combined_ll']
        second_ll = results[1]['combined_ll']
        likelihood_ratio = math.exp(top_ll - second_ll)
    else:
        likelihood_ratio = float('inf')
    
    return results[0]['relationship'], likelihood_ratio

In real-world testing, age-based disambiguation has been shown to improve relationship classification accuracy by 15-30% for these ambiguous relationship types, demonstrating the power of integrating multiple evidence sources.

Implementation Details and Extensions

Sex-Specific Age Models

Bonsai v3's production code includes sex-specific age models that account for biological sex differences in reproductive patterns. For example, the timing of parenthood tends to differ between males and females:

def _get_sex_specific_age_parameters(self, relationship_tuple, sex1, sex2):
    """Get sex-specific age parameters for a relationship.
    
    Args:
        relationship_tuple: (up, down, num_ancs) tuple
        sex1, sex2: Biological sex of individuals ('M', 'F', or None)
        
    Returns:
        Adjusted age parameters
    """
    # Get base parameters
    params = self._get_age_parameters(relationship_tuple)
    
    # No sex information or not a relationship where it matters
    if sex1 is None or sex2 is None:
        return params
        
    up, down, num_ancs = relationship_tuple
    
    # Adjust parent-child relationships
    if up == 0 and down == 1:  # id1 is parent of id2
        if sex1 == 'M':  # Father-child
            params['mean'] += 3  # Fathers tend to be slightly older
        elif sex1 == 'F':  # Mother-child
            params['mean'] -= 3  # Mothers tend to be slightly younger
            params['std_dev'] -= 1  # With less variance
            
    elif up == 1 and down == 0:  # id1 is child of id2
        if sex2 == 'M':  # Child-father
            params['mean'] -= 3  # Inverse of above
        elif sex2 == 'F':  # Child-mother
            params['mean'] += 3
            params['std_dev'] -= 1
    
    # [Additional relationship-specific adjustments...]
    
    return params

These sex-specific adjustments are based on demographic research showing systematic differences in parental age patterns between males and females across many populations. They are particularly valuable for distinguishing maternal from paternal relationships when both genetic and age evidence are available.

Cultural and Historical Calibration

The age models in Bonsai v3 can be calibrated for different populations and historical periods. This is important because age patterns vary across cultures and have changed over time. The implementation includes parameters that can be adjusted based on population context:

def _adjust_for_population_context(self, params, population_context):
    """Adjust age parameters based on population context.
    
    Args:
        params: Base age parameters
        population_context: Dictionary with context information
        
    Returns:
        Adjusted parameters
    """
    # No context information
    if not population_context:
        return params
        
    # Extract context information
    era = population_context.get('era')
    culture = population_context.get('culture')
    
    # Historical era adjustments
    if era == 'pre_1900':
        # Earlier parenthood in historical eras
        if params['direction'] != 0:  # Parent-child related
            params['mean'] *= 0.85  # About 15% younger
            
    elif era == 'post_1980':
        # Later parenthood in modern era
        if params['direction'] != 0:  # Parent-child related
            params['mean'] *= 1.1  # About 10% older
    
    # Cultural adjustments
    if culture in ['A', 'B', 'C']:  # Specific cultural patterns
        # Apply culture-specific adjustments
        adjustment_factor = CULTURE_ADJUSTMENTS.get(culture, 1.0)
        params['mean'] *= adjustment_factor
    
    return params

These contextual adjustments make Bonsai v3's age modeling system adaptable to diverse genealogical scenarios, from modern populations to historical pedigree reconstruction where age patterns may differ significantly from contemporary norms.

Core Component: Bonsai v3's age-based relationship modeling provides a powerful complement to genetic evidence, helping to resolve ambiguities and improve relationship inference accuracy. By modeling age differences as relationship-specific normal distributions and integrating them with genetic evidence through a weighted likelihood approach, the system leverages all available information for more accurate pedigree reconstruction.

Comparing Notebook and Production Code

The Lab08 notebook provides a simplified exploration of age-based relationship modeling, while the production implementation in Bonsai v3 includes additional sophistication:

The notebook provides a valuable introduction to the key concepts, but the production implementation represents years of refinement to handle the complexities of real-world demographic patterns across diverse human populations.

Interactive Lab Environment

Run the interactive Lab 08 notebook in Google Colab:

Google Colab Environment

Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.

Data will be automatically downloaded from S3 when you run the notebook.

Note: You may need a Google account to save your work in Google Drive.

Open Lab 08 Notebook in Google Colab

Beyond the Code

As you explore age-based relationship modeling, consider these broader implications:

These considerations highlight how age-based modeling represents not just a technical enhancement but a bridge between genetic data and human demographic patterns, connecting the molecular to the social aspects of genealogy.

This lab is part of the Bonsai v3 Deep Dive track:

Introduction

Lab 01

Architecture

Lab 02

IBD Formats

Lab 03

Statistics

Lab 04

Models

Lab 05

Relationships

Lab 06

PwLogLike

Lab 07

Age Modeling

Lab 08

Data Structures

Lab 09

Up-Node Dict

Lab 10