Computational Genetic Genealogy

Interpreting Results and Confidence Measures

Lab 22: Interpreting Results and Confidence Measures

Core Component: This lab explores how Bonsai v3 interprets genetic data to make relationship predictions and assigns confidence levels to these predictions. Understanding the statistical foundation of relationship inference and confidence assessment is essential for making reliable genealogical conclusions from genetic data.

The Challenge of Interpretation in Genetic Genealogy

Why Confidence Matters

Genetic genealogy involves making inferences about relationships based on patterns of DNA sharing. However, these inferences come with varying degrees of certainty due to several inherent challenges:

  • Biological Randomness: The random nature of genetic recombination creates variability in DNA sharing, even among relatives of the same degree
  • Overlap in Patterns: Different relationships can produce similar patterns of DNA sharing
  • Data Quality Issues: Testing and analysis limitations can introduce noise and uncertainty
  • Complexity: Real-world family structures often include complex relationships that don't fit simple models

Given these challenges, simply providing relationship predictions without confidence measures could lead to misinterpretations and incorrect genealogical conclusions. Bonsai v3 addresses this by implementing sophisticated statistical methods to quantify confidence and communicate uncertainty in its predictions.

Likelihood-Based Inference

The Statistical Foundation of Relationship Prediction

At the core of Bonsai's relationship prediction is the computation of likelihoods—statistical measures of how well the observed data fits different relationship hypotheses.

The Likelihood Framework

For a given pair of individuals, Bonsai calculates the likelihood of observing their shared DNA data under different relationship hypotheses:

P(Data | Relationship) = Likelihood of the relationship given the observed data

This likelihood-based approach has several advantages:

  • It provides a principled way to compare different relationship hypotheses
  • It can incorporate multiple types of evidence (genetic, age, etc.)
  • It naturally leads to confidence measures based on likelihood ratios
Likelihood Calculation in Bonsai

Bonsai's PwLogLike class implements likelihood calculations for pairwise relationships:

class PwLogLike:
    """
    Class for computing pairwise log-likelihoods for different relationship
    hypotheses.
    
    This class combines genetic and age-based likelihoods to determine
    the most likely relationship between a pair of individuals.
    """
    
    def get_pw_gen_ll(self, node1, node2, rel_tuple):
        """
        Calculate the genetic component of the log-likelihood.
        
        Args:
            node1: ID of the first individual
            node2: ID of the second individual
            rel_tuple: Relationship tuple (up, down, num_ancestors)
            
        Returns:
            Log-likelihood score for the genetic component
        """
        # Implementation calculates likelihood based on IBD sharing patterns
        
    def get_pw_age_ll(self, node1, node2, rel_tuple):
        """
        Calculate the age component of the log-likelihood.
        
        Args:
            node1: ID of the first individual
            node2: ID of the second individual
            rel_tuple: Relationship tuple (up, down, num_ancestors)
            
        Returns:
            Log-likelihood score for the age component
        """
        # Implementation calculates likelihood based on age differences

The total log-likelihood is typically the sum of the genetic and age-based components, following the principle that if these sources of evidence are independent, their log-likelihoods can be added:

total_ll = genetic_ll + age_ll
Working in Log Space

Bonsai performs most likelihood calculations in logarithmic space (using log-likelihoods) rather than raw likelihoods. This approach has several advantages:

  • Prevents numerical underflow when working with very small probabilities
  • Converts multiplication of probabilities to addition of log-probabilities
  • Improves numerical stability in optimization procedures

Confidence Intervals for Relationship Degree

Quantifying Uncertainty in Degree Estimation

When inferring the degree of relationship between individuals, it's important to capture the uncertainty in the estimate. Bonsai v3 calculates confidence intervals to provide a range of plausible relationship degrees rather than just a point estimate.

The get_total_ibd_deg_lbd_pt_ubd Function

This function computes a confidence interval for the degree of relationship based on the total amount of shared IBD:

def get_total_ibd_deg_lbd_pt_ubd(a, L, condition=True, alpha=0.05):
    """
    Calculate a confidence interval for relationship degree based on total
    IBD length.
    
    Args:
        a: Number of common ancestors (1 for most relationships)
        L: Total IBD length in centiMorgans
        condition: Whether to condition on observing at least one segment
        alpha: Significance level (default: 0.05 for 95% confidence)
        
    Returns:
        Tuple of (lower_bound, point_estimate, upper_bound)
    """

The confidence interval provides valuable information about the reliability of the degree estimate:

  • A narrow interval indicates high confidence in the estimated degree
  • A wide interval suggests considerable uncertainty
  • The interval may span multiple degrees, indicating several plausible relationships
Example: Confidence Interval Interpretation

Consider a confidence interval for relationship degree:

  • Lower bound: 3.2 meioses
  • Point estimate: 4.5 meioses
  • Upper bound: 6.1 meioses

This interval spans degrees 3 to 6, suggesting the relationship could be anything from 2nd cousin to 3rd cousin once removed, with the most likely being 3rd cousin. This uncertainty is valuable information for the genealogist.

Bonsai computes these confidence intervals using a statistical approach based on the Highest Posterior Density (HPD) interval, which includes all degree values that together contain the specified probability mass (e.g., 95%).

Age-Based Constraints

Enhancing Predictions with Age Information

Age information provides valuable constraints that can significantly improve relationship inference. Different relationships have different characteristic age differences, and biological constraints limit certain relationships (e.g., parents must be older than children).

Age-Based Likelihood Component

Bonsai incorporates age information through an age-based likelihood component:

def get_age_log_like(age1, age2, rel_tuple):
    """
    Calculate the log-likelihood of observing a particular age difference
    given a relationship type.
    
    Args:
        age1: Age of the first individual
        age2: Age of the second individual
        rel_tuple: Relationship tuple (up, down, num_ancestors)
        
    Returns:
        Log-likelihood score
    """
    # Get mean and standard deviation for the age difference
    # distribution for this relationship
    mean, std = get_age_mean_std_for_rel_tuple(rel_tuple)
    
    # Calculate log-likelihood using normal distribution
    age_diff = age1 - age2
    return stats.norm.logpdf(age_diff, loc=mean, scale=std)

This age-based component is combined with the genetic component to create a more comprehensive assessment of relationship likelihood:

Relationship Typical Age Difference Constraints
Parent-Child 20-40 years Parent must be older than child
Full Siblings 0-10 years No strict constraints
Grandparent 40-80 years Grandparent must be older than grandchild
Aunt/Uncle 5-50 years Usually older than niece/nephew
First Cousins 0-30 years No strict constraints
Impact of Age Constraints

Age information can significantly improve relationship predictions, especially when genetic evidence alone is ambiguous. For example:

  • Half-siblings and grandparent-grandchild relationships can have similar genetic patterns but very different age distributions
  • Aunt/uncle and half-sibling relationships may be distinguished by age differences
  • Age constraints can rule out biologically impossible relationships even when genetic data suggests them

Multiple Hypothesis Testing

Comparing Alternative Relationship Hypotheses

In genetic genealogy, we often need to compare multiple relationship hypotheses to determine the most likely connection between individuals. Bonsai v3 implements several approaches for systematic hypothesis testing.

Bayes Factors

One key method for comparing hypotheses is the Bayes factor, which is the ratio of the likelihoods of two competing hypotheses:

BF = P(Data | Hypothesis1) / P(Data | Hypothesis2)

In log space, this becomes:

log(BF) = log(P(Data | Hypothesis1)) - log(P(Data | Hypothesis2))

The Bayes factor provides a measure of the strength of evidence in favor of one hypothesis over another, with larger values indicating stronger evidence.

Bayes Factor Range Interpretation
1-3 Barely worth mentioning; very weak evidence
3-10 Substantial evidence
10-30 Strong evidence
30-100 Very strong evidence
>100 Decisive evidence
Posterior Probabilities

Another approach is to calculate posterior probabilities for multiple hypotheses, which represent the probability of each relationship given the observed data:

P(Relationship | Data) ∝ P(Data | Relationship) × P(Relationship)

This approach allows for incorporating prior probabilities (P(Relationship)) that reflect existing knowledge or beliefs about the relationships. The posterior probabilities are normalized to sum to 1, providing a probability distribution over all considered relationships.

Example: Multiple Hypothesis Testing

Consider a case where we observe 850 cM of shared DNA between two individuals:

# Calculate log-likelihoods
parent_child_ll = -32.5  # Very low; this sharing is too low for parent-child
full_sibling_ll = -28.7  # Low; this sharing is too low for full siblings
half_sibling_ll = -4.2   # High; this sharing is typical for half-siblings
first_cousin_ll = -3.8   # Highest; this sharing is typical for first cousins
second_cousin_ll = -12.1 # Moderate; this sharing is high for second cousins

# Calculate posterior probabilities (assuming equal priors)
posterior_parent_child = 0.0000
posterior_full_sibling = 0.0000
posterior_half_sibling = 0.4332
posterior_first_cousin = 0.5667
posterior_second_cousin = 0.0001

# Bayes factor: first cousin vs half-sibling
bf = exp(first_cousin_ll - half_sibling_ll) = 1.31

Interpretation: First cousin is the most likely relationship (57% probability), closely followed by half-sibling (43%). The Bayes factor of 1.31 indicates very weak evidence favoring first cousin over half-sibling. In this case, both relationships should be considered plausible explanations.

Visualizing Uncertainty

Communicating Confidence Visually

Effectively communicating uncertainty in relationship predictions helps users make informed decisions. Bonsai v3 supports several approaches for visualizing confidence and uncertainty:

1. Probability Distribution Visualization

Visualizing the full distribution of relationship probabilities provides a comprehensive view of the evidence:

  • Bar charts showing posterior probabilities for different relationships
  • Highlighting the most likely relationships
  • Showing the relative probabilities of alternative hypotheses
2. Confidence Color Coding

In pedigree visualizations, confidence levels can be encoded using color:

  • Green for high-confidence relationships (e.g., Bayes factor > 100)
  • Yellow for moderate-confidence relationships (e.g., Bayes factor 10-100)
  • Red for low-confidence relationships (e.g., Bayes factor < 10)
3. Confidence Interval Visualization

For degree estimation, confidence intervals can be visualized to show the range of plausible relationships:

  • Horizontal bars showing the confidence interval range
  • Point estimates highlighted within the interval
  • Relationship labels for different degrees within the interval
4. Alternative Pedigree Hypotheses

For complex cases, visualizing multiple plausible pedigree structures can be valuable:

  • Side-by-side comparison of alternative pedigree structures
  • Annotating each alternative with its posterior probability
  • Highlighting differences between alternative structures
Uncertainty Visualization Best Practices

Effective uncertainty visualization follows certain principles:

  • Use clear visual encodings that are intuitive to understand
  • Provide context for interpreting confidence levels
  • Allow comparison between alternatives
  • Avoid visual elements that could be misleading
  • Include both quantitative measures and qualitative interpretations

Practical Interpretation Guidelines

Making Sense of Relationship Predictions

Beyond the technical aspects of confidence calculation, there are practical guidelines for interpreting relationship predictions in genetic genealogy:

1. Prioritize High-Confidence Predictions

Focus first on relationships with high confidence levels:

  • Build your pedigree foundation with the most reliable connections
  • Use high-confidence relationships as anchors for placing uncertain relationships
  • Document confidence levels to distinguish facts from hypotheses
2. Consider Multiple Lines of Evidence

Strengthen conclusions by incorporating multiple sources of information:

  • Genetic evidence (total cM, number of segments, segment sizes)
  • Age and demographic information
  • Documentary evidence from traditional genealogy
  • Input from additional relatives who have tested
3. Be Aware of Confounding Factors

Recognize situations that can complicate relationship inference:

  • Endogamy: Marriage within a relatively closed community can inflate DNA sharing
  • Population structure: Shared ancestry within a population can create background relatedness
  • Multiple relationships: Individuals can be related through multiple pathways
  • Adoption and misattributed parentage: Documentary relationships may not match genetic relationships
4. Interpret Confidence Appropriately

Understand what confidence measures mean in practical terms:

  • High confidence doesn't always mean correctness, especially if key information is missing
  • Low confidence doesn't necessarily mean the prediction is wrong, only that alternatives are plausible
  • Confidence measures are relative to the hypotheses being considered
Example: Interpreting a Complex Case

Consider this analysis of a relationship with mixed signals:

  • Genetic data: 900 cM shared across 30 segments
  • Age difference: 25 years (older individual is 65, younger is 40)
  • Top relationship hypotheses:
    • Half-sibling (30% probability)
    • Aunt/Uncle (25% probability)
    • First cousin (40% probability)
    • Others (5% probability)

Interpretation: This is a case with significant ambiguity. The genetic data alone is consistent with several relationship types. The age difference slightly favors aunt/uncle or first cousin over half-sibling but isn't decisive. Additional testing of common relatives would be the recommended next step to resolve the ambiguity.

Conclusion and Next Steps

Interpreting results and understanding confidence measures are essential skills for computational genetic genealogy. Bonsai v3 provides sophisticated statistical methods for quantifying confidence in relationship predictions, empowering users to make informed decisions about family connections based on genetic data.

The likelihood-based approach, confidence intervals, age constraints, and multiple hypothesis testing provide a comprehensive framework for relationship inference that acknowledges and quantifies uncertainty. By properly interpreting these confidence measures and following the practical guidelines outlined in this lab, users can build more reliable family trees and make stronger genealogical conclusions.

In the next lab, we'll explore how Bonsai v3 handles twins and close relatives, which present special challenges and opportunities for genetic genealogy analysis.

Interactive Lab Environment

Run the interactive Lab 22 notebook in Google Colab:

Google Colab Environment

Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.

Data will be automatically downloaded from S3 when you run the notebook.

Note: You may need a Google account to save your work in Google Drive.

Open Lab 22 Notebook in Google Colab