Computational Genetic Genealogy

Probabilistic Relationship Inference

Lab 06: Probabilistic Relationship Inference

Bayesian Inference: This lab examines how Bonsai v3 transforms observed IBD patterns into probabilistic assessments of relationships using statistical models and Bayesian inference. Understanding this process is key to grasping the system's core functionality and interpreting its outputs.

The Mathematical Framework

Relationship Representation

Bonsai v3 uses a compact tuple representation for genetic relationships that captures the essential genealogical structure. There are two equivalent formats used in different parts of the codebase:

  1. Three-value tuple: (up, down, num_ancs)
    • up: Generations from person 1 to common ancestor
    • down: Generations from common ancestor to person 2
    • num_ancs: Number of common ancestors (1 for half-relationships, 2 for full relationships)
  2. Five-value tuple: (degree1, removal1, degree2, removal2, half)
    • degree1: Genealogical degree of person 1 (ancestors to common ancestor)
    • removal1: Removal for person 1 (generational distance within degree)
    • degree2: Genealogical degree of person 2 (ancestors to common ancestor)
    • removal2: Removal for person 2 (generational distance within degree)
    • half: 1 for half relationships (one common ancestor), 0 for full relationships (two common ancestors)

The three-value tuple representation is more commonly used in the core algorithmic components. For example, the function get_simple_rel_tuple() in pedigrees.py returns this representation when computing relationships in pedigree structures:

def get_simple_rel_tuple(up_node_dict, id1, id2):
    """Compute the relationship tuple between two individuals in a pedigree.
    
    Args:
        up_node_dict: Dictionary representing the pedigree structure
        id1, id2: IDs of the individuals to compare
        
    Returns:
        Tuple of (up, down, num_ancs) representing the relationship
    """

This representation directly encodes the meiotic distance between individuals (up + down), which is the primary determinant of expected IBD sharing patterns. The tuple approach allows the system to represent any possible relationship within a consistent mathematical framework, enabling efficient comparison and computation.

Bayesian Inference for Relationships

At its core, Bonsai v3 implements a Bayesian approach to relationship inference. For two individuals with observed IBD data D, the posterior probability of relationship R is calculated as:

P(R|D) = \frac{P(D|R) \cdot P(R)}{P(D)}

Where:

  • P(R|D) is the posterior probability of relationship R given the observed IBD data
  • P(D|R) is the likelihood of observing the IBD data given relationship R
  • P(R) is the prior probability of relationship R
  • P(D) is the marginal probability of the IBD data

Since the goal is to compare different possible relationships for the same IBD data, the denominator P(D) is constant across all relationship hypotheses. This allows Bonsai to focus on the numerator, comparing relationships based on the product of likelihood and prior:

P(R|D) \propto P(D|R) \cdot P(R)

In practice, Bonsai v3 performs these calculations in log space to avoid numerical underflow with very small probabilities:

\log P(R|D) \propto \log P(D|R) + \log P(R)

This Bayesian framework is implemented in the get_relationship_log_like() method of the PwLogLike class, which combines log-likelihoods from multiple sources of evidence with relationship priors to compute posterior log-probabilities.

Relationship Priors

Bonsai v3 incorporates prior probabilities for different relationship types, reflecting the relative frequency of these relationships in typical human populations. These priors are implemented in the get_rel_ll_prior() function in the likelihoods.py module:

def get_rel_ll_prior(rel_tuple):
    """Calculate prior log-likelihood for a relationship tuple.
    
    Args:
        rel_tuple: Tuple of (up, down, num_ancs)
        
    Returns:
        Prior log-likelihood
    """
    up, down, num_ancs = rel_tuple
    
    # Calculate meiotic distance
    m = up + down
    
    # No relationship has a very low prior
    if rel_tuple is None:
        return -20.0
    
    # Self-comparison has a low prior in most contexts
    if m == 0 and num_ancs == 2:  # Self
        return -10.0
    
    # Prior decreases with increasing meiotic distance
    base_prior = -0.5 * m
    
    # Full relationships are more common than half relationships
    anc_prior = 0.5 if num_ancs == 2 else 0.0
    
    # Symmetric relationships (up ≈ down) are more common
    sym_prior = -0.2 * abs(up - down)
    
    return base_prior + anc_prior + sym_prior

This prior model encodes several important insights about human kinship patterns:

  • The probability of relationship decreases with increasing genealogical distance
  • Full relationships (sharing two common ancestors) are more common than half relationships
  • Symmetric relationships (same generation) are more common than asymmetric ones

The prior serves as a regularization mechanism, guiding the inference process toward more plausible relationships when the genetic evidence alone is ambiguous. This is particularly important for distant relationships where IBD sharing patterns can be highly variable.

Statistical Moments of IBD Distributions

The Lambda Parameter

The lambda parameter in Bonsai v3 controls the distribution of IBD segment lengths. For a relationship with meiotic distance m, the lambda parameter is calculated as:

\lambda = \frac{m}{100}

This parameter has a direct biological interpretation: it represents the rate of decrease in the probability of observing longer segments, reflecting the increased chances of recombination breaking up segments over more meiotic events.

The lambda parameter determines several key properties of the expected IBD distribution:

  • The expected mean segment length is approximately 100/m centimorgans
  • The probability density function for segment length x is λe-λx
  • The probability that a segment exceeds length t is e-λt

The get_lam_a_m() function in likelihoods.py implements this calculation with additional refinements for specific relationship types:

def get_lam_a_m(g, num_ancs=1, min_len=7.0):
    """Calculate lambda parameter for segment length distribution.
    
    Args:
        g: Meiotic distance (up + down)
        num_ancs: Number of common ancestors
        min_len: Minimum segment length
        
    Returns:
        Lambda parameter
    """
    # Base lambda is meiotic distance / 100
    lam = g / 100.0
    
    # Adjust for specific relationship patterns
    if g == 2 and num_ancs == 2:  # Full siblings
        lam *= 0.9  # Empirical adjustment for full siblings
    
    return lam

This parameter is central to Bonsai's relationship inference capabilities, as it directly ties the observable pattern of segment lengths to the underlying genealogical relationship.

The Eta Parameter

The eta parameter represents the expected number of IBD segments for a relationship. For a relationship with meiotic distance m and a common ancestors, eta is calculated as:

\eta = \frac{a(rm+c)}{2^{m-1}} \cdot e^{-\frac{m \cdot t}{100}}

Where:

  • r is the recombination rate (~34 crossovers per meiosis for the human genome)
  • c is the number of chromosomes (22 autosomes)
  • t is the minimum detectable segment length

This formula has a strong theoretical foundation in genetic inheritance models:

  • The term rm+c represents the expected number of segments before any filtering
  • The division by 2m-1 accounts for the probability of inheriting segments through specific lineages
  • The exponential term accounts for segments that fall below the detection threshold

The get_eta() function in moments.py implements this calculation:

def get_eta(g, r=34.0, c=22, a=1):
    """Calculate expected number of IBD segments.
    
    Args:
        g: Meiotic distance
        r: Recombination rate per meiosis
        c: Number of chromosomes
        a: Number of common ancestors
        
    Returns:
        Expected number of segments
    """
    if g <= 1:
        return a * (r * g + c)
    else:
        return a * (r * g + c) / (2 ** (g - 1))

This parameter directly influences the probability distribution of observed segment counts, which follows a Poisson distribution with mean η:

P(k|R) = \frac{\eta^k e^{-\eta}}{k!}

Where k is the observed number of segments and R is the relationship. This distribution is implemented in the get_log_seg_count_pdf() function in likelihoods.py.

Expected Total IBD Sharing

The expected total IBD sharing for a relationship combines the eta and lambda parameters. For a relationship with meiotic distance m and expected segment count η, the expected total IBD sharing is:

E[Total\;IBD] = \eta \cdot (\frac{100}{m} + t)

Where t is the minimum detectable segment length. This formula accounts for the truncation of the exponential distribution at the minimum threshold.

The expected total IBD sharing follows characteristic patterns for different relationships:

  • Parent-Child: ~3400 cM (50% of the genome)
  • Full Siblings: ~2550 cM (37.5% of the genome)
  • Half Siblings/Grandparent/Avuncular: ~1700 cM (25% of the genome)
  • First Cousins: ~850 cM (12.5% of the genome)
  • Second Cousins: ~212 cM (3.125% of the genome)

The get_expected_total_ibd() function in moments.py calculates this value for any relationship tuple:

def get_expected_total_ibd(rel_tuple, min_seg_len=7.0):
    """Calculate expected total IBD sharing.
    
    Args:
        rel_tuple: (up, down, num_ancs) tuple
        min_seg_len: Minimum segment length threshold
        
    Returns:
        Expected total IBD in cM
    """
    up, down, num_ancs = rel_tuple
    m = up + down
    
    # Special case for self
    if m == 0 and num_ancs == 2:
        return 3400.0
    
    # Calculate expected segment count
    eta = get_eta(m, a=num_ancs)
    
    # Adjust for detection threshold
    p_obs = math.exp(-m * min_seg_len / 100.0)
    eta_obs = eta * p_obs
    
    # Calculate expected total IBD
    expected_total = eta_obs * (100.0 / m + min_seg_len)
    
    return expected_total

This expected value serves as the mean of a normal distribution that models the variation in total IBD sharing. The variance of this distribution is also modeled based on theoretical considerations and empirical calibration.

Likelihood Functions

Likelihood for Segment Counts

The likelihood function for segment counts quantifies the probability of observing k segments given a relationship R. This likelihood is based on the Poisson distribution with parameter η (expected segment count):

P(k|R) = \frac{\eta^k e^{-\eta}}{k!}

In log space, this becomes:

\log P(k|R) = k \log \eta - \eta - \log(k!)

The get_log_seg_count_pdf() function in likelihoods.py implements this calculation:

def get_log_seg_count_pdf(k, eta, sigma=None):
    """Calculate log-likelihood of observing k segments.
    
    Args:
        k: Observed segment count
        eta: Expected segment count
        sigma: Standard deviation (not used for Poisson)
        
    Returns:
        Log-likelihood
    """
    if eta <= 0:
        return -np.inf if k > 0 else 0.0
    
    # Use Poisson PMF
    return stats.poisson.logpmf(k, eta)

This likelihood function captures the inherent stochasticity in segment counts due to the random nature of recombination and inheritance. It accounts for the fact that even for a fixed relationship type, the observed number of segments can vary significantly.

Likelihood for Segment Lengths

The likelihood function for segment lengths quantifies the probability of observing segments with specific lengths given a relationship R. For a relationship with meiotic distance m, the probability density function for a segment of length x (given that it exceeds the minimum threshold t) is:

f(x|R,x>t) = \frac{m}{100} e^{-\frac{m}{100}(x-t)}

The log-likelihood for a set of segments is the sum of the log-likelihoods for each individual segment:

\log P(\{x_1, x_2, ..., x_n\}|R) = \sum_{i=1}^{n} \log f(x_i|R,x_i>t)

This calculation is implemented in the get_log_seg_len_pdf() function:

def get_log_seg_len_pdf(segs, lam, min_len=7.0):
    """Calculate log-likelihood of observed segment lengths.
    
    Args:
        segs: List of segment lengths
        lam: Lambda parameter for exponential distribution
        min_len: Minimum segment length threshold
        
    Returns:
        Log-likelihood
    """
    if not segs:
        return 0.0  # No segments
    
    # Calculate log-likelihood for each segment
    log_likes = []
    for seg_len in segs:
        if seg_len < min_len:
            continue
            
        # Log-pdf for truncated exponential distribution
        adj_len = seg_len - min_len
        log_like = np.log(lam) - lam * adj_len
        log_likes.append(log_like)
    
    # Sum log-likelihoods
    return sum(log_likes)

This function handles the truncated exponential distribution that arises from imposing a minimum detection threshold on segment lengths. It captures the distinctive patterns of segment length distributions that characterize different relationship types.

The Combined Likelihood Function

The core of Bonsai v3's relationship inference is the get_log_seg_pdf() function, which combines the likelihoods from segment counts and segment lengths into a comprehensive likelihood model:

def get_log_seg_pdf(segments, rel_tuple, min_seg_len=7.0):
    """Calculate log-likelihood of observing segments given a relationship.
    
    Args:
        segments: List of segment lengths
        rel_tuple: (up, down, num_ancs) tuple
        min_seg_len: Minimum segment length threshold
        
    Returns:
        Log-likelihood
    """
    up, down, num_ancs = rel_tuple
    m = up + down
    
    # Handle special case for self
    if m == 0 and num_ancs == 2:
        return 0.0 if not segments else -np.inf
    
    # Calculate parameters
    lam = get_lam_a_m(m, num_ancs, min_seg_len)
    eta = get_eta(m, a=num_ancs)
    
    # Adjust for detection threshold
    p_obs = math.exp(-m * min_seg_len / 100.0)
    eta_obs = eta * p_obs
    
    # Calculate likelihood for segment count
    count_ll = get_log_seg_count_pdf(len(segments), eta_obs)
    
    # Calculate likelihood for segment lengths
    length_ll = get_log_seg_len_pdf(segments, lam, min_seg_len)
    
    # Combine likelihoods with appropriate weights
    # Weight factors are empirically calibrated
    count_weight = 0.4
    length_weight = 0.6
    
    return count_weight * count_ll + length_weight * length_ll

This function implements a weighted combination of the count and length likelihoods, with weights that are empirically calibrated to optimize inference accuracy. The weights reflect the relative information content of the different statistics - segment lengths often provide more discriminative power for distinguishing close relationships, while segment counts become more important for distant relationships.

In the actual Bonsai v3 implementation, this function includes additional refinements:

  • Special handling for parent-child relationships, which have distinctive IBD patterns
  • Adjustments for IBD2 segments, which are present in full sibling relationships
  • Chromosome-specific models that account for varying recombination rates
  • Population-specific calibration factors for different demographic backgrounds

These refinements ensure that the likelihood function accurately captures the complex patterns of IBD sharing observed in real human relationships.

Handling Uncertainty and Stochasticity

The Stochastic Nature of Genetic Inheritance

One of the fundamental challenges in genetic relationship inference is the inherently stochastic nature of genetic inheritance. Due to the random processes of meiosis and recombination, even the same relationship type can produce widely varying patterns of IBD sharing.

Key sources of stochasticity include:

  • Random Segregation: Which chromosomes are inherited from each parent is random
  • Recombination Variability: The number and location of crossovers during meiosis is variable
  • Sampling Error: Genotyping covers only a subset of markers, introducing sampling variance
  • Detection Error: IBD detection algorithms have false positives and false negatives

This stochasticity means that a single observed IBD pattern could be consistent with multiple possible relationships. For example, the coefficient of variation (standard deviation divided by mean) for total IBD sharing between second cousins can be as high as 50%, meaning that the actual sharing can easily range from half to double the expected value!

Bonsai v3 explicitly models this stochasticity through the probability distributions in its likelihood functions. The variance parameters in these distributions are calibrated using both theoretical models and empirical data from thousands of known relationships.

Likelihood Ratios and Bayes Factors

To quantify uncertainty in relationship inference, Bonsai v3 uses likelihood ratios (also known as Bayes factors) to compare different relationship hypotheses. For two relationship hypotheses R₁ and R₂, the likelihood ratio is:

LR = \frac{P(D|R_1)}{P(D|R_2)}

In log space, this becomes a simple subtraction:

\log LR = \log P(D|R_1) - \log P(D|R_2)

Bonsai v3 implements a function to determine when one hypothesis is significantly better than another:

def is_significantly_better(ll1, ll2, threshold=2.0):
    """Determine if one log-likelihood is significantly better than another.
    
    Args:
        ll1, ll2: Log-likelihood values
        threshold: Difference threshold (default 2.0 ≈ 7.4x more likely)
        
    Returns:
        True if ll1 is significantly better than ll2
    """
    return ll1 - ll2 > threshold

The threshold value of 2.0 corresponds to a likelihood ratio of approximately 7.4, which is often used as a standard for "strong evidence" in Bayesian statistics. When multiple relationship hypotheses have similar likelihoods (differences below the threshold), Bonsai reports the ambiguity rather than making an arbitrary choice.

This approach to uncertainty quantification is critical for reliable pedigree reconstruction, as it prevents the propagation of errors from ambiguous relationships to the larger pedigree structure.

Ambiguous Relationship Cases

Certain pairs of relationships are notoriously difficult to distinguish based on IBD patterns alone. The most common ambiguous cases are:

Ambiguous Group Shared DNA Distinguishing Features
Half-Siblings
Grandparent-Grandchild
Avuncular (Aunt/Uncle)
~25% Age differences
Segment length patterns
Half-Avuncular
First Cousins
~12.5% Age differences
Segment count
Half-First Cousins
First Cousins Once Removed
~6.25% Age differences
Generation gap
Second Cousins
Half-First Cousins Once Removed
~3.125% Difficult to distinguish reliably

To address these ambiguities, Bonsai v3 implements several strategies:

  1. Age-Based Disambiguation: Using age information to distinguish relationships with similar IBD patterns
  2. Multiple Relationship Paths: Considering all possible ways individuals might be related
  3. Network Constraints: Using relationships with other individuals to constrain possibilities
  4. Explicit Uncertainty Reporting: Providing confidence scores and alternative hypotheses

The get_age_log_like() function in likelihoods.py specifically addresses the first strategy, computing the likelihood of observed age differences under different relationship hypotheses:

def get_age_log_like(age1, age2, rel_tuple):
    """Calculate log-likelihood of age difference for a relationship.
    
    Args:
        age1, age2: Ages of the two individuals
        rel_tuple: (up, down, num_ancs) relationship tuple
        
    Returns:
        Log-likelihood of the age difference
    """

By combining genetic and demographic evidence, Bonsai can often resolve ambiguities that would be impossible to address using IBD patterns alone.

Statistical Foundation: Bonsai v3's probabilistic relationship inference demonstrates the power of Bayesian modeling for complex biological problems. By explicitly modeling the stochastic nature of genetic inheritance and quantifying uncertainty, the system provides not just point estimates but comprehensive probabilistic assessments of relationships, enabling more reliable pedigree reconstruction in the face of inherent biological variability.

Comparing Notebook and Production Code

The Lab06 notebook provides a simplified implementation of the probabilistic relationship inference methods used in Bonsai v3, while the actual implementation includes additional sophistication:

Despite these differences, the fundamental mathematical principles demonstrated in the notebook directly correspond to those used in the production system, providing an accurate conceptual foundation for understanding Bonsai's approach to relationship inference.

Interactive Lab Environment

Run the interactive Lab 06 notebook in Google Colab:

Google Colab Environment

Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.

Data will be automatically downloaded from S3 when you run the notebook.

Note: You may need a Google account to save your work in Google Drive.

Open Lab 06 Notebook in Google Colab

Beyond the Code

As you explore the probabilistic relationship inference methods in Bonsai v3, consider these broader implications:

These considerations highlight how Bonsai v3's approach to relationship inference represents not just a technical solution but a principled scientific methodology for drawing conclusions from complex and noisy data.

This lab is part of the Bonsai v3 Deep Dive track:

Introduction

Lab 01

Architecture

Lab 02

IBD Formats

Lab 03

Statistics

Lab 04

Models

Lab 05

Relationships

Lab 06

PwLogLike

Lab 07

Age Modeling

Lab 08

Data Structures

Lab 09

Up-Node Dict

Lab 10