Lab 17: Incremental Individual Addition | Computational Genetic Genealogy

Lab 17: Incremental Individual Addition Strategies

Core Component: This lab explores how Bonsai v3 efficiently adds new individuals to existing pedigrees one at a time, rather than rebuilding the entire structure. These incremental addition strategies are critical for performance when working with large datasets and for updating pedigrees as new genetic data becomes available.

The Challenge of Incremental Addition

Why Incremental Addition Matters

In real-world genetic genealogy applications, data often arrives incrementally. New individuals may take DNA tests and need to be placed in an existing family structure without recomputing the entire pedigree. Bonsai v3 addresses this challenge through specialized functions that efficiently find optimal placement for new individuals within existing pedigrees.

The key advantages of incremental addition include:

Computational Efficiency: Adding one individual is much faster than rebuilding an entire pedigree
Stability: Existing relationships in the pedigree remain stable as new individuals are added
Scalability: The approach can handle continuous addition of new individuals over time
Real-time Updates: Results can be updated promptly as new data becomes available

The core of this functionality in Bonsai v3 is the add_individual_to_pedigree function:

def add_individual_to_pedigree(
    id_val: int,
    up_dct: dict[int, dict[int, int]],
    id_to_shared_ibd: dict[tuple[int, int], list[dict]],
    id_to_info: dict[int, dict],
    pw_ll: Any,
    max_up: int = 3,
    n_keep: int = 5,
    return_only_best: bool = True,
):
    """
    Add a single individual to an existing pedigree.
    
    Args:
        id_val: ID of the individual to add
        up_dct: Up-node dictionary representing the existing pedigree
        id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
        id_to_info: Dict with demographic information for individuals
        pw_ll: PwLogLike instance for likelihood calculation
        max_up: Maximum number of generations to extend upward
        n_keep: Number of top pedigrees to keep
        return_only_best: Whether to return only the best pedigree
        
    Returns:
        List of (pedigree, likelihood) pairs or just the best one
    """
    # Actual implementation details...

This function orchestrates the process of finding the optimal way to add a new individual to an existing pedigree. It evaluates different potential connection points and relationship configurations, much like the pedigree merging process we saw previously, but optimized for the single-individual case.

Finding Potential Connection Points

The first step in adding a new individual to a pedigree is to identify potential connection points. This is handled by the find_potential_connection_points function:

def find_potential_connection_points(
    id_val: int,
    up_dct: dict[int, dict[int, int]],
    id_to_shared_ibd: dict[tuple[int, int], list[dict]],
):
    """
    Find potential points for connecting a new individual to a pedigree.
    
    Args:
        id_val: ID of the individual to add
        up_dct: Up-node dictionary representing the existing pedigree
        id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
        
    Returns:
        connection_points: List of (existing_id, ibd_amount) tuples
    """
    # Find individuals in the pedigree who share IBD with the new individual
    connection_points = []
    for id_pair, segments in id_to_shared_ibd.items():
        # Check if one of the IDs is the new individual
        if id_val not in id_pair:
            continue
            
        # Get the other ID
        other_id = id_pair[0] if id_pair[1] == id_val else id_pair[1]
        
        # Check if the other ID is in the pedigree
        if other_id in up_dct:
            # Calculate total shared IBD
            total_cm = sum(seg['length_cm'] for seg in segments)
            
            # Add to connection points
            connection_points.append((other_id, total_cm))
    
    # Sort by amount of IBD sharing (descending)
    connection_points.sort(key=lambda x: x[1], reverse=True)
    
    return connection_points

This function identifies individuals in the existing pedigree who share IBD with the new individual. These individuals form the potential connection points through which the new individual can be added to the pedigree. The function sorts these connection points by the amount of IBD sharing, prioritizing stronger genetic connections.

The key insight here is that individuals who share more IBD with the new individual are likely to be more closely related, making them better candidates for connection points. By focusing on these high-IBD connections first, Bonsai can often find the optimal placement more efficiently.

Evaluating Different Placement Options

Once potential connection points are identified, Bonsai v3 evaluates different options for adding the new individual to the pedigree. This involves determining the optimal relationship between the new individual and existing individuals in the pedigree.

The evaluate_placement_options function handles this process:

def evaluate_placement_options(
    id_val: int,
    connection_points: list[tuple[int, float]],
    up_dct: dict[int, dict[int, int]],
    id_to_shared_ibd: dict[tuple[int, int], list[dict]],
    id_to_info: dict[int, dict],
    pw_ll: Any,
    max_up: int = 3,
):
    """
    Evaluate different options for placing a new individual in a pedigree.
    
    Args:
        id_val: ID of the individual to add
        connection_points: List of (existing_id, ibd_amount) tuples
        up_dct: Up-node dictionary representing the existing pedigree
        id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
        id_to_info: Dict with demographic information for individuals
        pw_ll: PwLogLike instance for likelihood calculation
        max_up: Maximum number of generations to extend upward
        
    Returns:
        placement_options: List of (up_dct_new, likelihood) tuples
    """
    placement_options = []
    
    # Get demographic information for the new individual
    new_info = id_to_info.get(id_val, {})
    
    # For each potential connection point
    for existing_id, _ in connection_points:
        # Get demographic information for the existing individual
        existing_info = id_to_info.get(existing_id, {})
        
        # Get IBD segments between the individuals
        pair = (min(id_val, existing_id), max(id_val, existing_id))
        ibd_segs = id_to_shared_ibd.get(pair, [])
        
        # Get potential relationship configurations
        rel_configs = get_connection_degs_and_log_likes(
            id1=id_val,
            id2=existing_id,
            id_to_shared_ibd=id_to_shared_ibd,
            id_to_info=id_to_info,
            pw_ll=pw_ll,
            max_up=max_up,
        )
        
        # For each potential relationship configuration
        for up, down, num_ancs, log_like in rel_configs:
            # Create a new version of the pedigree with this relationship
            up_dct_new = add_with_relationship(
                id_val=id_val,
                existing_id=existing_id,
                up_dct=up_dct,
                up=up,
                down=down,
                num_ancs=num_ancs,
            )
            
            # Calculate total likelihood of the new pedigree
            total_likelihood = calculate_pedigree_likelihood(
                up_dct_new,
                id_to_shared_ibd,
                id_to_info,
                pw_ll,
            )
            
            # Add to placement options
            placement_options.append((up_dct_new, total_likelihood))
    
    # Sort by likelihood (descending)
    placement_options.sort(key=lambda x: x[1], reverse=True)
    
    return placement_options

This function systematically evaluates different ways to add the new individual to the pedigree, considering various relationship configurations with each potential connection point. It calculates the likelihood of each resulting pedigree and ranks them, allowing Bonsai to identify the most probable placement for the new individual.

The key steps in this process are:

For each potential connection point, identify possible relationship configurations
For each relationship configuration, create a new version of the pedigree
Calculate the likelihood of each resulting pedigree
Rank the pedigrees by likelihood

This approach allows Bonsai to efficiently explore the space of possible placements for the new individual and identify the most probable one based on genetic evidence.

Implementing the Addition

Adding Individuals with Different Relationship Types

Once the optimal relationship configuration is identified, Bonsai v3 needs to physically add the new individual to the pedigree with that relationship. The add_with_relationship function handles this:

def add_with_relationship(
    id_val: int,
    existing_id: int,
    up_dct: dict[int, dict[int, int]],
    up: int,
    down: int,
    num_ancs: int,
):
    """
    Add a new individual to a pedigree with a specific relationship.
    
    Args:
        id_val: ID of the individual to add
        existing_id: ID of the existing individual to connect to
        up_dct: Up-node dictionary representing the existing pedigree
        up, down, num_ancs: Relationship parameters
        
    Returns:
        up_dct_new: Updated pedigree with the new individual
    """
    # Create a copy of the existing pedigree
    up_dct_new = copy.deepcopy(up_dct)
    
    # Ensure the new individual exists in the pedigree
    if id_val not in up_dct_new:
        up_dct_new[id_val] = {}
    
    # Handle different relationship types
    if up == 0 and down == 1:  # id_val is parent of existing_id
        # Add id_val as parent of existing_id
        up_dct_new[existing_id][id_val] = 1
        
    elif up == 1 and down == 0:  # existing_id is parent of id_val
        # Add existing_id as parent of id_val
        up_dct_new[id_val][existing_id] = 1
        
    elif up == 1 and down == 1:  # Siblings
        # Get existing parents of existing_id
        existing_parents = list(up_dct_new.get(existing_id, {}).keys())
        
        if num_ancs == 1:  # Half siblings
            # Need at least one parent to create half siblings
            if existing_parents:
                shared_parent = existing_parents[0]
                up_dct_new[id_val][shared_parent] = 1
            else:
                # Create a new ungenotyped parent
                new_parent_id = get_min_id(up_dct_new) - 1
                up_dct_new[existing_id][new_parent_id] = 1
                up_dct_new[id_val][new_parent_id] = 1
                up_dct_new[new_parent_id] = {}
                
        elif num_ancs == 2:  # Full siblings
            # Need two parents to create full siblings
            if len(existing_parents) >= 2:
                parent1, parent2 = existing_parents[:2]
                up_dct_new[id_val][parent1] = 1
                up_dct_new[id_val][parent2] = 1
            elif len(existing_parents) == 1:
                # Use the existing parent and create a new one
                shared_parent = existing_parents[0]
                new_parent_id = get_min_id(up_dct_new) - 1
                up_dct_new[existing_id][new_parent_id] = 1
                up_dct_new[id_val][shared_parent] = 1
                up_dct_new[id_val][new_parent_id] = 1
                up_dct_new[new_parent_id] = {}
            else:
                # Create two new ungenotyped parents
                parent1_id = get_min_id(up_dct_new) - 1
                parent2_id = parent1_id - 1
                up_dct_new[existing_id][parent1_id] = 1
                up_dct_new[existing_id][parent2_id] = 1
                up_dct_new[id_val][parent1_id] = 1
                up_dct_new[id_val][parent2_id] = 1
                up_dct_new[parent1_id] = {}
                up_dct_new[parent2_id] = {}
                
    else:  # More distant relationships
        # Handle more complex relationships through common ancestors
        # Create ancestry chains as needed
        # ... (implementation details for more distant relationships)
    
    return up_dct_new

This function implements the physical addition of a new individual to the pedigree with a specific relationship to an existing individual. It handles various relationship types, including:

Parent-Child: Adding the new individual as a parent or child of an existing individual
Siblings: Adding the new individual as a half or full sibling of an existing individual
More Distant Relationships: Adding the new individual with more complex relationships that involve common ancestors

For sibling relationships, the function attempts to reuse existing parents when available, or creates new ungenotyped parents as needed. This approach ensures that the resulting pedigree remains consistent with all previously established relationships while properly incorporating the new individual.

Maintaining Multiple Hypotheses

An important aspect of Bonsai v3's incremental addition strategy is the maintenance of multiple hypotheses about how a new individual might fit into the pedigree. This approach allows Bonsai to avoid getting stuck in local optima and explore different placement possibilities.

The select_best_placements function handles this aspect:

def select_best_placements(
    placement_options: list[tuple[dict, float]],
    n_keep: int = 5,
):
    """
    Select the top placement options to keep.
    
    Args:
        placement_options: List of (pedigree, likelihood) pairs
        n_keep: Number of top options to keep
        
    Returns:
        best_placements: List of top (pedigree, likelihood) pairs
    """
    # Filter out any invalid pedigrees
    valid_options = [(ped, ll) for ped, ll in placement_options if is_valid_pedigree(ped)]
    
    # Sort by likelihood (descending)
    valid_options.sort(key=lambda x: x[1], reverse=True)
    
    # Keep only the top n_keep options
    best_placements = valid_options[:n_keep]
    
    return best_placements

This function selects the top n_keep placement options from all the evaluated possibilities. By maintaining multiple hypotheses, Bonsai can:

Explore different possible placements for the new individual
Compare these placements based on their likelihoods
Select the most probable placement while keeping alternative possibilities

This approach is particularly valuable when the genetic evidence is ambiguous or when there are multiple plausible ways to fit the new individual into the pedigree.

The function also includes a validation step (is_valid_pedigree) to ensure that the resulting pedigrees are biologically and structurally valid. This helps prevent invalid configurations, such as cycles in the pedigree or other inconsistencies.

Optimizing the Addition Process

Efficient Addition Strategies

Bonsai v3 includes several optimizations to make the incremental addition process more efficient, especially for large pedigrees. These optimizations include:

def optimize_addition(
    id_val: int,
    up_dct: dict[int, dict[int, int]],
    id_to_shared_ibd: dict[tuple[int, int], list[dict]],
    id_to_info: dict[int, dict],
):
    """
    Apply optimizations to make the addition process more efficient.
    
    Args:
        id_val: ID of the individual to add
        up_dct: Up-node dictionary representing the existing pedigree
        id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
        id_to_info: Dict with demographic information for individuals
        
    Returns:
        optimized_data: Tuple of optimized parameters for addition
    """
    # Calculate IBD threshold based on pedigree size
    pedigree_size = len(up_dct)
    ibd_threshold = max(20, 10 * math.log(pedigree_size))
    
    # Filter connection points by IBD threshold
    connection_points = find_potential_connection_points(id_val, up_dct, id_to_shared_ibd)
    filtered_points = [(id_i, ibd) for id_i, ibd in connection_points if ibd >= ibd_threshold]
    
    # Limit the number of connection points to evaluate
    max_points = min(len(filtered_points), 10)
    top_points = filtered_points[:max_points]
    
    # Determine optimal max_up parameter based on IBD amounts
    if top_points:
        max_ibd = top_points[0][1]
        if max_ibd > 1700:  # Parent-child level
            suggested_max_up = 1
        elif max_ibd > 700:  # Sibling level
            suggested_max_up = 2
        elif max_ibd > 200:  # First cousin level
            suggested_max_up = 3
        elif max_ibd > 50:   # Second cousin level
            suggested_max_up = 4
        else:
            suggested_max_up = 5
    else:
        suggested_max_up = 3  # Default
    
    # Filter id_to_shared_ibd to include only relevant pairs
    relevant_ids = set([id_val])
    relevant_ids.update([id_i for id_i, _ in top_points])
    
    filtered_ibd = {}
    for (id1, id2), segments in id_to_shared_ibd.items():
        if id1 in relevant_ids or id2 in relevant_ids:
            filtered_ibd[(id1, id2)] = segments
    
    return (top_points, suggested_max_up, filtered_ibd)

This function applies several key optimizations to make the addition process more efficient:

Dynamic IBD Threshold: Adjusts the minimum IBD threshold based on pedigree size
Connection Point Filtering: Limits the number of connection points to evaluate
Adaptive max_up Parameter: Determines the optimal maximum number of generations to consider based on IBD amounts
IBD Data Filtering: Restricts the IBD data to only the relevant pairs

These optimizations significantly reduce the computational cost of adding new individuals to large pedigrees while maintaining accuracy. By focusing on the most promising connection points and using adaptive parameters, Bonsai can efficiently find optimal placements even in complex scenarios.

Integration with the Overall Pedigree Building Process

The incremental addition strategy is integrated into Bonsai v3's overall pedigree building process through the build_pedigree_incremental function:

def build_pedigree_incremental(
    id_to_shared_ibd: dict[tuple[int, int], list[dict]],
    id_to_info: dict[int, dict],
    pw_ll: Any,
    max_up: int = 3,
    n_keep: int = 5,
    id_order: list[int] = None,
):
    """
    Build a pedigree incrementally by adding individuals one at a time.
    
    Args:
        id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
        id_to_info: Dict with demographic information for individuals
        pw_ll: PwLogLike instance for likelihood calculation
        max_up: Maximum number of generations to extend upward
        n_keep: Number of top pedigrees to keep
        id_order: Order in which to add individuals (if None, determined automatically)
        
    Returns:
        best_pedigree: The final best pedigree
    """
    # Get all individuals from the IBD data
    all_ids = set()
    for id1, id2 in id_to_shared_ibd:
        all_ids.add(id1)
        all_ids.add(id2)
    
    # If no specific order provided, order by IBD connectivity
    if id_order is None:
        id_order = order_by_connectivity(all_ids, id_to_shared_ibd)
    
    # Start with the first individual
    first_id = id_order[0]
    up_dct = {first_id: {}}  # Single individual pedigree
    
    # Add remaining individuals one at a time
    for id_val in id_order[1:]:
        # Add this individual to the current pedigree
        result = add_individual_to_pedigree(
            id_val=id_val,
            up_dct=up_dct,
            id_to_shared_ibd=id_to_shared_ibd,
            id_to_info=id_to_info,
            pw_ll=pw_ll,
            max_up=max_up,
            n_keep=n_keep,
            return_only_best=True,
        )
        
        # Update the current pedigree
        if result:
            up_dct = result[0]
    
    return up_dct

This function builds a pedigree incrementally by adding individuals one at a time in a specific order. The key steps are:

Identifying all individuals from the IBD data
Determining an optimal order for adding individuals (if not specified)
Starting with a single individual
Adding the remaining individuals one at a time

The order in which individuals are added can significantly impact the quality of the resulting pedigree. Bonsai typically orders individuals by their IBD connectivity — adding highly connected individuals first to establish a stable core pedigree, and then adding less connected individuals around this core.

This incremental approach allows Bonsai to efficiently build large pedigrees that would be computationally infeasible to optimize directly. By breaking the problem into a series of simpler incremental additions, Bonsai can scale to much larger datasets while maintaining accuracy.

Core Component: Incremental individual addition is a critical optimization in Bonsai v3 that enables efficient pedigree reconstruction at scale. By adding individuals one at a time to an existing pedigree, rather than rebuilding the entire structure, Bonsai can handle large datasets and continuous updates with new genetic data. The sophisticated algorithms for finding optimal placements and efficiently evaluating different options allow Bonsai to maintain accuracy while dramatically reducing computational costs.

Comparing Notebook and Bonsai v3

The Lab17 notebook explores incremental individual addition strategies through simplified implementations and examples. While the notebook provides an educational introduction to the key concepts, the actual Bonsai v3 implementation includes additional sophistication:

Sophisticated Ordering Strategies: The production code includes advanced algorithms for determining the optimal order in which to add individuals.
Parallel Evaluation: The real implementation can evaluate multiple placement options in parallel for better performance.
Metadata Integration: More comprehensive integration of non-genetic information (age, sex, etc.) in the placement process.
Caching and Memoization: The production code includes sophisticated caching mechanisms to avoid redundant computations.
Error Recovery: More robust handling of edge cases and ability to recover from sub-optimal placements.

These differences allow the production implementation to handle the full complexity of real-world pedigree reconstruction tasks, while the notebook provides a more accessible introduction to the core concepts.

Interactive Lab Environment

Run the interactive Lab 17 notebook in Google Colab:

Google Colab Environment

Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.

Data will be automatically downloaded from S3 when you run the notebook.

Note: You may need a Google account to save your work in Google Drive.

Open Lab 17 Notebook in Google Colab

Beyond the Code

As you explore incremental individual addition strategies, consider these broader implications:

Growing Datasets: How these techniques enable analysis of ever-growing genetic genealogy databases
Adaptive Pedigree Building: The potential for continuously updating pedigrees as new data becomes available
Ordering Strategies: The impact of different individual addition orders on the resulting pedigree structure
Balancing Accuracy and Performance: The trade-offs involved in optimizing for efficient additions
Adaptability: How incremental approaches can adjust to new evidence without requiring complete recomputation

These considerations highlight how incremental individual addition strategies are not just technical optimizations but fundamental enablers of large-scale pedigree reconstruction from genetic data.

This lab is part of the Bonsai v3 Deep Dive track:

Introduction

Lab 01

Architecture

Lab 02

IBD Formats

Lab 03

Statistics

Lab 04

Models

Lab 05

Relationships

Lab 06

PwLogLike

Lab 07

Age Modeling

Lab 08

Data Structures

Lab 09

Up-Node Dict

Lab 10

Connection Points

Lab 11

Relationship Assessment

Lab 12

Small Pedigrees

Lab 13

Optimizing Pedigrees

Lab 14

Combine Up Dicts

Lab 15

Merging Pedigrees

Lab 16

Incremental Addition

Lab 17