Lab 15: The combine_up_dicts() Algorithm

Core Component: This lab explores the combine_up_dicts() algorithm, the central mechanism in Bonsai v3 for scaling from small pedigree structures to larger ones. This algorithm is essential for reconstructing complex family networks by systematically merging smaller pedigree units through an iterative, likelihood-based approach.

Scaling from Small to Large Pedigrees

The Challenge of Large-Scale Pedigree Reconstruction

Reconstructing large pedigrees presents significant challenges:

Combinatorial Explosion: The number of possible pedigree configurations grows exponentially with the number of individuals
Computational Constraints: Directly optimizing large pedigrees would require prohibitive computational resources
Incomplete Data: Real-world genetic data often contains missing information and measurement errors
Conflicting Evidence: Different pieces of genetic evidence might suggest incompatible relationships

To address these challenges, Bonsai v3 implements a bottom-up, incremental approach through the combine_up_dicts() function. This function, found in utils/bonsaitree/bonsaitree/v3/connections.py, lies at the heart of Bonsai's ability to scale from small pedigree structures to more complex family networks.

def combine_up_dicts(
    id_to_up_dct: dict[int, dict[int, dict[int, int]]],
    id_to_shared_ibd: dict[tuple[int, int], list[dict]],
    id_to_info: dict[int, dict],
    pw_ll: Any,
    max_up: int = 3,
    n_keep: int = 5,
    id_set_to_exclude: set[int] = None,
):
    """
    Combine all pedigrees in id_to_up_dct incrementally.
    
    Args:
        id_to_up_dct: Dictionary mapping IDs to their up-node dictionaries
        id_to_shared_ibd: Dictionary mapping pairs of IDs to their IBD segments
        id_to_info: Dictionary with demographic information for individuals
        pw_ll: PwLogLike instance for likelihood calculation
        max_up: Maximum number of generations to extend upward
        n_keep: Number of top pedigrees to keep at each step
        id_set_to_exclude: Set of IDs to exclude from combination
        
    Returns:
        List of optimized pedigrees with their likelihoods
    """
    # Core algorithm implementation...

This function orchestrates the pedigree merging process, addressing the scaling challenges through several key strategies:

Divide and Conquer: Breaking the large problem into smaller, more manageable subproblems (small pedigrees)
Greedy Merging: Iteratively combining the pedigrees with the strongest genetic connections first
Hypothesis Tracking: Maintaining multiple candidate pedigrees to avoid local optima
Intelligent Pruning: Focusing computational resources on the most promising pedigree configurations

By following this incremental approach, Bonsai v3 can reconstruct large-scale pedigrees that would be computationally infeasible to optimize directly.

The Iteration Process: Finding and Merging Pedigrees

The core of combine_up_dicts() is an iterative process that repeatedly identifies and merges the most closely related pedigrees until all individuals are connected or no more reliable connections can be made. This process involves several key steps:

def combine_up_dicts(...):
    # ... initialization code ...
    
    # Main iteration loop
    while len(idx_to_up_dict_ll_list) > 1:
        # Find closest pedigrees to merge based on IBD sharing
        idx1, idx2 = find_closest_pedigrees(
            idx_to_id_set=idx_to_id_set,
            id_to_shared_ibd=id_to_shared_ibd,
            min_cm=min_cm,
        )
        
        # If no more pedigrees can be connected, break
        if idx1 is None or idx2 is None:
            break
            
        # Get list of pedigrees to combine
        up_dict_ll_list1 = idx_to_up_dict_ll_list[idx1]
        up_dict_ll_list2 = idx_to_up_dict_ll_list[idx2]
        
        # Try all combinations of pedigrees
        all_combined = []
        for up_dict1, ll1 in up_dict_ll_list1:
            for up_dict2, ll2 in up_dict_ll_list2:
                # Find optimal way to connect these pedigrees
                combined_list = combine_pedigrees(
                    up_dct1=up_dict1,
                    up_dct2=up_dict2,
                    id_to_shared_ibd=id_to_shared_ibd,
                    id_to_info=id_to_info,
                    pw_ll=pw_ll,
                    max_up=max_up,
                    keep_num=n_keep,
                    return_many=True,
                )
                
                # Add all combined pedigrees to the list
                all_combined.extend([(ped, ll1 + ll2 + new_ll) 
                                     for ped, new_ll in combined_list])
        
        # Keep only top n_keep combined pedigrees
        all_combined.sort(key=lambda x: x[1], reverse=True)
        kept_combined = all_combined[:n_keep]
        
        # Merge pedigree tracking structures
        # ... update idx_to_up_dict_ll_list, id_to_idx, idx_to_id_set ...
    
    # Return the final list of pedigrees with likelihoods
    return idx_to_up_dict_ll_list[0]

The key steps in this iteration process are:

Finding the Closest Pedigrees: The find_closest_pedigrees() function identifies which pedigrees share the most IBD and should be merged next. This prioritizes merging the most closely related pedigrees first.
Trying All Combinations: For each pedigree in the first set and each pedigree in the second set, Bonsai tries to combine them using combine_pedigrees().
Evaluating Connection Options: When combining pedigrees, Bonsai explores various connection points and relationship types to find the optimal configuration.
Pruning Hypotheses: After all combinations are evaluated, only the top n_keep pedigrees with the highest likelihoods are retained for the next iteration.
Updating Tracking Structures: The pedigree tracking data structures are updated to reflect the merged pedigrees, and the process repeats until all pedigrees are connected or no more connections are possible.

This iterative approach allows Bonsai v3 to build complex pedigrees piece by piece, focusing computational resources on the most promising configurations at each step.

Finding the Optimal Connection Points

A critical component of the combine_up_dicts() algorithm is finding the optimal way to connect two pedigrees. This is handled by the combine_pedigrees() function, which systematically explores different connection options:

def combine_pedigrees(
    up_dct1: dict[int, dict[int, int]],
    up_dct2: dict[int, dict[int, int]],
    id_to_shared_ibd: dict[tuple[int, int], list[dict]],
    id_to_info: dict[int, dict],
    pw_ll: Any,
    max_up: int = 3,
    keep_num: int = 3,
    return_many: bool = False,
):
    """
    Combine two pedigrees into one, using IBD sharing to guide the connection.
    """
    # Find pairs of individuals that connect the pedigrees
    con_pairs = find_connecting_pairs(
        up_dct1=up_dct1,
        up_dct2=up_dct2,
        id_to_shared_ibd=id_to_shared_ibd,
    )
    
    if not con_pairs:
        return None if not return_many else []
    
    # Get all possible connection points in each pedigree
    con_pts1 = get_possible_connection_point_set(up_dct1)
    con_pts2 = get_possible_connection_point_set(up_dct2)
    
    # Restrict to connection points involving individuals with IBD
    con_pts1 = restrict_connection_point_set(up_dct1, con_pts1, 
                                          [id1 for id1, _ in con_pairs])
    con_pts2 = restrict_connection_point_set(up_dct2, con_pts2, 
                                          [id2 for _, id2 in con_pairs])
    
    # Try different ways to connect the pedigrees
    all_combined = []
    for (id1, id2) in con_pairs:
        # Find connection points that involve these IDs
        rel_pts1 = [pt for pt in con_pts1 if pt[0] == id1]
        rel_pts2 = [pt for pt in con_pts2 if pt[0] == id2]
        
        # Try each pair of connection points
        for cp1 in rel_pts1:
            for cp2 in rel_pts2:
                # Try different relationship configurations
                for up in range(max_up + 1):
                    for down in range(max_up + 1):
                        if up + down > max_up:
                            continue
                            
                        # Try both 1 and 2 ancestors
                        for num_ancs in [1, 2]:
                            # Connect the pedigrees
                            combs = connect_pedigrees_through_points(
                                id1=cp1[0], 
                                id2=cp2[0],
                                pid1=cp1[1], 
                                pid2=cp2[1],
                                up_dct1=up_dct1, 
                                up_dct2=up_dct2,
                                deg1=up, 
                                deg2=down,
                                num_ancs=num_ancs,
                            )
                            
                            # Evaluate each combination
                            for comb in combs:
                                ll = get_ped_like(
                                    up_dct=comb,
                                    id_to_shared_ibd=id_to_shared_ibd,
                                    id_to_info=id_to_info,
                                    pw_ll=pw_ll,
                                )
                                
                                all_combined.append((comb, ll))
    
    # Sort by likelihood and return
    all_combined.sort(key=lambda x: x[1], reverse=True)
    return all_combined[:keep_num] if return_many else all_combined[0][0]

This function systematically explores the space of possible connections between two pedigrees:

Identifying Connection Candidates: Bonsai identifies pairs of individuals (one from each pedigree) who share IBD segments, making them candidates for connection.
Finding Connection Points: Using get_possible_connection_point_set(), Bonsai identifies all possible points in each pedigree where connections can be made.
Restricting to Relevant Points: Connection points are filtered to focus on those involving individuals who share IBD, reducing the search space.
Systematic Exploration: Bonsai then systematically tries different connection configurations by varying:
- The number of generations up from the first individual (up)
- The number of generations down from the second individual (down)
- The number of common ancestors (num_ancs)
Evaluation: Each combined pedigree is evaluated using get_ped_like() to calculate its likelihood given the observed IBD data.
Ranking: Combinations are ranked by likelihood, and the top keep_num are returned.

This systematic approach allows Bonsai v3 to find the most likely way to connect pedigrees, even when there are multiple plausible connection options.

Tracking and Merging Pedigrees

Pedigree Tracking Data Structures

To efficiently manage the pedigree merging process, Bonsai v3 maintains several data structures that track which individuals belong to which pedigrees and how multiple pedigree hypotheses are evaluated:

def combine_up_dicts(...):
    # Initialize pedigree tracking structures
    idx_to_up_dict_ll_list = {}  # Maps pedigree indices to lists of (pedigree, log-likelihood) pairs
    id_to_idx = {}               # Maps individual IDs to their pedigree index
    idx_to_id_set = {}           # Maps pedigree indices to sets of contained individual IDs
    
    # Initialize with individual pedigrees
    for i, (id_val, up_dict) in enumerate(id_to_up_dct.items()):
        if id_set_to_exclude and id_val in id_set_to_exclude:
            continue
            
        idx_to_up_dict_ll_list[i] = [(up_dict, 0.0)]  # Initial likelihood = 0
        id_to_idx[id_val] = i
        idx_to_id_set[i] = {id_val}
    
    # Merging operations update these structures
    # ... main iteration loop ...
    
    # After finding closest pedigrees and combining them:
    
    # Create new index for merged pedigree
    new_idx = max(idx_to_up_dict_ll_list.keys()) + 1
    
    # Store combined pedigrees under new index
    idx_to_up_dict_ll_list[new_idx] = kept_combined
    
    # Update id_to_idx for all individuals in merged pedigrees
    id_set1 = idx_to_id_set[idx1]
    id_set2 = idx_to_id_set[idx2]
    merged_id_set = id_set1.union(id_set2)
    
    for id_val in merged_id_set:
        id_to_idx[id_val] = new_idx
        
    # Update idx_to_id_set with merged set
    idx_to_id_set[new_idx] = merged_id_set
    
    # Remove old pedigree records
    del idx_to_up_dict_ll_list[idx1]
    del idx_to_up_dict_ll_list[idx2]
    del idx_to_id_set[idx1]
    del idx_to_id_set[idx2]

These data structures serve several important purposes:

Multiple Hypothesis Tracking: For each pedigree index, idx_to_up_dict_ll_list maintains multiple candidate pedigrees with their likelihoods, allowing Bonsai to consider alternative hypotheses.
Individual-to-Pedigree Mapping: The id_to_idx dictionary enables efficient lookup of which pedigree contains a given individual.
Pedigree Membership Tracking: The idx_to_id_set dictionary allows Bonsai to quickly determine which individuals are in a given pedigree.

When pedigrees are merged, these data structures are updated accordingly:

A new pedigree index is created for the merged pedigree
The combined pedigrees are stored under this new index
Individual-to-pedigree mappings are updated to reflect the merger
The old pedigree records are removed

This efficient tracking system allows Bonsai v3 to manage complex pedigree merging operations while maintaining multiple hypotheses about the optimal pedigree structure.

Merging the Physical Pedigrees

While the tracking data structures manage the logical organization of pedigrees, the actual merging of physical pedigree structures is handled by the connect_pedigrees_through_points() function:

def connect_pedigrees_through_points(
    id1 : int,
    id2 : int,
    pid1 : Optional[int],
    pid2 : Optional[int],
    up_dct1 : dict[int, dict[int, int]],
    up_dct2 : dict[int, dict[int, int]],
    deg1 : int,
    deg2 : int,
    num_ancs : int,
    simple : bool=True,
):
    """
    Connect up_dct1 to up_dct2 through points id1 in up_dct1
    and id2 in up_dct2. Also connect through partner points
    pid1 and pid2, if indicated. Connect id1 to id2 through
    a relationship specified by (deg1, deg2, num_ancs).
    """
    # Validate connection parameters
    if deg1 == deg2 == 0 and (id1 > 0 and id2 > 0) and id1 != id2:
        return []  # Can't connect directly on different genotyped individuals
    
    if deg1 == deg2 == 0 and (pid1 != pid2):
        if pid1 is None or pid2 is None:
            return []
        elif pid1 > 0 and pid2 > 0:
            return []  # Can't connect on genotyped partners
    
    # Make copies to avoid modifying originals
    up_dct1 = copy.deepcopy(up_dct1)
    up_dct2 = copy.deepcopy(up_dct2)
    
    # Extend lineages upward if needed
    if deg1 > 0:
        up_dct1, _, id1, pid1 = extend_up(
            iid=id1,
            deg=deg1,
            num_ancs=num_ancs,
            up_dct=up_dct1,
        )
    
    if deg2 > 0:
        up_dct2, _, id2, pid2 = extend_up(
            iid=id2,
            deg=deg2,
            num_ancs=num_ancs,
            up_dct=up_dct2,
        )
    
    # Shift IDs in up_dct2 to avoid conflicts
    min_id = get_min_id(up_dct1) - 1
    up_dct2, id_map = shift_ids(
        ped=up_dct2,
        shift=min_id,
    )
    id2 = id_map.get(id2, id2)
    pid2 = id_map.get(pid2, pid2)
    
    # Create ID mapping for connection
    if simple:
        if (pid1 is not None) and (pid2 is not None):
            id_map_list = [
                {id1 : id2, pid1 : pid2}
            ]
        else:
            id_map_list = [
                {id1 : id2}
            ]
    else:
        id_map_list = get_all_matches(
            id1=id1,
            id2=id2,
            pid1=pid1,
            pid2=pid2,
            up_dct1=up_dct1,
            up_dct2=up_dct2,
        )
    
    # Connect pedigrees using ID mapping
    connect_dct_list = []
    for id_map in id_map_list:
        up_dct = connect_on(
            id_map=id_map,
            up_dct1=up_dct1,
            up_dct2=up_dct2,
        )
        connect_dct_list.append(up_dct)
    
    return connect_dct_list

This function handles the intricate process of physically merging two pedigree structures through specified connection points. The key steps are:

Validation: Ensuring the connection is biologically plausible (e.g., can't connect directly on different genotyped individuals).
Lineage Extension: Using extend_up() to add ancestors if the connection requires multiple generations.
ID Management: Shifting IDs in the second pedigree to avoid conflicts with the first pedigree.
Connection Mapping: Creating a mapping between IDs in the two pedigrees that specifies how they should be connected.
Physical Merging: Using connect_on() to physically merge the pedigrees based on the ID mapping.

The actual connection is performed by connect_on(), which merges two pedigrees based on an ID mapping:

def connect_on(
    id_map: dict[int, int],
    up_dct1: dict[int, dict[int, int]],
    up_dct2: dict[int, dict[int, int]],
):
    """
    Connect up_dct1 to up_dct2 based on id_map.
    Map values in up_dct1 to keys in up_dct2.
    """
    # Create result pedigree starting with all of up_dct1
    result = copy.deepcopy(up_dct1)
    
    # Add all nodes from up_dct2 not in the mapping
    for node2, parents2 in up_dct2.items():
        # Skip nodes that are values in the mapping
        if node2 in id_map.values():
            continue
            
        # Add node if not already in result
        if node2 not in result:
            result[node2] = {}
            
        # Add parents, mapping them if necessary
        for parent2, deg2 in parents2.items():
            mapped_parent = next((k for k, v in id_map.items() if v == parent2), parent2)
            result[node2][mapped_parent] = deg2
    
    # Connect nodes that are mapped
    for node1, node2 in id_map.items():
        # Transfer parents from node2 to node1
        for parent2, deg2 in up_dct2.get(node2, {}).items():
            mapped_parent = next((k for k, v in id_map.items() if v == parent2), parent2)
            result[node1][mapped_parent] = deg2
            
        # Make node1 a parent for all children of node2
        for child2, parents2 in up_dct2.items():
            if node2 in parents2 and child2 not in id_map.values():
                if child2 not in result:
                    result[child2] = {}
                result[child2][node1] = parents2[node2]
    
    return result

This function handles the details of merging pedigrees at the structural level:

It creates a new pedigree that includes all nodes from the first pedigree
It adds nodes from the second pedigree that aren't part of the connection mapping
It transfers parents and children between the mapped nodes to create the merged structure

Through this process, Bonsai v3 can physically merge pedigree structures in various ways, exploring different connection configurations to find the most likely explanation for the observed genetic data.

Evaluating Merged Pedigrees

After merging pedigrees, Bonsai v3 evaluates the likelihood of the resulting structure to determine if it provides a good explanation for the observed genetic data. This evaluation is performed by the get_ped_like() function:

def get_ped_like(
    up_dct: dict[int, dict[int, int]],
    id_to_shared_ibd: dict[tuple[int, int], list[dict]],
    id_to_info: dict[int, dict],
    pw_ll: Any,
):
    """
    Calculate the log-likelihood of a pedigree given IBD data.
    """
    log_like = 0.0
    
    # Get all genotyped individuals in the pedigree
    all_ids = get_all_id_set(up_dct)
    gen_ids = [i for i in all_ids if i > 0]  # Genotyped IDs are positive
    
    # For each pair of genotyped individuals
    for i in range(len(gen_ids)):
        id1 = gen_ids[i]
        for j in range(i+1, len(gen_ids)):
            id2 = gen_ids[j]
            
            # Skip if no IBD data for this pair
            pair = (min(id1, id2), max(id1, id2))
            if pair not in id_to_shared_ibd:
                continue
                
            # Get IBD data for this pair
            ibd_segs = id_to_shared_ibd[pair]
            
            # Get relationship from pedigree
            rel_tuple = get_simple_rel_tuple(up_dct, id1, id2)
            
            # Calculate likelihood of this relationship given IBD
            pair_ll = pw_ll.get_ibd_log_like(
                id1=id1,
                id2=id2,
                rel_tuple=rel_tuple,
                ibd_segs=ibd_segs,
            )
            
            # Add to total log-likelihood
            log_like += pair_ll
    
    # Add age-based likelihood component
    age_ll = get_age_log_like(up_dct, id_to_info)
    log_like += age_ll
    
    # Add structural likelihood component
    struct_ll = get_structural_log_like(up_dct)
    log_like += struct_ll
    
    return log_like

This function calculates the total log-likelihood of a pedigree by:

Identifying Genotyped Pairs: Finding all pairs of genotyped individuals in the pedigree.
Determining Relationships: Using get_simple_rel_tuple() to determine the relationship implied by the pedigree for each pair.
Calculating Genetic Likelihood: Using pw_ll.get_ibd_log_like() to calculate how well the relationship explains the observed IBD data.
Adding Age Constraints: Incorporating age-based constraints through get_age_log_like().
Adding Structural Penalties: Including penalties for structurally implausible pedigrees through get_structural_log_like().

The total log-likelihood provides a principled measure of how well a pedigree explains the observed genetic data while respecting biological constraints. This allows Bonsai v3 to compare different pedigree structures and select the most plausible explanation for the observed data.

Core Component: The combine_up_dicts() algorithm is fundamental to Bonsai v3's ability to scale from small pedigree structures to larger ones. By systematically merging pedigrees based on IBD sharing and evaluating the likelihood of different connection configurations, this algorithm enables the reconstruction of complex family networks that would be computationally infeasible to optimize directly. The iterative, likelihood-based approach allows Bonsai to build accurate pedigrees even in the presence of noise, missing data, and ambiguous relationships.

Comparing Notebook and Bonsai v3

The Lab15 notebook explores the combine_up_dicts() algorithm through simplified implementations and examples. While the notebook provides an educational introduction to the key concepts, the actual Bonsai v3 implementation includes additional sophistication:

Parallel Hypothesis Tracking: The production code maintains multiple candidate pedigrees in parallel, allowing Bonsai to explore multiple hypotheses simultaneously.
Sophisticated ID Management: The real implementation includes robust mechanisms for shifting and mapping IDs to avoid conflicts during pedigree merging.
Comprehensive Likelihood Evaluation: The production code includes more detailed likelihood calculations that account for various types of genetic and non-genetic evidence.
Optimization Heuristics: The real implementation includes various heuristics to efficiently prune the search space and focus on promising configurations.
Debug Visualization: The production code includes optional visualization features for debugging and analysis of the merging process.

The differences reflect the balance between educational clarity and production-ready robustness. The notebook provides a conceptual understanding of the algorithm, while the production implementation handles the full complexity of real-world pedigree reconstruction tasks.

Interactive Lab Environment

Run the interactive Lab 15 notebook in Google Colab:

Google Colab Environment

Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.

Data will be automatically downloaded from S3 when you run the notebook.

Note: You may need a Google account to save your work in Google Drive.

Open Lab 15 Notebook in Google Colab

Beyond the Code

As you explore the combine_up_dicts() algorithm, consider these broader implications:

Computational Challenges: The combinatorial explosion in pedigree reconstruction and how divide-and-conquer approaches address these challenges
Probabilistic Reasoning: How likelihood-based methods can navigate uncertainty in genetic data and family relationships
Biological Constraints: The importance of incorporating domain knowledge about biological constraints in pedigree reconstruction
Historical Applications: How these methods can be applied to reconstruct historical family networks from limited genetic data
Algorithmic Trade-offs: The balance between exhaustive search and computational efficiency in large-scale applications

These considerations highlight how the combine_up_dicts() algorithm bridges theoretical computer science, statistical genetics, and practical genealogy applications, making it a fascinating example of interdisciplinary computational methods.

This lab is part of the Bonsai v3 Deep Dive track:

Introduction

Lab 01

Architecture

Lab 02

IBD Formats

Lab 03

Statistics

Lab 04

Models

Lab 05

Relationships

Lab 06

PwLogLike

Lab 07

Age Modeling

Lab 08

Data Structures

Lab 09

Up-Node Dict

Lab 10

Connection Points

Lab 11

Relationship Assessment

Lab 12

Small Pedigrees

Lab 13

Optimizing Pedigrees

Lab 14

Combine Up Dicts

Lab 15