Lab 15: The combine_up_dicts() Algorithm
Core Component: This lab explores the combine_up_dicts()
algorithm, the central mechanism in Bonsai v3 for scaling from small pedigree structures to larger ones. This algorithm is essential for reconstructing complex family networks by systematically merging smaller pedigree units through an iterative, likelihood-based approach.
Scaling from Small to Large Pedigrees
The Challenge of Large-Scale Pedigree Reconstruction
Reconstructing large pedigrees presents significant challenges:
- Combinatorial Explosion: The number of possible pedigree configurations grows exponentially with the number of individuals
- Computational Constraints: Directly optimizing large pedigrees would require prohibitive computational resources
- Incomplete Data: Real-world genetic data often contains missing information and measurement errors
- Conflicting Evidence: Different pieces of genetic evidence might suggest incompatible relationships
To address these challenges, Bonsai v3 implements a bottom-up, incremental approach through the combine_up_dicts()
function. This function, found in utils/bonsaitree/bonsaitree/v3/connections.py
, lies at the heart of Bonsai's ability to scale from small pedigree structures to more complex family networks.
def combine_up_dicts(
id_to_up_dct: dict[int, dict[int, dict[int, int]]],
id_to_shared_ibd: dict[tuple[int, int], list[dict]],
id_to_info: dict[int, dict],
pw_ll: Any,
max_up: int = 3,
n_keep: int = 5,
id_set_to_exclude: set[int] = None,
):
"""
Combine all pedigrees in id_to_up_dct incrementally.
Args:
id_to_up_dct: Dictionary mapping IDs to their up-node dictionaries
id_to_shared_ibd: Dictionary mapping pairs of IDs to their IBD segments
id_to_info: Dictionary with demographic information for individuals
pw_ll: PwLogLike instance for likelihood calculation
max_up: Maximum number of generations to extend upward
n_keep: Number of top pedigrees to keep at each step
id_set_to_exclude: Set of IDs to exclude from combination
Returns:
List of optimized pedigrees with their likelihoods
"""
# Core algorithm implementation...
This function orchestrates the pedigree merging process, addressing the scaling challenges through several key strategies:
- Divide and Conquer: Breaking the large problem into smaller, more manageable subproblems (small pedigrees)
- Greedy Merging: Iteratively combining the pedigrees with the strongest genetic connections first
- Hypothesis Tracking: Maintaining multiple candidate pedigrees to avoid local optima
- Intelligent Pruning: Focusing computational resources on the most promising pedigree configurations
By following this incremental approach, Bonsai v3 can reconstruct large-scale pedigrees that would be computationally infeasible to optimize directly.
The Iteration Process: Finding and Merging Pedigrees
The core of combine_up_dicts()
is an iterative process that repeatedly identifies and merges the most closely related pedigrees until all individuals are connected or no more reliable connections can be made. This process involves several key steps:
def combine_up_dicts(...):
# ... initialization code ...
# Main iteration loop
while len(idx_to_up_dict_ll_list) > 1:
# Find closest pedigrees to merge based on IBD sharing
idx1, idx2 = find_closest_pedigrees(
idx_to_id_set=idx_to_id_set,
id_to_shared_ibd=id_to_shared_ibd,
min_cm=min_cm,
)
# If no more pedigrees can be connected, break
if idx1 is None or idx2 is None:
break
# Get list of pedigrees to combine
up_dict_ll_list1 = idx_to_up_dict_ll_list[idx1]
up_dict_ll_list2 = idx_to_up_dict_ll_list[idx2]
# Try all combinations of pedigrees
all_combined = []
for up_dict1, ll1 in up_dict_ll_list1:
for up_dict2, ll2 in up_dict_ll_list2:
# Find optimal way to connect these pedigrees
combined_list = combine_pedigrees(
up_dct1=up_dict1,
up_dct2=up_dict2,
id_to_shared_ibd=id_to_shared_ibd,
id_to_info=id_to_info,
pw_ll=pw_ll,
max_up=max_up,
keep_num=n_keep,
return_many=True,
)
# Add all combined pedigrees to the list
all_combined.extend([(ped, ll1 + ll2 + new_ll)
for ped, new_ll in combined_list])
# Keep only top n_keep combined pedigrees
all_combined.sort(key=lambda x: x[1], reverse=True)
kept_combined = all_combined[:n_keep]
# Merge pedigree tracking structures
# ... update idx_to_up_dict_ll_list, id_to_idx, idx_to_id_set ...
# Return the final list of pedigrees with likelihoods
return idx_to_up_dict_ll_list[0]
The key steps in this iteration process are:
- Finding the Closest Pedigrees: The
find_closest_pedigrees()
function identifies which pedigrees share the most IBD and should be merged next. This prioritizes merging the most closely related pedigrees first. - Trying All Combinations: For each pedigree in the first set and each pedigree in the second set, Bonsai tries to combine them using
combine_pedigrees()
. - Evaluating Connection Options: When combining pedigrees, Bonsai explores various connection points and relationship types to find the optimal configuration.
- Pruning Hypotheses: After all combinations are evaluated, only the top
n_keep
pedigrees with the highest likelihoods are retained for the next iteration. - Updating Tracking Structures: The pedigree tracking data structures are updated to reflect the merged pedigrees, and the process repeats until all pedigrees are connected or no more connections are possible.
This iterative approach allows Bonsai v3 to build complex pedigrees piece by piece, focusing computational resources on the most promising configurations at each step.
Finding the Optimal Connection Points
A critical component of the combine_up_dicts()
algorithm is finding the optimal way to connect two pedigrees. This is handled by the combine_pedigrees()
function, which systematically explores different connection options:
def combine_pedigrees(
up_dct1: dict[int, dict[int, int]],
up_dct2: dict[int, dict[int, int]],
id_to_shared_ibd: dict[tuple[int, int], list[dict]],
id_to_info: dict[int, dict],
pw_ll: Any,
max_up: int = 3,
keep_num: int = 3,
return_many: bool = False,
):
"""
Combine two pedigrees into one, using IBD sharing to guide the connection.
"""
# Find pairs of individuals that connect the pedigrees
con_pairs = find_connecting_pairs(
up_dct1=up_dct1,
up_dct2=up_dct2,
id_to_shared_ibd=id_to_shared_ibd,
)
if not con_pairs:
return None if not return_many else []
# Get all possible connection points in each pedigree
con_pts1 = get_possible_connection_point_set(up_dct1)
con_pts2 = get_possible_connection_point_set(up_dct2)
# Restrict to connection points involving individuals with IBD
con_pts1 = restrict_connection_point_set(up_dct1, con_pts1,
[id1 for id1, _ in con_pairs])
con_pts2 = restrict_connection_point_set(up_dct2, con_pts2,
[id2 for _, id2 in con_pairs])
# Try different ways to connect the pedigrees
all_combined = []
for (id1, id2) in con_pairs:
# Find connection points that involve these IDs
rel_pts1 = [pt for pt in con_pts1 if pt[0] == id1]
rel_pts2 = [pt for pt in con_pts2 if pt[0] == id2]
# Try each pair of connection points
for cp1 in rel_pts1:
for cp2 in rel_pts2:
# Try different relationship configurations
for up in range(max_up + 1):
for down in range(max_up + 1):
if up + down > max_up:
continue
# Try both 1 and 2 ancestors
for num_ancs in [1, 2]:
# Connect the pedigrees
combs = connect_pedigrees_through_points(
id1=cp1[0],
id2=cp2[0],
pid1=cp1[1],
pid2=cp2[1],
up_dct1=up_dct1,
up_dct2=up_dct2,
deg1=up,
deg2=down,
num_ancs=num_ancs,
)
# Evaluate each combination
for comb in combs:
ll = get_ped_like(
up_dct=comb,
id_to_shared_ibd=id_to_shared_ibd,
id_to_info=id_to_info,
pw_ll=pw_ll,
)
all_combined.append((comb, ll))
# Sort by likelihood and return
all_combined.sort(key=lambda x: x[1], reverse=True)
return all_combined[:keep_num] if return_many else all_combined[0][0]
This function systematically explores the space of possible connections between two pedigrees:
- Identifying Connection Candidates: Bonsai identifies pairs of individuals (one from each pedigree) who share IBD segments, making them candidates for connection.
- Finding Connection Points: Using
get_possible_connection_point_set()
, Bonsai identifies all possible points in each pedigree where connections can be made. - Restricting to Relevant Points: Connection points are filtered to focus on those involving individuals who share IBD, reducing the search space.
- Systematic Exploration: Bonsai then systematically tries different connection configurations by varying:
- The number of generations up from the first individual (
up
) - The number of generations down from the second individual (
down
) - The number of common ancestors (
num_ancs
)
- The number of generations up from the first individual (
- Evaluation: Each combined pedigree is evaluated using
get_ped_like()
to calculate its likelihood given the observed IBD data. - Ranking: Combinations are ranked by likelihood, and the top
keep_num
are returned.
This systematic approach allows Bonsai v3 to find the most likely way to connect pedigrees, even when there are multiple plausible connection options.
Tracking and Merging Pedigrees
Pedigree Tracking Data Structures
To efficiently manage the pedigree merging process, Bonsai v3 maintains several data structures that track which individuals belong to which pedigrees and how multiple pedigree hypotheses are evaluated:
def combine_up_dicts(...):
# Initialize pedigree tracking structures
idx_to_up_dict_ll_list = {} # Maps pedigree indices to lists of (pedigree, log-likelihood) pairs
id_to_idx = {} # Maps individual IDs to their pedigree index
idx_to_id_set = {} # Maps pedigree indices to sets of contained individual IDs
# Initialize with individual pedigrees
for i, (id_val, up_dict) in enumerate(id_to_up_dct.items()):
if id_set_to_exclude and id_val in id_set_to_exclude:
continue
idx_to_up_dict_ll_list[i] = [(up_dict, 0.0)] # Initial likelihood = 0
id_to_idx[id_val] = i
idx_to_id_set[i] = {id_val}
# Merging operations update these structures
# ... main iteration loop ...
# After finding closest pedigrees and combining them:
# Create new index for merged pedigree
new_idx = max(idx_to_up_dict_ll_list.keys()) + 1
# Store combined pedigrees under new index
idx_to_up_dict_ll_list[new_idx] = kept_combined
# Update id_to_idx for all individuals in merged pedigrees
id_set1 = idx_to_id_set[idx1]
id_set2 = idx_to_id_set[idx2]
merged_id_set = id_set1.union(id_set2)
for id_val in merged_id_set:
id_to_idx[id_val] = new_idx
# Update idx_to_id_set with merged set
idx_to_id_set[new_idx] = merged_id_set
# Remove old pedigree records
del idx_to_up_dict_ll_list[idx1]
del idx_to_up_dict_ll_list[idx2]
del idx_to_id_set[idx1]
del idx_to_id_set[idx2]
These data structures serve several important purposes:
- Multiple Hypothesis Tracking: For each pedigree index,
idx_to_up_dict_ll_list
maintains multiple candidate pedigrees with their likelihoods, allowing Bonsai to consider alternative hypotheses. - Individual-to-Pedigree Mapping: The
id_to_idx
dictionary enables efficient lookup of which pedigree contains a given individual. - Pedigree Membership Tracking: The
idx_to_id_set
dictionary allows Bonsai to quickly determine which individuals are in a given pedigree.
When pedigrees are merged, these data structures are updated accordingly:
- A new pedigree index is created for the merged pedigree
- The combined pedigrees are stored under this new index
- Individual-to-pedigree mappings are updated to reflect the merger
- The old pedigree records are removed
This efficient tracking system allows Bonsai v3 to manage complex pedigree merging operations while maintaining multiple hypotheses about the optimal pedigree structure.
Merging the Physical Pedigrees
While the tracking data structures manage the logical organization of pedigrees, the actual merging of physical pedigree structures is handled by the connect_pedigrees_through_points()
function:
def connect_pedigrees_through_points(
id1 : int,
id2 : int,
pid1 : Optional[int],
pid2 : Optional[int],
up_dct1 : dict[int, dict[int, int]],
up_dct2 : dict[int, dict[int, int]],
deg1 : int,
deg2 : int,
num_ancs : int,
simple : bool=True,
):
"""
Connect up_dct1 to up_dct2 through points id1 in up_dct1
and id2 in up_dct2. Also connect through partner points
pid1 and pid2, if indicated. Connect id1 to id2 through
a relationship specified by (deg1, deg2, num_ancs).
"""
# Validate connection parameters
if deg1 == deg2 == 0 and (id1 > 0 and id2 > 0) and id1 != id2:
return [] # Can't connect directly on different genotyped individuals
if deg1 == deg2 == 0 and (pid1 != pid2):
if pid1 is None or pid2 is None:
return []
elif pid1 > 0 and pid2 > 0:
return [] # Can't connect on genotyped partners
# Make copies to avoid modifying originals
up_dct1 = copy.deepcopy(up_dct1)
up_dct2 = copy.deepcopy(up_dct2)
# Extend lineages upward if needed
if deg1 > 0:
up_dct1, _, id1, pid1 = extend_up(
iid=id1,
deg=deg1,
num_ancs=num_ancs,
up_dct=up_dct1,
)
if deg2 > 0:
up_dct2, _, id2, pid2 = extend_up(
iid=id2,
deg=deg2,
num_ancs=num_ancs,
up_dct=up_dct2,
)
# Shift IDs in up_dct2 to avoid conflicts
min_id = get_min_id(up_dct1) - 1
up_dct2, id_map = shift_ids(
ped=up_dct2,
shift=min_id,
)
id2 = id_map.get(id2, id2)
pid2 = id_map.get(pid2, pid2)
# Create ID mapping for connection
if simple:
if (pid1 is not None) and (pid2 is not None):
id_map_list = [
{id1 : id2, pid1 : pid2}
]
else:
id_map_list = [
{id1 : id2}
]
else:
id_map_list = get_all_matches(
id1=id1,
id2=id2,
pid1=pid1,
pid2=pid2,
up_dct1=up_dct1,
up_dct2=up_dct2,
)
# Connect pedigrees using ID mapping
connect_dct_list = []
for id_map in id_map_list:
up_dct = connect_on(
id_map=id_map,
up_dct1=up_dct1,
up_dct2=up_dct2,
)
connect_dct_list.append(up_dct)
return connect_dct_list
This function handles the intricate process of physically merging two pedigree structures through specified connection points. The key steps are:
- Validation: Ensuring the connection is biologically plausible (e.g., can't connect directly on different genotyped individuals).
- Lineage Extension: Using
extend_up()
to add ancestors if the connection requires multiple generations. - ID Management: Shifting IDs in the second pedigree to avoid conflicts with the first pedigree.
- Connection Mapping: Creating a mapping between IDs in the two pedigrees that specifies how they should be connected.
- Physical Merging: Using
connect_on()
to physically merge the pedigrees based on the ID mapping.
The actual connection is performed by connect_on()
, which merges two pedigrees based on an ID mapping:
def connect_on(
id_map: dict[int, int],
up_dct1: dict[int, dict[int, int]],
up_dct2: dict[int, dict[int, int]],
):
"""
Connect up_dct1 to up_dct2 based on id_map.
Map values in up_dct1 to keys in up_dct2.
"""
# Create result pedigree starting with all of up_dct1
result = copy.deepcopy(up_dct1)
# Add all nodes from up_dct2 not in the mapping
for node2, parents2 in up_dct2.items():
# Skip nodes that are values in the mapping
if node2 in id_map.values():
continue
# Add node if not already in result
if node2 not in result:
result[node2] = {}
# Add parents, mapping them if necessary
for parent2, deg2 in parents2.items():
mapped_parent = next((k for k, v in id_map.items() if v == parent2), parent2)
result[node2][mapped_parent] = deg2
# Connect nodes that are mapped
for node1, node2 in id_map.items():
# Transfer parents from node2 to node1
for parent2, deg2 in up_dct2.get(node2, {}).items():
mapped_parent = next((k for k, v in id_map.items() if v == parent2), parent2)
result[node1][mapped_parent] = deg2
# Make node1 a parent for all children of node2
for child2, parents2 in up_dct2.items():
if node2 in parents2 and child2 not in id_map.values():
if child2 not in result:
result[child2] = {}
result[child2][node1] = parents2[node2]
return result
This function handles the details of merging pedigrees at the structural level:
- It creates a new pedigree that includes all nodes from the first pedigree
- It adds nodes from the second pedigree that aren't part of the connection mapping
- It transfers parents and children between the mapped nodes to create the merged structure
Through this process, Bonsai v3 can physically merge pedigree structures in various ways, exploring different connection configurations to find the most likely explanation for the observed genetic data.
Evaluating Merged Pedigrees
After merging pedigrees, Bonsai v3 evaluates the likelihood of the resulting structure to determine if it provides a good explanation for the observed genetic data. This evaluation is performed by the get_ped_like()
function:
def get_ped_like(
up_dct: dict[int, dict[int, int]],
id_to_shared_ibd: dict[tuple[int, int], list[dict]],
id_to_info: dict[int, dict],
pw_ll: Any,
):
"""
Calculate the log-likelihood of a pedigree given IBD data.
"""
log_like = 0.0
# Get all genotyped individuals in the pedigree
all_ids = get_all_id_set(up_dct)
gen_ids = [i for i in all_ids if i > 0] # Genotyped IDs are positive
# For each pair of genotyped individuals
for i in range(len(gen_ids)):
id1 = gen_ids[i]
for j in range(i+1, len(gen_ids)):
id2 = gen_ids[j]
# Skip if no IBD data for this pair
pair = (min(id1, id2), max(id1, id2))
if pair not in id_to_shared_ibd:
continue
# Get IBD data for this pair
ibd_segs = id_to_shared_ibd[pair]
# Get relationship from pedigree
rel_tuple = get_simple_rel_tuple(up_dct, id1, id2)
# Calculate likelihood of this relationship given IBD
pair_ll = pw_ll.get_ibd_log_like(
id1=id1,
id2=id2,
rel_tuple=rel_tuple,
ibd_segs=ibd_segs,
)
# Add to total log-likelihood
log_like += pair_ll
# Add age-based likelihood component
age_ll = get_age_log_like(up_dct, id_to_info)
log_like += age_ll
# Add structural likelihood component
struct_ll = get_structural_log_like(up_dct)
log_like += struct_ll
return log_like
This function calculates the total log-likelihood of a pedigree by:
- Identifying Genotyped Pairs: Finding all pairs of genotyped individuals in the pedigree.
- Determining Relationships: Using
get_simple_rel_tuple()
to determine the relationship implied by the pedigree for each pair. - Calculating Genetic Likelihood: Using
pw_ll.get_ibd_log_like()
to calculate how well the relationship explains the observed IBD data. - Adding Age Constraints: Incorporating age-based constraints through
get_age_log_like()
. - Adding Structural Penalties: Including penalties for structurally implausible pedigrees through
get_structural_log_like()
.
The total log-likelihood provides a principled measure of how well a pedigree explains the observed genetic data while respecting biological constraints. This allows Bonsai v3 to compare different pedigree structures and select the most plausible explanation for the observed data.
Core Component: The combine_up_dicts()
algorithm is fundamental to Bonsai v3's ability to scale from small pedigree structures to larger ones. By systematically merging pedigrees based on IBD sharing and evaluating the likelihood of different connection configurations, this algorithm enables the reconstruction of complex family networks that would be computationally infeasible to optimize directly. The iterative, likelihood-based approach allows Bonsai to build accurate pedigrees even in the presence of noise, missing data, and ambiguous relationships.
Comparing Notebook and Bonsai v3
The Lab15 notebook explores the combine_up_dicts()
algorithm through simplified implementations and examples. While the notebook provides an educational introduction to the key concepts, the actual Bonsai v3 implementation includes additional sophistication:
- Parallel Hypothesis Tracking: The production code maintains multiple candidate pedigrees in parallel, allowing Bonsai to explore multiple hypotheses simultaneously.
- Sophisticated ID Management: The real implementation includes robust mechanisms for shifting and mapping IDs to avoid conflicts during pedigree merging.
- Comprehensive Likelihood Evaluation: The production code includes more detailed likelihood calculations that account for various types of genetic and non-genetic evidence.
- Optimization Heuristics: The real implementation includes various heuristics to efficiently prune the search space and focus on promising configurations.
- Debug Visualization: The production code includes optional visualization features for debugging and analysis of the merging process.
The differences reflect the balance between educational clarity and production-ready robustness. The notebook provides a conceptual understanding of the algorithm, while the production implementation handles the full complexity of real-world pedigree reconstruction tasks.
Interactive Lab Environment
Run the interactive Lab 15 notebook in Google Colab:
Google Colab Environment
Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.
Data will be automatically downloaded from S3 when you run the notebook.
Note: You may need a Google account to save your work in Google Drive.
Beyond the Code
As you explore the combine_up_dicts()
algorithm, consider these broader implications:
- Computational Challenges: The combinatorial explosion in pedigree reconstruction and how divide-and-conquer approaches address these challenges
- Probabilistic Reasoning: How likelihood-based methods can navigate uncertainty in genetic data and family relationships
- Biological Constraints: The importance of incorporating domain knowledge about biological constraints in pedigree reconstruction
- Historical Applications: How these methods can be applied to reconstruct historical family networks from limited genetic data
- Algorithmic Trade-offs: The balance between exhaustive search and computational efficiency in large-scale applications
These considerations highlight how the combine_up_dicts()
algorithm bridges theoretical computer science, statistical genetics, and practical genealogy applications, making it a fascinating example of interdisciplinary computational methods.
This lab is part of the Bonsai v3 Deep Dive track:
Introduction
Lab 01
Architecture
Lab 02
IBD Formats
Lab 03
Statistics
Lab 04
Models
Lab 05
Relationships
Lab 06
PwLogLike
Lab 07
Age Modeling
Lab 08
Data Structures
Lab 09
Up-Node Dict
Lab 10
Connection Points
Lab 11
Relationship Assessment
Lab 12
Small Pedigrees
Lab 13
Optimizing Pedigrees
Lab 14
Combine Up Dicts
Lab 15