Lab 17: Incremental Individual Addition Strategies
Core Component: This lab explores how Bonsai v3 efficiently adds new individuals to existing pedigrees one at a time, rather than rebuilding the entire structure. These incremental addition strategies are critical for performance when working with large datasets and for updating pedigrees as new genetic data becomes available.
The Challenge of Incremental Addition
Why Incremental Addition Matters
In real-world genetic genealogy applications, data often arrives incrementally. New individuals may take DNA tests and need to be placed in an existing family structure without recomputing the entire pedigree. Bonsai v3 addresses this challenge through specialized functions that efficiently find optimal placement for new individuals within existing pedigrees.
The key advantages of incremental addition include:
- Computational Efficiency: Adding one individual is much faster than rebuilding an entire pedigree
- Stability: Existing relationships in the pedigree remain stable as new individuals are added
- Scalability: The approach can handle continuous addition of new individuals over time
- Real-time Updates: Results can be updated promptly as new data becomes available
The core of this functionality in Bonsai v3 is the add_individual_to_pedigree
function:
def add_individual_to_pedigree(
id_val: int,
up_dct: dict[int, dict[int, int]],
id_to_shared_ibd: dict[tuple[int, int], list[dict]],
id_to_info: dict[int, dict],
pw_ll: Any,
max_up: int = 3,
n_keep: int = 5,
return_only_best: bool = True,
):
"""
Add a single individual to an existing pedigree.
Args:
id_val: ID of the individual to add
up_dct: Up-node dictionary representing the existing pedigree
id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
id_to_info: Dict with demographic information for individuals
pw_ll: PwLogLike instance for likelihood calculation
max_up: Maximum number of generations to extend upward
n_keep: Number of top pedigrees to keep
return_only_best: Whether to return only the best pedigree
Returns:
List of (pedigree, likelihood) pairs or just the best one
"""
# Actual implementation details...
This function orchestrates the process of finding the optimal way to add a new individual to an existing pedigree. It evaluates different potential connection points and relationship configurations, much like the pedigree merging process we saw previously, but optimized for the single-individual case.
Finding Potential Connection Points
The first step in adding a new individual to a pedigree is to identify potential connection points. This is handled by the find_potential_connection_points
function:
def find_potential_connection_points(
id_val: int,
up_dct: dict[int, dict[int, int]],
id_to_shared_ibd: dict[tuple[int, int], list[dict]],
):
"""
Find potential points for connecting a new individual to a pedigree.
Args:
id_val: ID of the individual to add
up_dct: Up-node dictionary representing the existing pedigree
id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
Returns:
connection_points: List of (existing_id, ibd_amount) tuples
"""
# Find individuals in the pedigree who share IBD with the new individual
connection_points = []
for id_pair, segments in id_to_shared_ibd.items():
# Check if one of the IDs is the new individual
if id_val not in id_pair:
continue
# Get the other ID
other_id = id_pair[0] if id_pair[1] == id_val else id_pair[1]
# Check if the other ID is in the pedigree
if other_id in up_dct:
# Calculate total shared IBD
total_cm = sum(seg['length_cm'] for seg in segments)
# Add to connection points
connection_points.append((other_id, total_cm))
# Sort by amount of IBD sharing (descending)
connection_points.sort(key=lambda x: x[1], reverse=True)
return connection_points
This function identifies individuals in the existing pedigree who share IBD with the new individual. These individuals form the potential connection points through which the new individual can be added to the pedigree. The function sorts these connection points by the amount of IBD sharing, prioritizing stronger genetic connections.
The key insight here is that individuals who share more IBD with the new individual are likely to be more closely related, making them better candidates for connection points. By focusing on these high-IBD connections first, Bonsai can often find the optimal placement more efficiently.
Evaluating Different Placement Options
Once potential connection points are identified, Bonsai v3 evaluates different options for adding the new individual to the pedigree. This involves determining the optimal relationship between the new individual and existing individuals in the pedigree.
The evaluate_placement_options
function handles this process:
def evaluate_placement_options(
id_val: int,
connection_points: list[tuple[int, float]],
up_dct: dict[int, dict[int, int]],
id_to_shared_ibd: dict[tuple[int, int], list[dict]],
id_to_info: dict[int, dict],
pw_ll: Any,
max_up: int = 3,
):
"""
Evaluate different options for placing a new individual in a pedigree.
Args:
id_val: ID of the individual to add
connection_points: List of (existing_id, ibd_amount) tuples
up_dct: Up-node dictionary representing the existing pedigree
id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
id_to_info: Dict with demographic information for individuals
pw_ll: PwLogLike instance for likelihood calculation
max_up: Maximum number of generations to extend upward
Returns:
placement_options: List of (up_dct_new, likelihood) tuples
"""
placement_options = []
# Get demographic information for the new individual
new_info = id_to_info.get(id_val, {})
# For each potential connection point
for existing_id, _ in connection_points:
# Get demographic information for the existing individual
existing_info = id_to_info.get(existing_id, {})
# Get IBD segments between the individuals
pair = (min(id_val, existing_id), max(id_val, existing_id))
ibd_segs = id_to_shared_ibd.get(pair, [])
# Get potential relationship configurations
rel_configs = get_connection_degs_and_log_likes(
id1=id_val,
id2=existing_id,
id_to_shared_ibd=id_to_shared_ibd,
id_to_info=id_to_info,
pw_ll=pw_ll,
max_up=max_up,
)
# For each potential relationship configuration
for up, down, num_ancs, log_like in rel_configs:
# Create a new version of the pedigree with this relationship
up_dct_new = add_with_relationship(
id_val=id_val,
existing_id=existing_id,
up_dct=up_dct,
up=up,
down=down,
num_ancs=num_ancs,
)
# Calculate total likelihood of the new pedigree
total_likelihood = calculate_pedigree_likelihood(
up_dct_new,
id_to_shared_ibd,
id_to_info,
pw_ll,
)
# Add to placement options
placement_options.append((up_dct_new, total_likelihood))
# Sort by likelihood (descending)
placement_options.sort(key=lambda x: x[1], reverse=True)
return placement_options
This function systematically evaluates different ways to add the new individual to the pedigree, considering various relationship configurations with each potential connection point. It calculates the likelihood of each resulting pedigree and ranks them, allowing Bonsai to identify the most probable placement for the new individual.
The key steps in this process are:
- For each potential connection point, identify possible relationship configurations
- For each relationship configuration, create a new version of the pedigree
- Calculate the likelihood of each resulting pedigree
- Rank the pedigrees by likelihood
This approach allows Bonsai to efficiently explore the space of possible placements for the new individual and identify the most probable one based on genetic evidence.
Implementing the Addition
Adding Individuals with Different Relationship Types
Once the optimal relationship configuration is identified, Bonsai v3 needs to physically add the new individual to the pedigree with that relationship. The add_with_relationship
function handles this:
def add_with_relationship(
id_val: int,
existing_id: int,
up_dct: dict[int, dict[int, int]],
up: int,
down: int,
num_ancs: int,
):
"""
Add a new individual to a pedigree with a specific relationship.
Args:
id_val: ID of the individual to add
existing_id: ID of the existing individual to connect to
up_dct: Up-node dictionary representing the existing pedigree
up, down, num_ancs: Relationship parameters
Returns:
up_dct_new: Updated pedigree with the new individual
"""
# Create a copy of the existing pedigree
up_dct_new = copy.deepcopy(up_dct)
# Ensure the new individual exists in the pedigree
if id_val not in up_dct_new:
up_dct_new[id_val] = {}
# Handle different relationship types
if up == 0 and down == 1: # id_val is parent of existing_id
# Add id_val as parent of existing_id
up_dct_new[existing_id][id_val] = 1
elif up == 1 and down == 0: # existing_id is parent of id_val
# Add existing_id as parent of id_val
up_dct_new[id_val][existing_id] = 1
elif up == 1 and down == 1: # Siblings
# Get existing parents of existing_id
existing_parents = list(up_dct_new.get(existing_id, {}).keys())
if num_ancs == 1: # Half siblings
# Need at least one parent to create half siblings
if existing_parents:
shared_parent = existing_parents[0]
up_dct_new[id_val][shared_parent] = 1
else:
# Create a new ungenotyped parent
new_parent_id = get_min_id(up_dct_new) - 1
up_dct_new[existing_id][new_parent_id] = 1
up_dct_new[id_val][new_parent_id] = 1
up_dct_new[new_parent_id] = {}
elif num_ancs == 2: # Full siblings
# Need two parents to create full siblings
if len(existing_parents) >= 2:
parent1, parent2 = existing_parents[:2]
up_dct_new[id_val][parent1] = 1
up_dct_new[id_val][parent2] = 1
elif len(existing_parents) == 1:
# Use the existing parent and create a new one
shared_parent = existing_parents[0]
new_parent_id = get_min_id(up_dct_new) - 1
up_dct_new[existing_id][new_parent_id] = 1
up_dct_new[id_val][shared_parent] = 1
up_dct_new[id_val][new_parent_id] = 1
up_dct_new[new_parent_id] = {}
else:
# Create two new ungenotyped parents
parent1_id = get_min_id(up_dct_new) - 1
parent2_id = parent1_id - 1
up_dct_new[existing_id][parent1_id] = 1
up_dct_new[existing_id][parent2_id] = 1
up_dct_new[id_val][parent1_id] = 1
up_dct_new[id_val][parent2_id] = 1
up_dct_new[parent1_id] = {}
up_dct_new[parent2_id] = {}
else: # More distant relationships
# Handle more complex relationships through common ancestors
# Create ancestry chains as needed
# ... (implementation details for more distant relationships)
return up_dct_new
This function implements the physical addition of a new individual to the pedigree with a specific relationship to an existing individual. It handles various relationship types, including:
- Parent-Child: Adding the new individual as a parent or child of an existing individual
- Siblings: Adding the new individual as a half or full sibling of an existing individual
- More Distant Relationships: Adding the new individual with more complex relationships that involve common ancestors
For sibling relationships, the function attempts to reuse existing parents when available, or creates new ungenotyped parents as needed. This approach ensures that the resulting pedigree remains consistent with all previously established relationships while properly incorporating the new individual.
Maintaining Multiple Hypotheses
An important aspect of Bonsai v3's incremental addition strategy is the maintenance of multiple hypotheses about how a new individual might fit into the pedigree. This approach allows Bonsai to avoid getting stuck in local optima and explore different placement possibilities.
The select_best_placements
function handles this aspect:
def select_best_placements(
placement_options: list[tuple[dict, float]],
n_keep: int = 5,
):
"""
Select the top placement options to keep.
Args:
placement_options: List of (pedigree, likelihood) pairs
n_keep: Number of top options to keep
Returns:
best_placements: List of top (pedigree, likelihood) pairs
"""
# Filter out any invalid pedigrees
valid_options = [(ped, ll) for ped, ll in placement_options if is_valid_pedigree(ped)]
# Sort by likelihood (descending)
valid_options.sort(key=lambda x: x[1], reverse=True)
# Keep only the top n_keep options
best_placements = valid_options[:n_keep]
return best_placements
This function selects the top n_keep
placement options from all the evaluated possibilities. By maintaining multiple hypotheses, Bonsai can:
- Explore different possible placements for the new individual
- Compare these placements based on their likelihoods
- Select the most probable placement while keeping alternative possibilities
This approach is particularly valuable when the genetic evidence is ambiguous or when there are multiple plausible ways to fit the new individual into the pedigree.
The function also includes a validation step (is_valid_pedigree
) to ensure that the resulting pedigrees are biologically and structurally valid. This helps prevent invalid configurations, such as cycles in the pedigree or other inconsistencies.
Optimizing the Addition Process
Efficient Addition Strategies
Bonsai v3 includes several optimizations to make the incremental addition process more efficient, especially for large pedigrees. These optimizations include:
def optimize_addition(
id_val: int,
up_dct: dict[int, dict[int, int]],
id_to_shared_ibd: dict[tuple[int, int], list[dict]],
id_to_info: dict[int, dict],
):
"""
Apply optimizations to make the addition process more efficient.
Args:
id_val: ID of the individual to add
up_dct: Up-node dictionary representing the existing pedigree
id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
id_to_info: Dict with demographic information for individuals
Returns:
optimized_data: Tuple of optimized parameters for addition
"""
# Calculate IBD threshold based on pedigree size
pedigree_size = len(up_dct)
ibd_threshold = max(20, 10 * math.log(pedigree_size))
# Filter connection points by IBD threshold
connection_points = find_potential_connection_points(id_val, up_dct, id_to_shared_ibd)
filtered_points = [(id_i, ibd) for id_i, ibd in connection_points if ibd >= ibd_threshold]
# Limit the number of connection points to evaluate
max_points = min(len(filtered_points), 10)
top_points = filtered_points[:max_points]
# Determine optimal max_up parameter based on IBD amounts
if top_points:
max_ibd = top_points[0][1]
if max_ibd > 1700: # Parent-child level
suggested_max_up = 1
elif max_ibd > 700: # Sibling level
suggested_max_up = 2
elif max_ibd > 200: # First cousin level
suggested_max_up = 3
elif max_ibd > 50: # Second cousin level
suggested_max_up = 4
else:
suggested_max_up = 5
else:
suggested_max_up = 3 # Default
# Filter id_to_shared_ibd to include only relevant pairs
relevant_ids = set([id_val])
relevant_ids.update([id_i for id_i, _ in top_points])
filtered_ibd = {}
for (id1, id2), segments in id_to_shared_ibd.items():
if id1 in relevant_ids or id2 in relevant_ids:
filtered_ibd[(id1, id2)] = segments
return (top_points, suggested_max_up, filtered_ibd)
This function applies several key optimizations to make the addition process more efficient:
- Dynamic IBD Threshold: Adjusts the minimum IBD threshold based on pedigree size
- Connection Point Filtering: Limits the number of connection points to evaluate
- Adaptive max_up Parameter: Determines the optimal maximum number of generations to consider based on IBD amounts
- IBD Data Filtering: Restricts the IBD data to only the relevant pairs
These optimizations significantly reduce the computational cost of adding new individuals to large pedigrees while maintaining accuracy. By focusing on the most promising connection points and using adaptive parameters, Bonsai can efficiently find optimal placements even in complex scenarios.
Integration with the Overall Pedigree Building Process
The incremental addition strategy is integrated into Bonsai v3's overall pedigree building process through the build_pedigree_incremental
function:
def build_pedigree_incremental(
id_to_shared_ibd: dict[tuple[int, int], list[dict]],
id_to_info: dict[int, dict],
pw_ll: Any,
max_up: int = 3,
n_keep: int = 5,
id_order: list[int] = None,
):
"""
Build a pedigree incrementally by adding individuals one at a time.
Args:
id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
id_to_info: Dict with demographic information for individuals
pw_ll: PwLogLike instance for likelihood calculation
max_up: Maximum number of generations to extend upward
n_keep: Number of top pedigrees to keep
id_order: Order in which to add individuals (if None, determined automatically)
Returns:
best_pedigree: The final best pedigree
"""
# Get all individuals from the IBD data
all_ids = set()
for id1, id2 in id_to_shared_ibd:
all_ids.add(id1)
all_ids.add(id2)
# If no specific order provided, order by IBD connectivity
if id_order is None:
id_order = order_by_connectivity(all_ids, id_to_shared_ibd)
# Start with the first individual
first_id = id_order[0]
up_dct = {first_id: {}} # Single individual pedigree
# Add remaining individuals one at a time
for id_val in id_order[1:]:
# Add this individual to the current pedigree
result = add_individual_to_pedigree(
id_val=id_val,
up_dct=up_dct,
id_to_shared_ibd=id_to_shared_ibd,
id_to_info=id_to_info,
pw_ll=pw_ll,
max_up=max_up,
n_keep=n_keep,
return_only_best=True,
)
# Update the current pedigree
if result:
up_dct = result[0]
return up_dct
This function builds a pedigree incrementally by adding individuals one at a time in a specific order. The key steps are:
- Identifying all individuals from the IBD data
- Determining an optimal order for adding individuals (if not specified)
- Starting with a single individual
- Adding the remaining individuals one at a time
The order in which individuals are added can significantly impact the quality of the resulting pedigree. Bonsai typically orders individuals by their IBD connectivity — adding highly connected individuals first to establish a stable core pedigree, and then adding less connected individuals around this core.
This incremental approach allows Bonsai to efficiently build large pedigrees that would be computationally infeasible to optimize directly. By breaking the problem into a series of simpler incremental additions, Bonsai can scale to much larger datasets while maintaining accuracy.
Core Component: Incremental individual addition is a critical optimization in Bonsai v3 that enables efficient pedigree reconstruction at scale. By adding individuals one at a time to an existing pedigree, rather than rebuilding the entire structure, Bonsai can handle large datasets and continuous updates with new genetic data. The sophisticated algorithms for finding optimal placements and efficiently evaluating different options allow Bonsai to maintain accuracy while dramatically reducing computational costs.
Comparing Notebook and Bonsai v3
The Lab17 notebook explores incremental individual addition strategies through simplified implementations and examples. While the notebook provides an educational introduction to the key concepts, the actual Bonsai v3 implementation includes additional sophistication:
- Sophisticated Ordering Strategies: The production code includes advanced algorithms for determining the optimal order in which to add individuals.
- Parallel Evaluation: The real implementation can evaluate multiple placement options in parallel for better performance.
- Metadata Integration: More comprehensive integration of non-genetic information (age, sex, etc.) in the placement process.
- Caching and Memoization: The production code includes sophisticated caching mechanisms to avoid redundant computations.
- Error Recovery: More robust handling of edge cases and ability to recover from sub-optimal placements.
These differences allow the production implementation to handle the full complexity of real-world pedigree reconstruction tasks, while the notebook provides a more accessible introduction to the core concepts.
Interactive Lab Environment
Run the interactive Lab 17 notebook in Google Colab:
Google Colab Environment
Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.
Data will be automatically downloaded from S3 when you run the notebook.
Note: You may need a Google account to save your work in Google Drive.
Beyond the Code
As you explore incremental individual addition strategies, consider these broader implications:
- Growing Datasets: How these techniques enable analysis of ever-growing genetic genealogy databases
- Adaptive Pedigree Building: The potential for continuously updating pedigrees as new data becomes available
- Ordering Strategies: The impact of different individual addition orders on the resulting pedigree structure
- Balancing Accuracy and Performance: The trade-offs involved in optimizing for efficient additions
- Adaptability: How incremental approaches can adjust to new evidence without requiring complete recomputation
These considerations highlight how incremental individual addition strategies are not just technical optimizations but fundamental enablers of large-scale pedigree reconstruction from genetic data.
This lab is part of the Bonsai v3 Deep Dive track:
Introduction
Lab 01
Architecture
Lab 02
IBD Formats
Lab 03
Statistics
Lab 04
Models
Lab 05
Relationships
Lab 06
PwLogLike
Lab 07
Age Modeling
Lab 08
Data Structures
Lab 09
Up-Node Dict
Lab 10
Connection Points
Lab 11
Relationship Assessment
Lab 12
Small Pedigrees
Lab 13
Optimizing Pedigrees
Lab 14
Combine Up Dicts
Lab 15
Merging Pedigrees
Lab 16
Incremental Addition
Lab 17