Lab 18: Optimization Techniques and Performance Enhancements
Core Component: This lab explores the sophisticated optimization techniques used in Bonsai v3 to enable efficient pedigree reconstruction at scale. These optimizations include search space pruning, parallel processing, adaptive parameter selection, and specialized data structures, all of which are essential for handling real-world genetic genealogy datasets with hundreds or thousands of individuals.
The Challenge of Computational Efficiency
Why Optimization Matters
Pedigree reconstruction from genetic data is inherently combinatorial in nature. The number of possible pedigree configurations grows exponentially with the number of individuals, making naive approaches computationally infeasible for real-world datasets. Bonsai v3 addresses this challenge through a suite of sophisticated optimizations that dramatically reduce computational costs without sacrificing accuracy.
The key optimization challenges in pedigree reconstruction include:
- Combinatorial Explosion: The number of possible pedigree configurations grows exponentially with the number of individuals
- Computational Intensity: Likelihood calculations for each potential pedigree are computationally expensive
- Memory Requirements: Naive approaches require storing large numbers of candidate pedigrees
- Real-time Constraints: Applications often require results in a reasonable timeframe
Bonsai v3 addresses these challenges through optimization strategies implemented across all levels of the system, from high-level algorithms to low-level implementation details. Let's explore the key techniques that enable Bonsai to scale to large datasets.
Search Space Pruning
One of the most important optimization strategies in Bonsai v3 is search space pruning — intelligently restricting the space of pedigree configurations to explore. This is implemented through several specialized functions that focus computational resources on the most promising regions of the search space.
The prune_search_space
function manages this process:
def prune_search_space(
id_to_up_dct: dict[int, dict[int, dict[int, int]]],
id_to_shared_ibd: dict[tuple[int, int], list[dict]],
id_to_info: dict[int, dict],
pw_ll: Any,
):
"""
Prune the search space for pedigree reconstruction by identifying
and focusing on the most promising configurations.
Args:
id_to_up_dct: Dict mapping IDs to their pedigrees (up-node dictionaries)
id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
id_to_info: Dict with demographic information for individuals
pw_ll: PwLogLike instance for likelihood calculation
Returns:
pruned_search_space: Data structure defining the pruned search space
"""
# Step 1: Determine IBD connectivity structure
id_pair_to_cm = calculate_id_pair_to_cm(id_to_shared_ibd)
connectivity_graph = build_connectivity_graph(id_pair_to_cm)
# Step 2: Identify clusters of closely related individuals
clusters = identify_related_clusters(connectivity_graph, min_ibd_threshold=400)
# Step 3: Establish constraints based on demographic information
constraints = establish_demographic_constraints(id_to_info)
# Step 4: Generate search space partitions
partitions = generate_search_partitions(clusters, constraints, id_to_up_dct)
# Step 5: Create prioritized search order
search_order = prioritize_search_order(partitions, id_pair_to_cm, id_to_info)
# Assemble pruned search space representation
pruned_search_space = {
'clusters': clusters,
'constraints': constraints,
'partitions': partitions,
'search_order': search_order,
}
return pruned_search_space
This function implements several key optimization techniques:
- Connectivity-Based Clustering: Grouping individuals into clusters based on their IBD connectivity, allowing Bonsai to focus on reconstructing each cluster separately before merging them
- Demographic Constraints: Using age, sex, and other demographic information to rule out impossible relationships
- Search Partitioning: Dividing the search space into partitions that can be explored independently
- Prioritized Ordering: Establishing an optimal order for exploring the search space, focusing on the most promising regions first
By intelligently pruning the search space, Bonsai v3 can reduce the number of pedigree configurations to evaluate by orders of magnitude while still finding optimal or near-optimal solutions.
Parallel Processing
Bonsai v3 takes advantage of modern multi-core processors through parallel processing of independent tasks. This is particularly important for operations that can be naturally parallelized, such as evaluating multiple pedigree configurations or processing different chromosomes independently.
The parallel processing infrastructure is implemented in the parallel_manager.py
module:
class ParallelManager:
"""
Manages parallel execution of tasks in Bonsai v3.
This class provides infrastructure for running multiple independent
tasks in parallel using a thread pool or process pool.
"""
def __init__(self, max_workers=None, use_processes=False):
"""
Initialize the parallel manager.
Args:
max_workers: Maximum number of worker threads/processes
use_processes: Whether to use processes instead of threads
"""
self.max_workers = max_workers or min(32, os.cpu_count() + 4)
self.use_processes = use_processes
self._executor = None
def execute(self, tasks, callback=None):
"""
Execute a list of tasks in parallel.
Args:
tasks: List of (function, args, kwargs) tuples to execute
callback: Function to call with each result
Returns:
List of results from all tasks
"""
executor_cls = concurrent.futures.ProcessPoolExecutor if self.use_processes else concurrent.futures.ThreadPoolExecutor
with executor_cls(max_workers=self.max_workers) as executor:
# Submit all tasks
futures = []
for func, args, kwargs in tasks:
future = executor.submit(func, *args, **kwargs)
if callback:
future.add_done_callback(lambda f: callback(f.result()))
futures.append(future)
# Collect results
results = []
for future in concurrent.futures.as_completed(futures):
try:
result = future.result()
results.append(result)
except Exception as e:
results.append(None)
logging.error(f"Task failed with error: {e}")
return results
Bonsai v3 uses this parallel processing infrastructure for several key operations:
- Evaluating Multiple Pedigree Configurations: When exploring different ways to connect pedigrees or add new individuals, Bonsai can evaluate multiple configurations in parallel
- Processing Multiple Chromosomes: IBD segments on different chromosomes can be processed independently
- Batch Likelihood Calculations: When calculating likelihoods for multiple relationship configurations, these can be done in parallel
This parallel processing capability allows Bonsai v3 to leverage the full power of modern hardware, dramatically reducing computation time for large datasets.
Specialized Optimization Techniques
Adaptive Parameter Selection
Bonsai v3 uses adaptive parameter selection to optimize performance based on dataset characteristics. Rather than using fixed parameters for all situations, Bonsai dynamically adjusts its parameters based on factors like dataset size, IBD density, and computational resources available.
The optimize_parameters
function implements this approach:
def optimize_parameters(
id_to_shared_ibd: dict[tuple[int, int], list[dict]],
id_to_info: dict[int, dict],
execution_context: dict,
):
"""
Dynamically optimize algorithm parameters based on dataset characteristics.
Args:
id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
id_to_info: Dict with demographic information for individuals
execution_context: Information about the execution environment
Returns:
optimized_params: Dict of optimized parameter values
"""
# Get dataset characteristics
num_individuals = get_num_individuals(id_to_shared_ibd)
ibd_density = calculate_ibd_density(id_to_shared_ibd)
avg_ibd_length = calculate_avg_ibd_length(id_to_shared_ibd)
# Get execution environment information
available_memory = execution_context.get('available_memory', 8 * 1024 * 1024 * 1024) # Default 8GB
available_cores = execution_context.get('available_cores', os.cpu_count())
# Initialize parameters with default values
params = {
'max_up': 3, # Maximum generations to extend upward
'n_keep': 5, # Number of top pedigrees to keep
'ibd_threshold': 20, # Minimum IBD amount to consider (cM)
'max_iterations': 100, # Maximum iterations for optimization
'batch_size': 10, # Batch size for parallel processing
'use_threading': True, # Whether to use threading
}
# Adjust max_up based on IBD density and length
if ibd_density > 0.5 and avg_ibd_length > 1000:
# Dense IBD with long segments - likely close relatives
params['max_up'] = 2
elif ibd_density < 0.1 or avg_ibd_length < 100:
# Sparse IBD with short segments - likely distant relatives
params['max_up'] = 4
# Adjust n_keep based on available memory
if available_memory < 4 * 1024 * 1024 * 1024: # Less than 4GB
params['n_keep'] = 3
elif available_memory > 16 * 1024 * 1024 * 1024: # More than 16GB
params['n_keep'] = 10
# Adjust batch_size based on available cores
params['batch_size'] = min(max(available_cores, 2), 32)
# Adjust ibd_threshold based on dataset size
if num_individuals > 100:
params['ibd_threshold'] = 30
elif num_individuals < 20:
params['ibd_threshold'] = 10
# Decide whether to use threading or processes
if available_memory > 8 * 1024 * 1024 * 1024: # More than 8GB
params['use_threading'] = False # Use processes for better parallelism
return params
This adaptive parameter selection allows Bonsai v3 to optimize its behavior for each specific dataset and execution environment. For example:
- With datasets containing mostly close relatives (long IBD segments), Bonsai can use smaller values for
max_up
, reducing the search space - With large datasets on machines with limited memory, Bonsai can reduce
n_keep
to conserve memory - On machines with many CPU cores, Bonsai can increase
batch_size
to leverage parallel processing
This adaptive approach ensures that Bonsai v3 can efficiently handle a wide range of datasets, from small pedigrees with closely related individuals to large pedigrees with distant relationships.
Specialized Data Structures
Bonsai v3 uses specialized data structures optimized for the specific requirements of pedigree reconstruction. These data structures are designed to minimize memory usage and computational overhead while still providing efficient access to the information needed for reconstruction.
The CompactIBDStore
class is an example of such a specialized data structure:
class CompactIBDStore:
"""
Memory-efficient storage for IBD segment data.
This class provides a compact representation of IBD segments,
optimized for the specific access patterns used in Bonsai v3.
"""
def __init__(self, id_to_shared_ibd):
"""
Initialize the compact IBD store from a standard IBD representation.
Args:
id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
"""
# Convert to more efficient representation
self.pairs = []
self.segments = []
self.pair_to_segments = {}
pair_to_idx = {}
for (id1, id2), segs in id_to_shared_ibd.items():
pair_idx = len(self.pairs)
self.pairs.append((id1, id2))
pair_to_idx[(id1, id2)] = pair_idx
# Store segment indices
seg_indices = []
for seg in segs:
seg_idx = len(self.segments)
# Store only essential fields
compact_seg = {
'chrom': seg.get('chrom', 0),
'start_cm': seg.get('start_cm', 0),
'end_cm': seg.get('end_cm', 0),
'length_cm': seg.get('length_cm', 0)
}
self.segments.append(compact_seg)
seg_indices.append(seg_idx)
self.pair_to_segments[pair_idx] = seg_indices
def get_shared_segments(self, id1, id2):
"""
Get the IBD segments shared by two individuals.
Args:
id1, id2: IDs of the individuals
Returns:
List of shared IBD segments
"""
pair = (min(id1, id2), max(id1, id2))
if pair in self.pairs:
pair_idx = self.pairs.index(pair)
seg_indices = self.pair_to_segments.get(pair_idx, [])
return [self.segments[idx] for idx in seg_indices]
else:
return []
def get_total_ibd(self, id1, id2):
"""
Get the total IBD shared by two individuals.
Args:
id1, id2: IDs of the individuals
Returns:
Total shared IBD in centimorgans
"""
segments = self.get_shared_segments(id1, id2)
return sum(seg['length_cm'] for seg in segments)
Bonsai v3 uses several such specialized data structures, each optimized for specific operations:
- CompactIBDStore: Memory-efficient storage for IBD segment data
- SparseRelationshipMatrix: Efficient representation of pairwise relationships
- PedigreeGraph: Optimized graph representation of pedigree structures
- LikelihoodCache: Caching structure for reusing likelihood calculations
These specialized data structures allow Bonsai v3 to minimize memory usage and computational overhead, enabling it to handle much larger datasets than would be possible with general-purpose data structures.
Early Termination and Lazy Evaluation
Bonsai v3 uses early termination and lazy evaluation strategies to avoid unnecessary computation. These techniques allow Bonsai to quickly eliminate unpromising pedigree configurations without fully evaluating them, leading to significant performance improvements.
The evaluate_pedigree
function demonstrates this approach:
def evaluate_pedigree(
up_dct: dict[int, dict[int, int]],
id_to_shared_ibd: dict[tuple[int, int], list[dict]],
id_to_info: dict[int, dict],
pw_ll: Any,
early_term_threshold: float = -1000.0,
):
"""
Evaluate the likelihood of a pedigree with early termination.
Args:
up_dct: Up-node dictionary representing the pedigree
id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
id_to_info: Dict with demographic information for individuals
pw_ll: PwLogLike instance for likelihood calculation
early_term_threshold: Threshold for early termination
Returns:
log_likelihood: Log-likelihood of the pedigree, or None if terminated early
"""
# Initialize log-likelihood
log_like = 0.0
# Get all pairs of individuals in the pedigree
all_ids = list(up_dct.keys())
all_pairs = [(all_ids[i], all_ids[j]) for i in range(len(all_ids)) for j in range(i+1, len(all_ids))]
# First evaluate high-IBD pairs (more informative)
high_ibd_pairs = []
low_ibd_pairs = []
for id1, id2 in all_pairs:
pair = (min(id1, id2), max(id1, id2))
if pair in id_to_shared_ibd:
total_cm = sum(seg.get('length_cm', 0) for seg in id_to_shared_ibd[pair])
if total_cm > 100: # High IBD threshold
high_ibd_pairs.append((id1, id2))
else:
low_ibd_pairs.append((id1, id2))
else:
low_ibd_pairs.append((id1, id2))
# Evaluate high-IBD pairs first
for id1, id2 in high_ibd_pairs:
# Get relationship in the pedigree
rel_tuple = get_relationship_tuple(up_dct, id1, id2)
# Get IBD segments
pair = (min(id1, id2), max(id1, id2))
ibd_segs = id_to_shared_ibd.get(pair, [])
# Calculate likelihood for this pair
pair_ll = pw_ll.get_ibd_log_like(
id1=id1,
id2=id2,
rel_tuple=rel_tuple,
ibd_segs=ibd_segs,
)
# Update total likelihood
log_like += pair_ll
# Check for early termination
if log_like < early_term_threshold:
return None # Terminate early, pedigree is very unlikely
# Only evaluate low-IBD pairs if we haven't terminated early
for id1, id2 in low_ibd_pairs:
# (Similar evaluation as for high-IBD pairs)
# ...
return log_like
This function implements several key optimization techniques:
- Early Termination: If the likelihood falls below a threshold during evaluation, the function returns immediately without evaluating the remaining pairs
- Prioritized Evaluation: High-IBD pairs are evaluated first, as they are more informative and more likely to trigger early termination if the pedigree is incorrect
- Lazy Evaluation: Low-IBD pairs are only evaluated if the pedigree hasn't been rejected based on high-IBD pairs
These techniques allow Bonsai v3 to quickly eliminate unlikely pedigree configurations without fully evaluating them, dramatically reducing computation time for large datasets with many potential configurations.
Performance Profiling and Bottleneck Elimination
Identifying Performance Bottlenecks
Bonsai v3 includes sophisticated performance profiling capabilities to identify and eliminate bottlenecks. These profiling tools help developers understand where the system is spending time and memory, allowing them to focus optimization efforts on the most critical areas.
The performance_profiler.py
module provides these capabilities:
class PerformanceProfiler:
"""
Profiles the performance of Bonsai v3 components.
This class provides tools for measuring execution time,
memory usage, and other performance metrics.
"""
def __init__(self):
"""
Initialize the performance profiler.
"""
self.timers = {}
self.memory_samples = {}
self.function_calls = {}
self.is_active = False
def start(self):
"""
Start the profiler.
"""
self.is_active = True
self.timers = {}
self.memory_samples = {}
self.function_calls = {}
def stop(self):
"""
Stop the profiler.
"""
self.is_active = False
@contextlib.contextmanager
def timer(self, name):
"""
Context manager for timing a block of code.
Args:
name: Name of the timer
"""
if not self.is_active:
yield
return
start_time = time.time()
start_memory = get_memory_usage()
try:
yield
finally:
end_time = time.time()
end_memory = get_memory_usage()
# Record timing information
duration = end_time - start_time
if name not in self.timers:
self.timers[name] = []
self.timers[name].append(duration)
# Record memory usage
memory_delta = end_memory - start_memory
if name not in self.memory_samples:
self.memory_samples[name] = []
self.memory_samples[name].append(memory_delta)
def record_function_call(self, func_name, args_len, kwargs_len):
"""
Record a function call.
Args:
func_name: Name of the function
args_len: Number of positional arguments
kwargs_len: Number of keyword arguments
"""
if not self.is_active:
return
if func_name not in self.function_calls:
self.function_calls[func_name] = 0
self.function_calls[func_name] += 1
def get_summary(self):
"""
Get a summary of the profiling results.
Returns:
summary: Dict containing profiling summary
"""
if not self.timers:
return {"status": "No profiling data available"}
summary = {
"timers": {},
"memory": {},
"function_calls": self.function_calls,
"hotspots": []
}
# Summarize timing information
for name, durations in self.timers.items():
summary["timers"][name] = {
"total": sum(durations),
"calls": len(durations),
"average": sum(durations) / len(durations),
"min": min(durations),
"max": max(durations)
}
# Summarize memory usage
for name, deltas in self.memory_samples.items():
summary["memory"][name] = {
"total": sum(deltas),
"calls": len(deltas),
"average": sum(deltas) / len(deltas),
"min": min(deltas),
"max": max(deltas)
}
# Identify hotspots (operations taking the most time)
hotspots = sorted(
[(name, data["total"]) for name, data in summary["timers"].items()],
key=lambda x: x[1],
reverse=True
)
summary["hotspots"] = hotspots[:10] # Top 10 hotspots
return summary
This profiling infrastructure allows Bonsai v3 developers to identify and eliminate performance bottlenecks through:
- Time Profiling: Identifying which operations take the most time
- Memory Profiling: Identifying which operations use the most memory
- Call Profiling: Identifying which functions are called most frequently
- Hotspot Analysis: Focusing optimization efforts on the most critical areas
By systematically identifying and eliminating bottlenecks, Bonsai v3 developers have achieved significant performance improvements over previous versions, enabling the system to handle much larger datasets with reasonable computational resources.
Regression Testing
To ensure that optimization improvements don't compromise accuracy, Bonsai v3 includes a comprehensive regression testing framework. This framework validates that optimized algorithms produce results that are consistent with reference implementations, allowing developers to confidently improve performance without sacrificing accuracy.
The performance_test.py
module implements this approach:
def run_performance_benchmark(
test_datasets,
reference_implementation,
optimized_implementation,
tolerance=1e-6,
):
"""
Run a performance benchmark comparing reference and optimized implementations.
Args:
test_datasets: List of test datasets to use
reference_implementation: Function implementing the reference algorithm
optimized_implementation: Function implementing the optimized algorithm
tolerance: Tolerance for numerical differences in results
Returns:
benchmark_results: Dict containing benchmark results
"""
results = {
"performance_improvement": {},
"accuracy": {},
"memory_improvement": {},
"overall": {}
}
for dataset_name, dataset in test_datasets.items():
print(f"Benchmarking {dataset_name}...")
# Run reference implementation
reference_start_time = time.time()
reference_start_memory = get_memory_usage()
reference_result = reference_implementation(dataset)
reference_end_memory = get_memory_usage()
reference_end_time = time.time()
reference_time = reference_end_time - reference_start_time
reference_memory = reference_end_memory - reference_start_memory
# Run optimized implementation
optimized_start_time = time.time()
optimized_start_memory = get_memory_usage()
optimized_result = optimized_implementation(dataset)
optimized_end_memory = get_memory_usage()
optimized_end_time = time.time()
optimized_time = optimized_end_time - optimized_start_time
optimized_memory = optimized_end_memory - optimized_start_memory
# Calculate performance improvement
time_improvement = (reference_time - optimized_time) / reference_time * 100
memory_improvement = (reference_memory - optimized_memory) / reference_memory * 100
# Check result accuracy
is_accurate = check_results_match(reference_result, optimized_result, tolerance)
# Record results
results["performance_improvement"][dataset_name] = time_improvement
results["memory_improvement"][dataset_name] = memory_improvement
results["accuracy"][dataset_name] = is_accurate
results["overall"][dataset_name] = {
"reference_time": reference_time,
"optimized_time": optimized_time,
"reference_memory": reference_memory,
"optimized_memory": optimized_memory,
"time_improvement": time_improvement,
"memory_improvement": memory_improvement,
"is_accurate": is_accurate
}
print(f" Time improvement: {time_improvement:.2f}%")
print(f" Memory improvement: {memory_improvement:.2f}%")
print(f" Accurate results: {is_accurate}")
# Calculate overall statistics
avg_time_improvement = sum(results["performance_improvement"].values()) / len(results["performance_improvement"])
avg_memory_improvement = sum(results["memory_improvement"].values()) / len(results["memory_improvement"])
all_accurate = all(results["accuracy"].values())
results["summary"] = {
"avg_time_improvement": avg_time_improvement,
"avg_memory_improvement": avg_memory_improvement,
"all_accurate": all_accurate
}
print(f"Overall time improvement: {avg_time_improvement:.2f}%")
print(f"Overall memory improvement: {avg_memory_improvement:.2f}%")
print(f"All results accurate: {all_accurate}")
return results
This benchmark framework allows Bonsai v3 developers to:
- Measure Performance Improvements: Quantify time and memory improvements from optimizations
- Validate Accuracy: Ensure that optimized implementations produce results consistent with reference implementations
- Test Multiple Datasets: Verify that optimizations are effective across a range of dataset types and sizes
By systematically benchmarking and validating optimizations, Bonsai v3 developers can maintain a balance between computational efficiency and accuracy, ensuring that the system can handle large datasets without sacrificing the quality of results.
Core Component: The optimization techniques employed in Bonsai v3 are what make large-scale pedigree reconstruction computationally feasible. Through a combination of search space pruning, parallel processing, adaptive parameter selection, specialized data structures, and careful performance profiling, Bonsai v3 can handle datasets orders of magnitude larger than would be possible with naive approaches. These optimizations are not just implementation details but fundamental enabling technologies that make it possible to apply pedigree reconstruction to real-world genetic genealogy datasets.
Comparing Notebook and Bonsai v3
The Lab18 notebook explores optimization techniques and performance enhancements through simplified implementations and examples. While the notebook provides an educational introduction to the key concepts, the actual Bonsai v3 implementation includes additional sophistication:
- Advanced Profiling Infrastructure: The production code includes more sophisticated profiling capabilities for identifying bottlenecks.
- Extensive Optimization Tuning: The real implementation includes parameters that have been carefully tuned based on extensive benchmarking.
- Environment-Specific Optimizations: The production code includes optimizations specific to different hardware and software environments.
- Dynamic Load Balancing: More sophisticated approaches to balancing computational load across parallel workers.
- Memory Pool Management: Custom memory management to minimize allocation overhead and fragmentation.
These differences allow the production implementation to achieve maximum performance across a wide range of datasets and execution environments, while the notebook provides a more accessible introduction to the core optimization concepts.
Interactive Lab Environment
Run the interactive Lab 18 notebook in Google Colab:
Google Colab Environment
Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.
Data will be automatically downloaded from S3 when you run the notebook.
Note: You may need a Google account to save your work in Google Drive.
Beyond the Code
As you explore optimization techniques and performance enhancements, consider these broader implications:
- Scaling Challenges: How computational optimizations enable the application of genetic genealogy to increasingly large datasets
- Accuracy vs. Speed Trade-offs: The balance between computational efficiency and accuracy in real-world applications
- Hardware Limitations: How optimization strategies must adapt to different hardware environments and constraints
- Algorithm Engineering: The importance of careful implementation and optimization beyond theoretical algorithmic complexity
- Data-Driven Optimization: How understanding the structure of genetic genealogy data can inform specific optimization strategies
These considerations highlight how optimization techniques are not just about making code run faster, but about enabling new applications and insights that would otherwise be computationally infeasible.
This lab is part of the Bonsai v3 Deep Dive track:
Introduction
Lab 01
Architecture
Lab 02
IBD Formats
Lab 03
Statistics
Lab 04
Models
Lab 05
Relationships
Lab 06
PwLogLike
Lab 07
Age Modeling
Lab 08
Data Structures
Lab 09
Up-Node Dict
Lab 10
Connection Points
Lab 11
Relationship Assessment
Lab 12
Small Pedigrees
Lab 13
Optimizing Pedigrees
Lab 14
Combine Up Dicts
Lab 15
Merging Pedigrees
Lab 16
Incremental Addition
Lab 17
Optimization Techniques
Lab 18