Lab 10: Up-Node Dictionary and Pedigree Representation
Core Component: This lab explores the up-node dictionary, the fundamental data structure at the heart of Bonsai v3's pedigree representation. Understanding this structure is essential for comprehending how Bonsai stores, queries, and manipulates genealogical relationships during pedigree inference.
Structure and Design Philosophy
The Dictionary-Based Approach
At the core of Bonsai v3's architecture is a dictionary-based approach to representing pedigrees. Unlike traditional graph libraries that use specialized data structures, Bonsai leverages Python's native dictionary type for its combination of flexibility, performance, and simplicity.
The up-node dictionary follows this structure:
up_node_dict = {
individual_id: {parent_id1: degree1, parent_id2: degree2, ...},
...
}
Where:
- Each key is an individual ID (positive for genotyped individuals, negative for inferred/ungenotyped individuals)
- Each value is a dictionary mapping parent IDs to their relationship degrees (typically 1 for direct parents)
- Empty dictionaries (
{}
) represent founder individuals with no recorded parents
This structure encodes a directed graph where edges point from child to parent (upward in a pedigree). The design offers several key advantages:
- Efficiency: O(1) lookups for parent-child relationships
- Flexibility: Easy to modify by adding or removing individuals
- Serializability: Simple to convert to/from JSON for storage
- Memory Efficiency: Compact representation for sparse pedigrees
- Simplicity: Built on Python's native data structures
In the Bonsai v3 codebase, the key operations on up-node dictionaries are implemented in the pedigrees.py
module, which contains dozens of specialized functions for creating, querying, and manipulating pedigree structures.
ID Conventions and Pedigree Topology
The Bonsai v3 system uses a consistent convention for individual IDs that encodes important information about the pedigree:
- Positive IDs (1, 2, 3, ...): Represent genotyped individuals for whom genetic data is available
- Negative IDs (-1, -2, -3, ...): Represent inferred/ungenotyped individuals, often ancestors who must exist but lack genetic data
This convention is critical for pedigree inference algorithms, as it distinguishes between observed individuals and those whose existence is inferred from patterns in the data. The actual implementation in pedigrees.py
includes functions to ensure consistent ID assignment:
def get_min_id(dct):
"""Get the minimal ID in a node dict.
Args:
dct: Dictionary mapping nodes to their relatives
Returns:
Minimum ID found (always negative)
"""
# Extract all IDs from the dictionary
all_ids = set(dct.keys())
for relatives in dct.values():
all_ids.update(relatives.keys())
# Find the minimum negative ID, or -1 if no negative IDs exist
neg_ids = [i for i in all_ids if i < 0]
min_id = min(neg_ids) if neg_ids else -1
return min_id
This function is used when adding new ungenotyped individuals to ensure they receive unique negative IDs. The implementation avoids ID collisions by always selecting a value lower than any existing negative ID.
The combination of positive and negative IDs enables Bonsai to represent complex pedigree topologies, including:
- Nuclear families with genotyped and ungenotyped members
- Multi-generational pedigrees with missing ancestors
- Half-sibling relationships (two individuals sharing one parent)
- Consanguineous relationships (marriages between relatives)
- Complex pedigrees with multiple family branches
Bidirectional Navigation with Down-Node Dictionaries
While the up-node dictionary is the primary representation in Bonsai v3, the system also uses a complementary structure called the down-node dictionary, which represents relationships in the opposite direction:
down_node_dict = {
parent_id: {child_id1: degree1, child_id2: degree2, ...},
...
}
The two representations are interconvertible using the reverse_node_dict()
function:
def reverse_node_dict(node_dict):
"""Convert between up_node_dict and down_node_dict.
Args:
node_dict: Dictionary mapping nodes to their relatives
Returns:
Dictionary with the reverse mapping
"""
reverse_dict = {}
# Initialize empty dictionaries for all nodes
all_nodes = set(node_dict.keys()).union(
*[set(relatives.keys()) for relatives in node_dict.values()])
for node in all_nodes:
reverse_dict[node] = {}
# Populate reverse mappings
for node, relatives in node_dict.items():
for relative, degree in relatives.items():
reverse_dict[relative][node] = degree
return reverse_dict
This dual representation allows efficient navigation in both directions through the pedigree:
- Use
up_node_dict
to navigate upward from individuals to ancestors - Use
down_node_dict
to navigate downward from individuals to descendants
Having both representations available is crucial for operations like finding all descendants of an individual or identifying siblings, which would be inefficient using only an up-node dictionary.
Core Operations on Up-Node Dictionaries
Traversing the Pedigree Graph
Pedigree traversal is a fundamental operation in genetic genealogy. Bonsai v3 implements efficient traversal algorithms through the get_rel_set()
function, which finds all relatives of a specific type for an individual:
def get_rel_set(node_dict, i):
"""Find all relatives of i in node_dict.
If node_dict is an up_dict, finds all ancestors.
If node_dict is a down_dict, finds all descendants.
Args:
node_dict: Dictionary mapping nodes to their relatives
i: Individual ID
Returns:
Set of all relatives (including i itself)
"""
# Start with the individual themself
rel_set = {i}
# If not in dictionary, return just the individual
if i not in node_dict:
return rel_set
# Process each direct relative recursively
for j in node_dict[i]:
# Add the direct relative
rel_set.add(j)
# Recursively add their relatives
j_relatives = get_rel_set(node_dict, j)
rel_set.update(j_relatives)
return rel_set
This recursive implementation uses depth-first search to efficiently find all connected individuals. When applied to an up-node dictionary, it returns all ancestors of the specified individual. When applied to a down-node dictionary, it returns all descendants.
The production implementation includes additional optimizations for handling large pedigrees, including caching and cycle detection. These ensure that even complex pedigrees with thousands of individuals can be traversed efficiently.
Finding Paths and Determining Relationships
One of the most important operations in a pedigree is determining the relationship between two individuals. Bonsai v3 implements this through a two-step process:
- Find all paths connecting the individuals using
get_all_paths()
- Analyze these paths to determine the relationship type using
get_simple_rel_tuple()
The path-finding algorithm identifies all possible routes through common ancestors:
def get_all_paths(up_node_dict, i, j):
"""Find all paths between individuals i and j.
Args:
up_node_dict: Dictionary mapping individuals to their ancestors
i, j: Individual IDs
Returns:
Tuple of (paths, common_ancestors)
"""
# Special case for self
if i == j:
return {(i,)}, {i}
# Find all ancestors of i and j
i_ancs = get_rel_set(up_node_dict, i)
j_ancs = get_rel_set(up_node_dict, j)
# Find common ancestors
common_ancs = i_ancs.intersection(j_ancs)
# No relationship if no common ancestors
if not common_ancs:
return set(), set()
# Find all paths through common ancestors
paths = set()
for anc in common_ancs:
i_paths = get_paths_to_anc(up_node_dict, i, anc)
j_paths = get_paths_to_anc(up_node_dict, j, anc)
# Combine paths through this ancestor
for i_path in i_paths:
for j_path in j_paths:
# Join paths (i → ancestor → j)
full_path = i_path + j_path[1:]
paths.add(tuple(full_path))
return paths, common_ancs
Once the paths are found, the get_simple_rel_tuple()
function analyzes them to determine the relationship type, returning a tuple of the form (up, down, num_ancs)
:
def get_simple_rel_tuple(up_node_dict, i, j):
"""Get relationship tuple (up, down, num_ancs) between individuals i and j.
Args:
up_node_dict: Dictionary mapping individuals to their ancestors
i, j: Individual IDs
Returns:
Relationship tuple (up, down, num_ancs) or None if unrelated
"""
# Handle self relationship
if i == j:
return (0, 0, 2)
# Get paths between individuals
paths, common_ancs = get_all_paths(up_node_dict, i, j)
if not paths:
return None # No relationship
# Analyze paths to determine relationship
shortest_up = float('inf')
shortest_down = float('inf')
for path in paths:
# Find position of common ancestor in path
for k, node in enumerate(path):
if node in common_ancs:
up_steps = k
down_steps = len(path) - k - 1
shortest_up = min(shortest_up, up_steps)
shortest_down = min(shortest_down, down_steps)
break
# Count distinct common ancestors that provide shortest paths
num_ancs = sum(1 for anc in common_ancs
if any(anc in path and
path.index(anc) == shortest_up for path in paths))
# Cap at 2 for full relationships
num_ancs = min(num_ancs, 2)
return (shortest_up, shortest_down, num_ancs)
This tuple representation elegantly encodes all possible genealogical relationships using just three integers:
- up: Number of meioses (reproductive events) from individual i up to the common ancestor
- down: Number of meioses from the common ancestor down to individual j
- num_ancs: Number of common ancestors (1 for half relationships like half-siblings, 2 for full relationships like full siblings)
Common relationship patterns include:
- Parent-child:
(0, 1, 1)
or(1, 0, 1)
- Full siblings:
(1, 1, 2)
- Half siblings:
(1, 1, 1)
- Grandparent-grandchild:
(0, 2, 1)
or(2, 0, 1)
- Aunt/uncle-niece/nephew:
(1, 2, 1)
or(2, 1, 1)
- First cousins:
(2, 2, 1)
This compact representation allows Bonsai v3 to efficiently reason about relationships during pedigree construction.
Extracting and Manipulating Subpedigrees
Many pedigree operations require working with just a portion of the full pedigree. Bonsai v3 provides specialized functions for extracting subpedigrees:
def get_subdict(dct, node):
"""Get the cone above/below node in a node dict.
If dct is an up_dict, returns the subdict containing the node and all its ancestors.
If dct is a down_dict, returns the subdict containing the node and all its descendants.
Args:
dct: Dictionary mapping nodes to their relatives
node: Root node for the subdict
Returns:
Dictionary representing the subpedigree
"""
import copy
if node not in dct:
return {}
# Start with the node itself
sub_dct = {}
sub_dct[node] = copy.deepcopy(dct[node])
# Add subdicts for all relatives of the node
for n in dct[node]:
n_dct = get_subdict(dct, n)
if n_dct:
sub_dct.update(n_dct)
return sub_dct
This function extracts the "cone" of individuals connected to a specific node, making it useful for operations like visualizing an individual's ancestry or working with a specific branch of a family tree.
For more complex extraction needs, Bonsai v3 includes the get_sub_up_node_dict()
function, which extracts the minimal subtree connecting a set of individuals:
def get_sub_up_node_dict(up_dct, id_set):
"""Get subtree connecting all IDs in id_set.
Args:
up_dct: Dictionary mapping individuals to their ancestors
id_set: Set of individual IDs to connect
Returns:
Dictionary representing the minimal connecting subpedigree
"""
# Collect all path nodes needed for the subtree
all_nodes = set()
# Include the specified IDs
all_nodes.update(id_set)
# Add all path nodes between pairs of IDs
for i in id_set:
for j in id_set:
if i >= j: # Avoid duplicate pairs
continue
# Get all paths between i and j
paths, _ = get_all_paths(up_dct, i, j)
# Add nodes from all paths
for path in paths:
all_nodes.update(path)
# Create the subpedigree with only necessary connections
sub_dict = {}
for node in all_nodes:
if node in up_dct:
# Only include parents that are in the subtree
sub_dict[node] = {p: d for p, d in up_dct[node].items()
if p in all_nodes}
return sub_dict
This function is particularly valuable for visualizing the relationships between selected individuals without the clutter of the complete pedigree. It's frequently used in Bonsai's visualization components and for extracting the relevant portion of a pedigree during incremental construction.
Advanced Pedigree Operations
Most Recent Common Ancestors (MRCAs)
Identifying common ancestors is crucial for understanding how individuals are related. Bonsai v3 implements an efficient algorithm for finding the Most Recent Common Ancestors (MRCAs) of a set of individuals:
def get_mrca_set(up_dct, id_set):
"""Get the set of most recent common ancestors of id_set.
Args:
up_dct: Dictionary mapping individuals to their ancestors
id_set: Set of individual IDs
Returns:
Set of most recent common ancestors
"""
if len(id_set) == 0:
return set()
if len(id_set) == 1:
return id_set # Individual is its own ancestor
# Find all common ancestors
anc_sets = [get_rel_set(up_dct, i) for i in id_set]
common_ancs = set.intersection(*anc_sets)
if not common_ancs:
return set() # No common ancestors
# For each common ancestor, check if any of its descendants
# is also a common ancestor - if so, it's not most recent
mrca_set = set(common_ancs)
for anc in common_ancs:
# Skip if this ancestor has already been removed
if anc not in mrca_set:
continue
# Get all ancestors of this ancestor
anc_ancs = get_rel_set(up_dct, anc) - {anc}
# Remove higher ancestors from MRCA set
mrca_set -= anc_ancs
return mrca_set
This algorithm first finds all ancestors that the individuals have in common, then filters out those that are more distant than necessary. The result is the set of "most recent" common ancestors—those that don't have descendants who are also common ancestors.
The MRCA concept is important not only for understanding relationships but also for efficient pedigree construction. By identifying MRCAs, Bonsai can focus on the most relevant parts of the pedigree when adding new individuals.
Adding and Removing Individuals
Pedigree manipulation is central to Bonsai v3's operation. The system includes specialized functions for modifying pedigree structures, such as the add_parent()
function for adding ungenotyped parents:
def add_parent(node, up_dct, min_id=None):
"""Add an ungenotyped parent to node in up_dct.
Args:
node: Individual ID
up_dct: Dictionary mapping individuals to their ancestors
min_id: Minimum ID to use for new parent (optional)
Returns:
Tuple of (updated_dict, new_parent_id)
"""
import copy
up_dct = copy.deepcopy(up_dct)
if node not in up_dct:
raise ValueError(f"Node {node} is not in up dct.")
pid_dict = up_dct[node]
if len(pid_dict) >= 2:
return up_dct, None # Already has two parents, can't add more
# Generate a new negative ID for the ungenotyped parent
if min_id is None:
min_id = get_min_id(up_dct)
new_pid = min_id - 1
up_dct[node][new_pid] = 1 # Add parent with degree 1
up_dct[new_pid] = {} # Initialize empty parents dict for new parent
return up_dct, new_pid
This function maintains the integrity of the pedigree structure while adding new individuals. It includes checks to ensure that no individual has more than two parents and that new ungenotyped individuals receive unique negative IDs.
For removing individuals, Bonsai provides the delete_node()
function:
def delete_node(dct, node):
"""Delete node from a node dict.
Args:
dct: Dictionary mapping nodes to their relatives
node: Node to delete
Returns:
Updated dictionary
"""
import copy
new_dct = {}
# Copy all entries except the deleted node
for k, v in dct.items():
if k != node:
# Also remove references to the deleted node in relatives
new_dct[k] = {r: d for r, d in v.items() if r != node}
return new_dct
This function completely removes an individual from the pedigree, including all references to it from other individuals. This is useful for pruning unnecessary individuals from a pedigree or testing alternative hypotheses.
These manipulation functions, combined with the query functions discussed earlier, form the building blocks for Bonsai v3's pedigree construction algorithms.
Connection Points and Pedigree Merging
A critical operation in Bonsai v3 is identifying connection points between separate pedigree fragments. This is implemented through functions like find_connection_points()
, which analyzes two pedigrees to find optimal ways to connect them:
def find_connection_points(pedigree1, pedigree2, pw_ll):
"""Find optimal points to connect two pedigrees.
Args:
pedigree1, pedigree2: Pedigrees to connect
pw_ll: PwLogLike instance with relationship likelihoods
Returns:
List of potential connection points with scores
"""
# Get individuals in each pedigree
ids1 = set(pedigree1.keys())
ids2 = set(pedigree2.keys())
# Evaluate all possible connections
connections = []
for id1 in ids1:
for id2 in ids2:
# Only consider connections between genotyped individuals
if id1 < 0 or id2 < 0:
continue
# Get relationship likelihood
rel_tuple, log_ll = pw_ll.get_most_likely_rel(id1, id2)
# Add to potential connections if sufficiently likely
if log_ll > THRESHOLD:
connections.append((id1, id2, rel_tuple, log_ll))
# Sort by likelihood and return
connections.sort(key=lambda x: x[3], reverse=True)
return connections
This function evaluates potential connections between genotyped individuals in two separate pedigrees, using the PwLogLike
class (which we explored in Lab 7) to assess the likelihood of each relationship. The result is a ranked list of potential connection points.
Once connection points are identified, pedigrees can be merged using specialized functions that combine the up-node dictionaries while maintaining consistency.
Core Component: The up-node dictionary is the cornerstone of Bonsai v3's architecture, providing a flexible and efficient representation of pedigree structures. Its combination of simplicity and power enables the system to represent complex family relationships, manipulate pedigrees, and perform sophisticated relationship inference from genetic data.
Comparing Notebook and Production Code
The Lab10 notebook provides a simplified exploration of up-node dictionaries, while the production implementation in Bonsai v3 includes additional sophistication:
- Performance Optimizations: The production code includes memoization for recursive functions, specialized handling for large pedigrees, and cycle detection algorithms.
- Error Handling: Robust error handling with detailed diagnostic information for common edge cases.
- Support for Complex Topologies: Additional functions for handling complex structures like multiple marriages, adoptions, and special relationship types.
- Consistency Checks: Validation routines that ensure pedigree integrity during modifications.
- Serialization Support: Functions for converting between dictionary representation and other formats like GEDCOM or JSON.
- Integration with Inference: Tight coupling with the PwLogLike system for probability-based pedigree construction.
The notebook provides a valuable introduction to the key concepts, but the production implementation represents years of refinement to handle the complexities of real-world pedigrees and genetic data.
Interactive Lab Environment
Run the interactive Lab 10 notebook in Google Colab:
Google Colab Environment
Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.
Data will be automatically downloaded from S3 when you run the notebook.
Note: You may need a Google account to save your work in Google Drive.
Beyond the Code
As you explore up-node dictionaries and pedigree representation, consider these broader implications:
- Data Structure Design: How the choice of representation impacts the efficiency and expressiveness of algorithms
- Graph Theory: How pedigrees represent a specialized class of directed acyclic graphs with biological constraints
- Genealogical Encoding: How mathematical representations can capture the rich complexity of human family structures
- Cultural Dimensions: How different cultures conceptualize family relationships, and how these can be encoded in data structures
These considerations highlight how the up-node dictionary represents not just a technical implementation but a computational model of human relationships that bridges biology, genealogy, and computer science.
This lab is part of the Bonsai v3 Deep Dive track:
Introduction
Lab 01
Architecture
Lab 02
IBD Formats
Lab 03
Statistics
Lab 04
Models
Lab 05
Relationships
Lab 06
PwLogLike
Lab 07
Age Modeling
Lab 08
Data Structures
Lab 09
Up-Node Dict
Lab 10