Lab 10: Up-Node Dictionary | Computational Genetic Genealogy

Lab 10: Up-Node Dictionary and Pedigree Representation

Core Component: This lab explores the up-node dictionary, the fundamental data structure at the heart of Bonsai v3's pedigree representation. Understanding this structure is essential for comprehending how Bonsai stores, queries, and manipulates genealogical relationships during pedigree inference.

Structure and Design Philosophy

The Dictionary-Based Approach

At the core of Bonsai v3's architecture is a dictionary-based approach to representing pedigrees. Unlike traditional graph libraries that use specialized data structures, Bonsai leverages Python's native dictionary type for its combination of flexibility, performance, and simplicity.

The up-node dictionary follows this structure:

up_node_dict = {
    individual_id: {parent_id1: degree1, parent_id2: degree2, ...},
    ...
}

Where:

Each key is an individual ID (positive for genotyped individuals, negative for inferred/ungenotyped individuals)
Each value is a dictionary mapping parent IDs to their relationship degrees (typically 1 for direct parents)
Empty dictionaries ({}) represent founder individuals with no recorded parents

This structure encodes a directed graph where edges point from child to parent (upward in a pedigree). The design offers several key advantages:

Efficiency: O(1) lookups for parent-child relationships
Flexibility: Easy to modify by adding or removing individuals
Serializability: Simple to convert to/from JSON for storage
Memory Efficiency: Compact representation for sparse pedigrees
Simplicity: Built on Python's native data structures

In the Bonsai v3 codebase, the key operations on up-node dictionaries are implemented in the pedigrees.py module, which contains dozens of specialized functions for creating, querying, and manipulating pedigree structures.

ID Conventions and Pedigree Topology

The Bonsai v3 system uses a consistent convention for individual IDs that encodes important information about the pedigree:

Positive IDs (1, 2, 3, ...): Represent genotyped individuals for whom genetic data is available
Negative IDs (-1, -2, -3, ...): Represent inferred/ungenotyped individuals, often ancestors who must exist but lack genetic data

This convention is critical for pedigree inference algorithms, as it distinguishes between observed individuals and those whose existence is inferred from patterns in the data. The actual implementation in pedigrees.py includes functions to ensure consistent ID assignment:

def get_min_id(dct):
    """Get the minimal ID in a node dict.
    
    Args:
        dct: Dictionary mapping nodes to their relatives
        
    Returns:
        Minimum ID found (always negative)
    """
    # Extract all IDs from the dictionary
    all_ids = set(dct.keys())
    for relatives in dct.values():
        all_ids.update(relatives.keys())
    
    # Find the minimum negative ID, or -1 if no negative IDs exist
    neg_ids = [i for i in all_ids if i < 0]
    min_id = min(neg_ids) if neg_ids else -1
    
    return min_id

This function is used when adding new ungenotyped individuals to ensure they receive unique negative IDs. The implementation avoids ID collisions by always selecting a value lower than any existing negative ID.

The combination of positive and negative IDs enables Bonsai to represent complex pedigree topologies, including:

Nuclear families with genotyped and ungenotyped members
Multi-generational pedigrees with missing ancestors
Half-sibling relationships (two individuals sharing one parent)
Consanguineous relationships (marriages between relatives)
Complex pedigrees with multiple family branches

Bidirectional Navigation with Down-Node Dictionaries

While the up-node dictionary is the primary representation in Bonsai v3, the system also uses a complementary structure called the down-node dictionary, which represents relationships in the opposite direction:

down_node_dict = {
    parent_id: {child_id1: degree1, child_id2: degree2, ...},
    ...
}

The two representations are interconvertible using the reverse_node_dict() function:

def reverse_node_dict(node_dict):
    """Convert between up_node_dict and down_node_dict.
    
    Args:
        node_dict: Dictionary mapping nodes to their relatives
        
    Returns:
        Dictionary with the reverse mapping
    """
    reverse_dict = {}
    
    # Initialize empty dictionaries for all nodes
    all_nodes = set(node_dict.keys()).union(
        *[set(relatives.keys()) for relatives in node_dict.values()])
    
    for node in all_nodes:
        reverse_dict[node] = {}
    
    # Populate reverse mappings
    for node, relatives in node_dict.items():
        for relative, degree in relatives.items():
            reverse_dict[relative][node] = degree
    
    return reverse_dict

This dual representation allows efficient navigation in both directions through the pedigree:

Use up_node_dict to navigate upward from individuals to ancestors
Use down_node_dict to navigate downward from individuals to descendants

Having both representations available is crucial for operations like finding all descendants of an individual or identifying siblings, which would be inefficient using only an up-node dictionary.

Core Operations on Up-Node Dictionaries

Traversing the Pedigree Graph

Pedigree traversal is a fundamental operation in genetic genealogy. Bonsai v3 implements efficient traversal algorithms through the get_rel_set() function, which finds all relatives of a specific type for an individual:

def get_rel_set(node_dict, i):
    """Find all relatives of i in node_dict.
    
    If node_dict is an up_dict, finds all ancestors.
    If node_dict is a down_dict, finds all descendants.
    
    Args:
        node_dict: Dictionary mapping nodes to their relatives
        i: Individual ID
        
    Returns:
        Set of all relatives (including i itself)
    """
    # Start with the individual themself
    rel_set = {i}
    
    # If not in dictionary, return just the individual
    if i not in node_dict:
        return rel_set
    
    # Process each direct relative recursively
    for j in node_dict[i]:
        # Add the direct relative
        rel_set.add(j)
        
        # Recursively add their relatives
        j_relatives = get_rel_set(node_dict, j)
        rel_set.update(j_relatives)
    
    return rel_set

This recursive implementation uses depth-first search to efficiently find all connected individuals. When applied to an up-node dictionary, it returns all ancestors of the specified individual. When applied to a down-node dictionary, it returns all descendants.

The production implementation includes additional optimizations for handling large pedigrees, including caching and cycle detection. These ensure that even complex pedigrees with thousands of individuals can be traversed efficiently.

Finding Paths and Determining Relationships

One of the most important operations in a pedigree is determining the relationship between two individuals. Bonsai v3 implements this through a two-step process:

Find all paths connecting the individuals using get_all_paths()
Analyze these paths to determine the relationship type using get_simple_rel_tuple()

The path-finding algorithm identifies all possible routes through common ancestors:

def get_all_paths(up_node_dict, i, j):
    """Find all paths between individuals i and j.
    
    Args:
        up_node_dict: Dictionary mapping individuals to their ancestors
        i, j: Individual IDs
        
    Returns:
        Tuple of (paths, common_ancestors)
    """
    # Special case for self
    if i == j:
        return {(i,)}, {i}
    
    # Find all ancestors of i and j
    i_ancs = get_rel_set(up_node_dict, i)
    j_ancs = get_rel_set(up_node_dict, j)
    
    # Find common ancestors
    common_ancs = i_ancs.intersection(j_ancs)
    
    # No relationship if no common ancestors
    if not common_ancs:
        return set(), set()
    
    # Find all paths through common ancestors
    paths = set()
    for anc in common_ancs:
        i_paths = get_paths_to_anc(up_node_dict, i, anc)
        j_paths = get_paths_to_anc(up_node_dict, j, anc)
        
        # Combine paths through this ancestor
        for i_path in i_paths:
            for j_path in j_paths:
                # Join paths (i → ancestor → j)
                full_path = i_path + j_path[1:]
                paths.add(tuple(full_path))
    
    return paths, common_ancs

Once the paths are found, the get_simple_rel_tuple() function analyzes them to determine the relationship type, returning a tuple of the form (up, down, num_ancs):

def get_simple_rel_tuple(up_node_dict, i, j):
    """Get relationship tuple (up, down, num_ancs) between individuals i and j.
    
    Args:
        up_node_dict: Dictionary mapping individuals to their ancestors
        i, j: Individual IDs
        
    Returns:
        Relationship tuple (up, down, num_ancs) or None if unrelated
    """
    # Handle self relationship
    if i == j:
        return (0, 0, 2)
    
    # Get paths between individuals
    paths, common_ancs = get_all_paths(up_node_dict, i, j)
    
    if not paths:
        return None  # No relationship
    
    # Analyze paths to determine relationship
    shortest_up = float('inf')
    shortest_down = float('inf')
    
    for path in paths:
        # Find position of common ancestor in path
        for k, node in enumerate(path):
            if node in common_ancs:
                up_steps = k
                down_steps = len(path) - k - 1
                
                shortest_up = min(shortest_up, up_steps)
                shortest_down = min(shortest_down, down_steps)
                break
    
    # Count distinct common ancestors that provide shortest paths
    num_ancs = sum(1 for anc in common_ancs 
                  if any(anc in path and 
                         path.index(anc) == shortest_up for path in paths))
    
    # Cap at 2 for full relationships
    num_ancs = min(num_ancs, 2)
    
    return (shortest_up, shortest_down, num_ancs)

This tuple representation elegantly encodes all possible genealogical relationships using just three integers:

up: Number of meioses (reproductive events) from individual i up to the common ancestor
down: Number of meioses from the common ancestor down to individual j
num_ancs: Number of common ancestors (1 for half relationships like half-siblings, 2 for full relationships like full siblings)

Common relationship patterns include:

Parent-child: (0, 1, 1) or (1, 0, 1)
Full siblings: (1, 1, 2)
Half siblings: (1, 1, 1)
Grandparent-grandchild: (0, 2, 1) or (2, 0, 1)
Aunt/uncle-niece/nephew: (1, 2, 1) or (2, 1, 1)
First cousins: (2, 2, 1)

This compact representation allows Bonsai v3 to efficiently reason about relationships during pedigree construction.

Extracting and Manipulating Subpedigrees

Many pedigree operations require working with just a portion of the full pedigree. Bonsai v3 provides specialized functions for extracting subpedigrees:

def get_subdict(dct, node):
    """Get the cone above/below node in a node dict.
    
    If dct is an up_dict, returns the subdict containing the node and all its ancestors.
    If dct is a down_dict, returns the subdict containing the node and all its descendants.
    
    Args:
        dct: Dictionary mapping nodes to their relatives
        node: Root node for the subdict
        
    Returns:
        Dictionary representing the subpedigree
    """
    import copy
    
    if node not in dct:
        return {}
        
    # Start with the node itself
    sub_dct = {}
    sub_dct[node] = copy.deepcopy(dct[node])
    
    # Add subdicts for all relatives of the node
    for n in dct[node]:
        n_dct = get_subdict(dct, n)
        if n_dct:
            sub_dct.update(n_dct)
            
    return sub_dct

This function extracts the "cone" of individuals connected to a specific node, making it useful for operations like visualizing an individual's ancestry or working with a specific branch of a family tree.

For more complex extraction needs, Bonsai v3 includes the get_sub_up_node_dict() function, which extracts the minimal subtree connecting a set of individuals:

def get_sub_up_node_dict(up_dct, id_set):
    """Get subtree connecting all IDs in id_set.
    
    Args:
        up_dct: Dictionary mapping individuals to their ancestors
        id_set: Set of individual IDs to connect
        
    Returns:
        Dictionary representing the minimal connecting subpedigree
    """
    # Collect all path nodes needed for the subtree
    all_nodes = set()
    
    # Include the specified IDs
    all_nodes.update(id_set)
    
    # Add all path nodes between pairs of IDs
    for i in id_set:
        for j in id_set:
            if i >= j:  # Avoid duplicate pairs
                continue
                
            # Get all paths between i and j
            paths, _ = get_all_paths(up_dct, i, j)
            
            # Add nodes from all paths
            for path in paths:
                all_nodes.update(path)
    
    # Create the subpedigree with only necessary connections
    sub_dict = {}
    for node in all_nodes:
        if node in up_dct:
            # Only include parents that are in the subtree
            sub_dict[node] = {p: d for p, d in up_dct[node].items() 
                              if p in all_nodes}
    
    return sub_dict

This function is particularly valuable for visualizing the relationships between selected individuals without the clutter of the complete pedigree. It's frequently used in Bonsai's visualization components and for extracting the relevant portion of a pedigree during incremental construction.

Advanced Pedigree Operations

Most Recent Common Ancestors (MRCAs)

Identifying common ancestors is crucial for understanding how individuals are related. Bonsai v3 implements an efficient algorithm for finding the Most Recent Common Ancestors (MRCAs) of a set of individuals:

def get_mrca_set(up_dct, id_set):
    """Get the set of most recent common ancestors of id_set.
    
    Args:
        up_dct: Dictionary mapping individuals to their ancestors
        id_set: Set of individual IDs
        
    Returns:
        Set of most recent common ancestors
    """
    if len(id_set) == 0:
        return set()
    if len(id_set) == 1:
        return id_set  # Individual is its own ancestor
    
    # Find all common ancestors
    anc_sets = [get_rel_set(up_dct, i) for i in id_set]
    common_ancs = set.intersection(*anc_sets)
    
    if not common_ancs:
        return set()  # No common ancestors
    
    # For each common ancestor, check if any of its descendants
    # is also a common ancestor - if so, it's not most recent
    mrca_set = set(common_ancs)
    
    for anc in common_ancs:
        # Skip if this ancestor has already been removed
        if anc not in mrca_set:
            continue
            
        # Get all ancestors of this ancestor
        anc_ancs = get_rel_set(up_dct, anc) - {anc}
        
        # Remove higher ancestors from MRCA set
        mrca_set -= anc_ancs
    
    return mrca_set

This algorithm first finds all ancestors that the individuals have in common, then filters out those that are more distant than necessary. The result is the set of "most recent" common ancestors—those that don't have descendants who are also common ancestors.

The MRCA concept is important not only for understanding relationships but also for efficient pedigree construction. By identifying MRCAs, Bonsai can focus on the most relevant parts of the pedigree when adding new individuals.

Adding and Removing Individuals

Pedigree manipulation is central to Bonsai v3's operation. The system includes specialized functions for modifying pedigree structures, such as the add_parent() function for adding ungenotyped parents:

def add_parent(node, up_dct, min_id=None):
    """Add an ungenotyped parent to node in up_dct.
    
    Args:
        node: Individual ID
        up_dct: Dictionary mapping individuals to their ancestors
        min_id: Minimum ID to use for new parent (optional)
        
    Returns:
        Tuple of (updated_dict, new_parent_id)
    """
    import copy
    up_dct = copy.deepcopy(up_dct)
    
    if node not in up_dct:
        raise ValueError(f"Node {node} is not in up dct.")
        
    pid_dict = up_dct[node]
    if len(pid_dict) >= 2:
        return up_dct, None  # Already has two parents, can't add more
        
    # Generate a new negative ID for the ungenotyped parent
    if min_id is None:
        min_id = get_min_id(up_dct)
        
    new_pid = min_id - 1
    up_dct[node][new_pid] = 1  # Add parent with degree 1
    up_dct[new_pid] = {}       # Initialize empty parents dict for new parent
    
    return up_dct, new_pid

This function maintains the integrity of the pedigree structure while adding new individuals. It includes checks to ensure that no individual has more than two parents and that new ungenotyped individuals receive unique negative IDs.

For removing individuals, Bonsai provides the delete_node() function:

def delete_node(dct, node):
    """Delete node from a node dict.
    
    Args:
        dct: Dictionary mapping nodes to their relatives
        node: Node to delete
        
    Returns:
        Updated dictionary
    """
    import copy
    new_dct = {}
    
    # Copy all entries except the deleted node
    for k, v in dct.items():
        if k != node:
            # Also remove references to the deleted node in relatives
            new_dct[k] = {r: d for r, d in v.items() if r != node}
    
    return new_dct

This function completely removes an individual from the pedigree, including all references to it from other individuals. This is useful for pruning unnecessary individuals from a pedigree or testing alternative hypotheses.

These manipulation functions, combined with the query functions discussed earlier, form the building blocks for Bonsai v3's pedigree construction algorithms.

Connection Points and Pedigree Merging

A critical operation in Bonsai v3 is identifying connection points between separate pedigree fragments. This is implemented through functions like find_connection_points(), which analyzes two pedigrees to find optimal ways to connect them:

def find_connection_points(pedigree1, pedigree2, pw_ll):
    """Find optimal points to connect two pedigrees.
    
    Args:
        pedigree1, pedigree2: Pedigrees to connect
        pw_ll: PwLogLike instance with relationship likelihoods
        
    Returns:
        List of potential connection points with scores
    """
    # Get individuals in each pedigree
    ids1 = set(pedigree1.keys())
    ids2 = set(pedigree2.keys())
    
    # Evaluate all possible connections
    connections = []
    for id1 in ids1:
        for id2 in ids2:
            # Only consider connections between genotyped individuals
            if id1 < 0 or id2 < 0:
                continue
                
            # Get relationship likelihood
            rel_tuple, log_ll = pw_ll.get_most_likely_rel(id1, id2)
            
            # Add to potential connections if sufficiently likely
            if log_ll > THRESHOLD:
                connections.append((id1, id2, rel_tuple, log_ll))
    
    # Sort by likelihood and return
    connections.sort(key=lambda x: x[3], reverse=True)
    return connections

This function evaluates potential connections between genotyped individuals in two separate pedigrees, using the PwLogLike class (which we explored in Lab 7) to assess the likelihood of each relationship. The result is a ranked list of potential connection points.

Once connection points are identified, pedigrees can be merged using specialized functions that combine the up-node dictionaries while maintaining consistency.

Core Component: The up-node dictionary is the cornerstone of Bonsai v3's architecture, providing a flexible and efficient representation of pedigree structures. Its combination of simplicity and power enables the system to represent complex family relationships, manipulate pedigrees, and perform sophisticated relationship inference from genetic data.

Comparing Notebook and Production Code

The Lab10 notebook provides a simplified exploration of up-node dictionaries, while the production implementation in Bonsai v3 includes additional sophistication:

Performance Optimizations: The production code includes memoization for recursive functions, specialized handling for large pedigrees, and cycle detection algorithms.
Error Handling: Robust error handling with detailed diagnostic information for common edge cases.
Support for Complex Topologies: Additional functions for handling complex structures like multiple marriages, adoptions, and special relationship types.
Consistency Checks: Validation routines that ensure pedigree integrity during modifications.
Serialization Support: Functions for converting between dictionary representation and other formats like GEDCOM or JSON.
Integration with Inference: Tight coupling with the PwLogLike system for probability-based pedigree construction.

The notebook provides a valuable introduction to the key concepts, but the production implementation represents years of refinement to handle the complexities of real-world pedigrees and genetic data.

Interactive Lab Environment

Run the interactive Lab 10 notebook in Google Colab:

Google Colab Environment

Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.

Data will be automatically downloaded from S3 when you run the notebook.

Note: You may need a Google account to save your work in Google Drive.

Open Lab 10 Notebook in Google Colab

Beyond the Code

As you explore up-node dictionaries and pedigree representation, consider these broader implications:

Data Structure Design: How the choice of representation impacts the efficiency and expressiveness of algorithms
Graph Theory: How pedigrees represent a specialized class of directed acyclic graphs with biological constraints
Genealogical Encoding: How mathematical representations can capture the rich complexity of human family structures
Cultural Dimensions: How different cultures conceptualize family relationships, and how these can be encoded in data structures

These considerations highlight how the up-node dictionary represents not just a technical implementation but a computational model of human relationships that bridges biology, genealogy, and computer science.

This lab is part of the Bonsai v3 Deep Dive track:

Introduction

Lab 01

Architecture

Lab 02

IBD Formats

Lab 03

Statistics

Lab 04

Models

Lab 05

Relationships

Lab 06

PwLogLike

Lab 07

Age Modeling

Lab 08

Data Structures

Lab 09

Up-Node Dict

Lab 10