Lab 09: Pedigree Data Structures

Core Component: This lab explores the fundamental data structures used in Bonsai v3 to represent and manipulate pedigrees. Understanding these structures is essential because they form the foundation for all pedigree operations, from building and querying to optimization and rendering.

Fundamental Pedigree Representations

The Node Dictionary Approach

At the heart of Bonsai v3's pedigree representation is a dictionary-based approach that efficiently encodes genealogical relationships. The system uses two complementary dictionary structures:

up_node_dict: Maps individuals to their ancestors with degrees of relationship
down_node_dict: Maps individuals to their descendants with degrees of relationship

These dictionaries offer several advantages over alternative representations:

Efficient lookup of relatives in O(1) time
Memory-efficient representation of sparse pedigrees
Natural handling of missing data and complex pedigree structures
Support for both genotyped and ungenotyped (inferred) individuals

Let's examine the structure of the up_node_dict, the primary representation used in Bonsai v3:

up_node_dict = {
    3: {1: 1, 2: 1},  # Individual 3 has parents 1 and 2, with degree 1
    4: {1: 1, 2: 1},  # Individual 4 has parents 1 and 2, with degree 1
    1: {},            # Individual 1 has no parents in this pedigree
    2: {}             # Individual 2 has no parents in this pedigree
}

In this dictionary:

Each key is an individual's ID (positive for genotyped, negative for ungenotyped)
Each value is a dictionary mapping parent IDs to relationship degrees
The relationship degree is typically 1 for direct parents

Bonsai v3 also maintains a parallel down_node_dict that provides the reverse mapping, from parents to children. These dictionaries can be converted between each other using the reverse_node_dict() function:

def reverse_node_dict(node_dict):
    """Convert between up_node_dict and down_node_dict.
    
    Args:
        node_dict: Dictionary mapping nodes to their relatives
        
    Returns:
        Dictionary with the reverse mapping
    """
    reverse_dict = {}
    
    # Initialize empty dictionaries for all nodes
    all_nodes = set(node_dict.keys()).union(
        *[set(relatives.keys()) for relatives in node_dict.values()])
    
    for node in all_nodes:
        reverse_dict[node] = {}
    
    # Populate reverse mappings
    for node, relatives in node_dict.items():
        for relative, degree in relatives.items():
            reverse_dict[relative][node] = degree
    
    return reverse_dict

This dual representation allows efficient navigation in both directions through the pedigree, facilitating operations like finding all descendants of an individual or tracing ancestry paths.

Representing Relationships

Beyond the node dictionaries, Bonsai v3 uses a compact tuple representation for relationships between individuals: (up, down, num_ancs)

up: Number of meioses up from individual 1 to common ancestor
down: Number of meioses down from common ancestor to individual 2
num_ancs: Number of common ancestors (1 for half relationships, 2 for full relationships)

This representation has several advantages:

Compact encoding of relationship types
Direct mapping to genetic inheritance patterns
Clear distinction between full and half relationships
Support for arbitrary relationship distances

Common relationship tuples include:

"identical/self": (0, 0, 2)
"parent-child": (0, 1, 1)
"child-parent": (1, 0, 1)
"full sibling": (1, 1, 2)
"half sibling": (1, 1, 1)
"grandparent-grandchild": (0, 2, 1)
"aunt/uncle-niece/nephew": (1, 2, 1)
"first cousin": (2, 2, 1)
"second cousin": (3, 3, 1)

The Bonsai codebase includes functions to convert between relationship tuples and relationship names, facilitating user-friendly output:

def get_relationship_name(relationship_tuple):
    """Convert a relationship tuple to a human-readable name.
    
    Args:
        relationship_tuple: (up, down, num_ancs) tuple
        
    Returns:
        Human-readable relationship name
    """
    up, down, num_ancs = relationship_tuple
    
    # Self relationship
    if up == 0 and down == 0 and num_ancs == 2:
        return "Self"
    
    # Direct lineage
    if up == 0 and down == 1:
        return "Parent"
    if up == 1 and down == 0:
        return "Child"
    if up == 0 and down == 2:
        return "Grandparent"
    if up == 2 and down == 0:
        return "Grandchild"
    
    # Siblings
    if up == 1 and down == 1:
        if num_ancs == 2:
            return "Full Sibling"
        else:
            return "Half Sibling"
    
    # Aunts/Uncles and Nieces/Nephews
    if up == 1 and down == 2:
        return "Aunt/Uncle"
    if up == 2 and down == 1:
        return "Niece/Nephew"
    
    # Cousins
    if up >= 2 and down >= 2:
        degree = min(up, down) - 1  # First cousin = 1, Second cousin = 2, etc.
        removal = abs(up - down)    # 0 = same generation, 1 = once removed, etc.
        
        cousin_type = "Full" if num_ancs == 2 else "Half"
        
        if removal == 0:
            return f"{cousin_type} {degree_to_ordinal(degree)} Cousin"
        else:
            return f"{cousin_type} {degree_to_ordinal(degree)} Cousin {removal} time{'s' if removal > 1 else ''} removed"
    
    # Other complex relationships
    return f"Relationship ({up}, {down}, {num_ancs})"

This bidirectional mapping between compact tuples and human-readable names is crucial for both internal computation and user interface elements.

Navigating and Querying Pedigrees

Finding Relatives and Paths

Bonsai v3 includes a suite of functions for navigating and querying pedigrees. These functions build upon the foundation of the node dictionaries to provide higher-level operations.

The get_rel_set() function finds all relatives of a specific type (ancestors or descendants):

def get_rel_set(node_dict, i):
    """Find all relatives of i in node_dict.
    
    If node_dict is an up_dict, finds all ancestors.
    If node_dict is a down_dict, finds all descendants.
    
    Args:
        node_dict: Dictionary mapping nodes to their relatives
        i: Individual ID
        
    Returns:
        Set of all relatives (including i)
    """
    # Start with the individual themself
    rel_set = {i}
    
    # If not in dictionary, return just the individual
    if i not in node_dict:
        return rel_set
    
    # Process each direct relative
    for j in node_dict[i]:
        # Add the direct relative
        rel_set.add(j)
        
        # Recursively add their relatives
        j_relatives = get_rel_set(node_dict, j)
        rel_set.update(j_relatives)
    
    return rel_set

This recursive approach efficiently traces all paths through the pedigree in a specific direction. For example, when applied to an up_node_dict, it finds all ancestors of an individual, and when applied to a down_node_dict, it finds all descendants.

To find paths between individuals, Bonsai v3 uses the get_all_paths() function:

def get_all_paths(up_node_dict, i, j):
    """Find all paths between individuals i and j.
    
    Args:
        up_node_dict: Dictionary mapping individuals to their ancestors
        i, j: Individual IDs
        
    Returns:
        Tuple of (paths, common_ancestors)
    """
    # Special case for self
    if i == j:
        return {(i,)}, {i}
    
    # Find all ancestors of i and j
    i_ancs = get_rel_set(up_node_dict, i)
    j_ancs = get_rel_set(up_node_dict, j)
    
    # Find common ancestors
    common_ancs = i_ancs.intersection(j_ancs)
    
    # No relationship if no common ancestors
    if not common_ancs:
        return set(), set()
    
    # Find all paths through common ancestors
    paths = set()
    for anc in common_ancs:
        i_paths = get_path_to_anc(up_node_dict, i, anc)
        j_paths = get_path_to_anc(up_node_dict, j, anc)
        
        # Combine paths through this ancestor
        for i_path in i_paths:
            i_path_rev = i_path[::-1]  # Reverse i's path (child → ancestor)
            for j_path in j_paths:
                # Join paths (i → ancestor → j)
                # Remove duplicate ancestor in the middle
                full_path = i_path_rev + j_path[1:]
                paths.add(tuple(full_path))
    
    return paths, common_ancs

This function is central to relationship inference, as it finds all possible paths connecting two individuals through common ancestors. The resulting paths can then be analyzed to determine the genetic relationship between the individuals.

Building on these path-finding capabilities, Bonsai v3 includes the get_simple_rel_tuple() function, which computes the standard relationship tuple between two individuals:

def get_simple_rel_tuple(up_node_dict, i, j):
    """Get relationship tuple (up, down, num_ancs) between individuals i and j.
    
    Args:
        up_node_dict: Dictionary mapping individuals to their ancestors
        i, j: Individual IDs
        
    Returns:
        Relationship tuple (up, down, num_ancs) or None if unrelated
    """
    # Handle self relationship
    if i == j:
        return (0, 0, 2)
    
    # Get paths between individuals
    paths, common_ancs = get_all_paths(up_node_dict, i, j)
    
    if not paths:
        return None  # No relationship
    
    # Analyze paths to determine relationship
    shortest_up = float('inf')
    shortest_down = float('inf')
    
    for path in paths:
        # Find position of common ancestor in path
        for k, node in enumerate(path):
            if node in common_ancs:
                up_steps = k
                down_steps = len(path) - k - 1
                
                shortest_up = min(shortest_up, up_steps)
                shortest_down = min(shortest_down, down_steps)
                break
    
    # Count distinct common ancestors that provide shortest paths
    num_ancs = sum(1 for anc in common_ancs 
                  if any(anc in path and 
                         path.index(anc) == shortest_up for path in paths))
    
    # Cap at 2 for full relationships
    num_ancs = min(num_ancs, 2)
    
    return (shortest_up, shortest_down, num_ancs)

This relationship computation is a core operation in Bonsai, used throughout the system for tasks ranging from relationship inference to pedigree validation and visualization.

Working with Pedigree Subsets

Many pedigree operations require working with subsets of the full pedigree. Bonsai v3 includes functions for extracting and manipulating pedigree subsets:

The get_subdict() function extracts the "cone" above or below a specific individual:

def get_subdict(dct, node):
    """Get the cone above/below node in a node dict.
    
    If dct is an up_dict, returns the subdict containing the node and all its ancestors.
    If dct is a down_dict, returns the subdict containing the node and all its descendants.
    
    Args:
        dct: Dictionary mapping nodes to their relatives
        node: Root node for the subdict
        
    Returns:
        Dictionary representing the subpedigree
    """
    import copy
    
    if node not in dct:
        return {}
        
    # Start with the node itself
    sub_dct = {}
    sub_dct[node] = copy.deepcopy(dct[node])
    
    # Add subdicts for all relatives of the node
    for n in dct[node]:
        n_dct = get_subdict(dct, n)
        if n_dct:
            sub_dct.update(n_dct)
            
    return sub_dct

This function is useful for extracting ancestral or descendant subpedigrees, which are common operations in pedigree exploration and analysis.

For more complex subset operations, Bonsai v3 provides the get_sub_up_node_dict() function, which extracts the minimal subpedigree connecting a set of individuals:

def get_sub_up_node_dict(up_dct, id_set):
    """Get subtree connecting all IDs in id_set.
    
    Args:
        up_dct: Dictionary mapping individuals to their ancestors
        id_set: Set of individual IDs to connect
        
    Returns:
        Dictionary representing the minimal connecting subpedigree
    """
    # Find all nodes needed in the minimal connecting subpedigree
    required_nodes = set()
    
    # First, include all IDs in the set
    required_nodes.update(id_set)
    
    # Find common ancestors between all pairs
    for i in id_set:
        for j in id_set:
            if i >= j:  # Avoid duplicate pairs
                continue
                
            # Get paths between i and j
            paths, common_ancs = get_all_paths(up_dct, i, j)
            
            # Add all nodes in these paths
            for path in paths:
                required_nodes.update(path)
    
    # Create the subpedigree
    sub_dict = {}
    for node in required_nodes:
        if node in up_dct:
            # Only include parents that are in the required nodes
            sub_dict[node] = {p: d for p, d in up_dct[node].items() 
                             if p in required_nodes}
        else:
            sub_dict[node] = {}
    
    return sub_dict

This function is particularly useful for visualizing the relationships between specific individuals without the clutter of the full pedigree. It's also used in pedigree merging operations to identify connection points between separate pedigree fragments.

Finally, Bonsai v3 includes the get_gt_id_set() function to identify all genotyped individuals in a pedigree:

def get_gt_id_set(ped):
    """Get all genotyped IDs (positive) from the pedigree.
    
    Args:
        ped: Dictionary representing a pedigree
        
    Returns:
        Set of genotyped individual IDs
    """
    all_ids = set(ped.keys()).union(*[set(parents.keys()) for parents in ped.values()])
    return {i for i in all_ids if i > 0}

This distinction between genotyped (observed) and ungenotyped (inferred) individuals is crucial in genetic genealogy, where many individuals in a pedigree may not have genetic data available.

Building and Modifying Pedigrees

Adding and Deleting Nodes

Bonsai v3 includes a comprehensive set of functions for building and modifying pedigrees. These functions enable the iterative construction and refinement of pedigree structures based on genetic and demographic evidence.

The add_parent() function adds an ungenotyped parent to an individual:

def add_parent(node, up_dct, min_id=None):
    """Add an ungenotyped parent to node in up_dct.
    
    Args:
        node: Individual ID
        up_dct: Dictionary mapping individuals to their ancestors
        min_id: Minimum ID to use for new parent (optional)
        
    Returns:
        Tuple of (updated_dict, new_parent_id)
    """
    import copy
    up_dct = copy.deepcopy(up_dct)
    
    if node not in up_dct:
        raise ValueError(f"Node {node} is not in up dct.")
        
    pid_dict = up_dct[node]
    if len(pid_dict) >= 2:
        return up_dct, None  # Already has two parents, can't add more
        
    # Generate a new negative ID for the ungenotyped parent
    if min_id is None:
        min_id = get_min_id(up_dct)
        
    new_pid = min_id - 1
    up_dct[node][new_pid] = 1  # Add parent with degree 1
    up_dct[new_pid] = {}       # Initialize empty parents dict for new parent
    
    return up_dct, new_pid

This function is essential for building out pedigrees, allowing the addition of inferred individuals to complete family structures based on genetic evidence.

The delete_node() function removes an individual from a pedigree:

def delete_node(dct, node):
    """Delete node from a node dict.
    
    Args:
        dct: Dictionary mapping nodes to their relatives
        node: Node to delete
        
    Returns:
        Updated dictionary
    """
    import copy
    new_dct = {}
    
    # Copy all entries except the deleted node
    for k, v in dct.items():
        if k != node:
            # Also remove references to the deleted node in relatives
            new_dct[k] = {r: d for r, d in v.items() if r != node}
    
    return new_dct

This function is used for pedigree pruning, error correction, and testing alternative pedigree hypotheses by removing specific individuals.

The replace_ids() function allows for the remapping of individual IDs throughout a pedigree:

def replace_ids(rep_dct, dct):
    """Replace IDs in dct according to mapping in rep_dct.
    
    Args:
        rep_dct: Dictionary mapping old IDs to new IDs
        dct: Dictionary to update
        
    Returns:
        Updated dictionary with new IDs
    """
    if not isinstance(dct, dict):
        return dct
        
    new_dct = {}
    for k, v in dct.items():
        new_k = rep_dct.get(k, k)  # Replace key if in mapping, otherwise keep
        
        if isinstance(v, dict):
            new_v = {}
            for k2, v2 in v.items():
                new_k2 = rep_dct.get(k2, k2)  # Replace nested key
                new_v[new_k2] = v2
        else:
            new_v = v
            
        new_dct[new_k] = new_v
    
    return new_dct

This function is valuable for aligning IDs between different data sources, merging pedigrees, and standardizing ID schemes in complex pedigree operations.

Higher-Level Pedigree Operations

Beyond basic node operations, Bonsai v3 includes functions for higher-level pedigree analysis and manipulation:

The get_rel_dict() function computes all pairwise relationships in a pedigree:

def get_rel_dict(up_dct):
    """Get dict mapping each ID pair to their relationship tuple.
    
    Args:
        up_dct: Dictionary mapping individuals to their ancestors
        
    Returns:
        Dictionary mapping (id1, id2) to relationship tuples
    """
    # Get all IDs in the pedigree
    all_ids = set(up_dct.keys()).union(*[set(parents.keys()) for parents in up_dct.values()])
    all_ids = sorted(all_ids)  # Sort for deterministic ordering
    
    # Initialize the relationship dictionary
    rel_dict = {}
    for i in all_ids:
        rel_dict[i] = {}
    
    # Compute relationships for all pairs
    for i in range(len(all_ids)):
        id1 = all_ids[i]
        
        # Self relationship
        rel_dict[id1][id1] = (0, 0, 2)
        
        for j in range(i+1, len(all_ids)):
            id2 = all_ids[j]
            
            # Get relationship tuple
            rel_tuple = get_simple_rel_tuple(up_dct, id1, id2)
            
            # Store in both directions
            if rel_tuple:
                rel_dict[id1][id2] = rel_tuple
                
                # Reverse for the opposite direction
                up, down, num_ancs = rel_tuple
                rel_dict[id2][id1] = (down, up, num_ancs)
    
    return rel_dict

This function is used for comprehensive pedigree analysis, validation, and visualization. It efficiently computes all relationships in a single pass, avoiding redundant computation.

The get_mrca_set() function finds the most recent common ancestors of a set of individuals:

def get_mrca_set(up_dct, id_set):
    """Get the set of most recent common ancestors of id_set.
    
    Args:
        up_dct: Dictionary mapping individuals to their ancestors
        id_set: Set of individual IDs
        
    Returns:
        Set of most recent common ancestors
    """
    if len(id_set) == 0:
        return set()
    if len(id_set) == 1:
        return id_set  # Individual is own ancestor
    
    # Find all common ancestors
    anc_sets = [get_rel_set(up_dct, i) for i in id_set]
    common_ancs = set.intersection(*anc_sets)
    
    if not common_ancs:
        return set()  # No common ancestors
    
    # For each common ancestor, check if any of its descendants
    # is also a common ancestor
    mrca_set = set(common_ancs)
    
    for anc in common_ancs:
        if anc not in mrca_set:
            continue  # Already removed
            
        # Get ancestors of this ancestor
        higher_ancs = get_rel_set(up_dct, anc) - {anc}
        
        # Remove higher ancestors from MRCA set
        mrca_set -= higher_ancs
    
    return mrca_set

This function is crucial for relationship analysis, particularly for identifying the connecting points between different branches of a family tree.

Finally, the get_sib_set() function finds all siblings of an individual:

def get_sib_set(up_dct, down_dct, node):
    """Get all siblings of node.
    
    Args:
        up_dct: Dictionary mapping individuals to their ancestors
        down_dct: Dictionary mapping individuals to their descendants
        node: Individual ID
        
    Returns:
        Set of sibling IDs
    """
    if node not in up_dct:
        return set()
        
    # Get parents of the node
    parents = set(up_dct[node].keys())
    
    # Get all children of these parents
    sibling_set = set()
    for parent in parents:
        if parent in down_dct:
            sibling_set.update(down_dct[parent].keys())
                
    # Remove the node itself
    sibling_set.discard(node)
    
    return sibling_set

This function is used for identifying siblings, which is a common operation in pedigree analysis and reconstruction. By combining the up_dct and down_dct representations, it efficiently finds all siblings without traversing the entire pedigree.

Core Component: Bonsai v3's pedigree data structures provide a flexible, efficient, and powerful foundation for representing and manipulating genealogical relationships. The node dictionary approach, combined with a rich set of navigation and manipulation functions, enables both simple operations like finding relatives and complex tasks like pedigree merging and optimization.

Comparing Notebook and Production Code

The Lab09 notebook provides a simplified exploration of pedigree data structures, while the production implementation in Bonsai v3 includes additional sophistication:

Performance Optimizations: The production code includes optimized implementations for large pedigrees
Caching Mechanisms: Strategic caching to avoid redundant computation in frequently accessed operations
Error Handling: Comprehensive error checking and validation for robust operation
Memory Efficiency: Careful attention to memory usage in large-scale pedigree operations
Graph Algorithms: Sophisticated graph-theoretic algorithms for complex pedigree analysis
Serialization: Functions for saving and loading pedigrees in various formats

The notebook provides a valuable introduction to the key concepts, but the production implementation represents years of refinement to handle the complexities of real-world pedigrees, including large family structures, missing data, and complex relationship patterns.

Interactive Lab Environment

Run the interactive Lab 09 notebook in Google Colab:

Google Colab Environment

Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.

Data will be automatically downloaded from S3 when you run the notebook.

Note: You may need a Google account to save your work in Google Drive.

Open Lab 09 Notebook in Google Colab

Beyond the Code

As you explore pedigree data structures, consider these broader implications:

Data Structure Design: How the choice of representation impacts the efficiency and expressiveness of operations
Graph Theory: How pedigrees represent a specialized form of directed graph with biological constraints
Computational Genetics: The interplay between genetic inheritance patterns and computational representations
Scalability Challenges: The technical challenges of representing complex family structures across many generations

These considerations highlight how pedigree data structures represent not just a technical implementation but a computational model of biological inheritance and family relationships, bridging the gap between genetics, genealogy, and computer science.

This lab is part of the Bonsai v3 Deep Dive track:

Introduction

Lab 01

Architecture

Lab 02

IBD Formats

Lab 03

Statistics

Lab 04

Models

Lab 05

Relationships

Lab 06

PwLogLike

Lab 07

Age Modeling

Lab 08

Data Structures

Lab 09

Up-Node Dict

Lab 10