Lab 11: Finding Connection Points Between Individuals
Core Component: This lab explores Bonsai v3's algorithms for identifying potential connection points between individuals or pedigrees. Finding optimal ways to connect individuals based on genetic evidence is a central challenge in computational genetic genealogy, and understanding these algorithms is essential for comprehending how Bonsai reconstructs complex family trees.
The Connection Point Problem
Defining Connection Points
In genetic genealogy, we often encounter situations where we have evidence that two individuals or pedigrees are related, but we need to determine precisely how they connect. This is the "connection point problem," and it's central to Bonsai v3's pedigree construction algorithm.
A connection point in Bonsai v3 is represented as a tuple of the form (id1, id2, dir)
, where:
- id1: The primary individual through which the connection is made
- id2: An optional secondary individual (usually a partner of id1)
- dir: The direction of the connection
0
: Connect downward from the node (add as child)1
: Connect upward from the node (add as parent)None
: Replace the node or connect laterally
This tuple representation encapsulates three key aspects of a potential connection:
- Who is being connected (the individuals involved)
- How they're being connected (the direction of the connection)
- What type of relationship is being created (implicitly through the direction)
The connection point approach is powerful because it allows Bonsai to systematically evaluate all possible ways that two pedigree fragments might be connected, and to prioritize the most likely connections based on genetic evidence.
Types of Connection Points
There are three primary types of connection points in Bonsai v3, each representing a different way to connect individuals or pedigrees:
Upward Connections (dir=1
)
An upward connection adds a parent to an individual. This is appropriate when genetic evidence suggests that an individual from one pedigree is descended from someone in the other pedigree.
Example: Adding a previously unknown parent to an individual who was thought to be a founder.
// Upward connection (id, None, 1)
// Before: After:
// 1 1
// / \ / \
// 2 3 2 3
// \
// 4
Downward Connections (dir=0
)
A downward connection adds a child to an individual or couple. This is appropriate when the evidence suggests that someone in one pedigree is an ancestor of someone in the other pedigree.
Example: Identifying that a couple had an additional child who's the ancestor of individuals in another pedigree.
// Downward connection (id1, id2, 0)
// Before: After:
// 1 2 1 2
// \ / \ /
// 3 3
// |
// 4
Lateral/Replace Connections (dir=None
)
A lateral connection either replaces an individual or connects individuals at the same generation level. This is appropriate when the evidence suggests that individuals previously thought to be unrelated are actually the same person or are related laterally (e.g., as siblings).
Example: Discovering that an individual appears in two different pedigrees under different identities.
// Replace connection (id, None, None)
// Before: After:
// 1 1
// / \ / \
// 2 3 X 3
// |
// 4
In the actual implementation, Bonsai v3 uses sophisticated algorithms to identify all possible connection points of these types in a pedigree. The core function for this purpose is get_possible_connection_point_set()
:
def get_possible_connection_point_set(ped):
"""Find all possible points through which a pedigree can be connected to another pedigree.
Args:
ped: Up-node dictionary representing a pedigree
Returns:
Set of connection point tuples (id1, id2, dir)
"""
point_set = set()
# Get all IDs in the pedigree
all_ids = set()
for node, parents in ped.items():
all_ids.add(node)
all_ids.update(parents.keys())
# Evaluate each ID as a potential connection point
for a in all_ids:
# Check if this node can be connected upward (has fewer than 2 parents)
parent_to_deg = ped.get(a, {})
if len(parent_to_deg) < 2:
point_set.add((a, None, 1)) # Can add a parent
# Get partners of this individual
partners = get_partner_id_set(a, ped)
# Add downward connection possibilities
point_set.add((a, None, 0)) # Can add a child to this individual alone
for partner in partners:
if (partner, a, 0) not in point_set: # Avoid duplicates
point_set.add((a, partner, 0)) # Can add a child to this couple
# Add lateral/replace connection possibilities
point_set.add((a, None, None)) # Can replace this individual
for partner in partners:
point_set.add((a, partner, None)) # Can connect laterally to this couple
return point_set
This function systematically evaluates every individual in the pedigree, considering:
- Whether they can have additional parents added (upward connection)
- Whether they can have children added, either alone or with a partner (downward connection)
- Whether they can be replaced or connected laterally (replace/lateral connection)
The result is a comprehensive set of all possible ways the pedigree could be connected to another pedigree, which forms the foundation for the next step: prioritizing the most likely connections based on genetic evidence.
Identifying and Prioritizing Connection Points
Finding Partners for Connection Points
Many connection points involve couples rather than single individuals, so Bonsai v3 needs to identify the partners of each individual in the pedigree. This is accomplished through the get_partner_id_set()
function:
def get_partner_id_set(node, up_dct):
"""Find the set of partners of node in pedigree up_dct.
Args:
node: Individual ID
up_dct: Up-node dictionary representing the pedigree
Returns:
Set of partner IDs
"""
# Convert to down-node dictionary to find children
down_dct = reverse_node_dict(up_dct)
# Get children of the node (direct descendants only)
child_id_set = {c for c, d in down_dct.get(node, {}).items() if d == 1}
# For each child, find the other parent(s)
partner_id_set = set()
for cid in child_id_set:
# Get all direct parents of this child
parent_ids = {p for p, d in up_dct.get(cid, {}).items() if d == 1}
# Add them to the partner set
partner_id_set.update(parent_ids)
# Remove the node itself from the partner set
partner_id_set.discard(node)
return partner_id_set
This function works by first finding all children of the individual, then for each child, identifying all parents. Any parent other than the individual themselves is considered a partner. This approach handles complex scenarios like:
- Multiple marriages/partnerships
- Blended families
- Cases where an individual has children with multiple partners
By accurately identifying partners, Bonsai can generate connection points that respect the full complexity of human family structures.
Restricting Connection Points Based on IBD
When we have genetic evidence like IBD (Identity by Descent) sharing, we can often restrict the set of possible connection points to those that could explain the observed patterns. Bonsai v3 implements this through the restrict_connection_point_set()
function:
def restrict_connection_point_set(up_dct, con_pt_set, id_set):
"""Restrict connection points to those that could connect to individuals in id_set.
Args:
up_dct: Up-node dictionary representing the pedigree
con_pt_set: Set of all possible connection points
id_set: Set of individual IDs that share IBD with another pedigree
Returns:
Restricted set of connection points
"""
# If id_set is empty, there's nothing to restrict
if not id_set:
return con_pt_set
# Get the subtree connecting all IDs in id_set
sub_dct = get_sub_up_node_dict(up_dct, id_set)
# Get all IDs in the subtree
sub_id_set = set(sub_dct.keys())
for parents in sub_dct.values():
sub_id_set.update(parents.keys())
# Restrict connection points to those involving IDs in the subtree
restricted_con_pt_set = set()
for a, b, d in con_pt_set:
if a in sub_id_set or (b is not None and b in sub_id_set):
restricted_con_pt_set.add((a, b, d))
return restricted_con_pt_set
This function works by:
- Finding the minimal subtree containing all individuals that share IBD
- Restricting connection points to those involving individuals in this subtree
The key insight is that if two pedigrees share IBD, the connection between them must involve the ancestors of the individuals showing the IBD sharing. By focusing on this subtree, Bonsai dramatically reduces the search space while ensuring no valid connection points are overlooked.
Ranking Connection Points by Likelihood
After identifying possible connection points and restricting them based on IBD sharing, Bonsai v3 ranks the remaining options based on how well they would explain the observed genetic data. This is implemented in the get_likely_con_pt_set()
function:
def get_likely_con_pt_set(up_dct, id_to_shared_ibd, rel_dict, con_pt_set, max_con_pts=5):
"""Find the most likely connection points based on IBD sharing.
Args:
up_dct: Up-node dictionary representing the pedigree
id_to_shared_ibd: Dict mapping IDs to their amount of IBD sharing
rel_dict: Dict mapping ID pairs to their relationship tuples
con_pt_set: Set of possible connection points
max_con_pts: Maximum number of connection points to return
Returns:
Set of the most likely connection points
"""
# For each connection point, compute a score based on IBD sharing patterns
scored_con_pts = []
for a, b, d in con_pt_set:
# Calculate how well this connection would explain the observed IBD sharing
score = _score_connection_point(a, b, d, up_dct, id_to_shared_ibd, rel_dict)
scored_con_pts.append((score, (a, b, d)))
# Sort connection points by score (highest first)
scored_con_pts.sort(reverse=True)
# Return the top connection points
return {cp for _, cp in scored_con_pts[:max_con_pts]}
The scoring function _score_connection_point()
evaluates each connection point based on how consistent it is with the observed IBD sharing. The algorithm considers factors like:
- Genetic Distance: How many meioses separate the individuals sharing IBD when connected through this point
- IBD Amounts: Whether the observed amounts of IBD are consistent with the proposed relationship
- Distribution Across Relatives: Whether the pattern of IBD sharing across multiple relatives makes sense
- Age Constraints: Whether the connection respects biological constraints on parent-child age differences
The result is a ranked list of the most promising connection points, which can then be evaluated in more detail or implemented in the pedigree.
Implementing and Evaluating Connections
Modifying Pedigrees with Connection Points
Once a promising connection point has been identified, Bonsai v3 needs to implement it by modifying the pedigree structure. This involves adding new individuals, creating new relationships, and potentially merging pedigree fragments.
The implementation depends on the connection type:
- Upward Connections (
dir=1
):def implement_upward_connection(pedigree, id1, id2, other_pedigree): """Implement an upward connection from individual id1.""" # Get founders from other_pedigree to connect as ancestors founders = [node for node, parents in other_pedigree.items() if not parents] # Create modified pedigree modified_pedigree = copy.deepcopy(pedigree) # If there's only one founder, connect directly if len(founders) == 1: modified_pedigree[id1][founders[0]] = 1 else: # Create a connector individual connector_id = get_min_id(modified_pedigree) - 1 # Add connector as parent of id1 modified_pedigree[id1][connector_id] = 1 modified_pedigree[connector_id] = {} # Make founders parents of connector for founder in founders: modified_pedigree[connector_id][founder] = 1 # Merge in the rest of other_pedigree for node, parents in other_pedigree.items(): if node not in modified_pedigree: modified_pedigree[node] = parents return modified_pedigree
- Downward Connections (
dir=0
):def implement_downward_connection(pedigree, id1, id2, other_pedigree): """Implement a downward connection from individual id1 (and id2 if provided).""" # Create modified pedigree modified_pedigree = copy.deepcopy(pedigree) # Create a connector individual connector_id = get_min_id(modified_pedigree) - 1 # Add connector as child of id1 (and id2 if provided) modified_pedigree[connector_id] = {id1: 1} if id2 is not None: modified_pedigree[connector_id][id2] = 1 # Make connector parent of founders in other_pedigree founders = [node for node, parents in other_pedigree.items() if not parents] for founder in founders: if founder not in modified_pedigree: modified_pedigree[founder] = {} modified_pedigree[founder][connector_id] = 1 # Merge in the rest of other_pedigree for node, parents in other_pedigree.items(): if node not in modified_pedigree: modified_pedigree[node] = parents elif node not in founders: # For non-founders, merge parents modified_pedigree[node].update(parents) return modified_pedigree
- Lateral/Replace Connections (
dir=None
):def implement_lateral_connection(pedigree, id1, id2, other_pedigree): """Implement a lateral connection by replacing id1 with a founder from other_pedigree.""" # Create modified pedigree modified_pedigree = copy.deepcopy(pedigree) # Get a founder from other_pedigree to use as replacement founders = [node for node, parents in other_pedigree.items() if not parents] if not founders: return modified_pedigree # Can't implement without a founder replacement_id = founders[0] # Replace all occurrences of id1 with replacement_id for node, parents in modified_pedigree.items(): if id1 in parents: degree = parents.pop(id1) parents[replacement_id] = degree # Transfer the parents of id1 to replacement_id if id1 in modified_pedigree: if replacement_id not in modified_pedigree: modified_pedigree[replacement_id] = {} modified_pedigree[replacement_id].update(modified_pedigree[id1]) del modified_pedigree[id1] # Remove id1 # Merge in the rest of other_pedigree for node, parents in other_pedigree.items(): if node != replacement_id: # Skip the replacement node (already handled) if node not in modified_pedigree: modified_pedigree[node] = parents else: # Merge parents modified_pedigree[node].update(parents) return modified_pedigree
These implementations handle the complexities of merging pedigrees while maintaining the integrity of the pedigree structure. They ensure that all relationships in both pedigrees are preserved, and that the new connections are created in a way that's consistent with the genetic evidence.
Evaluating Connection Quality
After implementing a connection, Bonsai v3 needs to evaluate how well it explains the observed genetic data. This evaluation is based on a comprehensive model of genetic inheritance, taking into account factors like:
- IBD Sharing Consistency: Whether the observed IBD sharing matches what would be expected given the relationship
- Genetic Recombination: Whether the observed patterns of recombination are consistent with the proposed relationship
- Age Consistency: Whether the proposed connections respect biological constraints on parent-child age differences
- Demographic Plausibility: Whether the proposed connections are demographically plausible given historical context
In the production implementation, this evaluation uses sophisticated statistical models based on Mendelian inheritance and population genetics. A simplified version of the evaluation function looks like this:
def evaluate_connection(combined_pedigree, id_to_shared_ibd):
"""Evaluate a connected pedigree based on how well it explains IBD sharing.
Args:
combined_pedigree: Up-node dictionary of the combined pedigree
id_to_shared_ibd: Dict mapping individual IDs to the amount of IBD they share
Returns:
score: A score indicating how well the pedigree explains the IBD sharing
"""
# Calculate relationships in the combined pedigree
rel_dict = get_rel_dict(combined_pedigree)
# Compute correlation between relationship degrees and IBD sharing
individuals = sorted(id_to_shared_ibd.keys())
# For each individual, calculate expected IBD sharing
expected_ibd = {}
for i in individuals:
expected_ibd[i] = 0.0
for j, rel_tuple in rel_dict.get(i, {}).items():
if j not in id_to_shared_ibd or rel_tuple is None:
continue
# Calculate expected sharing based on relationship
up, down, num_ancs = rel_tuple
degree = up + down
expected_coef = num_ancs * (0.5 ** degree) if degree > 0 else 1.0
# Add to expected IBD
expected_ibd[i] += expected_coef * id_to_shared_ibd[j]
# Calculate correlation between expected and observed IBD
observed = [id_to_shared_ibd[i] for i in individuals]
expected = [expected_ibd[i] for i in individuals]
# Return correlation coefficient
return np.corrcoef(observed, expected)[0, 1]
This function evaluates a connection by comparing the observed IBD sharing with what would be expected given the relationships in the combined pedigree. Higher correlation scores indicate better consistency between the genetic evidence and the proposed connections.
In practice, Bonsai will evaluate multiple potential connections and select the one with the highest score, or in some cases, present multiple high-scoring alternatives for human review.
Core Component: Finding connection points is a foundational operation in Bonsai v3's pedigree construction algorithm. By systematically identifying, evaluating, and implementing connection points, Bonsai can construct complex pedigrees that accurately explain patterns of IBD sharing across multiple individuals.
Comparing Notebook and Production Code
The Lab11 notebook provides a simplified exploration of connection point algorithms, while the production implementation in Bonsai v3 includes additional sophistication:
- Performance Optimizations: The production code includes sophisticated caching and pruning strategies to handle large pedigrees efficiently
- Probabilistic Models: More advanced statistical models for scoring connection points that account for IBD length distributions and population-specific parameters
- Cycle Detection: Robust handling of potential cycles in pedigrees that could arise from certain connection patterns
- Multi-way Connections: Support for connecting more than two pedigrees simultaneously in complex scenarios
- Identity Resolution: Sophisticated algorithms for determining when individuals in different pedigrees might actually be the same person
- User Interaction: Support for incorporating user feedback and constraints on connection points
The notebook provides a valuable introduction to the key concepts, but the production implementation represents years of refinement to handle the complexities of real-world genetic data and pedigree structures.
Interactive Lab Environment
Run the interactive Lab 11 notebook in Google Colab:
Google Colab Environment
Run the notebook in Google Colab for a powerful computing environment with access to Google's resources.
Data will be automatically downloaded from S3 when you run the notebook.
Note: You may need a Google account to save your work in Google Drive.
Beyond the Code
As you explore connection point algorithms, consider these broader implications:
- Genealogical Research: How computational methods are transforming traditional genealogical research by identifying connections that might be missed through documentary evidence alone
- Ethical Considerations: The implications of discovering unexpected connections between families, and how to handle sensitive information
- Historical Applications: How these algorithms can help reconstruct historical population structures from fragmentary genetic and documentary evidence
- Privacy and Consent: The importance of informed consent when inferring connections between individuals, especially for living people
These considerations highlight how connection point algorithms are not just technical solutions but tools with significant implications for understanding human history, identity, and relatedness.
This lab is part of the Bonsai v3 Deep Dive track:
Introduction
Lab 01
Architecture
Lab 02
IBD Formats
Lab 03
Statistics
Lab 04
Models
Lab 05
Relationships
Lab 06
PwLogLike
Lab 07
Age Modeling
Lab 08
Data Structures
Lab 09
Up-Node Dict
Lab 10
Connection Points
Lab 11