Minimum Genetic Mutation

Difficulty: Medium
Topics: Hash Table, String, Breadth-First Search
Link: https://leetcode.com/problems/minimum-genetic-mutation/

Problem Description

A gene string can be represented by an 8-character long string, with choices from 'A', 'C', 'G', and 'T'.

Suppose we need to investigate a mutation from a gene string startGene to a gene string endGene where one mutation is defined as one single character changed in the gene string.

For example, "AACCGGTT" --> "AACCGGTA" is one mutation.

There is also a gene bank bank that records all the valid gene mutations. A gene must be in bank to make it a valid gene string.

Given the two gene strings startGene and endGene and the gene bank bank, return the minimum number of mutations needed to mutate from startGene to endGene. If there is no such a mutation, return -1.

Note that the starting point is assumed to be valid, so it might not be included in the bank.

Example 1:


Input: startGene = "AACCGGTT", endGene = "AACCGGTA", bank = ["AACCGGTA"]
Output: 1

Example 2:


Input: startGene = "AACCGGTT", endGene = "AAACGGTA", bank = ["AACCGGTA","AACCGCTA","AAACGGTA"]
Output: 2

Constraints:

0 <= bank.length <= 10
startGene.length == endGene.length == bank[i].length == 8
startGene, endGene, and bank[i] consist of only the characters ['A', 'C', 'G', 'T'].

Solution

1. Problem Deconstruction

Technical Reformulation
We are given:

Start node: 8-character string over alphabet Σ = {A, C, G, T}
Target node: 8-character string over same alphabet
Valid state space: Set of strings (bank) plus start node
Transition rule: Two nodes are connected if they differ by exactly one character position
Objective: Find shortest path length from start to target using only valid states
Edge case: Return -1 if no valid path exists

Beginner-Friendly Version
Imagine you have an 8-letter word where each letter can only be A, C, G, or T. You want to change the starting word into the target word, but you can only change one letter at a time. There’s a list of approved words (bank) that you’re allowed to use along the way. Find the smallest number of changes needed, or say it’s impossible if you can’t reach the target.

Mathematical Formulation
Let:

$S$ = startGene, $T$ = endGene
$B$ = bank (set of valid states)
$d(s_1, s_2)$ = Hamming distance between strings $s_1$ and $s_2$
Define graph $G = (V,E)$ $G = (V, E)$ where:
- $V = \{S\} \cup B$
- $E = \{(u,v) \in V \times V \mid d(u,v) = 1\}$
Find: $\min\limits_{P \in \text{paths}(S,T)} |P| - 1$ where $|P|$ is path length in edges

Constraint Analysis

0 <= bank.length <= 10: Extremely small search space allows exponential approaches. With 8 positions and 4 choices, total possible strings = $4^8 = 65536$ , but bank limits valid states to ~11 nodes.
startGene.length == endGene.length == bank[i].length == 8: Fixed length simplifies distance calculations - no variable-length edge cases.
Character set restricted to {A,C,G,T}: Finite branching factor of 24 possible mutations from any state (8 positions × 3 alternative characters).
Hidden edge cases:
- Start equals end → immediate return 0
- End not in bank → impossible mutation
- Disconnected components → return -1
- Bank contains duplicates → use set semantics

2. Intuition Scaffolding

Real-World Metaphor
Imagine trying to transform the word “READABLE” into “TEACHERS” using only valid English words from a limited dictionary. Each step changes one letter, maintaining word validity. This is exactly the “word ladder” problem with a genetic twist.

Gaming Analogy
Think of a DNA puzzle game where you mutate a creature’s genes through approved evolutionary steps. Each mutation changes one gene (character), and you must reach the target species in the fewest mutations. Invalid mutations (not in bank) are evolutionary dead ends.

Math Analogy
We navigate an 8-dimensional hypercube where each dimension has 4 discrete values. Valid states form a subspace, and we seek the shortest Manhattan path between start and target vertices within this subspace.

Common Pitfalls

Global Min/Max Fallacy: Trying to greedily minimize Hamming distance at each step (counterexample: sometimes we must temporarily increase distance)
Bank Membership Oversight: Forgetting that only the start node can be outside the bank
Cycle Detection: Missing visited state tracking causing infinite loops
Early Termination: Stopping when finding any path instead of shortest path
Implicit Graph Construction: Attempting to precompute all edges ( $O(n^2)$ ) instead of generating neighbors on-demand

3. Approach Encyclopedia

Brute Force (DFS with Depth Limit)

Pseudocode:
function dfs(current, target, bank, visited, depth):
    if current == target: return depth
    if depth > min_found: return ∞
    
    min_steps = ∞
    for each candidate in generate_neighbors(current):
        if candidate in bank and candidate not in visited:
            visited.add(candidate)
            min_steps = min(min_steps, dfs(candidate, target, bank, visited, depth+1))
            visited.remove(candidate)
    return min_steps

Complexity: O(24^d) where d is minimum steps

Visualization

Start: AACCGGTT
       │
    AACCGGTA (bank)
       │
    AAACGGTA (bank)
       │
Target: AAACGGTA

Depth-limited search explores all permutations up to maximum possible depth

BFS (Optimal)

Pseudocode:
function minMutation(start, end, bank):
    bank_set = set(bank)
    if end not in bank_set: return -1
    
    queue = deque([(start, 0)])
    visited = {start}
    
    while queue:
        current, steps = queue.popleft()
        if current == end: return steps
        
        for i in range(8):
            for char in "ACGT":
                if char != current[i]:
                    neighbor = current[:i] + char + current[i+1:]
                    if neighbor in bank_set and neighbor not in visited:
                        visited.add(neighbor)
                        queue.append((neighbor, steps+1))
    return -1

Complexity Proof:
- Each node generates 24 neighbors
- Maximum |V| = 11 (bank + start)
- Time: O(V × 24) = O(264) operations
- Space: O(V) for queue and visited set

Visualization

Level 0: [AACCGGTT]
          ↓
Level 1: [AACCGGTA, AACCGCTT, ...] (valid bank members)
          ↓
Level 2: [AAACGGTA, ...] 
Target found at level 2 → return 2

Bidirectional BFS

Pseudocode:
function bidir_bfs(start, end, bank):
    bank_set = set(bank) | {start}
    if end not in bank_set: return -1
    
    front, back = {start}, {end}
    visited = {start, end}
    steps = 0
    
    while front and back:
        if front & back: return steps
        
        # Always expand smaller frontier
        if len(front) > len(back): front, back = back, front
            
        next_front = set()
        for gene in front:
            for i in range(8):
                for char in "ACGT":
                    neighbor = gene[:i] + char + gene[i+1:]
                    if neighbor in bank_set and neighbor not in visited:
                        next_front.add(neighbor)
                        visited.add(neighbor)
        
        front = next_front
        steps += 1
    
    return -1

Complexity: O(24^(d/2)) - square root reduction of search space

4. Code Deep Dive

from collections import deque

def minMutation(startGene: str, endGene: str, bank: List[str]) -> int:
    # Edge case: target unreachable
    bank_set = set(bank)
    if endGene not in bank_set:
        return -1
    
    # BFS initialization
    queue = deque([(startGene, 0)])  # (current_gene, mutations_count)
    visited = {startGene}
    
    # Character set for mutations
    choices = ['A', 'C', 'G', 'T']
    
    while queue:
        current, steps = queue.popleft()
        
        # Termination check
        if current == endGene:
            return steps
        
        # Generate all possible single mutations
        for i in range(8):  # For each gene position
            for char in choices:  # For each possible nucleotide
                if char != current[i]:  # Skip no-change scenario
                    # Construct neighbor: O(8) string operations
                    neighbor = current[:i] + char + current[i+1:]
                    
                    # Validity checks
                    if neighbor in bank_set and neighbor not in visited:
                        visited.add(neighbor)
                        queue.append((neighbor, steps + 1))
    
    return -1

Line-by-Line Analysis

Line 4-6: Critical early termination when endGene not in bank
Line 9: Queue stores (gene, steps) tuples to track depth
Line 15: Early return ensures shortest path (BFS property)
Line 19-20: Systematic neighbor generation: 8 positions × 3 changes = 24 neighbors
Line 23: String slicing creates new mutation - O(8) but fixed cost
Line 26: Bank membership + visited check in O(1) using sets

Edge Case Handling

Example 1: start = "AACCGGTT", end = "AACCGGTA"
- Line 15 returns 1 immediately when direct neighbor found
Example 2: Multi-step path requires level-by-level exploration
Empty Bank: Line 6 catches case where endGene not reachable
Start == End: Would return 0 at line 15 before any processing

5. Complexity War Room

Hardware-Aware Analysis

With |V| ≤ 11 and 24 edges per node → ~264 operations
Fits entirely in L1 cache (11 nodes × ~50 bytes ≈ 550 bytes)
Execution time: ~1,000 CPU cycles on modern hardware

Industry Comparison Table

Approach	Time Complexity	Space Complexity	Readability	Interview Viability
Brute DFS	$O(24^d)$	$O(d)$	8/10	❌ Exponential risk
BFS	$O(V + E)$ = $O(264)$	$O(V)$ = $O(11)$	9/10	✅ Optimal Choice
Bidir BFS	$O(24^{d/2})$	$O(V)$	7/10	✅ Good alternative
A* Search	$O(V \log V)$	$O(V)$	6/10	⚠️ Overkill for small N

6. Pro Mode Extras

Variants & Extensions

Multiple Targets: Find shortest path to any gene in target set

target_set = set(targets)
if current in target_set: return steps

Weighted Mutations: Different costs for transitions (A→C vs A→T)
- Use Dijkstra instead of BFS
K Mutations: Find if reachable within K steps
- Add depth limit to BFS
All Shortest Paths: Modify to collect all minimum mutation sequences

Interview Cheat Sheet

First Mention: “This is essentially a BFS shortest path problem in implicit graph”
Constraint Insight: “Small bank size allows even exponential approaches, but BFS is optimal”
Edge Cases: Always check: (1) start == end, (2) end not in bank, (3) empty bank
Optimization: “We generate neighbors on-demand rather than precomputing graph”
Extension Ready: “For larger banks, we could use bidirectional BFS or A* with Hamming distance heuristic”

#433 - Minimum Genetic Mutation