#433 - Minimum Genetic Mutation
Minimum Genetic Mutation
- Difficulty: Medium
- Topics: Hash Table, String, Breadth-First Search
- Link: https://leetcode.com/problems/minimum-genetic-mutation/
Problem Description
A gene string can be represented by an 8-character long string, with choices from 'A', 'C', 'G', and 'T'.
Suppose we need to investigate a mutation from a gene string startGene to a gene string endGene where one mutation is defined as one single character changed in the gene string.
- For example,
"AACCGGTT" --> "AACCGGTA"is one mutation.
There is also a gene bank bank that records all the valid gene mutations. A gene must be in bank to make it a valid gene string.
Given the two gene strings startGene and endGene and the gene bank bank, return the minimum number of mutations needed to mutate from startGene to endGene. If there is no such a mutation, return -1.
Note that the starting point is assumed to be valid, so it might not be included in the bank.
Example 1:
Input: startGene = "AACCGGTT", endGene = "AACCGGTA", bank = ["AACCGGTA"]
Output: 1
Example 2:
Input: startGene = "AACCGGTT", endGene = "AAACGGTA", bank = ["AACCGGTA","AACCGCTA","AAACGGTA"]
Output: 2
Constraints:
0 <= bank.length <= 10startGene.length == endGene.length == bank[i].length == 8startGene,endGene, andbank[i]consist of only the characters['A', 'C', 'G', 'T'].
Solution
1. Problem Deconstruction
Technical Reformulation
We are given:
- Start node: 8-character string over alphabet Σ = {A, C, G, T}
- Target node: 8-character string over same alphabet
- Valid state space: Set of strings (
bank) plus start node - Transition rule: Two nodes are connected if they differ by exactly one character position
- Objective: Find shortest path length from start to target using only valid states
- Edge case: Return -1 if no valid path exists
Beginner-Friendly Version
Imagine you have an 8-letter word where each letter can only be A, C, G, or T. You want to change the starting word into the target word, but you can only change one letter at a time. There’s a list of approved words (bank) that you’re allowed to use along the way. Find the smallest number of changes needed, or say it’s impossible if you can’t reach the target.
Mathematical Formulation
Let:
- = startGene, = endGene
- = bank (set of valid states)
- = Hamming distance between strings and
- Define graph where:
- Find: where is path length in edges
Constraint Analysis
0 <= bank.length <= 10: Extremely small search space allows exponential approaches. With 8 positions and 4 choices, total possible strings = , but bank limits valid states to ~11 nodes.startGene.length == endGene.length == bank[i].length == 8: Fixed length simplifies distance calculations - no variable-length edge cases.- Character set restricted to {A,C,G,T}: Finite branching factor of 24 possible mutations from any state (8 positions × 3 alternative characters).
- Hidden edge cases:
- Start equals end → immediate return 0
- End not in bank → impossible mutation
- Disconnected components → return -1
- Bank contains duplicates → use set semantics
2. Intuition Scaffolding
Real-World Metaphor
Imagine trying to transform the word “READABLE” into “TEACHERS” using only valid English words from a limited dictionary. Each step changes one letter, maintaining word validity. This is exactly the “word ladder” problem with a genetic twist.
Gaming Analogy
Think of a DNA puzzle game where you mutate a creature’s genes through approved evolutionary steps. Each mutation changes one gene (character), and you must reach the target species in the fewest mutations. Invalid mutations (not in bank) are evolutionary dead ends.
Math Analogy
We navigate an 8-dimensional hypercube where each dimension has 4 discrete values. Valid states form a subspace, and we seek the shortest Manhattan path between start and target vertices within this subspace.
Common Pitfalls
- Global Min/Max Fallacy: Trying to greedily minimize Hamming distance at each step (counterexample: sometimes we must temporarily increase distance)
- Bank Membership Oversight: Forgetting that only the start node can be outside the bank
- Cycle Detection: Missing visited state tracking causing infinite loops
- Early Termination: Stopping when finding any path instead of shortest path
- Implicit Graph Construction: Attempting to precompute all edges () instead of generating neighbors on-demand
3. Approach Encyclopedia
Brute Force (DFS with Depth Limit)
Pseudocode:
function dfs(current, target, bank, visited, depth):
if current == target: return depth
if depth > min_found: return ∞
min_steps = ∞
for each candidate in generate_neighbors(current):
if candidate in bank and candidate not in visited:
visited.add(candidate)
min_steps = min(min_steps, dfs(candidate, target, bank, visited, depth+1))
visited.remove(candidate)
return min_steps
Complexity: O(24^d) where d is minimum steps
Visualization
Start: AACCGGTT
│
AACCGGTA (bank)
│
AAACGGTA (bank)
│
Target: AAACGGTA
Depth-limited search explores all permutations up to maximum possible depth
BFS (Optimal)
Pseudocode:
function minMutation(start, end, bank):
bank_set = set(bank)
if end not in bank_set: return -1
queue = deque([(start, 0)])
visited = {start}
while queue:
current, steps = queue.popleft()
if current == end: return steps
for i in range(8):
for char in "ACGT":
if char != current[i]:
neighbor = current[:i] + char + current[i+1:]
if neighbor in bank_set and neighbor not in visited:
visited.add(neighbor)
queue.append((neighbor, steps+1))
return -1
Complexity Proof:
- Each node generates 24 neighbors
- Maximum |V| = 11 (bank + start)
- Time: O(V × 24) = O(264) operations
- Space: O(V) for queue and visited set
Visualization
Level 0: [AACCGGTT]
↓
Level 1: [AACCGGTA, AACCGCTT, ...] (valid bank members)
↓
Level 2: [AAACGGTA, ...]
Target found at level 2 → return 2
Bidirectional BFS
Pseudocode:
function bidir_bfs(start, end, bank):
bank_set = set(bank) | {start}
if end not in bank_set: return -1
front, back = {start}, {end}
visited = {start, end}
steps = 0
while front and back:
if front & back: return steps
# Always expand smaller frontier
if len(front) > len(back): front, back = back, front
next_front = set()
for gene in front:
for i in range(8):
for char in "ACGT":
neighbor = gene[:i] + char + gene[i+1:]
if neighbor in bank_set and neighbor not in visited:
next_front.add(neighbor)
visited.add(neighbor)
front = next_front
steps += 1
return -1
Complexity: O(24^(d/2)) - square root reduction of search space
4. Code Deep Dive
from collections import deque
def minMutation(startGene: str, endGene: str, bank: List[str]) -> int:
# Edge case: target unreachable
bank_set = set(bank)
if endGene not in bank_set:
return -1
# BFS initialization
queue = deque([(startGene, 0)]) # (current_gene, mutations_count)
visited = {startGene}
# Character set for mutations
choices = ['A', 'C', 'G', 'T']
while queue:
current, steps = queue.popleft()
# Termination check
if current == endGene:
return steps
# Generate all possible single mutations
for i in range(8): # For each gene position
for char in choices: # For each possible nucleotide
if char != current[i]: # Skip no-change scenario
# Construct neighbor: O(8) string operations
neighbor = current[:i] + char + current[i+1:]
# Validity checks
if neighbor in bank_set and neighbor not in visited:
visited.add(neighbor)
queue.append((neighbor, steps + 1))
return -1
Line-by-Line Analysis
- Line 4-6: Critical early termination when endGene not in bank
- Line 9: Queue stores (gene, steps) tuples to track depth
- Line 15: Early return ensures shortest path (BFS property)
- Line 19-20: Systematic neighbor generation: 8 positions × 3 changes = 24 neighbors
- Line 23: String slicing creates new mutation - O(8) but fixed cost
- Line 26: Bank membership + visited check in O(1) using sets
Edge Case Handling
- Example 1:
start = "AACCGGTT", end = "AACCGGTA"- Line 15 returns 1 immediately when direct neighbor found
- Example 2: Multi-step path requires level-by-level exploration
- Empty Bank: Line 6 catches case where endGene not reachable
- Start == End: Would return 0 at line 15 before any processing
5. Complexity War Room
Hardware-Aware Analysis
- With |V| ≤ 11 and 24 edges per node → ~264 operations
- Fits entirely in L1 cache (11 nodes × ~50 bytes ≈ 550 bytes)
- Execution time: ~1,000 CPU cycles on modern hardware
Industry Comparison Table
| Approach | Time Complexity | Space Complexity | Readability | Interview Viability |
|---|---|---|---|---|
| Brute DFS | 8/10 | ❌ Exponential risk | ||
| BFS | = | = | 9/10 | ✅ Optimal Choice |
| Bidir BFS | 7/10 | ✅ Good alternative | ||
| A* Search | 6/10 | ⚠️ Overkill for small N |
6. Pro Mode Extras
Variants & Extensions
- Multiple Targets: Find shortest path to any gene in target set
target_set = set(targets) if current in target_set: return steps - Weighted Mutations: Different costs for transitions (A→C vs A→T)
- Use Dijkstra instead of BFS
- K Mutations: Find if reachable within K steps
- Add depth limit to BFS
- All Shortest Paths: Modify to collect all minimum mutation sequences
Interview Cheat Sheet
- First Mention: “This is essentially a BFS shortest path problem in implicit graph”
- Constraint Insight: “Small bank size allows even exponential approaches, but BFS is optimal”
- Edge Cases: Always check: (1) start == end, (2) end not in bank, (3) empty bank
- Optimization: “We generate neighbors on-demand rather than precomputing graph”
- Extension Ready: “For larger banks, we could use bidirectional BFS or A* with Hamming distance heuristic”