#28 - Find the Index of the First Occurrence in a String
Find the Index of the First Occurrence in a String
- Difficulty: Easy
- Topics: Two Pointers, String, String Matching
- Link: https://leetcode.com/problems/find-the-index-of-the-first-occurrence-in-a-string/
Problem Description
Given two strings needle
and haystack
, return the index of the first occurrence of needle
in haystack
, or -1
if needle
is not part of haystack
.
Example 1:
Input: haystack = "sadbutsad", needle = "sad"
Output: 0
Explanation: "sad" occurs at index 0 and 6.
The first occurrence is at index 0, so we return 0.
Example 2:
Input: haystack = "leetcode", needle = "leeto"
Output: -1
Explanation: "leeto" did not occur in "leetcode", so we return -1.
Constraints:
1 <= haystack.length, needle.length <= 104
haystack
andneedle
consist of only lowercase English characters.
Solution
1. Problem Deconstruction
Technical Reformulation:
Given two strings S
(haystack) and P
(needle), identify the smallest integer i
(0 ≤ i ≤ |S| - |P|) such that for every integer j
(0 ≤ j < |P|), the equality S[i+j] = P[j]
holds. If no such i
exists, return -1. This is fundamentally a substring search problem requiring optimal pattern matching within a larger text corpus.
Beginner-Friendly Explanation: Imagine you have a long paragraph (haystack) and you’re looking for the first place where a particular word (needle) appears. You need to check each possible starting position in the paragraph to see if the letters match your target word exactly. If you find a match, return the starting position. If you check all possible positions and never find the word, return -1.
Mathematical Formulation: Let:
S
= haystack string of lengthn
P
= needle string of lengthm
S[i:j]
= substring of S from index i to j-1
Find the minimal i
∈ [0, n-m] such that:
Or equivalently:
If the set is empty, return -1.
Constraint Analysis:
1 <= haystack.length, needle.length <= 10^4
: The upper bound of 10,000 characters means O(n²) solutions are theoretically acceptable but suboptimal. Linear or near-linear solutions are preferred.haystack
andneedle
consist of only lowercase English characters: This simplifies character comparison to direct equality checks without case conversion concerns.- Hidden edge cases:
needle.length > haystack.length
: Immediate return -1needle.length == 0
: Most implementations return 0 by conventionneedle.length == haystack.length
: Only one possible position to check- Repeated patterns in needle (e.g., “abcabc” in “abcabcabc”)
- Needle with all identical characters (“aaaa” in “aaabaaaa”)
2. Intuition Scaffolding
Real-World Metaphor: Imagine you’re proofreading a long document looking for a specific phrase. You slide a transparent sheet with the target phrase over each position in the document, checking if the letters align perfectly. Experienced proofreaders develop mental shortcuts - they might skip ahead when they know certain letter combinations can’t match.
Gaming Analogy: Think of “Where’s Waldo?” where you’re scanning a crowded scene (haystack) for a specific character pattern (Waldo’s red-striped shirt). Beginners check every position systematically, while experts use pattern recognition to eliminate large sections quickly.
Math Analogy: This is equivalent to finding the earliest position where two discrete sequences align perfectly. Like polynomial evaluation where we’re testing if P(x) divides S(x) at any integer offset, or convolution where we seek the peak indicating perfect correlation.
Common Pitfalls:
- Off-by-one errors: Forgetting that valid positions range from 0 to n-m, not 0 to n-1
- Empty needle handling: Returning -1 instead of 0 when needle is empty
- Early termination: Stopping after first partial match failure instead of continuing
- Inefficient backtracking: When a partial match fails, restarting from the next character instead of leveraging known information
- Case sensitivity: The problem states lowercase only, but in real interviews this might be overlooked
3. Approach Encyclopedia
Approach 1: Naive Brute Force
Algorithm NaiveSearch(S, P):
n = length(S), m = length(P)
if m == 0: return 0
if m > n: return -1
for i = 0 to n-m:
j = 0
while j < m AND S[i+j] == P[j]:
j += 1
if j == m:
return i
return -1
Complexity Proof:
- Time: Worst-case O((n-m+1)×m). For n = m, this is O(n²). When P appears at the end of S and shares long prefixes, we get maximum comparisons.
- Space: O(1) - only using loop counters and simple variables.
Visualization:
Haystack: "abcabcabcd"
Needle: "abcabcd"
i=0: abcabcabcd
abcabcd ✓✓✓✓✗
i=1: abcabcabcd
abcabcd ✗
i=2: abcabcabcd
abcabcd ✗
i=3: abcabcabcd
abcabcd ✓✓✓✓✓✓✓ MATCH!
Approach 2: Knuth-Morris-Pratt (KMP) Algorithm
Algorithm KMP(S, P):
n = length(S), m = length(P)
if m == 0: return 0
if m > n: return -1
lps = ComputeLPS(P) # Longest Prefix Suffix array
i = 0, j = 0
while i < n:
if S[i] == P[j]:
i += 1
j += 1
if j == m:
return i - j # Match found
elif i < n AND S[i] != P[j]:
if j != 0:
j = lps[j-1] # Use LPS to skip redundant checks
else:
i += 1
return -1
Function ComputeLPS(P):
m = length(P)
lps[0] = 0
length = 0
i = 1
while i < m:
if P[i] == P[length]:
length += 1
lps[i] = length
i += 1
else:
if length != 0:
length = lps[length-1]
else:
lps[i] = 0
i += 1
return lps
Complexity Proof:
- Time: O(n + m) - The KMP algorithm never backtracks in the haystack. Each character is examined at most twice.
- Space: O(m) - For storing the LPS array.
Visualization:
LPS for "abcabcd": [0,0,0,1,2,3,0]
Haystack: "abcabcabcd"
Needle: "abcabcd"
Step 0: abcabcabcd
abcabcd ✓✓✓✓✗ at position 4
Use LPS: j = lps[3] = 1
Continue: abcabcabcd
abcabcd ✓✗ at position 1
Use LPS: j = lps[0] = 0
Continue: abcabcabcd
abcabcd ✓✓✓✓✓✓✓ MATCH!
Approach 3: Rabin-Karp (Rolling Hash)
Algorithm RabinKarp(S, P):
n = length(S), m = length(P)
if m == 0: return 0
if m > n: return -1
base = 26 # lowercase letters
mod = 10^9 + 7 # large prime
# Compute hash for pattern and first window
hash_p = 0, hash_s = 0
for i = 0 to m-1:
hash_p = (hash_p * base + (P[i] - 'a')) % mod
hash_s = (hash_s * base + (S[i] - 'a')) % mod
power = base^(m-1) % mod
for i = 0 to n-m:
if hash_s == hash_p:
# Verify actual match (handle hash collisions)
if S[i:i+m] == P:
return i
if i < n-m:
# Roll the hash
hash_s = (hash_s - (S[i] - 'a') * power) % mod
hash_s = (hash_s * base + (S[i+m] - 'a')) % mod
if hash_s < 0: hash_s += mod
return -1
Complexity Proof:
- Time: Average O(n + m), Worst-case O(n×m) due to hash collisions
- Space: O(1) - constant space for hash values
4. Code Deep Dive
Optimal Solution (KMP):
def strStr(haystack: str, needle: str) -> int:
# Edge case: empty needle
if not needle:
return 0
n, m = len(haystack), len(needle)
# Edge case: needle longer than haystack
if m > n:
return -1
# Build LPS (Longest Prefix Suffix) array
lps = [0] * m
length = 0 # length of previous longest prefix suffix
i = 1
while i < m:
if needle[i] == needle[length]:
length += 1
lps[i] = length
i += 1
else:
if length != 0:
# Fall back to previous prefix
length = lps[length - 1]
else:
# No matching prefix
lps[i] = 0
i += 1
# KMP search
i = j = 0 # i for haystack, j for needle
while i < n:
if haystack[i] == needle[j]:
i += 1
j += 1
if j == m:
# Found complete match
return i - j
elif i < n and haystack[i] != needle[j]:
# Mismatch after j matches
if j != 0:
# Use LPS to skip redundant comparisons
j = lps[j - 1]
else:
# No progress possible, move to next character
i += 1
return -1
Line-by-Line Analysis:
- Lines 2-8: Handle edge cases - empty needle returns 0, longer needle returns -1
- Lines 11-25: LPS construction - identifies longest proper prefix that is also a suffix for each position
- Lines 28-42: KMP search - uses LPS to avoid redundant comparisons when mismatches occur
- Line 33: Complete match detection - when j reaches needle length
- Lines 36-41: Mismatch handling - uses LPS to determine new j position
Edge Case Handling:
- Example 1 (
"sadbutsad"
,"sad"
): LPS = [0,0,0], finds match at i=0, j=3 - Example 2 (
"leetcode"
,"leeto"
): LPS = [0,0,0,0,0], fails at i=4, returns -1 - Empty needle: Returns 0 immediately
- Needle longer: Returns -1 immediately
5. Complexity War Room
Hardware-Aware Analysis:
- For maximum constraint (n=10^4), KMP performs ~20,000 operations
- LPS array uses ~40KB memory (10^4 integers × 4 bytes)
- Fits comfortably in L1/L2 cache (typical sizes 32-256KB)
- Memory access pattern is sequential, cache-friendly
Industry Comparison Table:
Approach | Time Complexity | Space Complexity | Readability | Interview Viability | Real-World Usage |
---|---|---|---|---|---|
Brute Force | O(n×m) | O(1) | 10/10 | ✅ Good for small inputs | Rare in production |
KMP | O(n+m) | O(m) | 6/10 | ✅ Excellent choice | Text editors, IDEs |
Rabin-Karp | O(n+m) avg, O(n×m) worst | O(1) | 7/10 | ✅ Good alternative | Plagiarism detection |
Boyer-Moore | O(n/m) best, O(n×m) worst | O(m) | 5/10 | ❌ Too complex | grep, text search |
6. Pro Mode Extras
Algorithm Variants:
- Multiple Pattern Search: Extend KMP to Aho-Corasick for searching multiple needles simultaneously
- Approximate Matching: Allow k mismatches using dynamic programming
- Case Insensitive: Convert both strings to lowercase first
- Wildcard Support: Handle ‘?’ as single character wildcard
- Two Transactions: If asked to find first AND last occurrence, modify to continue search after first match
Interview Cheat Sheet:
- First Mention: Always state time/space complexity upfront
- Progression: Start with brute force, then optimize
- Edge Cases: Explicitly handle empty strings and size constraints
- Trade-offs: Discuss readability vs performance for the given constraints
- Real-world: Mention where these algorithms are actually used
Advanced Optimization: For extremely large haystacks (gigabytes), consider:
- Memory-mapped files to avoid loading entire haystack
- Parallel processing by splitting haystack into chunks
- Bloom filters for quick elimination of impossible matches
- SIMD instructions for vectorized character comparison