← ~/lc-grind/posts

#28 - Find the Index of the First Occurrence in a String

September 29, 2025

Find the Index of the First Occurrence in a String

Problem Description

Given two strings needle and haystack, return the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.

Example 1:


Input: haystack = "sadbutsad", needle = "sad"
Output: 0
Explanation: "sad" occurs at index 0 and 6.
The first occurrence is at index 0, so we return 0.

Example 2:


Input: haystack = "leetcode", needle = "leeto"
Output: -1
Explanation: "leeto" did not occur in "leetcode", so we return -1.

Constraints:

  • 1 <= haystack.length, needle.length <= 104
  • haystack and needle consist of only lowercase English characters.

Solution

1. Problem Deconstruction

Technical Reformulation: Given two strings S (haystack) and P (needle), identify the smallest integer i (0 ≤ i ≤ |S| - |P|) such that for every integer j (0 ≤ j < |P|), the equality S[i+j] = P[j] holds. If no such i exists, return -1. This is fundamentally a substring search problem requiring optimal pattern matching within a larger text corpus.

Beginner-Friendly Explanation: Imagine you have a long paragraph (haystack) and you’re looking for the first place where a particular word (needle) appears. You need to check each possible starting position in the paragraph to see if the letters match your target word exactly. If you find a match, return the starting position. If you check all possible positions and never find the word, return -1.

Mathematical Formulation: Let:

  • S = haystack string of length n
  • P = needle string of length m
  • S[i:j] = substring of S from index i to j-1

Find the minimal i ∈ [0, n-m] such that:

S[i:i+m]=P[0:m]S[i:i+m] = P[0:m]

Or equivalently:

min{i[0,nm]j[0,m1],S[i+j]=P[j]}\min\{i \in [0, n-m] \mid \forall j \in [0, m-1], S[i+j] = P[j]\}

If the set is empty, return -1.

Constraint Analysis:

  • 1 <= haystack.length, needle.length <= 10^4: The upper bound of 10,000 characters means O(n²) solutions are theoretically acceptable but suboptimal. Linear or near-linear solutions are preferred.
  • haystack and needle consist of only lowercase English characters: This simplifies character comparison to direct equality checks without case conversion concerns.
  • Hidden edge cases:
    • needle.length > haystack.length: Immediate return -1
    • needle.length == 0: Most implementations return 0 by convention
    • needle.length == haystack.length: Only one possible position to check
    • Repeated patterns in needle (e.g., “abcabc” in “abcabcabc”)
    • Needle with all identical characters (“aaaa” in “aaabaaaa”)

2. Intuition Scaffolding

Real-World Metaphor: Imagine you’re proofreading a long document looking for a specific phrase. You slide a transparent sheet with the target phrase over each position in the document, checking if the letters align perfectly. Experienced proofreaders develop mental shortcuts - they might skip ahead when they know certain letter combinations can’t match.

Gaming Analogy: Think of “Where’s Waldo?” where you’re scanning a crowded scene (haystack) for a specific character pattern (Waldo’s red-striped shirt). Beginners check every position systematically, while experts use pattern recognition to eliminate large sections quickly.

Math Analogy: This is equivalent to finding the earliest position where two discrete sequences align perfectly. Like polynomial evaluation where we’re testing if P(x) divides S(x) at any integer offset, or convolution where we seek the peak indicating perfect correlation.

Common Pitfalls:

  1. Off-by-one errors: Forgetting that valid positions range from 0 to n-m, not 0 to n-1
  2. Empty needle handling: Returning -1 instead of 0 when needle is empty
  3. Early termination: Stopping after first partial match failure instead of continuing
  4. Inefficient backtracking: When a partial match fails, restarting from the next character instead of leveraging known information
  5. Case sensitivity: The problem states lowercase only, but in real interviews this might be overlooked

3. Approach Encyclopedia

Approach 1: Naive Brute Force

Algorithm NaiveSearch(S, P):
    n = length(S), m = length(P)
    if m == 0: return 0
    if m > n: return -1
    
    for i = 0 to n-m:
        j = 0
        while j < m AND S[i+j] == P[j]:
            j += 1
        if j == m:
            return i
    return -1

Complexity Proof:

  • Time: Worst-case O((n-m+1)×m). For n = m, this is O(n²). When P appears at the end of S and shares long prefixes, we get maximum comparisons.
  • Space: O(1) - only using loop counters and simple variables.

Visualization:

Haystack: "abcabcabcd"
Needle:   "abcabcd"

i=0: abcabcabcd
     abcabcd   ✓✓✓✓✗
     
i=1: abcabcabcd
      abcabcd  ✗
      
i=2: abcabcabcd  
       abcabcd ✗
       
i=3: abcabcabcd
        abcabcd ✓✓✓✓✓✓✓ MATCH!

Approach 2: Knuth-Morris-Pratt (KMP) Algorithm

Algorithm KMP(S, P):
    n = length(S), m = length(P)
    if m == 0: return 0
    if m > n: return -1
    
    lps = ComputeLPS(P)  # Longest Prefix Suffix array
    
    i = 0, j = 0
    while i < n:
        if S[i] == P[j]:
            i += 1
            j += 1
        if j == m:
            return i - j  # Match found
        elif i < n AND S[i] != P[j]:
            if j != 0:
                j = lps[j-1]  # Use LPS to skip redundant checks
            else:
                i += 1
    return -1

Function ComputeLPS(P):
    m = length(P)
    lps[0] = 0
    length = 0
    i = 1
    while i < m:
        if P[i] == P[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length-1]
            else:
                lps[i] = 0
                i += 1
    return lps

Complexity Proof:

  • Time: O(n + m) - The KMP algorithm never backtracks in the haystack. Each character is examined at most twice.
  • Space: O(m) - For storing the LPS array.

Visualization:

LPS for "abcabcd": [0,0,0,1,2,3,0]

Haystack: "abcabcabcd"
Needle:   "abcabcd"

Step 0: abcabcabcd
        abcabcd    ✓✓✓✓✗ at position 4
        
Use LPS: j = lps[3] = 1
Continue: abcabcabcd
             abcabcd  ✓✗ at position 1
        
Use LPS: j = lps[0] = 0  
Continue: abcabcabcd
              abcabcd ✓✓✓✓✓✓✓ MATCH!

Approach 3: Rabin-Karp (Rolling Hash)

Algorithm RabinKarp(S, P):
    n = length(S), m = length(P)
    if m == 0: return 0
    if m > n: return -1
    
    base = 26  # lowercase letters
    mod = 10^9 + 7  # large prime
    
    # Compute hash for pattern and first window
    hash_p = 0, hash_s = 0
    for i = 0 to m-1:
        hash_p = (hash_p * base + (P[i] - 'a')) % mod
        hash_s = (hash_s * base + (S[i] - 'a')) % mod
    
    power = base^(m-1) % mod
    
    for i = 0 to n-m:
        if hash_s == hash_p:
            # Verify actual match (handle hash collisions)
            if S[i:i+m] == P:
                return i
        if i < n-m:
            # Roll the hash
            hash_s = (hash_s - (S[i] - 'a') * power) % mod
            hash_s = (hash_s * base + (S[i+m] - 'a')) % mod
            if hash_s < 0: hash_s += mod
    
    return -1

Complexity Proof:

  • Time: Average O(n + m), Worst-case O(n×m) due to hash collisions
  • Space: O(1) - constant space for hash values

4. Code Deep Dive

Optimal Solution (KMP):

def strStr(haystack: str, needle: str) -> int:
    # Edge case: empty needle
    if not needle:
        return 0
    
    n, m = len(haystack), len(needle)
    
    # Edge case: needle longer than haystack
    if m > n:
        return -1
    
    # Build LPS (Longest Prefix Suffix) array
    lps = [0] * m
    length = 0  # length of previous longest prefix suffix
    i = 1
    
    while i < m:
        if needle[i] == needle[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                # Fall back to previous prefix
                length = lps[length - 1]
            else:
                # No matching prefix
                lps[i] = 0
                i += 1
    
    # KMP search
    i = j = 0  # i for haystack, j for needle
    while i < n:
        if haystack[i] == needle[j]:
            i += 1
            j += 1
        
        if j == m:
            # Found complete match
            return i - j
        elif i < n and haystack[i] != needle[j]:
            # Mismatch after j matches
            if j != 0:
                # Use LPS to skip redundant comparisons
                j = lps[j - 1]
            else:
                # No progress possible, move to next character
                i += 1
    
    return -1

Line-by-Line Analysis:

  • Lines 2-8: Handle edge cases - empty needle returns 0, longer needle returns -1
  • Lines 11-25: LPS construction - identifies longest proper prefix that is also a suffix for each position
  • Lines 28-42: KMP search - uses LPS to avoid redundant comparisons when mismatches occur
  • Line 33: Complete match detection - when j reaches needle length
  • Lines 36-41: Mismatch handling - uses LPS to determine new j position

Edge Case Handling:

  • Example 1 ("sadbutsad", "sad"): LPS = [0,0,0], finds match at i=0, j=3
  • Example 2 ("leetcode", "leeto"): LPS = [0,0,0,0,0], fails at i=4, returns -1
  • Empty needle: Returns 0 immediately
  • Needle longer: Returns -1 immediately

5. Complexity War Room

Hardware-Aware Analysis:

  • For maximum constraint (n=10^4), KMP performs ~20,000 operations
  • LPS array uses ~40KB memory (10^4 integers × 4 bytes)
  • Fits comfortably in L1/L2 cache (typical sizes 32-256KB)
  • Memory access pattern is sequential, cache-friendly

Industry Comparison Table:

Approach Time Complexity Space Complexity Readability Interview Viability Real-World Usage
Brute Force O(n×m) O(1) 10/10 ✅ Good for small inputs Rare in production
KMP O(n+m) O(m) 6/10 ✅ Excellent choice Text editors, IDEs
Rabin-Karp O(n+m) avg, O(n×m) worst O(1) 7/10 ✅ Good alternative Plagiarism detection
Boyer-Moore O(n/m) best, O(n×m) worst O(m) 5/10 ❌ Too complex grep, text search

6. Pro Mode Extras

Algorithm Variants:

  1. Multiple Pattern Search: Extend KMP to Aho-Corasick for searching multiple needles simultaneously
  2. Approximate Matching: Allow k mismatches using dynamic programming
  3. Case Insensitive: Convert both strings to lowercase first
  4. Wildcard Support: Handle ‘?’ as single character wildcard
  5. Two Transactions: If asked to find first AND last occurrence, modify to continue search after first match

Interview Cheat Sheet:

  • First Mention: Always state time/space complexity upfront
  • Progression: Start with brute force, then optimize
  • Edge Cases: Explicitly handle empty strings and size constraints
  • Trade-offs: Discuss readability vs performance for the given constraints
  • Real-world: Mention where these algorithms are actually used

Advanced Optimization: For extremely large haystacks (gigabytes), consider:

  • Memory-mapped files to avoid loading entire haystack
  • Parallel processing by splitting haystack into chunks
  • Bloom filters for quick elimination of impossible matches
  • SIMD instructions for vectorized character comparison