← ~/lc-grind/posts

#274 - H-Index

September 15, 2025

H-Index

Problem Description

Given an array of integers citations where citations[i] is the number of citations a researcher received for their ith paper, return the researcher’s h-index.

According to the definition of h-index on Wikipedia: The h-index is defined as the maximum value of h such that the given researcher has published at least h papers that have each been cited at least h times.

Example 1:


Input: citations = [3,0,6,1,5]
Output: 3
Explanation: [3,0,6,1,5] means the researcher has 5 papers in total and each of them had received 3, 0, 6, 1, 5 citations respectively.
Since the researcher has 3 papers with at least 3 citations each and the remaining two with no more than 3 citations each, their h-index is 3.

Example 2:


Input: citations = [1,3,1]
Output: 1

Constraints:

  • n == citations.length
  • 1 <= n <= 5000
  • 0 <= citations[i] <= 1000

Solution

1. Problem Deconstruction

Rewrite the Problem

  1. Technical Definition:
    Given an array citations of length n, where each element citations[i] represents the citation count of the i-th paper, compute the largest integer h such that at least h papers in the array have citation counts ≥ h. The value h must satisfy 0 ≤ h ≤ n.

  2. Beginner-Friendly Explanation:
    Imagine you have a list of how many times each research paper was cited. The h-index is a number that measures both productivity and impact. It is the biggest number h where the researcher has at least h papers that each have at least h citations. For example, if h=3, they must have 3 papers with 3 or more citations.

  3. Mathematical Formulation:
    Let c be the sorted list of citations in descending order. The h-index is defined as:

    h=max{i{0,1,,n}cii+1i<h}h = \max \left\{ i \in \{0, 1, \dots, n\} \mid c_i \geq i + 1 \quad \forall i < h \right\}

    Alternatively, using cumulative counts:

    h=max{kZi=knI(cik)k}h = \max \left\{ k \in \mathbb{Z} \mid \sum_{i=k}^{n} \mathbb{I}(c_i \geq k) \geq k \right\}

    where \(\mathbb{I}\) is the indicator function.

Constraint Analysis

  • n == citations.length (1 ≤ n ≤ 5000):
    • Limits solutions to O(n²) at worst (25e6 operations), but O(n log n) or O(n) is preferred.
    • Implies edge cases: n=1 (single paper), n=5000 (large but manageable).
  • 0 ≤ citations[i] ≤ 1000:
    • Citations can be zero, reducing possible h (e.g., all zeros → h=0).
    • Values bounded by 1000 allow counting-based approaches (e.g., bucket sort).
  • Hidden Edge Cases:
    • All citations zero → h=0.
    • All citations ≥ n → h=n.
    • Mixed values with duplicates (e.g., [1,1,1] → h=1).
    • Citations exceeding n (e.g., [1000] for n=1h=1).

2. Intuition Scaffolding

Analogies

  1. Real-World Metaphor:
    Like a school with n students. Each student’s score is a citation. The h-index is the largest class size h where every student in that class has a score ≥ h.

  2. Gaming Analogy:
    In a game with n levels, each level has a coin count. The h-index is the maximum number of levels where you collected at least that many coins (e.g., 3 levels with ≥3 coins each).

  3. Math Analogy:
    Finding the largest h where the cumulative distribution function of citations (in descending order) satisfies F(h) ≥ h.

Common Pitfalls

  1. Assuming h is the median: Sorting and taking the median fails for [1,1,1] (median=1, but h=1).
  2. Ignoring zero citations: Papers with zero citations reduce the cumulative count for low h.
  3. Overcounting high citations: A paper with 1000 citations counts for all h ≤ 1000, but h cannot exceed n.
  4. Linear scan without sorting: Checking each h from 0 to n blindly leads to O(n²) inefficiency.
  5. Misinterpreting sorted order: After sorting ascending, the condition citations[i] ≥ n-i must hold for h.

3. Approach Encyclopedia

Approach 1: Brute Force (Check All h)

  • Idea: For each possible h from n down to 0, count papers with citations ≥ h. Return the first h where count ≥ h.
  • Pseudocode:
    n = len(citations)
    for h in range(n, -1, -1):
        count = 0
        for c in citations:
            if c >= h:
                count += 1
        if count >= h:
            return h
    
  • Complexity:
    • Time: O(n²) → Worst-case 25e6 operations for n=5000 (acceptable in Pyton/C++ but suboptimal).
    • Space: O(1).
  • Visualization:
    citations = [3,0,6,1,5]
    Check h=5: Count papers ≥5 → 2 (6,5) → 2 < 5 → skip.
    Check h=4: Count papers ≥4 → 2 (6,5) → 2 < 4 → skip.
    Check h=3: Count papers ≥3 → 3 (3,6,5) → 3 ≥ 3 → return 3.
    

Approach 2: Sorting and Linear Scan

  • Idea: Sort citations descending. Find the last index i where citations[i] ≥ i+1.
  • Pseudocode:
    citations.sort(reverse=True)
    for i in range(len(citations)):
        if citations[i] < i+1:
            return i
    return len(citations)
    
  • Complexity:
    • Time: O(n log n) → Dominated by sorting.
    • Space: O(1) if in-place sort, else O(n).
  • Visualization:
    Sorted: [6,5,3,1,0]
    i=0: 6 ≥ 1 → valid.
    i=1: 5 ≥ 2 → valid.
    i=2: 3 ≥ 3 → valid.
    i=3: 1 < 4 → invalid → return i=3.
    

Approach 3: Counting Sort (Optimal)

  • Idea: Use a bucket array counts of size n+1 to tally papers per citation count (capping citations ≥ n at n). Traverse backwards to compute cumulative counts.
  • Pseudocode:
    n = len(citations)
    counts = [0] * (n+1)
    for c in citations:
        if c >= n:
            counts[n] += 1
        else:
            counts[c] += 1
    total = 0
    for k in range(n, -1, -1):
        total += counts[k]
        if total >= k:
            return k
    
  • Complexity:
    • Time: O(n) → Two passes over array and counts.
    • Space: O(n) → Counts array of size n+1.
  • Visualization:
    citations = [3,0,6,1,5], n=5
    counts[0]=1, counts[1]=1, counts[3]=1, counts[5]=2
    k=5: total=2 → 2 < 5 → skip.
    k=4: total=2 → 2 < 4 → skip.
    k=3: total=3 → 3 ≥ 3 → return 3.
    

4. Code Deep Dive

Optimal Solution (Counting Method)

def hIndex(citations):
    n = len(citations)
    counts = [0] * (n + 1)  # Step 1: Initialize counts array for indices 0 to n.
    
    # Step 2: Populate counts array.
    for c in citations:
        if c >= n:
            counts[n] += 1  # Cap citations ≥ n at n.
        else:
            counts[c] += 1  # Count papers per citation value.
    
    total = 0
    # Step 3: Traverse from n down to 0 to find largest h.
    for k in range(n, -1, -1):
        total += counts[k]  # Cumulative count of papers with ≥ k citations.
        if total >= k:  # Condition for h-index.
            return k
    return 0  # Fallback for all zeros.

Line-by-Line Annotations

  • Line 2: n is the number of papers.
  • Line 3: counts array tracks how many papers have exactly k citations (or n if ≥ n).
  • Lines 6-9: Classify each citation: cap high values at n to avoid index errors.
  • Line 11: total accumulates the number of papers with ≥ k citations.
  • Lines 13-16: Iterate from high k to low. The first k where total ≥ k is the h-index.

Edge Case Handling

  • Example 2 ([1,3,1]):
    n=3, counts[1]=2, counts[3]=1.
    k=3: total=1 < 3 → skip.
    k=2: total=1 < 2 → skip.
    k=1: total=3 ≥ 1 → return 1.
  • All zeros ([0,0,0]):
    counts[0]=3.
    k=3→0: total=3 at k=0 → return 0.
  • Single paper ([1000]):
    n=1, counts[1]=1.
    k=1: total=1 ≥ 1 → return 1.

5. Complexity War Room

Hardware-Aware Analysis

  • The counts array size is n+1 ≈ 5e3 elements → 20 KB (4 bytes/int). Fits in L1/L2 cache, ensuring fast access.
  • Two linear passes: O(n) time → Efficient even for worst-case n=5000 (10k operations).

Industry Comparison Table

Approach Time Space Readability Interview Viability
Brute Force O(n²) O(1) High ❌ (Fails large n)
Sorting O(n log n) O(1) High ✅ (Acceptable)
Counting O(n) O(n) Medium ✅ (Optimal)

6. Pro Mode Extras

Variants Section

  • Dynamic h-index: If citations are streamed, use a Fenwick tree or segment tree to update counts and compute h-index in O(log n) per insertion.
  • h-index with weights: Each paper has a weight; find h such that the sum of weights for papers with citations ≥ h is ≥ h.
  • Multiple researchers: Compute h-index for a group by merging citation lists.

Interview Cheat Sheet

  • First Step: Clarify the h-index definition (e.g., “至少 h 篇论文至少被引用 h 次”).
  • Mention Constraints: “Since n is 5000, we can use O(n log n) sorting.”
  • Optimize: “Counting method is O(n) and leverages bounded citations.”
  • Edge Cases: “Always test with all zeros, all high values, and duplicates.”