H-Index

Difficulty: Medium
Topics: Array, Sorting, Counting Sort
Link: https://leetcode.com/problems/h-index/

Problem Description

Given an array of integers citations where citations[i] is the number of citations a researcher received for their ith paper, return the researcher’s h-index.

According to the definition of h-index on Wikipedia: The h-index is defined as the maximum value of h such that the given researcher has published at least h papers that have each been cited at least h times.

Example 1:


Input: citations = [3,0,6,1,5]
Output: 3
Explanation: [3,0,6,1,5] means the researcher has 5 papers in total and each of them had received 3, 0, 6, 1, 5 citations respectively.
Since the researcher has 3 papers with at least 3 citations each and the remaining two with no more than 3 citations each, their h-index is 3.

Example 2:


Input: citations = [1,3,1]
Output: 1

Constraints:

n == citations.length
1 <= n <= 5000
0 <= citations[i] <= 1000

Solution

1. Problem Deconstruction

Rewrite the Problem

Technical Definition:
Given an array citations of length n, where each element citations[i] represents the citation count of the i-th paper, compute the largest integer h such that at least h papers in the array have citation counts ≥ h. The value h must satisfy 0 ≤ h ≤ n.
Beginner-Friendly Explanation:
Imagine you have a list of how many times each research paper was cited. The h-index is a number that measures both productivity and impact. It is the biggest number h where the researcher has at least h papers that each have at least h citations. For example, if h=3, they must have 3 papers with 3 or more citations.
Mathematical Formulation:
Let c be the sorted list of citations in descending order. The h-index is defined as:
$h = \max \left\{ i \in \{0, 1, \dots, n\} \mid c_i \geq i + 1 \quad \forall i < h \right\}$
Alternatively, using cumulative counts:
$h = \max \left\{ k \in \mathbb{Z} \mid \sum_{i=k}^{n} \mathbb{I}(c_i \geq k) \geq k \right\}$
where $\mathbb{I}$ is the indicator function.

Constraint Analysis

n == citations.length (1 ≤ n ≤ 5000):
- Limits solutions to O(n²) at worst (25e6 operations), but O(n log n) or O(n) is preferred.
- Implies edge cases: n=1 (single paper), n=5000 (large but manageable).
0 ≤ citations[i] ≤ 1000:
- Citations can be zero, reducing possible h (e.g., all zeros → h=0).
- Values bounded by 1000 allow counting-based approaches (e.g., bucket sort).
Hidden Edge Cases:
- All citations zero → h=0.
- All citations ≥ n → h=n.
- Mixed values with duplicates (e.g., [1,1,1] → h=1).
- Citations exceeding n (e.g., [1000] for n=1 → h=1).

2. Intuition Scaffolding

Analogies

Real-World Metaphor:
Like a school with n students. Each student’s score is a citation. The h-index is the largest class size h where every student in that class has a score ≥ h.
Gaming Analogy:
In a game with n levels, each level has a coin count. The h-index is the maximum number of levels where you collected at least that many coins (e.g., 3 levels with ≥3 coins each).
Math Analogy:
Finding the largest h where the cumulative distribution function of citations (in descending order) satisfies F(h) ≥ h.

Common Pitfalls

Assuming h is the median: Sorting and taking the median fails for [1,1,1] (median=1, but h=1).
Ignoring zero citations: Papers with zero citations reduce the cumulative count for low h.
Overcounting high citations: A paper with 1000 citations counts for all h ≤ 1000, but h cannot exceed n.
Linear scan without sorting: Checking each h from 0 to n blindly leads to O(n²) inefficiency.
Misinterpreting sorted order: After sorting ascending, the condition citations[i] ≥ n-i must hold for h.

3. Approach Encyclopedia

Approach 1: Brute Force (Check All h)

Idea: For each possible h from n down to 0, count papers with citations ≥ h. Return the first h where count ≥ h.

Pseudocode:

n = len(citations)
for h in range(n, -1, -1):
    count = 0
    for c in citations:
        if c >= h:
            count += 1
    if count >= h:
        return h

Complexity:
- Time: O(n²) → Worst-case 25e6 operations for n=5000 (acceptable in Pyton/C++ but suboptimal).
- Space: O(1).

Visualization:

citations = [3,0,6,1,5]
Check h=5: Count papers ≥5 → 2 (6,5) → 2 < 5 → skip.
Check h=4: Count papers ≥4 → 2 (6,5) → 2 < 4 → skip.
Check h=3: Count papers ≥3 → 3 (3,6,5) → 3 ≥ 3 → return 3.

Approach 2: Sorting and Linear Scan

Idea: Sort citations descending. Find the last index i where citations[i] ≥ i+1.

Pseudocode:

citations.sort(reverse=True)
for i in range(len(citations)):
    if citations[i] < i+1:
        return i
return len(citations)

Complexity:
- Time: O(n log n) → Dominated by sorting.
- Space: O(1) if in-place sort, else O(n).

Visualization:

Sorted: [6,5,3,1,0]
i=0: 6 ≥ 1 → valid.
i=1: 5 ≥ 2 → valid.
i=2: 3 ≥ 3 → valid.
i=3: 1 < 4 → invalid → return i=3.

Approach 3: Counting Sort (Optimal)

Idea: Use a bucket array counts of size n+1 to tally papers per citation count (capping citations ≥ n at n). Traverse backwards to compute cumulative counts.

Pseudocode:

n = len(citations)
counts = [0] * (n+1)
for c in citations:
    if c >= n:
        counts[n] += 1
    else:
        counts[c] += 1
total = 0
for k in range(n, -1, -1):
    total += counts[k]
    if total >= k:
        return k

Complexity:
- Time: O(n) → Two passes over array and counts.
- Space: O(n) → Counts array of size n+1.

Visualization:

citations = [3,0,6,1,5], n=5
counts[0]=1, counts[1]=1, counts[3]=1, counts[5]=2
k=5: total=2 → 2 < 5 → skip.
k=4: total=2 → 2 < 4 → skip.
k=3: total=3 → 3 ≥ 3 → return 3.

4. Code Deep Dive

Optimal Solution (Counting Method)

def hIndex(citations):
    n = len(citations)
    counts = [0] * (n + 1)  # Step 1: Initialize counts array for indices 0 to n.
    
    # Step 2: Populate counts array.
    for c in citations:
        if c >= n:
            counts[n] += 1  # Cap citations ≥ n at n.
        else:
            counts[c] += 1  # Count papers per citation value.
    
    total = 0
    # Step 3: Traverse from n down to 0 to find largest h.
    for k in range(n, -1, -1):
        total += counts[k]  # Cumulative count of papers with ≥ k citations.
        if total >= k:  # Condition for h-index.
            return k
    return 0  # Fallback for all zeros.

Line-by-Line Annotations

Line 2: n is the number of papers.
Line 3: counts array tracks how many papers have exactly k citations (or n if ≥ n).
Lines 6-9: Classify each citation: cap high values at n to avoid index errors.
Line 11: total accumulates the number of papers with ≥ k citations.
Lines 13-16: Iterate from high k to low. The first k where total ≥ k is the h-index.

Edge Case Handling

Example 2 ([1,3,1]):
n=3, counts[1]=2, counts[3]=1.
k=3: total=1 < 3 → skip.
k=2: total=1 < 2 → skip.
k=1: total=3 ≥ 1 → return 1.
All zeros ([0,0,0]):
counts[0]=3.
k=3→0: total=3 at k=0 → return 0.
Single paper ([1000]):
n=1, counts[1]=1.
k=1: total=1 ≥ 1 → return 1.

5. Complexity War Room

Hardware-Aware Analysis

The counts array size is n+1 ≈ 5e3 elements → 20 KB (4 bytes/int). Fits in L1/L2 cache, ensuring fast access.
Two linear passes: O(n) time → Efficient even for worst-case n=5000 (10k operations).

Industry Comparison Table

Approach	Time	Space	Readability	Interview Viability
Brute Force	O(n²)	O(1)	High	❌ (Fails large n)
Sorting	O(n log n)	O(1)	High	✅ (Acceptable)
Counting	O(n)	O(n)	Medium	✅ (Optimal)

6. Pro Mode Extras

Variants Section

Dynamic h-index: If citations are streamed, use a Fenwick tree or segment tree to update counts and compute h-index in O(log n) per insertion.
h-index with weights: Each paper has a weight; find h such that the sum of weights for papers with citations ≥ h is ≥ h.
Multiple researchers: Compute h-index for a group by merging citation lists.

Interview Cheat Sheet

First Step: Clarify the h-index definition (e.g., “至少 h 篇论文至少被引用 h 次”).
Mention Constraints: “Since n is 5000, we can use O(n log n) sorting.”
Optimize: “Counting method is O(n) and leverages bounded citations.”
Edge Cases: “Always test with all zeros, all high values, and duplicates.”

#274 - H-Index

H-Index

Problem Description

Solution

1. Problem Deconstruction

2. Intuition Scaffolding

3. Approach Encyclopedia

4. Code Deep Dive

5. Complexity War Room

6. Pro Mode Extras