#274 - H-Index
H-Index
- Difficulty: Medium
- Topics: Array, Sorting, Counting Sort
- Link: https://leetcode.com/problems/h-index/
Problem Description
Given an array of integers citations
where citations[i]
is the number of citations a researcher received for their ith
paper, return the researcher’s h-index.
According to the definition of h-index on Wikipedia: The h-index is defined as the maximum value of h
such that the given researcher has published at least h
papers that have each been cited at least h
times.
Example 1:
Input: citations = [3,0,6,1,5]
Output: 3
Explanation: [3,0,6,1,5] means the researcher has 5 papers in total and each of them had received 3, 0, 6, 1, 5 citations respectively.
Since the researcher has 3 papers with at least 3 citations each and the remaining two with no more than 3 citations each, their h-index is 3.
Example 2:
Input: citations = [1,3,1]
Output: 1
Constraints:
n == citations.length
1 <= n <= 5000
0 <= citations[i] <= 1000
Solution
1. Problem Deconstruction
Rewrite the Problem
-
Technical Definition:
Given an arraycitations
of lengthn
, where each elementcitations[i]
represents the citation count of thei
-th paper, compute the largest integerh
such that at leasth
papers in the array have citation counts ≥h
. The valueh
must satisfy0 ≤ h ≤ n
. -
Beginner-Friendly Explanation:
Imagine you have a list of how many times each research paper was cited. The h-index is a number that measures both productivity and impact. It is the biggest numberh
where the researcher has at leasth
papers that each have at leasth
citations. For example, ifh=3
, they must have 3 papers with 3 or more citations. -
Mathematical Formulation:
Letc
be the sorted list of citations in descending order. The h-index is defined as:Alternatively, using cumulative counts:
where
\(\mathbb{I}\)
is the indicator function.
Constraint Analysis
n == citations.length
(1 ≤ n ≤ 5000):- Limits solutions to O(n²) at worst (25e6 operations), but O(n log n) or O(n) is preferred.
- Implies edge cases:
n=1
(single paper),n=5000
(large but manageable).
0 ≤ citations[i] ≤ 1000
:- Citations can be zero, reducing possible
h
(e.g., all zeros →h=0
). - Values bounded by 1000 allow counting-based approaches (e.g., bucket sort).
- Citations can be zero, reducing possible
- Hidden Edge Cases:
- All citations zero →
h=0
. - All citations ≥ n →
h=n
. - Mixed values with duplicates (e.g., [1,1,1] →
h=1
). - Citations exceeding
n
(e.g., [1000] forn=1
→h=1
).
- All citations zero →
2. Intuition Scaffolding
Analogies
-
Real-World Metaphor:
Like a school withn
students. Each student’s score is a citation. The h-index is the largest class sizeh
where every student in that class has a score ≥h
. -
Gaming Analogy:
In a game withn
levels, each level has a coin count. The h-index is the maximum number of levels where you collected at least that many coins (e.g., 3 levels with ≥3 coins each). -
Math Analogy:
Finding the largesth
where the cumulative distribution function of citations (in descending order) satisfiesF(h) ≥ h
.
Common Pitfalls
- Assuming h is the median: Sorting and taking the median fails for [1,1,1] (median=1, but h=1).
- Ignoring zero citations: Papers with zero citations reduce the cumulative count for low
h
. - Overcounting high citations: A paper with 1000 citations counts for all
h ≤ 1000
, buth
cannot exceedn
. - Linear scan without sorting: Checking each
h
from 0 to n blindly leads to O(n²) inefficiency. - Misinterpreting sorted order: After sorting ascending, the condition
citations[i] ≥ n-i
must hold forh
.
3. Approach Encyclopedia
Approach 1: Brute Force (Check All h)
- Idea: For each possible
h
fromn
down to0
, count papers with citations ≥h
. Return the firsth
where count ≥h
. - Pseudocode:
n = len(citations) for h in range(n, -1, -1): count = 0 for c in citations: if c >= h: count += 1 if count >= h: return h
- Complexity:
- Time: O(n²) → Worst-case 25e6 operations for n=5000 (acceptable in Pyton/C++ but suboptimal).
- Space: O(1).
- Visualization:
citations = [3,0,6,1,5] Check h=5: Count papers ≥5 → 2 (6,5) → 2 < 5 → skip. Check h=4: Count papers ≥4 → 2 (6,5) → 2 < 4 → skip. Check h=3: Count papers ≥3 → 3 (3,6,5) → 3 ≥ 3 → return 3.
Approach 2: Sorting and Linear Scan
- Idea: Sort citations descending. Find the last index
i
wherecitations[i] ≥ i+1
. - Pseudocode:
citations.sort(reverse=True) for i in range(len(citations)): if citations[i] < i+1: return i return len(citations)
- Complexity:
- Time: O(n log n) → Dominated by sorting.
- Space: O(1) if in-place sort, else O(n).
- Visualization:
Sorted: [6,5,3,1,0] i=0: 6 ≥ 1 → valid. i=1: 5 ≥ 2 → valid. i=2: 3 ≥ 3 → valid. i=3: 1 < 4 → invalid → return i=3.
Approach 3: Counting Sort (Optimal)
- Idea: Use a bucket array
counts
of sizen+1
to tally papers per citation count (capping citations ≥ n at n). Traverse backwards to compute cumulative counts. - Pseudocode:
n = len(citations) counts = [0] * (n+1) for c in citations: if c >= n: counts[n] += 1 else: counts[c] += 1 total = 0 for k in range(n, -1, -1): total += counts[k] if total >= k: return k
- Complexity:
- Time: O(n) → Two passes over array and counts.
- Space: O(n) → Counts array of size n+1.
- Visualization:
citations = [3,0,6,1,5], n=5 counts[0]=1, counts[1]=1, counts[3]=1, counts[5]=2 k=5: total=2 → 2 < 5 → skip. k=4: total=2 → 2 < 4 → skip. k=3: total=3 → 3 ≥ 3 → return 3.
4. Code Deep Dive
Optimal Solution (Counting Method)
def hIndex(citations):
n = len(citations)
counts = [0] * (n + 1) # Step 1: Initialize counts array for indices 0 to n.
# Step 2: Populate counts array.
for c in citations:
if c >= n:
counts[n] += 1 # Cap citations ≥ n at n.
else:
counts[c] += 1 # Count papers per citation value.
total = 0
# Step 3: Traverse from n down to 0 to find largest h.
for k in range(n, -1, -1):
total += counts[k] # Cumulative count of papers with ≥ k citations.
if total >= k: # Condition for h-index.
return k
return 0 # Fallback for all zeros.
Line-by-Line Annotations
Line 2
:n
is the number of papers.Line 3
:counts
array tracks how many papers have exactlyk
citations (orn
if ≥n
).Lines 6-9
: Classify each citation: cap high values atn
to avoid index errors.Line 11
:total
accumulates the number of papers with ≥k
citations.Lines 13-16
: Iterate from highk
to low. The firstk
wheretotal ≥ k
is the h-index.
Edge Case Handling
- Example 2 ([1,3,1]):
n=3
,counts[1]=2
,counts[3]=1
.
k=3
: total=1 < 3 → skip.
k=2
: total=1 < 2 → skip.
k=1
: total=3 ≥ 1 → return 1. - All zeros ([0,0,0]):
counts[0]=3
.
k=3→0
: total=3 atk=0
→ return 0. - Single paper ([1000]):
n=1
,counts[1]=1
.
k=1
: total=1 ≥ 1 → return 1.
5. Complexity War Room
Hardware-Aware Analysis
- The
counts
array size isn+1
≈ 5e3 elements → 20 KB (4 bytes/int). Fits in L1/L2 cache, ensuring fast access. - Two linear passes: O(n) time → Efficient even for worst-case n=5000 (10k operations).
Industry Comparison Table
Approach | Time | Space | Readability | Interview Viability |
---|---|---|---|---|
Brute Force | O(n²) | O(1) | High | ❌ (Fails large n) |
Sorting | O(n log n) | O(1) | High | ✅ (Acceptable) |
Counting | O(n) | O(n) | Medium | ✅ (Optimal) |
6. Pro Mode Extras
Variants Section
- Dynamic h-index: If citations are streamed, use a Fenwick tree or segment tree to update counts and compute h-index in O(log n) per insertion.
- h-index with weights: Each paper has a weight; find
h
such that the sum of weights for papers with citations ≥h
is ≥h
. - Multiple researchers: Compute h-index for a group by merging citation lists.
Interview Cheat Sheet
- First Step: Clarify the h-index definition (e.g., “至少 h 篇论文至少被引用 h 次”).
- Mention Constraints: “Since n is 5000, we can use O(n log n) sorting.”
- Optimize: “Counting method is O(n) and leverages bounded citations.”
- Edge Cases: “Always test with all zeros, all high values, and duplicates.”