#295 - Find Median from Data Stream
Find Median from Data Stream
- Difficulty: Hard
- Topics: Two Pointers, Design, Sorting, Heap (Priority Queue), Data Stream
- Link: https://leetcode.com/problems/find-median-from-data-stream/
Problem Description
The median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value, and the median is the mean of the two middle values.
- For example, for
arr = [2,3,4]
, the median is3
. - For example, for
arr = [2,3]
, the median is(2 + 3) / 2 = 2.5
.
Implement the MedianFinder class:
MedianFinder()
initializes theMedianFinder
object.void addNum(int num)
adds the integernum
from the data stream to the data structure.double findMedian()
returns the median of all elements so far. Answers within10-5
of the actual answer will be accepted.
Example 1:
Input
["MedianFinder", "addNum", "addNum", "findMedian", "addNum", "findMedian"]
[[], [1], [2], [], [3], []]
Output
[null, null, null, 1.5, null, 2.0]
Explanation
MedianFinder medianFinder = new MedianFinder();
medianFinder.addNum(1); // arr = [1]
medianFinder.addNum(2); // arr = [1, 2]
medianFinder.findMedian(); // return 1.5 (i.e., (1 + 2) / 2)
medianFinder.addNum(3); // arr[1, 2, 3]
medianFinder.findMedian(); // return 2.0
Constraints:
-105 <= num <= 105
- There will be at least one element in the data structure before calling
findMedian
. - At most
5 * 104
calls will be made toaddNum
andfindMedian
.
Follow up:
- If all integer numbers from the stream are in the range
[0, 100]
, how would you optimize your solution? - If
99%
of all integer numbers from the stream are in the range[0, 100]
, how would you optimize your solution?
Solution
1. Problem Deconstruction
Technical Definition
Design a class MedianFinder
that supports two operations:
addNum(int num)
: Inserts integernum
into a dynamic dataset.findMedian()
: Returns the median of the dataset as:- Middle element (if dataset size (n) is odd)
- Average of two middle elements (if (n) is even)
Operations must handle up to (5 \times 10^4) calls efficiently.
Beginner Explanation
Imagine numbers coming one-by-one (like exam scores). After each number, you must instantly report the “middle” score:
- For odd counts: The center score (e.g., 3rd in 5 scores).
- For even counts: Average of two center scores (e.g., average of 2nd and 3rd in 4 scores).
Mathematical Formulation
Let (S) be the multiset of numbers. After (k) insertions, (|S| = k). Sort (S) to get ordered list (s_0, s_1, \dots, s_{k-1}):
- If (k) odd: (\text{median} = s_{\lfloor k/2 \rfloor})
- If (k) even: (\text{median} = \frac{s_{k/2-1} + s_{k/2}}{2})
Constraint Analysis
- (-10^5 \leq \text{num} \leq 10^5):
- Enables non-comparison optimizations (e.g., counting sort for bounded ranges).
- Edge: Negative values break naive positive-only assumptions.
- At least one element before
findMedian
:- Eliminates empty-state handling; simplifies median logic.
- Up to (5 \times 10^4) total calls:
- Limits solutions to (O(n \log n)) or better; brute-force (O(n^2)) fails ((2.5 \times 10^9) ops).
- Edge: High call volume stresses memory; requires (O(n)) space.
2. Intuition Scaffolding
Real-World Analogy (Race Timing)
Track runners finish at different times. Maintain two leaderboards:
- Left board: Fastest half (max-heap for slowest in this group).
- Right board: Slowest half (min-heap for fastest in this group).
Median is either the slowest on the left board or average of slowest (left) and fastest (right).
Gaming Analogy (RPG Inventory)
Items with varying power levels stream in. Organize into:
- Low-tier bag (max-heap): Strongest low-tier item on top.
- High-tier bag (min-heap): Weakest high-tier item on top.
Median is the “bridge” item between bags.
Math Analogy (Partitioned Sequence)
Split sorted sequence (S) into subsequences (L) (lower half) and (R) (upper half):
- (\max(L) \leq \min®)
- (|L| = |R|) or (|L| = |R| + 1)
Median = (\begin{cases} \max(L) & \text{if } |S| \text{ odd} \ \frac{\max(L) + \min®}{2} & \text{otherwise} \end{cases})
Common Pitfalls
- Global sort after each insertion: (O(n \log n)) per op → (O(n^2 \log n)) → fails at scale.
- Binary search + insertion (sorted list): (O(n)) per insertion → (O(n^2)) → too slow.
- Single heap: Only tracks min/max, not median.
- Unbalanced heaps: Failing to rebalance after insertion corrupts median.
- Mixed heap types: Using two min-heaps loses lower-half max.
3. Approach Encyclopedia
Approach 1: Brute-Force Sort
class MedianFinder:
def __init__(self):
self.nums = []
def addNum(self, num):
self.nums.append(num) # O(1)
def findMedian(self):
self.nums.sort() # O(n log n)
mid = len(self.nums) // 2
if len(self.nums) % 2:
return self.nums[mid]
return (self.nums[mid - 1] + self.nums[mid]) / 2
Complexity Proof
- addNum: (O(1)) amortized (list append).
- findMedian: (O(n \log n)) (sorting).
- Total for (n) calls: (\sum_{i=1}^n O(i \log i) \approx O(n^2 \log n)). For (n = 5 \times 10^4), (\approx 2.5 \times 10^9 \times 16 = 40 \times 10^9) ops (infeasible).
Visualization
Insertions: [3] → [3,1] → [3,1,2]
Sorted: [1,2,3] → median = 2
Approach 2: Insertion Sort
import bisect
class MedianFinder:
def __init__(self):
self.nums = []
def addNum(self, num):
bisect.insort(self.nums, num) # O(n)
def findMedian(self): # O(1)
mid = len(self.nums) // 2
if len(self.nums) % 2:
return self.nums[mid]
return (self.nums[mid - 1] + self.nums[mid]) / 2
Complexity Proof
- addNum: (O(n)) (binary search (O(\log n)) + shift (O(n))).
- Total for (n) calls: (\sum_{i=1}^n O(i) = O(n^2)). For (n=5 \times 10^4), (\approx 1.25 \times 10^9) ops (borderline in C++, slow in Python).
Visualization
Insert 2: [2]
Insert 1: [1,2] (shift left)
Insert 3: [1,2,3] → median = 2
Approach 3: Two Heaps (Optimal)
Data Structures:
lo
: Max-heap (Python min-heap storing negatives) for lower half.hi
: Min-heap for upper half.
Invariants:
- (\max(\text{lo}) \leq \min(\text{hi}))
- (|\text{lo}| = |\text{hi}|) or (|\text{lo}| = |\text{hi}| + 1)
import heapq
class MedianFinder:
def __init__(self):
self.lo = [] # Max-heap (negatives)
self.hi = [] # Min-heap
def addNum(self, num):
heapq.heappush(self.lo, -num) # O(log n)
heapq.heappush(self.hi, -heapq.heappop(self.lo)) # O(log n)
if len(self.hi) > len(self.lo): # Rebalance
heapq.heappush(self.lo, -heapq.heappop(self.hi)) # O(log n)
def findMedian(self):
if len(self.lo) > len(self.hi):
return -self.lo[0]
return (-self.lo[0] + self.hi[0]) / 2
Complexity Proof
- addNum: 3–4 heap ops at (O(\log n)) each → (O(\log n)).
- findMedian: (O(1)) (heap top access).
- Total for (n) calls: (O(n \log n)).
Visualization
Insert 1:
lo = [-1] → push to hi: hi=[1] → rebalance: lo=[-1], hi=[]
Insert 2:
lo = [-2,-1] → pop lo: -2 → hi=[2] → sizes equal → median = (1+2)/2 = 1.5
Insert 3:
lo = [-3,-1] → pop lo: -3 → hi=[2,3] → rebalance: pop hi (2) → lo=[-2,-1]
median = -lo[0] = 2 (odd)
4. Code Deep Dive
Optimal Code (Two Heaps)
import heapq
class MedianFinder:
def __init__(self):
self.lo = [] # Max-heap: store negatives for lower half
self.hi = [] # Min-heap: upper half
def addNum(self, num):
# Step 1: Push to lo (as negative)
heapq.heappush(self.lo, -num)
# Step 2: Move largest from lo to hi
heapq.heappush(self.hi, -heapq.heappop(self.lo))
# Step 3: If hi > lo, move smallest from hi to lo
if len(self.hi) > len(self.lo):
heapq.heappush(self.lo, -heapq.heappop(self.hi))
def findMedian(self):
# If lo has extra element, it's median
if len(self.lo) > len(self.hi):
return -self.lo[0]
# Else average of tops
return (-self.lo[0] + self.hi[0]) / 2
Line-by-Line Analysis
__init__
: Initializes empty heaps.addNum
:- Line 6: Push
num
tolo
as negative (max-heap simulation). - Line 8: Pop largest from
lo
(as positive), push tohi
. Ensures (\max(\text{lo}) \leq \min(\text{hi})). - Line 10: If
hi
has more elements, pop smallest fromhi
and push tolo
(as negative). Maintains size invariant.
- Line 6: Push
findMedian
:- Line 13: For odd (n),
lo[0]
(negated) is median. - Line 15: For even (n), average of
-lo[0]
andhi[0]
.
- Line 13: For odd (n),
Edge Case Handling
- Single Element (e.g., [1]):
addNum(1)
→lo=[-1]
→ moved tohi=[1]
→ rebalanced tolo=[-1]
,hi=[]
→findMedian
returns1
. - All Duplicates (e.g., [2,2,2]):
After three inserts:lo=[-2,-2]
,hi=[2]
→ median =-lo[0] = 2
. - Descending Order (e.g., [3,2,1]):
After inserts:lo=[-1,-2]
,hi=[3]
→ median =-lo[0] = 2
(correct).
5. Complexity War Room
Hardware-Aware Analysis
- Memory: Two heaps store (O(n)) integers. At (n=5 \times 10^4), (\approx 5 \times 10^4 \times 28 \text{ bytes} \times 2 = 2.8 \text{ MB}) (fits in CPU L3 cache).
- Time: (O(\log n)) per insertion → (5 \times 10^4 \times \log_2(5 \times 10^4) \approx 5 \times 10^4 \times 16 = 8 \times 10^5) ops (efficient).
Approach Comparison
Approach | Time (addNum) | Time (findMedian) | Space | Readability | Interview Viability |
---|---|---|---|---|---|
Brute-Force | (O(1)) | (O(n \log n)) | (O(n)) | 10/10 | ❌ Fails at scale |
Insertion Sort | (O(n)) | (O(1)) | (O(n)) | 9/10 | ❌ (O(n^2)) total |
Two Heaps | (O(\log n)) | (O(1)) | (O(n)) | 8/10 | ✅ Optimal |
6. Pro Mode Extras
Follow-Up Optimizations
-
All numbers in [0,100]:
- Use frequency array
freq[0..100]
and total count. - addNum: Increment
freq[num]
and count ((O(1))). - findMedian: Traverse
freq
to find (k)-th smallest ((O(101)) per call).
class MedianFinder: def __init__(self): self.freq = [0] * 101 self.n = 0 def addNum(self, num): self.freq[num] += 1 self.n += 1 def findMedian(self): mid = self.n // 2 if self.n % 2: return self._find_kth(mid + 1) return (self._find_kth(mid) + self._find_kth(mid + 1)) / 2 def _find_kth(self, k): count = 0 for i in range(101): count += self.freq[i] if count >= k: return i
Complexity: (O(1)) per
addNum
, (O(1)) perfindMedian
(fixed 101 steps). - Use frequency array
-
99% of numbers in [0,100]:
- Store in-range numbers in
freq[0..100]
, outliers in sorted listsunder
(numbers < 0) andover
(numbers > 100). - addNum: Use
bisect.insort
for outliers ((O(\log m)), (m) = outlier count). - findMedian: Traverse
under
→freq
→over
for (k)-th element.
import bisect class MedianFinder: def __init__(self): self.under = [] # Sorted negatives self.over = [] # Sorted >100 self.freq = [0] * 101 self.n = 0 def addNum(self, num): if num < 0: bisect.insort(self.under, num) elif num > 100: bisect.insort(self.over, num) else: self.freq[num] += 1 self.n += 1 def findMedian(self): if self.n % 2: return self._find_kth(self.n // 2 + 1) k1 = self._find_kth(self.n // 2) k2 = self._find_kth(self.n // 2 + 1) return (k1 + k2) / 2 def _find_kth(self, k): # Check under if k <= len(self.under): return self.under[k - 1] # Check freq count = len(self.under) for i in range(101): count += self.freq[i] if count >= k: return i # Check over return self.over[k - count - 1]
Complexity:
- addNum: (O(\log m)) for outliers ((m \leq 500)), (O(1)) for in-range.
- findMedian: (O(101)) per call.
- Store in-range numbers in
Variant: Sliding Window Median (LeetCode 480)
- Challenge: Support removals from the dataset.
- Solution: Use two heaps with lazy deletion + hash table for pending deletions. Rebalance heaps when top elements are marked for deletion.
Interview Cheat Sheet
- Key Insight: Two heaps for dynamic median (max-heap + min-heap with size balancing).
- Complexity: (O(\log n)) insert, (O(1)) median.
- Edge Cases: Duplicates, negatives, large (n).
- Optimizations: Frequency arrays for bounded ranges.
- Mention First: Time/space complexity and invariants before coding.