← ~/lc-grind/posts

#295 - Find Median from Data Stream

August 1, 2025

Find Median from Data Stream

Problem Description

The median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value, and the median is the mean of the two middle values.

  • For example, for arr = [2,3,4], the median is 3.
  • For example, for arr = [2,3], the median is (2 + 3) / 2 = 2.5.

Implement the MedianFinder class:

  • MedianFinder() initializes the MedianFinder object.
  • void addNum(int num) adds the integer num from the data stream to the data structure.
  • double findMedian() returns the median of all elements so far. Answers within 10-5 of the actual answer will be accepted.

Example 1:


Input
["MedianFinder", "addNum", "addNum", "findMedian", "addNum", "findMedian"]
[[], [1], [2], [], [3], []]
Output
[null, null, null, 1.5, null, 2.0]

Explanation
MedianFinder medianFinder = new MedianFinder();
medianFinder.addNum(1);    // arr = [1]
medianFinder.addNum(2);    // arr = [1, 2]
medianFinder.findMedian(); // return 1.5 (i.e., (1 + 2) / 2)
medianFinder.addNum(3);    // arr[1, 2, 3]
medianFinder.findMedian(); // return 2.0

Constraints:

  • -105 <= num <= 105
  • There will be at least one element in the data structure before calling findMedian.
  • At most 5 * 104 calls will be made to addNum and findMedian.

Follow up:

  • If all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?
  • If 99% of all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?

Solution

1. Problem Deconstruction

Technical Definition
Design a class MedianFinder that supports two operations:

  1. addNum(int num): Inserts integer num into a dynamic dataset.
  2. findMedian(): Returns the median of the dataset as:
    • Middle element (if dataset size (n) is odd)
    • Average of two middle elements (if (n) is even)
      Operations must handle up to (5 \times 10^4) calls efficiently.

Beginner Explanation
Imagine numbers coming one-by-one (like exam scores). After each number, you must instantly report the “middle” score:

  • For odd counts: The center score (e.g., 3rd in 5 scores).
  • For even counts: Average of two center scores (e.g., average of 2nd and 3rd in 4 scores).

Mathematical Formulation
Let (S) be the multiset of numbers. After (k) insertions, (|S| = k). Sort (S) to get ordered list (s_0, s_1, \dots, s_{k-1}):

  • If (k) odd: (\text{median} = s_{\lfloor k/2 \rfloor})
  • If (k) even: (\text{median} = \frac{s_{k/2-1} + s_{k/2}}{2})

Constraint Analysis

  1. (-10^5 \leq \text{num} \leq 10^5):
    • Enables non-comparison optimizations (e.g., counting sort for bounded ranges).
    • Edge: Negative values break naive positive-only assumptions.
  2. At least one element before findMedian:
    • Eliminates empty-state handling; simplifies median logic.
  3. Up to (5 \times 10^4) total calls:
    • Limits solutions to (O(n \log n)) or better; brute-force (O(n^2)) fails ((2.5 \times 10^9) ops).
    • Edge: High call volume stresses memory; requires (O(n)) space.

2. Intuition Scaffolding

Real-World Analogy (Race Timing)
Track runners finish at different times. Maintain two leaderboards:

  • Left board: Fastest half (max-heap for slowest in this group).
  • Right board: Slowest half (min-heap for fastest in this group).
    Median is either the slowest on the left board or average of slowest (left) and fastest (right).

Gaming Analogy (RPG Inventory)
Items with varying power levels stream in. Organize into:

  • Low-tier bag (max-heap): Strongest low-tier item on top.
  • High-tier bag (min-heap): Weakest high-tier item on top.
    Median is the “bridge” item between bags.

Math Analogy (Partitioned Sequence)
Split sorted sequence (S) into subsequences (L) (lower half) and (R) (upper half):

  • (\max(L) \leq \min®)
  • (|L| = |R|) or (|L| = |R| + 1)
    Median = (\begin{cases} \max(L) & \text{if } |S| \text{ odd} \ \frac{\max(L) + \min®}{2} & \text{otherwise} \end{cases})

Common Pitfalls

  1. Global sort after each insertion: (O(n \log n)) per op → (O(n^2 \log n)) → fails at scale.
  2. Binary search + insertion (sorted list): (O(n)) per insertion → (O(n^2)) → too slow.
  3. Single heap: Only tracks min/max, not median.
  4. Unbalanced heaps: Failing to rebalance after insertion corrupts median.
  5. Mixed heap types: Using two min-heaps loses lower-half max.

3. Approach Encyclopedia

Approach 1: Brute-Force Sort

class MedianFinder:  
    def __init__(self):  
        self.nums = []  

    def addNum(self, num):  
        self.nums.append(num)  # O(1)  

    def findMedian(self):  
        self.nums.sort()  # O(n log n)  
        mid = len(self.nums) // 2  
        if len(self.nums) % 2:  
            return self.nums[mid]  
        return (self.nums[mid - 1] + self.nums[mid]) / 2  

Complexity Proof

  • addNum: (O(1)) amortized (list append).
  • findMedian: (O(n \log n)) (sorting).
  • Total for (n) calls: (\sum_{i=1}^n O(i \log i) \approx O(n^2 \log n)). For (n = 5 \times 10^4), (\approx 2.5 \times 10^9 \times 16 = 40 \times 10^9) ops (infeasible).

Visualization

Insertions: [3] → [3,1] → [3,1,2]  
Sorted: [1,2,3] → median = 2  

Approach 2: Insertion Sort

import bisect  
class MedianFinder:  
    def __init__(self):  
        self.nums = []  

    def addNum(self, num):  
        bisect.insort(self.nums, num)  # O(n)  

    def findMedian(self):  # O(1)  
        mid = len(self.nums) // 2  
        if len(self.nums) % 2:  
            return self.nums[mid]  
        return (self.nums[mid - 1] + self.nums[mid]) / 2  

Complexity Proof

  • addNum: (O(n)) (binary search (O(\log n)) + shift (O(n))).
  • Total for (n) calls: (\sum_{i=1}^n O(i) = O(n^2)). For (n=5 \times 10^4), (\approx 1.25 \times 10^9) ops (borderline in C++, slow in Python).

Visualization

Insert 2: [2]  
Insert 1: [1,2] (shift left)  
Insert 3: [1,2,3] → median = 2  

Approach 3: Two Heaps (Optimal)
Data Structures:

  • lo: Max-heap (Python min-heap storing negatives) for lower half.
  • hi: Min-heap for upper half.
    Invariants:
  1. (\max(\text{lo}) \leq \min(\text{hi}))
  2. (|\text{lo}| = |\text{hi}|) or (|\text{lo}| = |\text{hi}| + 1)
import heapq  
class MedianFinder:  
    def __init__(self):  
        self.lo = []  # Max-heap (negatives)  
        self.hi = []  # Min-heap  

    def addNum(self, num):  
        heapq.heappush(self.lo, -num)  # O(log n)  
        heapq.heappush(self.hi, -heapq.heappop(self.lo))  # O(log n)  
        if len(self.hi) > len(self.lo):  # Rebalance  
            heapq.heappush(self.lo, -heapq.heappop(self.hi))  # O(log n)  

    def findMedian(self):  
        if len(self.lo) > len(self.hi):  
            return -self.lo[0]  
        return (-self.lo[0] + self.hi[0]) / 2  

Complexity Proof

  • addNum: 3–4 heap ops at (O(\log n)) each → (O(\log n)).
  • findMedian: (O(1)) (heap top access).
  • Total for (n) calls: (O(n \log n)).

Visualization

Insert 1:  
  lo = [-1] → push to hi: hi=[1] → rebalance: lo=[-1], hi=[]  
Insert 2:  
  lo = [-2,-1] → pop lo: -2 → hi=[2] → sizes equal → median = (1+2)/2 = 1.5  
Insert 3:  
  lo = [-3,-1] → pop lo: -3 → hi=[2,3] → rebalance: pop hi (2) → lo=[-2,-1]  
  median = -lo[0] = 2 (odd)  

4. Code Deep Dive

Optimal Code (Two Heaps)

import heapq  
class MedianFinder:  
    def __init__(self):  
        self.lo = []  # Max-heap: store negatives for lower half  
        self.hi = []  # Min-heap: upper half  

    def addNum(self, num):  
        # Step 1: Push to lo (as negative)  
        heapq.heappush(self.lo, -num)  
        # Step 2: Move largest from lo to hi  
        heapq.heappush(self.hi, -heapq.heappop(self.lo))  
        # Step 3: If hi > lo, move smallest from hi to lo  
        if len(self.hi) > len(self.lo):  
            heapq.heappush(self.lo, -heapq.heappop(self.hi))  

    def findMedian(self):  
        # If lo has extra element, it's median  
        if len(self.lo) > len(self.hi):  
            return -self.lo[0]  
        # Else average of tops  
        return (-self.lo[0] + self.hi[0]) / 2  

Line-by-Line Analysis

  1. __init__: Initializes empty heaps.
  2. addNum:
    • Line 6: Push num to lo as negative (max-heap simulation).
    • Line 8: Pop largest from lo (as positive), push to hi. Ensures (\max(\text{lo}) \leq \min(\text{hi})).
    • Line 10: If hi has more elements, pop smallest from hi and push to lo (as negative). Maintains size invariant.
  3. findMedian:
    • Line 13: For odd (n), lo[0] (negated) is median.
    • Line 15: For even (n), average of -lo[0] and hi[0].

Edge Case Handling

  • Single Element (e.g., [1]):
    addNum(1)lo=[-1] → moved to hi=[1] → rebalanced to lo=[-1], hi=[]findMedian returns 1.
  • All Duplicates (e.g., [2,2,2]):
    After three inserts: lo=[-2,-2], hi=[2] → median = -lo[0] = 2.
  • Descending Order (e.g., [3,2,1]):
    After inserts: lo=[-1,-2], hi=[3] → median = -lo[0] = 2 (correct).

5. Complexity War Room

Hardware-Aware Analysis

  • Memory: Two heaps store (O(n)) integers. At (n=5 \times 10^4), (\approx 5 \times 10^4 \times 28 \text{ bytes} \times 2 = 2.8 \text{ MB}) (fits in CPU L3 cache).
  • Time: (O(\log n)) per insertion → (5 \times 10^4 \times \log_2(5 \times 10^4) \approx 5 \times 10^4 \times 16 = 8 \times 10^5) ops (efficient).

Approach Comparison

Approach Time (addNum) Time (findMedian) Space Readability Interview Viability
Brute-Force (O(1)) (O(n \log n)) (O(n)) 10/10 ❌ Fails at scale
Insertion Sort (O(n)) (O(1)) (O(n)) 9/10 ❌ (O(n^2)) total
Two Heaps (O(\log n)) (O(1)) (O(n)) 8/10 ✅ Optimal

6. Pro Mode Extras

Follow-Up Optimizations

  1. All numbers in [0,100]:

    • Use frequency array freq[0..100] and total count.
    • addNum: Increment freq[num] and count ((O(1))).
    • findMedian: Traverse freq to find (k)-th smallest ((O(101)) per call).
    class MedianFinder:  
        def __init__(self):  
            self.freq = [0] * 101  
            self.n = 0  
    
        def addNum(self, num):  
            self.freq[num] += 1  
            self.n += 1  
    
        def findMedian(self):  
            mid = self.n // 2  
            if self.n % 2:  
                return self._find_kth(mid + 1)  
            return (self._find_kth(mid) + self._find_kth(mid + 1)) / 2  
    
        def _find_kth(self, k):  
            count = 0  
            for i in range(101):  
                count += self.freq[i]  
                if count >= k:  
                    return i  
    

    Complexity: (O(1)) per addNum, (O(1)) per findMedian (fixed 101 steps).

  2. 99% of numbers in [0,100]:

    • Store in-range numbers in freq[0..100], outliers in sorted lists under (numbers < 0) and over (numbers > 100).
    • addNum: Use bisect.insort for outliers ((O(\log m)), (m) = outlier count).
    • findMedian: Traverse underfreqover for (k)-th element.
    import bisect  
    class MedianFinder:  
        def __init__(self):  
            self.under = []  # Sorted negatives  
            self.over = []   # Sorted >100  
            self.freq = [0] * 101  
            self.n = 0  
    
        def addNum(self, num):  
            if num < 0:  
                bisect.insort(self.under, num)  
            elif num > 100:  
                bisect.insort(self.over, num)  
            else:  
                self.freq[num] += 1  
            self.n += 1  
    
        def findMedian(self):  
            if self.n % 2:  
                return self._find_kth(self.n // 2 + 1)  
            k1 = self._find_kth(self.n // 2)  
            k2 = self._find_kth(self.n // 2 + 1)  
            return (k1 + k2) / 2  
    
        def _find_kth(self, k):  
            # Check under  
            if k <= len(self.under):  
                return self.under[k - 1]  
            # Check freq  
            count = len(self.under)  
            for i in range(101):  
                count += self.freq[i]  
                if count >= k:  
                    return i  
            # Check over  
            return self.over[k - count - 1]  
    

    Complexity:

    • addNum: (O(\log m)) for outliers ((m \leq 500)), (O(1)) for in-range.
    • findMedian: (O(101)) per call.

Variant: Sliding Window Median (LeetCode 480)

  • Challenge: Support removals from the dataset.
  • Solution: Use two heaps with lazy deletion + hash table for pending deletions. Rebalance heaps when top elements are marked for deletion.

Interview Cheat Sheet

  • Key Insight: Two heaps for dynamic median (max-heap + min-heap with size balancing).
  • Complexity: (O(\log n)) insert, (O(1)) median.
  • Edge Cases: Duplicates, negatives, large (n).
  • Optimizations: Frequency arrays for bounded ranges.
  • Mention First: Time/space complexity and invariants before coding.