Coordinated Batch Processing: Combining Parallel Work with Join and Reduce
Core Concept
- Focus: bringing parallel work back together
- Opposite of earlier chapters (which split work)
- Used for:
- Aggregation
- Synchronization
- Final results
Think:
“Do work in parallel → then coordinate + combine results”
Why Coordination Is Needed
- Previous patterns:
- Sharding → split work
- Work queues → process independently
But:
- Some tasks need complete dataset
- Some need aggregation (sum, stats)
Need coordination primitives
Core Patterns
1. Join (Barrier Synchronization)
What it does:
- Waits until ALL parallel work is done
- Then releases results
Example:
- Video rendering:
- Need all frames before creating video
Key properties:
- Ensures completeness
- Blocks until everything finishes
Trade-off:
- Slower (less parallelism)
- Correct & complete results
2. Reduce (VERY IMPORTANT)
What it does:
- Incrementally combines outputs
- Reduces many → one
Example: Word Count
Input:
a: 50
the: 17
a: 30
the: 25
Output:
a: 80
the: 42
Key insight:
- Can start before all data is ready
- Works in parallel
Faster than join
Join vs Reduce
| Feature | Join | Reduce |
|---|---|---|
| Waits for all data | Yes | No |
| Parallelism | Low | High |
| Use case | Completeness | Aggregation |
3. Sum (special case of Reduce)
- Add values across dataset
Example:
- Population calculation
(Seattle, 4M) + (Town, 25k) → (Combined, 4.025M)
4. Histogram (advanced Reduce)
- Merge distributions
Example:
0 kids: 15%
1 kid: 25%
Combine using:
- Weighted averages (based on population)
Key Insight
Reduce is composable → can run multiple times in parallel
- Tree-like reduction:
[Many outputs] → [Partial reduces] → [Final result]
Real Pipeline Example
Image Processing Workflow
Steps:
- Shard
- Distribute images across workers
- Multiworker
- Detect plate
- Blur plate
- Join
- Wait for ALL images processed
- Copier
- Duplicate work:
- Delete originals
- Analyze vehicles
- Duplicate work:
- Shard again
- Distribute analysis
- Reduce
- Aggregate:
- Vehicle counts
- Colors
- Aggregate:
Example Output
{
"vehicles": {
"car": 12,
"truck": 7,
"motorcycle": 4
},
"colors": {
"white": 8,
"black": 3,
"blue": 6,
"red": 6
}
}
Trade-offs
Join
- Correctness
- High latency
Reduce
- Fast & parallel
- More complex logic
Mental Model
[ Sharded Work ]
↓
(Join OR Reduce)
↓
[ Final Output ]
Key Insights
- Not all workflows are independent → need coordination
- Join = correctness guarantee
- Reduce = performance optimization
- Combine both for real systems
One-line Summary
Coordinated batch processing introduces join (wait for all) and reduce (incremental aggregation) patterns to safely combine parallel work into final results.
#Distributed Systems #System Design #Batch Processing #MapReduce #Aggregation