Monitoring and Observability: Logs, Metrics, Alerts, and Tracing
Core Concept
- Distributed systems run 24/7 → failures are inevitable
- You must:
- Detect problems before users
- Understand system behavior quickly
Observability = ability to debug from outside
Four Pillars (VERY IMPORTANT)
1. Logging
- Records individual events
- Helps debug specific requests
Key practices:
- Include:
- Timestamp
- Request ID / user ID
- Use log levels:
- Debug, Info, Warning, Error, Fatal
Insight:
- Log errors + unusual behavior, not everything
2. Metrics
- Aggregate view of system over time
- Stored as time series data
Types:
- Counts → total requests
- Rates → requests/sec
- Histograms → latency distribution
- Values → CPU, memory
Most important metrics:
- Request rate
- Error rate
- Latency (especially p95/p99)
3. Alerting
- Notifies you when something is wrong
Key idea:
Alerts = your SLO definition
Examples:
- Error rate > 1%
- Latency > threshold
Important:
- Too many alerts → alert fatigue
- Too few alerts → missed outages
Balance is critical
Advanced:
- Anomaly detection
- Detects unusual patterns
- Useful for partial outages
4. Tracing (VERY IMPORTANT)
- Tracks a single request across services
How:
- Use correlation ID
- Pass it across services
Enables:
- End-to-end debugging
- Finding bottlenecks
Logging Best Practices
- Avoid log spam
- Use:
- Structured logs
- Context (request ID)
- Enable dynamic debug logging
Log what you’ll wish you had during debugging
Metrics Best Practices
Request Monitoring
Track:
- Request count
- Error codes (200, 500, etc.)
- Latency (histogram)
Use labels:
- Response code
- Endpoint
Advanced Metrics
- Request size
- Queue time vs processing time
Helps identify:
- Scaling issues vs code inefficiency
Pull vs Push
- Pull (Prometheus scraping) → long-running services
- Push (Push gateway) → batch jobs
Alerting Insights
- Alerts should reflect:
- User experience
- Continuous tuning required
Bad alerts:
- Noisy
- Irrelevant
Good alerts:
- Actionable
- Rare but meaningful
Tracing Insights
- Distributed systems = many services
- Without tracing:
- You see fragments
With tracing:
- You see complete request journey
Use tools like:
- OpenTelemetry
Aggregation & Storage
Why needed:
- Massive data volume
Techniques:
1. Log aggregation
- Combine logs across services
- Tools:
- Elasticsearch
- ktail
2. Downsampling
- Reduce data:
- Lower frequency
- Average values
3. Tiered storage
- Hot storage → fast access
- Cold storage → cheap, slow
Key Insights
- Observability ≠ just logs
- Need:
- Logs + Metrics + Alerts + Tracing
Together they give:
- Local + global + end-to-end view
Trade-offs
Pros
- Faster debugging
- Better reliability
- Proactive issue detection
Cons
- High data volume
- Cost of storage/infra
- Complexity in setup
One-line Summary
Monitoring and observability combine logs, metrics, alerting, and tracing to detect, understand, and debug issues in distributed systems before users are impacted.
#Distributed Systems #System Design #Observability #Monitoring #Logging #Prometheus #Tracing