Monitoring and Observability: Logs, Metrics, Alerts, and Tracing

14 Mar, 2026

Core Concept

Distributed systems run 24/7 → failures are inevitable
You must:
- Detect problems before users
- Understand system behavior quickly

Observability = ability to debug from outside

Four Pillars (VERY IMPORTANT)

1. Logging

Records individual events
Helps debug specific requests

Key practices:

Include:
- Timestamp
- Request ID / user ID
Use log levels:
- Debug, Info, Warning, Error, Fatal

Insight:

Log errors + unusual behavior, not everything

2. Metrics

Aggregate view of system over time
Stored as time series data

Types:

Counts → total requests
Rates → requests/sec
Histograms → latency distribution
Values → CPU, memory

Most important metrics:

Request rate
Error rate
Latency (especially p95/p99)

3. Alerting

Notifies you when something is wrong

Key idea:

Alerts = your SLO definition

Examples:

Error rate > 1%
Latency > threshold

Important:

Too many alerts → alert fatigue
Too few alerts → missed outages

Balance is critical

Advanced:

Anomaly detection
- Detects unusual patterns
- Useful for partial outages

4. Tracing (VERY IMPORTANT)

Tracks a single request across services

How:

Use correlation ID
Pass it across services

Enables:

End-to-end debugging
Finding bottlenecks

Logging Best Practices

Avoid log spam
Use:
- Structured logs
- Context (request ID)
Enable dynamic debug logging

Log what you’ll wish you had during debugging

Metrics Best Practices

Request Monitoring

Track:

Request count
Error codes (200, 500, etc.)
Latency (histogram)

Use labels:

Response code
Endpoint

Advanced Metrics

Request size
Queue time vs processing time

Helps identify:

Scaling issues vs code inefficiency

Pull vs Push

Pull (Prometheus scraping) → long-running services
Push (Push gateway) → batch jobs

Alerting Insights

Alerts should reflect:
- User experience
Continuous tuning required

Bad alerts:

Noisy
Irrelevant

Good alerts:

Actionable
Rare but meaningful

Tracing Insights

Distributed systems = many services
Without tracing:
- You see fragments

With tracing:

You see complete request journey

Use tools like:

OpenTelemetry

Aggregation & Storage

Why needed:

Massive data volume

Techniques:

1. Log aggregation

Combine logs across services
Tools:
- Elasticsearch
- ktail

2. Downsampling

Reduce data:
- Lower frequency
- Average values

3. Tiered storage

Hot storage → fast access
Cold storage → cheap, slow

Key Insights

Observability ≠ just logs
Need:
- Logs + Metrics + Alerts + Tracing

Together they give:

Local + global + end-to-end view

Trade-offs

Pros

Faster debugging
Better reliability
Proactive issue detection

Cons

High data volume
Cost of storage/infra
Complexity in setup

One-line Summary

Monitoring and observability combine logs, metrics, alerting, and tracing to detect, understand, and debug issues in distributed systems before users are impacted.

#Distributed Systems #System Design #Observability #Monitoring #Logging #Prometheus #Tracing