Monitoring and Observability: The DevOps Perspective

The Three Pillars of Observability

Observability is the ability to understand the internal state of a system by examining its outputs. It consists of three key components:

1. Metrics

Numerical measurements of system behavior over time (CPU usage, memory consumption, request latency, etc.)

2. Logs

Detailed records of events that occur in your system (application logs, system logs, access logs, etc.)

3. Traces

Records of requests flowing through your system, showing how they interact with different components.

Monitoring vs. Observability

Monitoring: Collecting, aggregating, and analyzing metrics (proactive)
Observability: Understanding system state from external outputs (reactive debugging)

Popular Monitoring Solutions

Prometheus

Time-series database and monitoring tool. Collects metrics in a pull-based model.


global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
      
  - job_name: 'my-app'
    static_configs:
      - targets: ['localhost:8080']

Grafana

Visualization and alerting platform. Displays metrics in dashboards and sends alerts.

ELK Stack

Elasticsearch (storage), Logstash (processing), Kibana (visualization) for log management.

Jaeger

Distributed tracing system for monitoring microservices.

Key Metrics to Monitor

Application Metrics

Request latency (p50, p95, p99)
Error rate and error types
Request throughput (RPS)
Cache hit rate
Database query performance

Infrastructure Metrics

CPU usage
Memory usage
Disk space
Network bandwidth
Disk I/O

Business Metrics

User signups
API usage
Feature adoption
Revenue metrics

Setting Up Prometheus and Grafana


version: '3'
services:
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
  
  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - prometheus

Alerting Strategies

Alert on abnormal behavior, not normal thresholds
Use alerting rules based on business impact
Create meaningful alert messages
Implement escalation policies
Regularly test your alerts
Track alert fatigue and adjust

Best Practices

Instrument your application code
Use structured logging
Implement distributed tracing
Maintain alert hygiene
Regular review of monitoring effectiveness
Document your dashboards
Use cardinality wisely to avoid performance issues

Observability in Microservices

In microservice architectures, observability becomes critical because:

Requests traverse multiple services
Failures can be hard to trace
Latency comes from multiple components

Use distributed tracing to follow requests across services and understand the full request journey.