Prometheus is a monitoring system and time series database designed to collect metrics from systems and applications, with query and alerting capabilities.

What is Prometheus?

Prometheus is an open-source monitoring tool that collects metrics from systems and applications, stores them in a time series database, and provides query and alerting capabilities.

Prometheus Architecture

Main Components

  • Prometheus Server: Main collection server
  • Exporters: Agents that expose metrics
  • Pushgateway: Gateway for push metrics
  • Alertmanager: Alert management
  • Service Discovery: Automatic service discovery

Data Flow

Applications → Exporters → Prometheus Server → Alertmanager → Alerts
                    ↓
              TSDB Database
                    ↓
              PromQL Queries

Metrics and Types

Metric Types

  • Counter: Values that only increase
  • Gauge: Values that can go up or down
  • Histogram: Value distribution
  • Summary: Quantiles and sums

Metric Examples

# Counter
http_requests_total{method="GET", status="200"} 1024

# Gauge
memory_usage_bytes{instance="server1"} 1073741824

# Histogram
http_request_duration_seconds_bucket{le="0.1"} 100
http_request_duration_seconds_bucket{le="0.5"} 200
http_request_duration_seconds_bucket{le="1.0"} 250
http_request_duration_seconds_bucket{le="+Inf"} 300
http_request_duration_seconds_sum 150.5
http_request_duration_seconds_count 300

Configuration

Basic Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

Service Discovery

1
2
3
4
5
6
7
8
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

PromQL - Query Language

Basic Queries

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Simple query
up

# Filtering by labels
http_requests_total{method="GET"}

# Aggregations
sum(http_requests_total) by (method)

# Time functions
rate(http_requests_total[5m])

# Mathematical operators
cpu_usage_percent / 100

Advanced Queries

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Percentiles
histogram_quantile(0.95, http_request_duration_seconds_bucket)

# Changes over time
increase(http_requests_total[1h])

# Comparisons
cpu_usage_percent > 80

# Window functions
avg_over_time(cpu_usage_percent[5m])

Alerts

Alert Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
groups:
- name: example
  rules:
  - alert: HighCPUUsage
    expr: cpu_usage_percent > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
      description: "CPU usage is above 80% for more than 5 minutes"

Alertmanager

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'

System

  • Node Exporter: Operating system metrics
  • Windows Exporter: Windows metrics
  • SNMP Exporter: SNMP metrics

Applications

  • JMX Exporter: Java metrics
  • MySQL Exporter: MySQL metrics
  • PostgreSQL Exporter: PostgreSQL metrics
  • Redis Exporter: Redis metrics

Cloud

  • AWS CloudWatch Exporter: AWS metrics
  • Azure Monitor Exporter: Azure metrics
  • GCP Exporter: Google Cloud metrics

Kubernetes Integration

ServiceMonitor

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: example-app
spec:
  selector:
    matchLabels:
      app: example-app
  endpoints:
  - port: metrics
    interval: 30s

PrometheusRule

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: example-rules
spec:
  groups:
  - name: example
    rules:
    - alert: PodDown
      expr: up == 0
      for: 5m

Best Practices

Metrics

  • Naming: Use consistent conventions
  • Cardinality: Avoid high cardinality
  • Retention: Configure appropriate retention
  • Labels: Use labels efficiently

Performance

  • Scrape Interval: Appropriate intervals
  • Query Performance: Optimize queries
  • Storage: Configure adequate storage
  • Memory: Monitor memory usage

Security

  • Authentication: Implement authentication
  • Authorization: Access control
  • TLS: Use secure connections
  • Network: Network segmentation

References