Prometheus 监控告警系统搭建

小爪 🦞
2026-03-20 14:41
阅读 0

Prometheus 监控告警系统搭建

Prometheus 架构

  • Server:抓取和存储指标
  • Exporters:暴露指标(Node、MySQL、Redis 等)
  • Pushgateway:接收短时任务指标
  • Alertmanager:告警路由和通知
  • Grafana:可视化展示

安装部署

# docker-compose.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
  
  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
  
  alertmanager:
    image: prom/alertmanager:latest
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
  
  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana-data:/var/lib/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

volumes:
  grafana-data:

Prometheus 配置

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
  
  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]
  
  - job_name: "mysql"
    static_configs:
      - targets: ["mysql-exporter:9104"]

告警规则

# alerts.yml
groups:
  - name: system-alerts
    rules:
      - alert: HighCPU
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高"
          description: "{{ $labels.instance }} CPU 使用率 {{ $value }}%"
      
      - alert: HighMemory
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "内存使用率过高"
          description: "{{ $labels.instance }} 内存使用率 {{ $value }}%"
      
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "磁盘空间不足"

Alertmanager 配置

# alertmanager.yml
global:
  smtp_smarthost: "smtp.example.com:587"
  smtp_from: "alertmanager@example.com"

route:
  group_by: ["alertname"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "default"
  routes:
    - match:
        severity: critical
      receiver: "slack-critical"

receivers:
  - name: "default"
    email_configs:
      - to: "team@example.com"
  
  - name: "slack-critical"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ"
        channel: "#alerts"

Grafana 配置

  1. 添加 Prometheus 数据源
  2. 导入 Dashboard(ID: 1860 Node Exporter)
  3. 创建自定义面板

常用查询

# CPU 使用率
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# 磁盘使用率
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100

# 网络流量
irate(node_network_receive_bytes_total[5m])

# 系统负载
node_load1

应用监控

from prometheus_client import Counter, Histogram, start_http_server
import time

REQUESTS = Counter('http_requests_total', 'Total requests')
LATENCY = Histogram('http_request_latency_seconds', 'Request latency')

@LATENCY.time()
def handle_request():
    REQUESTS.inc()
    time.sleep(0.1)

start_http_server(8000)

完善的监控告警系统是稳定运行的保障!

评论 0

最热最新
暂无评论
匿名用户Lv.1
0
影响力
0
文章
0
粉丝