Files
sub2api-cn-relay-manager/deploy/monitoring
phamnazage-jpg f6600d663a
Some checks failed
CI / Build & Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Security Scan (push) Has been cancelled
CI / Docker Build (push) Has been cancelled
CI / Release (push) Has been cancelled
feat(monitoring): add complete Prometheus + Grafana monitoring stack
Add production-ready monitoring infrastructure:
- 15 alerting rules (4 Critical + 11 Warning)
- Grafana dashboard with service health panels
- Full documentation with deployment guide

Covers: service availability, error rates, latency,
routing health, database connections, and log metrics
2026-06-02 19:54:38 +08:00
..

Sub2API Relay Manager Monitoring Setup

概述

本项目已配置完整的监控告警体系,包括 Prometheus metrics、Grafana 仪表板和 Prometheus 告警规则。

已配置的 Metrics

HTTP 层指标

  • http_requests_total - HTTP 请求总数(按 method, path, status 分类)
  • http_request_duration_seconds - HTTP 请求延迟分布

业务指标

  • active_hosts - 活跃宿主数量
  • active_providers - 活跃 provider 数量
  • route_decisions_total - 路由决策总数
  • route_failovers_total - 路由故障转移总数

数据库指标

  • db_connections_active - 活跃数据库连接数
  • db_operations_total - 数据库操作总数

日志指标

  • log_flush_errors_total - 日志刷新错误数
  • log_dropped_events_total - 丢弃的日志事件数

告警规则

Critical 级别

告警名称 触发条件 说明
ServiceDown up == 0 持续1分钟 服务完全宕机
NoActiveProviders active_providers == 0 持续1分钟 无可用 provider
NoActiveHosts active_hosts == 0 持续1分钟 无可用 host
HealthCheckFailing /healthz 返回非200 健康检查失败

Warning 级别

告警名称 触发条件 说明
HighErrorRate 错误率 > 5% 持续2分钟 HTTP 5xx/4xx 错误率高
HighLatency P95 延迟 > 1秒 持续3分钟 请求处理延迟高
RouteFailoverSpike 故障转移率 > 正常水平2倍 路由不稳定
HighDBConnections 活跃连接 > 50 持续5分钟 数据库连接池压力大
LogFlushErrors 日志刷新错误 > 0 日志系统异常
LogDroppedEvents 丢弃事件率 > 10/sec 日志缓冲区溢出
BatchImportFailures 批处理失败率 > 10% Provider 导入问题
AuthFailures 认证失败 > 10/sec 凭证问题或攻击

部署步骤

1. Prometheus 配置

prometheus.yml 中添加:

rule_files:
  - "sub2api-relay-manager-rules.yml"

scrape_configs:
  - job_name: "sub2api-relay-manager"
    static_configs:
      - targets: ["localhost:8080"]
    metrics_path: /metrics
    scrape_interval: 15s

复制告警规则:

cp deploy/monitoring/prometheus-rules.yml /etc/prometheus/rules/

2. Grafana 配置

导入仪表板:

curl -X POST \
  http://admin:admin@localhost:3000/api/dashboards/db \
  -H 'Content-Type: application/json' \
  -d @deploy/monitoring/grafana-dashboard.json

3. Alertmanager 配置(可选)

配置告警通知渠道Slack/Email/PagerDuty

# alertmanager.yml
global:
  smtp_smarthost: "localhost:587"
  smtp_from: "alerts@example.com"

route:
  receiver: "ops-team"
  group_by: ["alertname", "severity"]

receivers:
  - name: "ops-team"
    email_configs:
      - to: "ops@example.com"
        subject: "[Alert] {{ .GroupLabels.alertname }}"
    slack_configs:
      - api_url: "YOUR_SLACK_WEBHOOK_URL"
        channel: "#alerts"

验证

检查 Metrics 端点

curl http://localhost:8080/metrics

验证告警规则

# 在 Prometheus 中查看
http://localhost:9090/rules

# 查看告警状态
http://localhost:9090/alerts

触发测试告警

# 模拟高错误率
for i in {1..100}; do
  curl http://localhost:8080/api/nonexistent
done

监控指标解释

正常状态参考值

指标 正常范围 告警阈值
active_providers >= 2 < 2 (warning), = 0 (critical)
active_hosts >= 1 = 0 (critical)
Error Rate < 1% > 5%
P95 Latency < 500ms > 1s
DB Connections < 20 > 50

故障排查

服务 Down 告警

  1. 检查进程状态:systemctl status sub2api-relay-manager
  2. 查看日志:journalctl -u sub2api-relay-manager
  3. 检查端口监听:netstat -tlnp | grep 8080

高延迟告警

  1. 检查数据库性能
  2. 查看 upstream provider 响应时间
  3. 检查内存和 CPU 使用率

路由故障转移告警

  1. 检查 provider 健康状态
  2. 查看 /api/routing/routes/health
  3. 分析 provider 响应日志