deploy/monitoring/README.md

# Sub2API Relay Manager Monitoring Setup

## 概述

本项目已配置完整的监控告警体系，包括 Prometheus metrics、Grafana 仪表板和 Prometheus 告警规则。

## 已配置的 Metrics

### HTTP 层指标

- `http_requests_total` - HTTP 请求总数（按 method, path, status 分类）
- `http_request_duration_seconds` - HTTP 请求延迟分布

### 业务指标

- `active_hosts` - 活跃宿主数量
- `active_providers` - 活跃 provider 数量
- `route_decisions_total` - 路由决策总数
- `route_failovers_total` - 路由故障转移总数

### 数据库指标

- `db_connections_active` - 活跃数据库连接数
- `db_operations_total` - 数据库操作总数

### 日志指标

- `log_flush_errors_total` - 日志刷新错误数
- `log_dropped_events_total` - 丢弃的日志事件数

## 告警规则

### Critical 级别

| 告警名称           | 触发条件                        | 说明            |
| ------------------ | ------------------------------- | --------------- |
| ServiceDown        | up == 0 持续1分钟               | 服务完全宕机    |
| NoActiveProviders  | active_providers == 0 持续1分钟 | 无可用 provider |
| NoActiveHosts      | active_hosts == 0 持续1分钟     | 无可用 host     |
| HealthCheckFailing | /healthz 返回非200              | 健康检查失败    |

### Warning 级别

| 告警名称            | 触发条件                 | 说明                  |
| ------------------- | ------------------------ | --------------------- |
| HighErrorRate       | 错误率 > 5% 持续2分钟    | HTTP 5xx/4xx 错误率高 |
| HighLatency         | P95 延迟 > 1秒 持续3分钟 | 请求处理延迟高        |
| RouteFailoverSpike  | 故障转移率 > 正常水平2倍 | 路由不稳定            |
| HighDBConnections   | 活跃连接 > 50 持续5分钟  | 数据库连接池压力大    |
| LogFlushErrors      | 日志刷新错误 > 0         | 日志系统异常          |
| LogDroppedEvents    | 丢弃事件率 > 10/sec      | 日志缓冲区溢出        |
| BatchImportFailures | 批处理失败率 > 10%       | Provider 导入问题     |
| AuthFailures        | 认证失败 > 10/sec        | 凭证问题或攻击        |

## 部署步骤

### 1. Prometheus 配置

在 `prometheus.yml` 中添加：

```yaml
rule_files:
  - "sub2api-relay-manager-rules.yml"

scrape_configs:
  - job_name: "sub2api-relay-manager"
    static_configs:
      - targets: ["localhost:8080"]
    metrics_path: /metrics
    scrape_interval: 15s
```

复制告警规则：

```bash
cp deploy/monitoring/prometheus-rules.yml /etc/prometheus/rules/
```

### 2. Grafana 配置

导入仪表板：

```bash
curl -X POST \
  http://admin:admin@localhost:3000/api/dashboards/db \
  -H 'Content-Type: application/json' \
  -d @deploy/monitoring/grafana-dashboard.json
```

### 3. Alertmanager 配置（可选）

配置告警通知渠道（Slack/Email/PagerDuty）：

```yaml
# alertmanager.yml
global:
  smtp_smarthost: "localhost:587"
  smtp_from: "alerts@example.com"

route:
  receiver: "ops-team"
  group_by: ["alertname", "severity"]

receivers:
  - name: "ops-team"
    email_configs:
      - to: "ops@example.com"
        subject: "[Alert] {{ .GroupLabels.alertname }}"
    slack_configs:
      - api_url: "YOUR_SLACK_WEBHOOK_URL"
        channel: "#alerts"
```

## 验证

### 检查 Metrics 端点

```bash
curl http://localhost:8080/metrics
```

### 验证告警规则

```bash
# 在 Prometheus 中查看
http://localhost:9090/rules

# 查看告警状态
http://localhost:9090/alerts
```

### 触发测试告警

```bash
# 模拟高错误率
for i in {1..100}; do
  curl http://localhost:8080/api/nonexistent
done
```

## 监控指标解释

### 正常状态参考值

| 指标             | 正常范围 | 告警阈值                      |
| ---------------- | -------- | ----------------------------- |
| active_providers | >= 2     | < 2 (warning), = 0 (critical) |
| active_hosts     | >= 1     | = 0 (critical)                |
| Error Rate       | < 1%     | > 5%                          |
| P95 Latency      | < 500ms  | > 1s                          |
| DB Connections   | < 20     | > 50                          |

## 故障排查

### 服务 Down 告警

1. 检查进程状态：`systemctl status sub2api-relay-manager`
2. 查看日志：`journalctl -u sub2api-relay-manager`
3. 检查端口监听：`netstat -tlnp | grep 8080`

### 高延迟告警

1. 检查数据库性能
2. 查看 upstream provider 响应时间
3. 检查内存和 CPU 使用率

### 路由故障转移告警

1. 检查 provider 健康状态
2. 查看 `/api/routing/routes/health`
3. 分析 provider 响应日志