Files
user-system/docs/guides/MONITORING.md

319 lines
7.8 KiB
Markdown
Raw Normal View History

# 健康检查与监控指南
本文档描述系统健康检查端点、Prometheus 监控指标和告警规则。
---
## 1. 健康检查端点
系统提供三个健康检查端点,适用于不同场景:
| 端点 | 路径 | 说明 | 使用场景 |
|------|------|------|----------|
| 存活探针 | `/health/live` | 确认进程存活 | Kubernetes `livenessProbe` |
| 就绪探针 | `/health/ready` | 确认服务就绪 | Kubernetes `readinessProbe` |
| 健康检查 | `/health` | 综合健康状态 | 负载均衡器、健康检查脚本 |
### 1.1 响应格式
```json
{
"status": "ok",
"timestamp": "2026-05-10T13:00:00Z",
"version": "1.0.0"
}
```
### 1.2 响应码
| 状态 | HTTP 响应码 | 说明 |
|------|-------------|------|
| ok | 200 | 服务正常 |
| degraded | 200 | 服务降级(部分依赖不可用,如 Redis |
| unhealthy | 503 | 服务不健康(如数据库不可达) |
---
## 2. Prometheus 监控指标
### 2.1 暴露方式
指标端点:`GET /metrics`
返回 Prometheus 格式文本。
### 2.2 核心指标
#### HTTP 指标
| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `http_requests_total` | Counter | method, path, status | HTTP 请求总数 |
| `http_request_duration_seconds` | Histogram | method, path | 请求延迟分布 |
#### 认证指标
| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `login_attempts_total` | Counter | result, method | 登录尝试次数(成功/失败) |
| `active_sessions_total` | Gauge | — | 当前活跃会话数 |
| `refresh_tokens_total` | Counter | — | Token 刷新次数 |
#### 数据库指标
| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `db_query_duration_seconds` | Histogram | operation, table | 数据库查询延迟 |
| `db_connections_open` | Gauge | type | 当前打开的连接数 |
| `db_connections_in_use` | Gauge | type | 使用中的连接数 |
#### 缓存指标
| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `cache_hits_total` | Counter | cache_level | 缓存命中次数 |
| `cache_misses_total` | Counter | cache_level | 缓存未命中次数 |
| `cache_operations_total` | Counter | operation | 缓存操作总数 |
#### 限流指标
| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `ratelimit_rejections_total` | Counter | endpoint, algorithm | 限流拦截次数 |
### 2.3 查看当前指标
```bash
curl http://localhost:8080/metrics
```
---
## 3. 告警规则
### 3.1 建议的告警规则Prometheus / Alertmanager 格式)
```yaml
groups:
- name: user-management
rules:
# 服务不可用
- alert: ServiceDown
expr: up{job="user-management"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "用户管理服务不可用"
# 错误率过高
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "HTTP 5xx 错误率超过 5%"
# 登录失败率过高(可能暴力破解)
- alert: HighLoginFailureRate
expr: |
rate(login_attempts_total{result="fail"}[5m]) /
rate(login_attempts_total[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "登录失败率超过 80%,可能存在暴力破解"
# 响应延迟过高
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "P99 响应延迟超过 1 秒"
# 数据库连接池耗尽
- alert: DatabaseConnectionPoolExhausted
expr: db_connections_in_use / db_connections_open > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "数据库连接池使用率超过 90%"
# 活跃会话数异常下降
- alert: ActiveSessionsDropped
expr: |
active_sessions_total < 10
and
delta(active_sessions_total[10m]) < -5
for: 5m
labels:
severity: warning
annotations:
summary: "活跃会话数急剧下降"
# 限流拦截频繁
- alert: RateLimitRejectionsHigh
expr: |
rate(ratelimit_rejections_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "限流拦截频率过高"
```
---
## 4. Grafana 看板
建议导入以下看板配置:
### 4.1 核心看板指标
**Overview 看板**
- 请求率QPS
- P50/P90/P99 延迟
- 错误率
- 活跃会话数
**Auth 看板**
- 登录尝试(成功/失败)
- Token 刷新次数
- 活跃会话趋势
- TOTP 启用率
**Database 看板**
- 查询延迟 P99
- 连接池使用率
- 慢查询数量
**Cache 看板**
- 命中率
- 未命中率
- L1/L2 缓存对比
---
## 5. 日志关键字监控
建议在日志收集系统(如 Loki/ELK中配置以下关键字告警
| 关键字 | 严重程度 | 说明 |
|--------|----------|------|
| `auth: increment login attempts failed` | warning | Redis/L1 缓存不可用 |
| `goroutine leak` | critical | 潜在的 goroutine 泄漏 |
| `token blacklisted but refresh failed` | critical | Token 黑名单写入失败 |
| `password reset code replay` | warning | 可能存在验证码重放 |
| `temporary login token cleanup failed` | warning | 临时令牌清理失败 |
| `cache.Set failed` | warning | 缓存写入失败 |
| `failed to send email` | warning | 邮件发送失败 |
---
## 6. 健康检查脚本示例
```bash
#!/bin/bash
# health_check.sh — 服务健康检查脚本
HEALTH_URL="http://localhost:8080/health"
READY_URL="http://localhost:8080/health/ready"
METRICS_URL="http://localhost:8080/metrics"
check_endpoint() {
local url=$1
local name=$2
local status=$(curl -s -o /dev/null -w "%{http_code}" "$url")
if [ "$status" -eq 200 ]; then
echo "[OK] $name: $status"
return 0
else
echo "[FAIL] $name: $status"
return 1
fi
}
# 执行检查
failed=0
check_endpoint "$HEALTH_URL" "Health" || failed=$((failed + 1))
check_endpoint "$READY_URL" "Ready" || failed=$((failed + 1))
# 检查 Prometheus 指标端点
status=$(curl -s -o /dev/null -w "%{http_code}" "$METRICS_URL")
if [ "$status" -eq 200 ]; then
echo "[OK] Metrics: $status"
else
echo "[WARN] Metrics: $status"
fi
# 检查数据库连接(通过日志)
if grep -q "database opened" logs/app.log 2>/dev/null; then
echo "[OK] Database: connected"
else
echo "[FAIL] Database: not connected"
failed=$((failed + 1))
fi
exit $failed
```
---
## 7. Kubernetes 部署配置示例
```yaml
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: user-management
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
ports:
- name: http
containerPort: 8080
- name: metrics
containerPort: 9090
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "1Gi"
cpu: "1000m"
```
---
*最后更新2026-05-10*