319 lines
7.8 KiB
Markdown
319 lines
7.8 KiB
Markdown
|
|
# 健康检查与监控指南
|
|||
|
|
|
|||
|
|
本文档描述系统健康检查端点、Prometheus 监控指标和告警规则。
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. 健康检查端点
|
|||
|
|
|
|||
|
|
系统提供三个健康检查端点,适用于不同场景:
|
|||
|
|
|
|||
|
|
| 端点 | 路径 | 说明 | 使用场景 |
|
|||
|
|
|------|------|------|----------|
|
|||
|
|
| 存活探针 | `/health/live` | 确认进程存活 | Kubernetes `livenessProbe` |
|
|||
|
|
| 就绪探针 | `/health/ready` | 确认服务就绪 | Kubernetes `readinessProbe` |
|
|||
|
|
| 健康检查 | `/health` | 综合健康状态 | 负载均衡器、健康检查脚本 |
|
|||
|
|
|
|||
|
|
### 1.1 响应格式
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"status": "ok",
|
|||
|
|
"timestamp": "2026-05-10T13:00:00Z",
|
|||
|
|
"version": "1.0.0"
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 1.2 响应码
|
|||
|
|
|
|||
|
|
| 状态 | HTTP 响应码 | 说明 |
|
|||
|
|
|------|-------------|------|
|
|||
|
|
| ok | 200 | 服务正常 |
|
|||
|
|
| degraded | 200 | 服务降级(部分依赖不可用,如 Redis) |
|
|||
|
|
| unhealthy | 503 | 服务不健康(如数据库不可达) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. Prometheus 监控指标
|
|||
|
|
|
|||
|
|
### 2.1 暴露方式
|
|||
|
|
|
|||
|
|
指标端点:`GET /metrics`
|
|||
|
|
|
|||
|
|
返回 Prometheus 格式文本。
|
|||
|
|
|
|||
|
|
### 2.2 核心指标
|
|||
|
|
|
|||
|
|
#### HTTP 指标
|
|||
|
|
|
|||
|
|
| 指标名 | 类型 | 标签 | 说明 |
|
|||
|
|
|--------|------|------|------|
|
|||
|
|
| `http_requests_total` | Counter | method, path, status | HTTP 请求总数 |
|
|||
|
|
| `http_request_duration_seconds` | Histogram | method, path | 请求延迟分布 |
|
|||
|
|
|
|||
|
|
#### 认证指标
|
|||
|
|
|
|||
|
|
| 指标名 | 类型 | 标签 | 说明 |
|
|||
|
|
|--------|------|------|------|
|
|||
|
|
| `login_attempts_total` | Counter | result, method | 登录尝试次数(成功/失败) |
|
|||
|
|
| `active_sessions_total` | Gauge | — | 当前活跃会话数 |
|
|||
|
|
| `refresh_tokens_total` | Counter | — | Token 刷新次数 |
|
|||
|
|
|
|||
|
|
#### 数据库指标
|
|||
|
|
|
|||
|
|
| 指标名 | 类型 | 标签 | 说明 |
|
|||
|
|
|--------|------|------|------|
|
|||
|
|
| `db_query_duration_seconds` | Histogram | operation, table | 数据库查询延迟 |
|
|||
|
|
| `db_connections_open` | Gauge | type | 当前打开的连接数 |
|
|||
|
|
| `db_connections_in_use` | Gauge | type | 使用中的连接数 |
|
|||
|
|
|
|||
|
|
#### 缓存指标
|
|||
|
|
|
|||
|
|
| 指标名 | 类型 | 标签 | 说明 |
|
|||
|
|
|--------|------|------|------|
|
|||
|
|
| `cache_hits_total` | Counter | cache_level | 缓存命中次数 |
|
|||
|
|
| `cache_misses_total` | Counter | cache_level | 缓存未命中次数 |
|
|||
|
|
| `cache_operations_total` | Counter | operation | 缓存操作总数 |
|
|||
|
|
|
|||
|
|
#### 限流指标
|
|||
|
|
|
|||
|
|
| 指标名 | 类型 | 标签 | 说明 |
|
|||
|
|
|--------|------|------|------|
|
|||
|
|
| `ratelimit_rejections_total` | Counter | endpoint, algorithm | 限流拦截次数 |
|
|||
|
|
|
|||
|
|
### 2.3 查看当前指标
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
curl http://localhost:8080/metrics
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. 告警规则
|
|||
|
|
|
|||
|
|
### 3.1 建议的告警规则(Prometheus / Alertmanager 格式)
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
groups:
|
|||
|
|
- name: user-management
|
|||
|
|
rules:
|
|||
|
|
# 服务不可用
|
|||
|
|
- alert: ServiceDown
|
|||
|
|
expr: up{job="user-management"} == 0
|
|||
|
|
for: 1m
|
|||
|
|
labels:
|
|||
|
|
severity: critical
|
|||
|
|
annotations:
|
|||
|
|
summary: "用户管理服务不可用"
|
|||
|
|
|
|||
|
|
# 错误率过高
|
|||
|
|
- alert: HighErrorRate
|
|||
|
|
expr: |
|
|||
|
|
rate(http_requests_total{status=~"5.."}[5m]) /
|
|||
|
|
rate(http_requests_total[5m]) > 0.05
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "HTTP 5xx 错误率超过 5%"
|
|||
|
|
|
|||
|
|
# 登录失败率过高(可能暴力破解)
|
|||
|
|
- alert: HighLoginFailureRate
|
|||
|
|
expr: |
|
|||
|
|
rate(login_attempts_total{result="fail"}[5m]) /
|
|||
|
|
rate(login_attempts_total[5m]) > 0.8
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "登录失败率超过 80%,可能存在暴力破解"
|
|||
|
|
|
|||
|
|
# 响应延迟过高
|
|||
|
|
- alert: HighLatency
|
|||
|
|
expr: |
|
|||
|
|
histogram_quantile(0.99,
|
|||
|
|
rate(http_request_duration_seconds_bucket[5m])) > 1
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "P99 响应延迟超过 1 秒"
|
|||
|
|
|
|||
|
|
# 数据库连接池耗尽
|
|||
|
|
- alert: DatabaseConnectionPoolExhausted
|
|||
|
|
expr: db_connections_in_use / db_connections_open > 0.9
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: critical
|
|||
|
|
annotations:
|
|||
|
|
summary: "数据库连接池使用率超过 90%"
|
|||
|
|
|
|||
|
|
# 活跃会话数异常下降
|
|||
|
|
- alert: ActiveSessionsDropped
|
|||
|
|
expr: |
|
|||
|
|
active_sessions_total < 10
|
|||
|
|
and
|
|||
|
|
delta(active_sessions_total[10m]) < -5
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "活跃会话数急剧下降"
|
|||
|
|
|
|||
|
|
# 限流拦截频繁
|
|||
|
|
- alert: RateLimitRejectionsHigh
|
|||
|
|
expr: |
|
|||
|
|
rate(ratelimit_rejections_total[5m]) > 10
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "限流拦截频率过高"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. Grafana 看板
|
|||
|
|
|
|||
|
|
建议导入以下看板配置:
|
|||
|
|
|
|||
|
|
### 4.1 核心看板指标
|
|||
|
|
|
|||
|
|
**Overview 看板**:
|
|||
|
|
- 请求率(QPS)
|
|||
|
|
- P50/P90/P99 延迟
|
|||
|
|
- 错误率
|
|||
|
|
- 活跃会话数
|
|||
|
|
|
|||
|
|
**Auth 看板**:
|
|||
|
|
- 登录尝试(成功/失败)
|
|||
|
|
- Token 刷新次数
|
|||
|
|
- 活跃会话趋势
|
|||
|
|
- TOTP 启用率
|
|||
|
|
|
|||
|
|
**Database 看板**:
|
|||
|
|
- 查询延迟 P99
|
|||
|
|
- 连接池使用率
|
|||
|
|
- 慢查询数量
|
|||
|
|
|
|||
|
|
**Cache 看板**:
|
|||
|
|
- 命中率
|
|||
|
|
- 未命中率
|
|||
|
|
- L1/L2 缓存对比
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 5. 日志关键字监控
|
|||
|
|
|
|||
|
|
建议在日志收集系统(如 Loki/ELK)中配置以下关键字告警:
|
|||
|
|
|
|||
|
|
| 关键字 | 严重程度 | 说明 |
|
|||
|
|
|--------|----------|------|
|
|||
|
|
| `auth: increment login attempts failed` | warning | Redis/L1 缓存不可用 |
|
|||
|
|
| `goroutine leak` | critical | 潜在的 goroutine 泄漏 |
|
|||
|
|
| `token blacklisted but refresh failed` | critical | Token 黑名单写入失败 |
|
|||
|
|
| `password reset code replay` | warning | 可能存在验证码重放 |
|
|||
|
|
| `temporary login token cleanup failed` | warning | 临时令牌清理失败 |
|
|||
|
|
| `cache.Set failed` | warning | 缓存写入失败 |
|
|||
|
|
| `failed to send email` | warning | 邮件发送失败 |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 6. 健康检查脚本示例
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# health_check.sh — 服务健康检查脚本
|
|||
|
|
|
|||
|
|
HEALTH_URL="http://localhost:8080/health"
|
|||
|
|
READY_URL="http://localhost:8080/health/ready"
|
|||
|
|
METRICS_URL="http://localhost:8080/metrics"
|
|||
|
|
|
|||
|
|
check_endpoint() {
|
|||
|
|
local url=$1
|
|||
|
|
local name=$2
|
|||
|
|
local status=$(curl -s -o /dev/null -w "%{http_code}" "$url")
|
|||
|
|
|
|||
|
|
if [ "$status" -eq 200 ]; then
|
|||
|
|
echo "[OK] $name: $status"
|
|||
|
|
return 0
|
|||
|
|
else
|
|||
|
|
echo "[FAIL] $name: $status"
|
|||
|
|
return 1
|
|||
|
|
fi
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 执行检查
|
|||
|
|
failed=0
|
|||
|
|
|
|||
|
|
check_endpoint "$HEALTH_URL" "Health" || failed=$((failed + 1))
|
|||
|
|
check_endpoint "$READY_URL" "Ready" || failed=$((failed + 1))
|
|||
|
|
|
|||
|
|
# 检查 Prometheus 指标端点
|
|||
|
|
status=$(curl -s -o /dev/null -w "%{http_code}" "$METRICS_URL")
|
|||
|
|
if [ "$status" -eq 200 ]; then
|
|||
|
|
echo "[OK] Metrics: $status"
|
|||
|
|
else
|
|||
|
|
echo "[WARN] Metrics: $status"
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# 检查数据库连接(通过日志)
|
|||
|
|
if grep -q "database opened" logs/app.log 2>/dev/null; then
|
|||
|
|
echo "[OK] Database: connected"
|
|||
|
|
else
|
|||
|
|
echo "[FAIL] Database: not connected"
|
|||
|
|
failed=$((failed + 1))
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
exit $failed
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 7. Kubernetes 部署配置示例
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
apiVersion: apps/v1
|
|||
|
|
kind: Deployment
|
|||
|
|
spec:
|
|||
|
|
template:
|
|||
|
|
spec:
|
|||
|
|
containers:
|
|||
|
|
- name: user-management
|
|||
|
|
livenessProbe:
|
|||
|
|
httpGet:
|
|||
|
|
path: /health/live
|
|||
|
|
port: 8080
|
|||
|
|
initialDelaySeconds: 10
|
|||
|
|
periodSeconds: 15
|
|||
|
|
timeoutSeconds: 5
|
|||
|
|
failureThreshold: 3
|
|||
|
|
|
|||
|
|
readinessProbe:
|
|||
|
|
httpGet:
|
|||
|
|
path: /health/ready
|
|||
|
|
port: 8080
|
|||
|
|
initialDelaySeconds: 5
|
|||
|
|
periodSeconds: 10
|
|||
|
|
timeoutSeconds: 3
|
|||
|
|
failureThreshold: 3
|
|||
|
|
|
|||
|
|
ports:
|
|||
|
|
- name: http
|
|||
|
|
containerPort: 8080
|
|||
|
|
- name: metrics
|
|||
|
|
containerPort: 9090
|
|||
|
|
|
|||
|
|
resources:
|
|||
|
|
requests:
|
|||
|
|
memory: "256Mi"
|
|||
|
|
cpu: "200m"
|
|||
|
|
limits:
|
|||
|
|
memory: "1Gi"
|
|||
|
|
cpu: "1000m"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
*最后更新:2026-05-10*
|