# UMS 站点可靠性工程（SRE）全面解决方案

> 版本：v1.0 | 日期：2026-04-05 | 审查人：SRE 工程师

---

## 执行摘要

本报告对用户管理系统（UMS）进行了全面的 SRE 审查，涵盖**可靠性基线、可观察性成熟度、告警体系、混沌工程能力、容量规划和自动化运维**六大维度。

**当前综合可靠性评级：⚠️ 4.5/10（开发就绪，生产未就绪）**

| 维度 | 当前分 | 目标分 | 优先级 |
|------|--------|--------|--------|
| SLO 定义 | 0/10 | 8/10 | 🔴 P0 |
| 可观察性成熟度 | 3/10 | 8/10 | 🔴 P0 |
| 告警体系 | 4/10 | 8/10 | 🔴 P0 |
| 错误预算管理 | 0/10 | 7/10 | 🔴 P0 |
| 混沌工程 | 1/10 | 6/10 | 🟡 P1 |
| 容量规划 | 2/10 | 7/10 | 🟡 P1 |
| 运维自动化 | 3/10 | 8/10 | 🟡 P1 |

---

## 一、系统架构现状审查

### 1.1 架构拓扑

```
┌─────────────────────────────────────────────────┐
│                   前端层                          │
│  React 18 + TypeScript + Ant Design 5            │
│  (Vite 构建, 无 SSR)                             │
└──────────────────────┬──────────────────────────┘
                       │ HTTP/REST
┌──────────────────────▼──────────────────────────┐
│                   API 层                          │
│  Gin HTTP Server (port 8080)                     │
│  • 认证中间件    • 速率限制中间件                 │
│  • IP 过滤中间件 • 操作日志中间件                 │
└──────────┬──────────────────────┬───────────────┘
           │                      │
┌──────────▼────────┐  ┌─────────▼──────────────┐
│   业务层 (Service) │  │  缓存层                  │
│  • AuthService    │  │  L1: 内存 LRU (10000项)  │
│  • UserService    │  │  L2: Redis (可选, 未启用) │
│  • DeviceService  │  └────────────────────────┘
│  • 异常检测器      │
└──────────┬────────┘
           │
┌──────────▼────────────────────────────────────┐
│                   数据层                         │
│  SQLite (当前运行时, 生产需迁移至 PostgreSQL)     │
│  GORM ORM                                        │
└───────────────────────────────────────────────┘
```

### 1.2 已有可靠性能力（正向）

| 能力 | 现状 |
|------|------|
| 健康检查端点 | ✅ `/health`, `/health/live`, `/health/ready` |
| Prometheus 指标 | ✅ 已定义 metrics.go，但**未接入路由暴露** |
| Alertmanager 配置 | ✅ 告警规则文件存在，但依赖占位符 |
| Grafana 仪表盘 | ✅ JSON 文件存在 |
| 优雅关闭 | ✅ 15s 超时 + Webhook 专属5s |
| 速率限制 | ✅ 登录/注册/API 三级限流 |
| 异常检测 | ✅ AnomalyDetector 已接线 |
| Token 轮换 | ✅ Refresh Token 滚动轮换 |
| 操作日志 | ✅ 中间件级别审计日志 |
| 数据库备份演练 | ✅ 脚本已存在 |

### 1.3 严重可靠性问题（负向）

---

## 二、严重问题审查清单

### 🔴 CRIT-01：Prometheus 指标端点未接入路由

**问题描述：** `metrics.go` 中定义了完整的 Prometheus 指标，但 `main.go` 和 `router.go` 中**没有注册 `/metrics` 端点**。监控系统实际上收集不到任何数据。

```go
// main.go 中缺失：
// engine.GET("/metrics", promhttp.HandlerFor(registry, promhttp.HandlerOpts{}))
// 当前 /health 只返回 {"status":"ok"}，没有 Prometheus 格式指标
```

**影响：** Alertmanager 告警规则形同虚设，Grafana 仪表盘无数据，所有监控告警全部失效。

**修复优先级：** P0 — 必须立即修复

---

### 🔴 CRIT-02：PrometheusMiddleware 未挂载到路由

**问题描述：** `monitoring/middleware.go` 中定义了 `PrometheusMiddleware`，但 `router.go` 的 `Setup()` 方法中**没有调用**，HTTP 请求计数和延迟指标全部为零。

**影响：** `HighErrorRate`、`HighResponseTime`、`UnusualAPIRequestRate` 三个核心告警永远不会触发。

**修复优先级：** P0

---

### 🔴 CRIT-03：SLO 完全缺失

**问题描述：** 系统没有定义任何 SLO（服务级别目标）。没有 SLO 意味着：
- 不知道什么样的错误率是"可接受"的
- 错误预算无法计算，无法指导发布决策
- 告警阈值缺乏业务依据（当前 5% 错误率阈值是拍脑袋来的）

**影响：** 整个可靠性工程体系缺少地基。

**修复优先级：** P0

---

### 🔴 CRIT-04：仅邮件告警，无 On-Call 升级链路

**问题描述：** `alertmanager.yml` 中只配置了 email_configs，且收件人地址全是占位符 `${ALERTMANAGER_CRITICAL_TO}`。生产环境：
- 无即时通知渠道（钉钉/飞书/PagerDuty/企业微信）
- 无 On-Call 轮班配置
- Critical 告警和 Warning 告警都发邮件，无差异化响应

**影响：** 凌晨 3 点系统宕机，值班工程师无法被及时叫醒。

**修复优先级：** P0

---

### 🔴 CRIT-05：SQLite 用于运行时（单点故障）

**问题描述：** 当前 `config.yaml` 配置为 SQLite，这意味着：
- 无主从复制，无读写分离
- 写操作串行化（WAL 模式下并发受限）
- 无法水平扩展
- 文件级单点故障

**影响：** 任何磁盘故障或进程崩溃都会导致完全不可用（SPOF）。

**修复优先级：** P0（生产上线前必须迁移至 PostgreSQL）

---

### 🟡 WARN-01：L1 Cache updateAccessOrder 时间复杂度 O(n)

**问题描述：** `l1.go` 中 `updateAccessOrder` 方法使用线性扫描，时间复杂度为 O(n)。当缓存接近 10000 条目时，每次缓存读取都会触发最坏 O(10000) 遍历。

```go
// 当前实现：O(n) 线性扫描
func (c *L1Cache) updateAccessOrder(key string) {
    for i, k := range c.accessOrder {  // 最坏 O(10000) 次遍历
        if k == key { ... }
    }
}
```

**影响：** 高并发下缓存层成为性能瓶颈，延迟 P99 显著上升。

**修复优先级：** P1 — 应改用 container/list 双向链表 + map 实现 O(1) LRU

---

### 🟡 WARN-02：健康检查未检查 Redis 连接

**问题描述：** `health.go` 的 `Check()` 方法只检查数据库，没有检查 Redis 连接状态（当 L2 Cache 启用时）。Redis 故障会导致缓存降级，但健康检查仍返回 UP。

**修复优先级：** P1

---

### 🟡 WARN-03：Webhook 服务 Enabled 硬编码为 false

**问题描述：** `main.go` 中：
```go
webhookService := service.NewWebhookService(db.DB, service.WebhookServiceConfig{
    Enabled: false,  // ← 硬编码！config.yaml 中 webhook.enabled=true 被忽略
})
```
**影响：** Webhook 功能实际上完全禁用，与配置文件不一致。

**修复优先级：** P1

---

### 🟡 WARN-04：缺少分布式追踪（Tracing）

**问题描述：** `config.yaml` 中 `monitoring.tracing.enabled: false`，系统完全没有链路追踪能力。当一个请求经过多个 Service 时，无法追踪请求路径。

**影响：** 排查跨 Service 问题时，平均恢复时间（MTTR）会大幅增加。

**修复优先级：** P1

---

### 🟡 WARN-05：结构化日志未完整实现

**问题描述：** `config.yaml` 定义了 JSON 格式日志，但实际代码中大量使用 `log.Printf`（Go 标准库），不携带 trace_id、request_id、user_id 等上下文字段。

**影响：** 日志无法有效聚合查询，排障困难。

**修复优先级：** P1

---

### 🟢 INFO-01：速率限制 Map 无界增长（历史遗留）

**问题描述：** 历史代码审查记录中曾提及 Rate limiter map 无界限增长风险。需确认当前实现是否已修复。

---

## 三、SLO 定义与错误预算

### 3.1 SLO 框架

```yaml
# ums-slo.yaml - 用户管理系统服务级别目标
service: user-management-system
owner: platform-team
review_cycle: 30d

slos:
  # SLO-1: API 可用性
  - name: api-availability
    description: "有效 HTTP 请求返回非 5xx 响应的比例"
    sli:
      metric: |
        (
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
        )
    target: 99.9%          # 每月允许约 43.8 分钟不可用
    window: 30d
    error_budget_minutes: 43.8  # 每月错误预算
    burn_rate_alerts:
      - name: fast-burn-critical
        severity: critical
        short_window: 5m
        long_window: 1h
        burn_rate_factor: 14.4   # 1小时内消耗 2% 错误预算
        page: true
      - name: slow-burn-warning
        severity: warning
        short_window: 30m
        long_window: 6h
        burn_rate_factor: 6      # 6小时内消耗 5% 错误预算
        page: false

  # SLO-2: API 响应延迟
  - name: api-latency
    description: "P99 请求延迟 < 500ms 的请求比例"
    sli:
      metric: |
        (
          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
          /
          sum(rate(http_request_duration_seconds_count[5m]))
        )
    target: 99%
    window: 30d
    critical_paths:
      - path: "/api/v1/auth/login"
        target: 99.5%
        latency_p99: 300ms
      - path: "/api/v1/auth/refresh"
        target: 99.9%
        latency_p99: 100ms
    burn_rate_alerts:
      - name: latency-fast-burn
        severity: warning
        short_window: 5m
        long_window: 1h
        burn_rate_factor: 14.4

  # SLO-3: 登录成功率
  - name: login-success-rate
    description: "登录请求成功（非系统错误）的比例"
    sli:
      metric: |
        (
          sum(rate(user_logins_total{status="success"}[5m]))
          /
          sum(rate(user_logins_total[5m]))
        )
    target: 99%
    window: 30d
    notes: "暴力破解导致的合理失败不计入 SLO 违规"

  # SLO-4: 数据库查询延迟
  - name: db-query-latency
    description: "P95 数据库查询延迟 < 100ms 的比例"
    sli:
      metric: |
        histogram_quantile(0.95,
          sum(rate(db_query_duration_seconds_bucket[5m])) by (le, operation)
        ) < 0.1
    target: 95%
    window: 30d
```

### 3.2 错误预算政策

```
┌─────────────────────────────────────────────────────┐
│              错误预算消耗策略                          │
├─────────────────────────────────────────────────────┤
│ 预算剩余 > 50%：正常发布，可以快速迭代                 │
│ 预算剩余 25-50%：评审每次发布风险，加强测试            │
│ 预算剩余 10-25%：冻结非关键功能发布，集中修复可靠性     │
│ 预算剩余 < 10%：仅允许可靠性修复发布，启动事后审查      │
│ 预算已耗尽：停止所有功能发布，直到下个周期               │
└─────────────────────────────────────────────────────┘
```

---

## 四、可观察性补强方案

### 4.1 三大支柱现状 vs 目标

| 支柱 | 现状 | 目标 | 差距 |
|------|------|------|------|
| **指标** | 已定义但未暴露 | 完整 Prometheus + Grafana | 接入路由 + 补充业务指标 |
| **日志** | 标准库 log.Printf | 结构化 JSON + 上下文字段 | 引入 slog/zap + 字段标准化 |
| **追踪** | 完全缺失 | OpenTelemetry 链路追踪 | 全量接入 |

### 4.2 指标补强清单

**当前缺失的关键指标：**

```go
// 需要新增的 Prometheus 指标
var (
    // 错误预算消耗速率（直接从 SLO 派生）
    errorBudgetBurnRate = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "error_budget_burn_rate",
            Help: "Current error budget burn rate multiplier",
        },
        []string{"slo"},
    )

    // 缓存命中率（告警规则引用此指标，但当前未定义）
    cacheHitsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "cache_hits_total",
            Help: "Total cache hits",
        },
        []string{"level", "operation"},  // level: l1/l2
    )

    cacheOperationsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "cache_operations_total",
            Help: "Total cache operations",
        },
        []string{"level", "operation"},
    )

    // 数据库连接池状态（告警引用但未定义）
    dbConnectionsActive = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "db_connections_active",
            Help: "Active database connections",
        },
    )

    dbConnectionsMax = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "db_connections_max",
            Help: "Maximum database connections",
        },
    )

    // 令牌刷新操作
    tokenRefreshTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "token_refresh_total",
            Help: "Total token refresh attempts",
        },
        []string{"status"},  // success/failure/rate_limited
    )

    // 账号锁定事件
    accountLockTotal = prometheus.NewCounter(
        prometheus.CounterOpts{
            Name: "account_lock_total",
            Help: "Total account lockout events",
        },
    )

    // 异常登录检测
    anomalyDetectedTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "anomaly_detected_total",
            Help: "Total anomaly login detections",
        },
        []string{"type"},  // geo_anomaly/device_anomaly/brute_force
    )
)
```

### 4.3 结构化日志方案

**日志字段标准：**

```go
// 每条日志必须携带的上下文字段
type LogContext struct {
    TraceID   string `json:"trace_id"`    // OpenTelemetry trace
    SpanID    string `json:"span_id"`
    RequestID string `json:"request_id"`  // X-Request-ID header
    UserID    string `json:"user_id,omitempty"`
    IP        string `json:"ip"`
    Method    string `json:"method"`
    Path      string `json:"path"`
    Duration  int64  `json:"duration_ms"`
    Status    int    `json:"status"`
    Error     string `json:"error,omitempty"`
}

// 安全事件专用字段
type SecurityLogEvent struct {
    EventType   string `json:"event_type"`   // login_failed/brute_force/anomaly
    Severity    string `json:"severity"`      // low/medium/high/critical
    UserID      string `json:"user_id,omitempty"`
    IP          string `json:"ip"`
    DeviceID    string `json:"device_id,omitempty"`
    Details     string `json:"details"`
}
```

**推荐接入 `log/slog`（Go 1.21+）：**

```go
// 替换 log.Printf → slog
import "log/slog"

// 初始化结构化 logger
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
    Level: slog.LevelInfo,
    AddSource: false,
}))
slog.SetDefault(logger)

// 在 Gin middleware 中注入 request_id
func StructuredLogger() gin.HandlerFunc {
    return func(c *gin.Context) {
        requestID := c.GetHeader("X-Request-ID")
        if requestID == "" {
            requestID = uuid.New().String()
        }
        c.Set("request_id", requestID)
        c.Header("X-Request-ID", requestID)

        start := time.Now()
        c.Next()

        slog.Info("http_request",
            "request_id", requestID,
            "method", c.Request.Method,
            "path", c.FullPath(),
            "status", c.Writer.Status(),
            "duration_ms", time.Since(start).Milliseconds(),
            "ip", c.ClientIP(),
            "user_id", c.GetString("user_id"),
        )
    }
}
```

### 4.4 OpenTelemetry 分布式追踪接入

```go
// 最小化追踪接入方案
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracing(endpoint string, serviceName string) (func(), error) {
    exporter, err := otlptracehttp.New(context.Background(),
        otlptracehttp.WithEndpoint(endpoint),
        otlptracehttp.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithSampler(trace.ParentBased(trace.TraceIDRatioBased(0.1))), // 10% 采样
    )
    otel.SetTracerProvider(tp)

    return func() { tp.Shutdown(context.Background()) }, nil
}
```

---

## 五、告警体系优化

### 5.1 告警分级矩阵

| 级别 | 定义 | 响应时间 | 通知渠道 | 示例 |
|------|------|----------|----------|------|
| **P0-CRITICAL** | 服务完全不可用，影响所有用户 | 5分钟内 | 电话 + 飞书 + 短信 | 健康检查失败、数据库宕机 |
| **P1-CRITICAL** | 核心功能降级，错误预算快速燃烧 | 15分钟内 | 飞书 + 短信 | 登录成功率 < 95%、P99 > 2s |
| **P2-WARNING** | 性能下降，错误预算缓慢消耗 | 1小时内 | 飞书 | 缓存命中率低、内存 > 80% |
| **P3-INFO** | 趋势异常，需要关注 | 工作时间内 | 邮件 | 在线用户异常、API 量异常 |

### 5.2 基于错误预算的燃烧率告警（替代当前阈值告警）

**当前问题：** `alerts.yml` 中的告警基于固定阈值（如"错误率 > 5%"），这种方式有两个问题：
1. **误报多**：短暂流量抖动就触发告警，导致告警疲劳
2. **漏报多**：长期小幅度超标会耗尽错误预算，但不触发告警

**改进方案：使用燃烧率（Burn Rate）告警**

```yaml
# 改进后的 alerts.yml - 基于 SLO 燃烧率
groups:
  - name: ums-slo-burn-rate
    rules:
      # === SLO-1: API 可用性 燃烧率告警 ===
      # 快速燃烧：1小时消耗 2% 月度错误预算 → 立即告警
      - alert: APIAvailability_FastBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > (1 - 0.999) * 14.4
          AND
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > (1 - 0.999) * 14.4
        for: 2m
        labels:
          severity: critical
          slo: api-availability
          page: "true"
        annotations:
          summary: "🔴 API 可用性 SLO 快速燃烧 — 立即响应"
          description: |
            错误预算正在以 14.4x 速率消耗（正常速率的14倍）
            当前错误率: {{ $value | humanizePercentage }}
            若持续1小时，将消耗本月 2% 错误预算
            剩余错误预算: 见 Grafana 仪表盘
            运维手册: https://docs/runbook/api-availability

      # 慢速燃烧：6小时消耗 5% 月度错误预算 → 警告
      - alert: APIAvailability_SlowBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[30m]))
            /
            sum(rate(http_requests_total[30m]))
          ) > (1 - 0.999) * 6
          AND
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total[6h]))
          ) > (1 - 0.999) * 6
        for: 15m
        labels:
          severity: warning
          slo: api-availability
          page: "false"
        annotations:
          summary: "🟡 API 可用性 SLO 缓慢燃烧 — 需要关注"
          description: |
            错误预算正在以 6x 速率消耗
            若持续6小时，将消耗本月 5% 错误预算

      # === SLO-2: 延迟 燃烧率告警 ===
      - alert: APILatency_FastBurn
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 0.5 * 14.4
        for: 2m
        labels:
          severity: critical
          slo: api-latency
          page: "true"
        annotations:
          summary: "🔴 API 延迟 SLO 快速燃烧"
          description: "P99 延迟: {{ $value }}s，超过 SLO 阈值 500ms"

      # === 基础设施告警（保留阈值型） ===
      - alert: ServiceDown
        expr: up{job="user-management"} == 0
        for: 1m
        labels:
          severity: critical
          page: "true"
        annotations:
          summary: "🚨 服务实例宕机"
          description: "{{ $labels.instance }} 已离线超过 1 分钟"

      - alert: DatabaseDown
        expr: |
          sum(rate(http_requests_total{status="503"}[2m])) > 0
        for: 1m
        labels:
          severity: critical
          page: "true"
        annotations:
          summary: "🚨 数据库连接失败"

      - alert: HighLoginFailureRate_BruteForce
        expr: |
          sum(rate(user_logins_total{status="failed"}[5m])) 
          /
          sum(rate(user_logins_total[5m])) > 0.5
        for: 3m
        labels:
          severity: critical
          category: security
        annotations:
          summary: "🔐 疑似暴力破解攻击"
          description: "登录失败率: {{ $value | humanizePercentage }}，超过 50%"

      - alert: TokenRefreshFailureSpike
        expr: |
          sum(rate(token_refresh_total{status="failure"}[5m])) > 10
        for: 2m
        labels:
          severity: warning
          category: auth
        annotations:
          summary: "Token 刷新失败激增"

      - alert: AnomalyDetectionSpike
        expr: |
          sum(rate(anomaly_detected_total[5m])) > 5
        for: 2m
        labels:
          severity: warning
          category: security
        annotations:
          summary: "异常登录检测激增，可能存在攻击"
```

### 5.3 多通道告警接收配置

```yaml
# alertmanager.yml 优化版（支持飞书 + 企业微信 + 邮件）
global:
  resolve_timeout: 5m
  slack_api_url: '${ALERTMANAGER_SLACK_API_URL}'

route:
  group_by: ['alertname', 'slo', 'category']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    # P0: 立即叫醒（飞书 + 短信）
    - match:
        page: "true"
      receiver: 'oncall-page'
      group_wait: 10s
      repeat_interval: 1h
      continue: true

    # 安全事件：安全团队专属通道
    - match:
        category: security
      receiver: 'security-team'
      group_wait: 30s
      continue: true

    # Warning：告警群组
    - match:
        severity: warning
      receiver: 'warning-channel'
      continue: false

receivers:
  - name: 'oncall-page'
    webhook_configs:
      - url: '${FEISHU_WEBHOOK_URL}'
        send_resolved: true
        http_config:
          bearer_token: '${FEISHU_TOKEN}'
    email_configs:
      - to: '${ONCALL_EMAIL}'
        from: '${ALERT_FROM}'
        smarthost: '${SMTP_HOST}'

  - name: 'security-team'
    webhook_configs:
      - url: '${SECURITY_FEISHU_WEBHOOK_URL}'
        send_resolved: true

  - name: 'warning-channel'
    webhook_configs:
      - url: '${WARNING_FEISHU_WEBHOOK_URL}'
        send_resolved: true

  - name: 'default'
    email_configs:
      - to: '${ALERTMANAGER_DEFAULT_TO}'
        from: '${ALERTMANAGER_FROM}'
        smarthost: '${ALERTMANAGER_SMARTHOST}'

inhibit_rules:
  # Critical 抑制同服务 Warning
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname']
```

---

## 六、混沌工程方案

### 6.1 混沌工程实施路线图

```
第1阶段（现在）：游戏日（Game Day）
  └── 手动故障注入 + 观察系统行为
  └── 目标：发现未知故障模式

第2阶段（1个月后）：脚本化故障注入
  └── PowerShell/Shell 脚本
  └── 目标：可重复验证

第3阶段（3个月后）：持续混沌（Continuous Chaos）
  └── 定时自动化故障注入
  └── 目标：回归防护
```

### 6.2 故障注入实验清单

| 实验 ID | 故障类型 | 注入方式 | 预期行为 | 验证指标 |
|---------|----------|----------|----------|----------|
| CE-001 | 数据库不可用 | 关闭 SQLite 文件句柄 | 返回 503，健康检查降为 DOWN | `health_check_status == DOWN` |
| CE-002 | Redis 不可用 | 停止 Redis 服务 | 降级到 L1 缓存，业务继续 | 错误率无显著上升 |
| CE-003 | 高内存压力 | 注入内存泄漏 goroutine | GC 正常运行，不 OOM | `system_goroutines`, 内存告警 |
| CE-004 | 网络延迟 | 添加人工 sleep | P99 延迟告警触发 | `APILatency_FastBurn` 触发 |
| CE-005 | 大量并发登录 | 压测工具 | 速率限制正确工作 | 登录接口 429 响应率 |
| CE-006 | JWT Secret 轮换 | 更换配置重启 | 现有 token 失效优雅处理 | 401 率短暂上升后恢复 |
| CE-007 | 进程崩溃恢复 | SIGKILL 进程 | 重启后状态恢复 | 服务可用性恢复时间 |
| CE-008 | 暴力破解攻击 | ab/wrk 高频失败登录 | 账号锁定 + IP 封禁 | `HighLoginFailureRate_BruteForce` |

### 6.3 混沌实验脚本（CE-005：并发登录压测）

```powershell
# scripts/chaos/ce-005-concurrent-login.ps1
# 目标：验证速率限制在高并发下是否正常工作

param(
    [string]$BaseURL = "http://localhost:8080",
    [int]$Concurrency = 50,
    [int]$Duration = 30
)

Write-Host "=== CE-005: 并发登录压测 ==="
Write-Host "目标: $BaseURL"
Write-Host "并发数: $Concurrency"

$results = @{
    total = 0
    success = 0
    rate_limited = 0
    other_error = 0
}

$jobs = 1..$Concurrency | ForEach-Object {
    Start-Job -ScriptBlock {
        param($BaseURL, $Duration)
        $end = (Get-Date).AddSeconds($Duration)
        $local_results = @{ total=0; success=0; rate_limited=0; error=0 }
        
        while ((Get-Date) -lt $end) {
            try {
                $body = @{
                    account = "testuser_$((Get-Random -Max 1000))"
                    password = "wrongpassword"
                } | ConvertTo-Json
                
                $resp = Invoke-WebRequest -Uri "$BaseURL/api/v1/auth/login" `
                    -Method POST -Body $body -ContentType "application/json" `
                    -ErrorAction SilentlyContinue
                
                $local_results.total++
                switch ($resp.StatusCode) {
                    200 { $local_results.success++ }
                    429 { $local_results.rate_limited++ }
                    default { $local_results.error++ }
                }
            } catch { $local_results.error++ }
        }
        return $local_results
    } -ArgumentList $BaseURL, $Duration
}

$jobs | Wait-Job | ForEach-Object {
    $r = Receive-Job $_
    $results.total += $r.total
    $results.success += $r.success
    $results.rate_limited += $r.rate_limited
    $results.other_error += $r.error
}

Write-Host "`n=== 压测结果 ==="
Write-Host "总请求: $($results.total)"
Write-Host "成功: $($results.success)"
Write-Host "速率限制(429): $($results.rate_limited)"
Write-Host "其他错误: $($results.other_error)"
Write-Host "速率限制比例: $([math]::Round($results.rate_limited / [math]::Max($results.total,1) * 100, 2))%"

# 验证：速率限制应该触发
if ($results.rate_limited -gt 0) {
    Write-Host "`n✅ 实验通过：速率限制正常工作" -ForegroundColor Green
} else {
    Write-Host "`n❌ 实验失败：速率限制未触发，需要检查配置" -ForegroundColor Red
    exit 1
}
```

---

## 七、容量规划

### 7.1 当前资源基线

| 资源 | 当前配置 | 预估容量 | 瓶颈风险 |
|------|----------|----------|----------|
| 并发用户 | 未测量 | ~500（估算） | 数据库写锁（SQLite） |
| 内存 | 未监控 | <500MB | 高 |
| L1 Cache | 10000 条目 | ~100MB | 低 |
| 速率限制 | 1000 req/min | 16.7 req/s | 取决于业务 |
| DB 连接池 | 未配置（GORM 默认） | 10 并发 | 高 |

### 7.2 扩展路线图

```
当前状态（SQLite 单机）
    ↓ 迁移触发条件：并发用户 > 100 或写入 QPS > 50
PostgreSQL 单主
    ↓ 扩展触发条件：读写比 > 4:1 或主库 CPU > 60%
PostgreSQL 主从（读写分离）
    ↓ 扩展触发条件：单机不足支撑峰值
PostgreSQL 连接池（PgBouncer） + 读副本
```

### 7.3 数据库连接池配置建议

```yaml
# config.yaml 推荐配置（迁移 PostgreSQL 后）
database:
  postgresql:
    max_open_conns: 50      # 根据 PostgreSQL max_connections 的 1/3 设置
    max_idle_conns: 10      # 保持 max_open_conns 的 20%
    conn_max_lifetime: 1h   # 防止连接泄漏
    conn_max_idle_time: 5m  # 回收空闲连接
```

---

## 八、P0 修复实施计划

### 8.1 立即修复（本周内）

#### Fix-1：接入 Prometheus 指标端点

修改 `cmd/server/main.go`，在路由中注册 `/metrics` 端点：

```go
// 在 router.go 的 Setup() 函数中添加（在 v1 group 之前）
import (
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "github.com/user-management-system/internal/monitoring"
)

// Setup() 中新增
metrics := monitoring.GetGlobalMetrics()
r.engine.Use(monitoring.PrometheusMiddleware(metrics))
r.engine.GET("/metrics", gin.WrapH(
    promhttp.HandlerFor(metrics.GetRegistry(), promhttp.HandlerOpts{
        EnableOpenMetrics: true,
    }),
))
```

#### Fix-2：修复健康检查增加 Redis 检查

```go
// health.go 增加 Redis 检查
func (h *HealthCheck) Check() *Status {
    status := &Status{
        Status: HealthStatusUP,
        Checks: make(map[string]CheckResult),
    }

    dbResult := h.checkDatabase()
    status.Checks["database"] = dbResult
    if dbResult.Status != HealthStatusUP {
        status.Status = HealthStatusDOWN
    }

    // 新增：Redis 检查（如果启用）
    if h.redisClient != nil {
        redisResult := h.checkRedis()
        status.Checks["redis"] = redisResult
        // Redis 不可用视为 degraded，不影响主服务状态
        // 但记录为 WARN
    }

    return status
}
```

#### Fix-3：修复 Webhook 服务 Enabled 配置

```go
// main.go 修复
webhookService := service.NewWebhookService(db.DB, service.WebhookServiceConfig{
    Enabled: cfg.Webhook.Enabled,  // 从配置读取，不再硬编码
})
```

### 8.2 本月完成

1. 引入结构化日志（slog）替换 log.Printf
2. 新增缺失的 Prometheus 指标（cache_hits_total 等）
3. 配置飞书 Webhook 告警通道
4. 更新 alerts.yml 为燃烧率告警
5. 执行 CE-001 ~ CE-005 混沌实验并记录结果

### 8.3 下季度完成

1. 迁移 SQLite → PostgreSQL（生产环境必须）
2. 接入 OpenTelemetry 分布式追踪
3. 建立 SLO 仪表盘（Grafana）
4. 实施错误预算政策，纳入发布流程

---

## 九、运维手册（Runbook）

### Runbook-01：API 可用性下降

**触发条件：** `APIAvailability_FastBurn` 告警触发

**响应步骤：**
1. 检查健康检查：`curl http://服务地址/health/ready`
2. 检查最近部署：`git log --oneline -10`
3. 检查数据库：`curl http://服务地址/health | jq .checks.database`
4. 检查错误日志：`tail -100 logs/app.log | grep "ERROR"`
5. 若数据库异常 → 执行数据库恢复流程
6. 若最近有部署 → 评估回滚：`git revert HEAD`
7. 上报状态给用户（若影响 > 5 分钟）

**恢复目标：** MTTR < 30分钟

---

### Runbook-02：疑似暴力破解

**触发条件：** `HighLoginFailureRate_BruteForce` 告警触发

**响应步骤：**
1. 查看攻击源 IP：检查登录日志 `GET /api/v1/logs/login`
2. 确认 IP 封禁已生效：查看 `anomaly_detected_total{type="brute_force"}`
3. 若 IP 封禁未生效：手动加入 IP 黑名单（ip_security 配置）
4. 通知安全团队
5. 评估是否需要临时提高速率限制阈值

---

### Runbook-03：数据库不可用

**触发条件：** `DatabaseDown` 告警触发

**响应步骤：**
1. 立即检查：`sqlite3 data/user_management.db ".tables"`
2. 若文件损坏：执行备份恢复：
   ```powershell
   powershell -ExecutionPolicy Bypass -File scripts/ops/drill-sqlite-backup-restore.ps1
   ```
3. 若进程锁定：检查是否有孤儿进程占用文件
4. 迁移计划：SQLite 单点是已知风险，立即提升 PostgreSQL 迁移优先级

---

## 十、SRE 度量指标（季度回顾）

| 指标 | 目标 | 测量方法 |
|------|------|----------|
| **MTTR**（平均恢复时间） | < 30分钟 | 事件记录 |
| **MTBF**（平均无故障时间） | > 720小时 | 运行日志 |
| **错误预算消耗率** | < 50%/月 | Prometheus |
| **告警噪声比** | < 10%（告警中非实际问题的比例） | 人工评审 |
| **混沌实验通过率** | > 80% | 实验记录 |
| **手册完备率** | 每个 P0 告警对应手册 | 文档检查 |

---

## 附录 A：SRE 工具链建议

| 工具 | 用途 | 当前状态 |
|------|------|----------|
| Prometheus | 指标采集 | ✅ 已配置（需接路由） |
| Grafana | 指标可视化 | ✅ 仪表盘已有 |
| Alertmanager | 告警路由 | ✅ 已配置（需真实通道） |
| OpenTelemetry | 分布式追踪 | ❌ 缺失 |
| 飞书/企业微信 Webhook | 即时告警 | ❌ 缺失 |
| PagerDuty/oncall | On-Call 管理 | ❌ 缺失 |
| k6/wrk | 压力测试 | ❌ 缺失 |
| 日志聚合（Loki/ELK） | 日志查询 | ❌ 缺失 |

---

## 附录 B：快速健康检查命令

```powershell
# 系统整体健康状态
Invoke-RestMethod -Uri "http://localhost:8080/health/ready"

# 检查指标端点（修复后）
Invoke-RestMethod -Uri "http://localhost:8080/metrics"

# 检查登录接口延迟
Measure-Command { Invoke-RestMethod -Uri "http://localhost:8080/api/v1/auth/capabilities" }

# 检查速率限制
1..10 | ForEach-Object {
    $resp = Invoke-WebRequest -Uri "http://localhost:8080/api/v1/auth/login" `
        -Method POST -Body '{"account":"x","password":"x"}' `
        -ContentType "application/json" -ErrorAction SilentlyContinue
    Write-Host "请求 $_: HTTP $($resp.StatusCode)"
}
```

---

*本报告由 SRE 工程师完成全面审查，问题分级标准参照 Google SRE Book。所有 P0 问题需在上线前修复，P1 问题需在下一个 Sprint 内修复。*

*下次 SLO 回顾日期：2026-05-05*