AI-Ops 运维看板

commit fc54ba84b20c2f129780edfa9a20d0306eae2d4c Author: phamnazage-jpg <247508310+phamnazage-jpg@users.noreply.github.com> Date: Tue May 12 17:47:32 2026 +0800 chore: initial import diff --git a/.dockerignore b/.dockerignore new file mode 100644 index 0000000..69ca2d5 --- /dev/null +++ b/.dockerignore @@ -0,0 +1,6 @@ +.git +.dockerignore +Dockerfile +docker-compose.yml +*.md +*.log diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..d11fadd --- /dev/null +++ b/.gitignore @@ -0,0 +1,10 @@ +# Local runtime artifacts +.runtime/ +backups/ +*.log +*.pid + +# Build outputs +ai-ops +ai-ops-static +coverage.out diff --git a/Dockerfile b/Dockerfile new file mode 100644 index 0000000..c2c24fc --- /dev/null +++ b/Dockerfile @@ -0,0 +1,16 @@ +# 多阶段构建 +FROM golang:1.22-alpine AS builder +WORKDIR /app +COPY go.mod go.sum ./ +RUN go mod download +COPY . . +RUN CGO_ENABLED=0 go build -buildvcs=false -o ai-ops ./cmd/ai-ops + +# 运行阶段 +FROM alpine:3.19 +RUN apk --no-cache add ca-certificates tzdata +WORKDIR /app +COPY --from=builder /app/ai-ops . +COPY config.yaml . +EXPOSE 8080 +CMD ["./ai-ops"] diff --git a/Dockerfile.podman-local b/Dockerfile.podman-local new file mode 100644 index 0000000..755be1a --- /dev/null +++ b/Dockerfile.podman-local @@ -0,0 +1,6 @@ +FROM alpine:3.19 +WORKDIR /app +COPY ai-ops . +COPY config.podman.yaml ./config.yaml +EXPOSE 8080 +CMD ["./ai-ops"] diff --git a/EXECUTION_BOARD.md b/EXECUTION_BOARD.md new file mode 100644 index 0000000..4e47846 --- /dev/null +++ b/EXECUTION_BOARD.md @@ -0,0 +1,200 @@ +# AI-Ops Execution Board + +> 版本：v1.6 | 日期：2026-05-12 | 状态：单机稳定版闭环已补齐，一键启动、备份、恢复、回滚演练通过 + +--- + +## 当前 Gate 状态 + +| Gate | 状态 | 说明 | +|------|------|------| +| GATE-0 编译 | ✅ **SOLID** | `go build -buildvcs=false ./...` 通过 | +| GATE-1 单测 | ✅ **SOLID** | `go test -buildvcs=false ./...` 通过 | +| GATE-2 运行 | ✅ **PODMAN_SOLID** | 已用本机 Podman Compose 跑通 PostgreSQL + Redis + App 全链路，完成健康检查、ready、登录、dashboard、alerts/rules/channels/openapi 烟测；Docker 权限环境仍待单独验证 | +| GATE-3 E2E | ⚠ **PARTIAL** | Podman Compose 全链路通过；完整 Docker Compose / 浏览器交互 / 聚合真实压测仍待有权限环境验证 | + +--- + +## 进度概览 + +| 阶段 | 内容 | 工期 | 状态 | 完成度 | +|------|------|------|------|--------| +| **Phase 1** | 监控看板 + 日志查询 | 8人天 | ✅ 可运行 | ~95% | +| **Phase 2** | 告警规则引擎 + 通知渠道 | 12人天 | ✅ 核心闭环完成 | ~92% | +| **Phase 3** | 自愈引擎 + 审计回滚 | 14人天 | ✅ 安全最小闭环完成 | ~82% | +| **全局 G1** | 认证与权限 | - | ✅ 完成 | ~90% | +| **全局 G2** | 健康检查 | - | ✅ 完成 | ~95% | +| **全局 G3** | OpenAPI 文档 | - | ✅ 完成 | ~80% | +| **CI** | GitHub Actions | - | ✅ 已补 | ~80% | + +**Go 文件数：63 | Go 代码行数：5890 | 编译：通过 | Race 测试：通过 | 单机一键启动：通过 | 备份：通过 | 恢复/回滚：通过 | 总覆盖率：83.3%** + +--- + +## 本轮收口内容 + +### 1. 告警集群聚合 + +| 项 | 状态 | 说明 | +|----|------|------| +| 同资源 1 分钟窗口聚合 | ✅ | `CreateEventWithAggregation(ctx,event,1m,20)`；同一 `resource_type/resource_id` 1 分钟内超过 20 条生成聚合告警 | +| 子告警关联 | ✅ | 子事件写入 `parent_alert_id`，聚合事件 `is_aggregated=true`、`aggregated_count=count` | +| 通知对象 | ✅ | 若触发聚合，通知发送聚合事件而不是最后一条子事件 | +| UUID 修复 | ✅ | 告警事件 ID 从 `evt_` 改为 UUID，匹配 PostgreSQL UUID 主键 | +| 测试 | ✅ | `TestAlertEngineAggregatesWhenSameResourceExceedsTwentyEventsWithinWindow` | + +### 2. 通知日志 Service 层集成 + +| 项 | 状态 | 说明 | +|----|------|------| +| Domain model | ✅ | `internal/domain/model/notification.go` | +| Repository interface | ✅ | `NotificationLogRepository` | +| PostgreSQL 实现 | ✅ | `pg_notification_log_repository.go` | +| Service 集成 | ✅ | 每个渠道发送前创建 pending 日志；成功标记 sent；失败标记 failed，并继续备用渠道 | +| 测试 | ✅ | `TestNotificationServiceWritesLogWhenWebhookSent` | + +### 3. 自愈真实执行器安全最小闭环 + +| 动作 | 状态 | 生产约束 | +|------|------|----------| +| `switch_route` | ✅ | 调用 healing_config.endpoint，默认 POST | +| `throttle` | ✅ | 调用 healing_config.endpoint，默认 POST | +| `restart_instance` | ✅ | 必须显式配置 `allow_restart=true`，否则拒绝执行 | +| `invoke_script` | ✅ | 只允许 `script_id` + endpoint 方式，禁止原始脚本文本直接执行 | +| HTTP 方法白名单 | ✅ | 仅允许 POST/PUT/PATCH | +| Token 支持 | ✅ | healing_config.token 写入 Bearer Authorization | +| 测试 | ✅ | 成功调用 endpoint + restart 安全拒绝测试 | + +### 4. 前端页面 + +| 页面/能力 | 状态 | 说明 | +|-----------|------|------| +| `/ops/dashboard` | ✅ | 单页 HTML，看板 + 登录 + 刷新 | +| 指标卡片 | ✅ | QPS、平均延迟、P99、错误率 | +| 告警事件列表 | ✅ | 调用 `/api/v1/ai-ops/alerts`，显示聚合字段 | +| 规则列表 | ✅ | 调用 `/api/v1/ai-ops/rules` | +| 渠道列表 | ✅ | 调用 `/api/v1/ai-ops/channels` | +| 日志列表 | ✅ | 调用 `/api/v1/ai-ops/logs` | +| 页面认证 | ✅ | 页面本身公开，API 使用登录后 localStorage JWT 调用 | + +### 5. CI / GitHub Actions + +| 项 | 状态 | 文件 | +|----|------|------| +| Go 1.22 CI | ✅ | `.github/workflows/ci.yml` | +| PostgreSQL/Redis service | ✅ | CI services 配置 | +| gofmt 检查 | ✅ | `test -z "$(gofmt -l .)"` | +| build/test | ✅ | `go build -buildvcs=false ./...` + `go test -buildvcs=false -race ./...` | +| migration smoke | ✅ | 应用 000001 + 000002 migration | + +--- + +## 当前模块状态 + +### Phase 1：监控看板 + 日志查询 + +| 模块 | 状态 | 说明 | +|--------|------|------| +| 首页基础布局 | ✅ | `/ops/dashboard` 已升级为可用单页看板 | +| 指标数据获取 | ✅ | `/api/v1/ai-ops/metrics/realtime` | +| 指标下钻 | ✅ | `/api/v1/ai-ops/metrics/query` | +| 日志查询 | ✅ | 页面 + API + CSV 导出 | +| 日志查询性能 | ⚠ | 超时逻辑待补；Redis 缓存框架已集成 | + +### Phase 2：告警规则引擎 + 通知渠道 + +| 模块 | 状态 | 说明 | +|--------|------|------| +| 告警规则 CRUD | ✅ | `/api/v1/ai-ops/rules` | +| 规则引擎 | ✅ | 30 秒评估 + 持续时间判定 + 抑制期 | +| 告警升级 | ✅ | P2 持续 2 小时未确认 → P1 | +| 告警事件列表 | ✅ | `/api/v1/ai-ops/alerts` 连接真实 repo | +| 告警集群聚合 | ✅ | 同资源 1 分钟 >20 条生成聚合事件 | +| 通知渠道 CRUD | ✅ | `/api/v1/ai-ops/channels` | +| 通知发送后端 | ✅ | 内存队列 + 失败自动切换 | +| 通知日志 | ✅ | DB + Service 集成完成 | + +### Phase 3：自愈引擎 + 审计回滚 + +| 模块 | 状态 | 说明 | +|--------|------|------| +| 自愈规则配置 | ✅ | `healing_action` + `healing_config` + `is_sandboxed` | +| 自愈执行后端 | ✅ | HTTP endpoint 执行器；restart/script 有安全约束 | +| 沙盒模式 | ✅ | dry-run 只记录不执行 | +| 级联失败处理 | ⚠ | 基础失败记录已完成；复杂级联策略未实现 | +| 审计日志查询 | ✅ | `/api/v1/ai-ops/audits` | +| 审计后端 | ✅ | append-only 触发器保障 | +| 配置回滚 | ✅ | `/api/v1/ai-ops/audits/{id}/rollback` | + +--- + +## 已验证命令 + +```bash +cd /home/long/project/ai-ops + +gofmt -w cmd internal test +go build -buildvcs=false ./... +go test -buildvcs=false ./... +go test -buildvcs=false -coverprofile=coverage.out ./... +go tool cover -func=coverage.out | tail -1 +# total: 81.3% +# latest total: 83.4% +# single-node hardening total: 83.3% + +# 单机稳定版验证 +scripts/aiops-single-node.sh doctor +AI_OPS_PROJECT=aiops-verify AI_OPS_APP_PORT=18180 AI_OPS_DB_PORT=15433 AI_OPS_REDIS_PORT=16380 scripts/aiops-single-node.sh start +AI_OPS_PROJECT=aiops-verify AI_OPS_APP_PORT=18180 AI_OPS_DB_PORT=15433 AI_OPS_REDIS_PORT=16380 scripts/aiops-single-node.sh backup +AI_OPS_PROJECT=aiops-verify AI_OPS_APP_PORT=18180 AI_OPS_DB_PORT=15433 AI_OPS_REDIS_PORT=16380 scripts/aiops-single-node.sh recover +AI_OPS_PROJECT=aiops-verify AI_OPS_APP_PORT=18180 AI_OPS_DB_PORT=15433 AI_OPS_REDIS_PORT=16380 scripts/aiops-single-node.sh restore backups/ai_ops_20260512-103615.sql.gz + +# Podman Compose 替代 Docker Compose 全链路验证 +CGO_ENABLED=0 go build -buildvcs=false -o ai-ops-static ./cmd/ai-ops +podman-compose -f docker-compose.podman.yml up -d +curl -fsS http://localhost:18080/health +curl -fsS http://localhost:18080/actuator/health/ready +curl -fsS http://localhost:18080/ops/dashboard +curl -fsS http://localhost:18080/openapi.json +curl -fsS -X POST http://localhost:18080/api/v1/ai-ops/login \ + -H 'Content-Type: application/json' \ + -d '{"username":"admin","password":"admin"}' +curl -fsS http://localhost:18080/api/v1/ai-ops/alerts?page=1\&page_size=5 \ + -H "Authorization: Bearer $TOKEN" + +# Podman DB schema smoke +PGPASSWORD=aiops123 psql -h localhost -p 15432 -U aiops -d ai_ops \ + -c "SELECT to_regclass('public.ai_ops_alerts'), to_regclass('public.ai_ops_notification_logs');" +``` + +--- + +## Podman 验证交付物 + +| 文件 | 用途 | +|------|------| +| `config.podman.yaml` | 容器内配置，DB host=`postgres`，Redis host=`redis` | +| `docker-compose.podman.yml` | Rootless Podman Compose 验证，端口映射 `15432/16379/18080`，避免宿主冲突 | +| `Dockerfile.podman-local` | 离线本地二进制镜像模板；当前 compose 使用 volume 挂载 `ai-ops-static`，避免 build/pull 阻塞 | +| `ai-ops-static` | `CGO_ENABLED=0` 静态二进制，适配 Alpine 容器运行 | + +Podman 当前验证结果： + +```text +ai-ops-podman-app=Up +ai-ops-podman-postgres=Up (healthy) +ai-ops-podman-redis=Up (healthy) +PODMAN_COMPOSE_FULL_STACK_OK +``` + +--- + +## 剩余风险 / P2 技术债 + +1. Docker 原生 Compose 仍需在有 Docker daemon 权限环境单独验证；本机已用 Podman Compose 完成替代验证。 +2. fresh DB 破坏性验证因本机命令钩子阻断 `DROP DATABASE`，未在本机完成；Podman 新卷初始化与 CI migration smoke 已覆盖主要场景。 +3. 自愈执行器目前是安全 HTTP endpoint 适配层，不直接接 Kubernetes / Gateway SDK；生产接入时需补具体 adapter。 +4. 前端为轻量单页运维面板，不是完整产品化 UI。 +5. 覆盖率 gate 已达成并继续补强：`go test -buildvcs=false -coverprofile=coverage.out ./...` 总覆盖率 83.3%；已补告警升级、Feishu/Wechat 通知占位分支、日志导出错误分支、`CreateEvent` 直接路径，并修复告警升级通知服务为空时的空指针风险。 +6. 单机稳定版已补齐：`scripts/aiops-single-node.sh` 支持 `start/stop/status/logs/smoke/backup/restore/recover/doctor`；隔离端口验证通过；restore 演练暴露并修复了恢复到已有库时 schema 对象重复的问题。 +7. production mode 配置已加硬校验：JWT secret 至少 32 字符、metrics auth 至少 16 字符、DB 必填项不能为空，避免线上单机空 secret/空 metrics key 误运行。 diff --git a/config.podman.yaml b/config.podman.yaml new file mode 100644 index 0000000..1715910 --- /dev/null +++ b/config.podman.yaml @@ -0,0 +1,24 @@ +server: + port: 8080 + mode: development + jwt_secret: "ai-ops-dev-secret" + metrics_auth: "metrics-api-key" + +database: + host: postgres + port: 5432 + user: aiops + password: aiops123 + dbname: ai_ops + sslmode: disable + pool_size: 10 + +redis: + host: redis + port: 6379 + password: "" + db: 0 + +metrics: + prometheus_url: "http://localhost:9090" + retention_days: 7 diff --git a/config.yaml b/config.yaml new file mode 100644 index 0000000..eed94aa --- /dev/null +++ b/config.yaml @@ -0,0 +1,24 @@ +server: + port: 8080 + mode: development + jwt_secret: "ai-ops-dev-secret" + metrics_auth: "metrics-api-key" + +database: + host: localhost + port: 5432 + user: aiops + password: aiops123 + dbname: ai_ops + sslmode: disable + pool_size: 10 + +redis: + host: localhost + port: 6379 + password: "" + db: 0 + +metrics: + prometheus_url: "http://localhost:9090" + retention_days: 7 diff --git a/docker-compose.podman.yml b/docker-compose.podman.yml new file mode 100644 index 0000000..0fac60c --- /dev/null +++ b/docker-compose.podman.yml @@ -0,0 +1,53 @@ +services: + postgres: + image: docker.io/library/postgres:16-alpine + container_name: ai-ops-podman-postgres + environment: + POSTGRES_USER: aiops + POSTGRES_PASSWORD: aiops123 + POSTGRES_DB: ai_ops + ports: + - "15432:5432" + volumes: + - podman_postgres_data:/var/lib/postgresql/data + - ./tech/migrations:/docker-entrypoint-initdb.d:ro + healthcheck: + test: ["CMD-SHELL", "pg_isready -U aiops -d ai_ops"] + interval: 5s + timeout: 5s + retries: 10 + + redis: + image: docker.io/library/redis:8-alpine + container_name: ai-ops-podman-redis + ports: + - "16379:6379" + volumes: + - podman_redis_data:/data + healthcheck: + test: ["CMD", "redis-cli", "ping"] + interval: 5s + timeout: 5s + retries: 10 + + ai-ops: + image: docker.io/library/alpine:3.19 + container_name: ai-ops-podman-app + working_dir: /app + command: ["/app/ai-ops"] + ports: + - "18080:8080" + environment: + AI_OPS_CONFIG: /app/config.yaml + volumes: + - ./ai-ops-static:/app/ai-ops:ro + - ./config.podman.yaml:/app/config.yaml:ro + - ./static:/app/static:ro + depends_on: + - postgres + - redis + restart: unless-stopped + +volumes: + podman_postgres_data: + podman_redis_data: diff --git a/docker-compose.single.yml b/docker-compose.single.yml new file mode 100644 index 0000000..3f9b39d --- /dev/null +++ b/docker-compose.single.yml @@ -0,0 +1,60 @@ +services: + postgres: + image: ${AI_OPS_POSTGRES_IMAGE:-docker.io/library/postgres:16-alpine} + container_name: ${AI_OPS_PROJECT:-ai-ops-single}-postgres + environment: + POSTGRES_USER: ${AI_OPS_DB_USER:-aiops} + POSTGRES_PASSWORD: ${AI_OPS_DB_PASSWORD:-aiops123} + POSTGRES_DB: ${AI_OPS_DB_NAME:-ai_ops} + ports: + - "${AI_OPS_BIND_ADDR:-127.0.0.1}:${AI_OPS_DB_PORT:-15432}:5432" + volumes: + - single_postgres_data:/var/lib/postgresql/data + - ./tech/migrations:/docker-entrypoint-initdb.d:ro + healthcheck: + test: ["CMD-SHELL", "pg_isready -U ${AI_OPS_DB_USER:-aiops} -d ${AI_OPS_DB_NAME:-ai_ops}"] + interval: 5s + timeout: 5s + retries: 12 + restart: unless-stopped + + redis: + image: ${AI_OPS_REDIS_IMAGE:-docker.io/library/redis:8-alpine} + container_name: ${AI_OPS_PROJECT:-ai-ops-single}-redis + ports: + - "${AI_OPS_BIND_ADDR:-127.0.0.1}:${AI_OPS_REDIS_PORT:-16379}:6379" + volumes: + - single_redis_data:/data + healthcheck: + test: ["CMD", "redis-cli", "ping"] + interval: 5s + timeout: 5s + retries: 12 + restart: unless-stopped + + ai-ops: + image: ${AI_OPS_RUNTIME_IMAGE:-docker.io/library/alpine:3.19} + container_name: ${AI_OPS_PROJECT:-ai-ops-single}-app + working_dir: /app + command: ["/app/ai-ops"] + ports: + - "${AI_OPS_BIND_ADDR:-127.0.0.1}:${AI_OPS_APP_PORT:-18080}:8080" + environment: + AI_OPS_CONFIG: /app/config.yaml + AI_OPS_SERVER_JWT_SECRET: ${AI_OPS_JWT_SECRET:?AI_OPS_JWT_SECRET is required} + AI_OPS_SERVER_METRICS_AUTH: ${AI_OPS_METRICS_AUTH:?AI_OPS_METRICS_AUTH is required} + AI_OPS_DATABASE_PASSWORD: ${AI_OPS_DB_PASSWORD:-aiops123} + volumes: + - ./.runtime/ai-ops:/app/ai-ops:ro + - ./.runtime/config.single.yaml:/app/config.yaml:ro + - ./static:/app/static:ro + depends_on: + postgres: + condition: service_healthy + redis: + condition: service_healthy + restart: unless-stopped + +volumes: + single_postgres_data: + single_redis_data: diff --git a/docker-compose.yml b/docker-compose.yml new file mode 100644 index 0000000..4f826a4 --- /dev/null +++ b/docker-compose.yml @@ -0,0 +1,51 @@ +version: "3.8" + +services: + postgres: + image: postgres:16-alpine + container_name: ai-ops-postgres + environment: + POSTGRES_USER: aiops + POSTGRES_PASSWORD: aiops123 + POSTGRES_DB: ai_ops + ports: + - "5432:5432" + volumes: + - postgres_data:/var/lib/postgresql/data + - ./tech/migrations:/docker-entrypoint-initdb.d + healthcheck: + test: ["CMD-SHELL", "pg_isready -U aiops -d ai_ops"] + interval: 5s + timeout: 5s + retries: 5 + + redis: + image: redis:7-alpine + container_name: ai-ops-redis + ports: + - "6379:6379" + volumes: + - redis_data:/data + healthcheck: + test: ["CMD", "redis-cli", "ping"] + interval: 5s + timeout: 5s + retries: 5 + + ai-ops: + build: . + container_name: ai-ops-app + ports: + - "8080:8080" + environment: + AI_OPS_CONFIG: /app/config.yaml + depends_on: + postgres: + condition: service_healthy + redis: + condition: service_healthy + restart: unless-stopped + +volumes: + postgres_data: + redis_data: diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md new file mode 100644 index 0000000..0db9055 --- /dev/null +++ b/docs/CHANGELOG.md @@ -0,0 +1,20 @@ +# AI-Ops 更新日志 + +## 2026-05-11 — Review 修复 + +### 已完成 +- [x] 统一回滚错误码：PRD / HLD / INTERFACE 三处不一致，统一为 `OPS_AUD_4101` / `OPS_AUD_4102` +- [x] 修复笔误："游戏化事务" → "编程式事务"，"预畈" → "预留" +- [x] 填充 docs/ 目录：新增 README.md、CHANGELOG.md +- [x] 补齐数据库 migration SQL：`tech/migrations/000001_init_schema.up.sql` / `.down.sql` + - 覆盖核心 6 张表：ai_ops_rules / ai_ops_alerts / ai_ops_healings / ai_ops_channels / ai_ops_audits / ai_ops_metrics + - 审计日志防篡改触发器（append-only） + - 时序表分区策略（按天分区，自动清理 > 7 天） +- [x] 功能清单裁剪：删除 66 条 PM 越界按钮级任务，添加 PM/Engineer 范围边界说明 +- [x] HLD 门控结论更新为具体行动项： + - 威胁建模验证要求转化为 CI 阻断测试 + - `BuildServer` / `BuildRuntime` 显式挂载约束落实为 QA 阻断检查项 + - 高风险变更 fail-closed 规则明确化 + +### 待完成 +- [ ] 项目骨架（go.mod、Makefile、Dockerfile — 待开发阶段启动） diff --git a/docs/EXECUTION_BOARD.md b/docs/EXECUTION_BOARD.md new file mode 100644 index 0000000..691c3b8 --- /dev/null +++ b/docs/EXECUTION_BOARD.md @@ -0,0 +1,91 @@ +# AI-Ops Phase 1 代码开发执行板 + +> 状态：开发中 +> 最后更新：2026-05-11 +> 负责人：小龙（统筹） + +--- + +## 一、Phase 1 范围 + +| 模块 | 功能点 | 工期 | 状态 | +|------|--------|------|------| +| 1.1 监控首页 | 首页布局 + 实时指标 + 供应商数 + 告警数 + 指标下钻 | 5 人天 | 骨架完成 | +| 1.2 日志查询 | 查询页 + 表格展示 + 分页 + CSV 导出 + 超时 + Redis 缓存 | 3 人天 | 骨架完成 | + +**总计：8 人天** + +--- + +## 二、任务清单 + +### 模块 1.1：监控首页 + +| 任务 ID | 任务 | 产出物 | 状态 | 备注 | +|---------|------|--------|------|------| +| C1-01 | 项目初始化（go mod + 目录结构） | go.mod + 目录树 | 完成 | 21 个 Go 文件 | +| C1-02 | 基础设施（config、database、redis、errors、response） | internal/{config,database,redis}/ + pkg/ | 完成 | 待 go mod tidy 下载依赖 | +| C1-03 | Domain 层（models + repository 接口） | internal/domain/{model,repository}/ | 完成 | 3 model + 3 interface | +| C1-04 | Infra 层（PostgreSQL 实现） | internal/infra/repository/pg_*.go | 完成 | 3 个 repository | +| C1-05 | Service 层（metrics + log 业务逻辑） | internal/service/*.go | 完成 | 含 CSV 导出、Redis 缓存 | +| C1-06 | Handler 层（HTTP API） | internal/handler/*.go | 完成 | 7 个 API 端点 | +| C1-07 | Middleware（auth、logging、recovery） | internal/middleware/*.go | 完成 | JWT + API Key 双鉴权 | +| C1-08 | 首页路由 /ops/dashboard | dashboard_handler.go | 完成 | 内联模板，待优化为文件模板 | +| C1-09 | 实时指标 API /api/v1/ai-ops/metrics/realtime | metric_handler.go | 完成 | 查询 ai_ops_metrics 表 | +| C1-10 | 供应商数量 API /api/v1/ai-ops/metrics/suppliers/count | metric_handler.go | 完成 | 基于 supplier_health 指标 | +| C1-11 | 告警数量 API /api/v1/ai-ops/alerts/open/count | metric_handler.go | 完成 | 查询 ai_ops_alerts 表 | +| C1-12 | 指标下钻 API /api/v1/ai-ops/metrics/query | metric_handler.go | 完成 | 支持时间范围过滤 | +| C1-13 | 前端模板优化 | web/templates/*.html | 待办 | 当前为内联模板 | + +### 模块 1.2：日志查询 + +| 任务 ID | 任务 | 产出物 | 状态 | 备注 | +|---------|------|--------|------|------| +| C2-01 | 日志查询页路由 /ops/dashboard/logs | dashboard_handler.go | 完成 | 内联模板 | +| C2-02 | 日志查询 API /api/v1/ai-ops/logs | log_handler.go | 完成 | 支持分页 + 多维度筛选 | +| C2-03 | CSV 导出 /api/v1/ai-ops/logs/export | log_handler.go | 完成 | 上限 10000 条 | +| C2-04 | 查询超时逻辑 | log_service.go | 完成 | 3 秒超时 | +| C2-05 | Redis 缓存（5 分钟 TTL） | log_service.go | 完成 | 基于筛选条件构建缓存键 | +| C2-06 | 日志表 migration | tech/migrations/000002_create_request_logs.up.sql | 完成 | 7 个索引 | + +### 通用任务 + +| 任务 ID | 任务 | 产出物 | 状态 | 备注 | +|---------|------|--------|------|------| +| C0-01 | 基础测试（service + handler mock 测试） | *_test.go | 完成 | 2 个测试文件 | +| C0-02 | 编译验证 | go build | 阻塞 | 缺少 go.sum，需运行 go mod tidy | +| C0-03 | 性能测试脚本（k6） | test/perf/ | 待办 | Phase 1 功能清单未要求，可延后 | +| C0-04 | 配置示例 | config.yaml | 完成 | 支持环境变量覆盖 | + +--- + +## 三、API 端点清单 + +| 方法 | 路径 | 说明 | 状态 | +|------|------|------|------| +| GET | /ops/dashboard | 监控首页 | 完成 | +| GET | /ops/dashboard/logs | 日志查询页 | 完成 | +| GET | /api/v1/ai-ops/metrics/realtime | 实时指标（QPS/延迟/P99/错误率） | 完成 | +| GET | /api/v1/ai-ops/metrics/suppliers/count | 活跃供应商数量 | 完成 | +| GET | /api/v1/ai-ops/alerts/open/count | 未关闭告警数量 | 完成 | +| GET | /api/v1/ai-ops/metrics/query | 指标下钻查询 | 完成 | +| GET | /api/v1/ai-ops/logs | 日志查询（分页） | 完成 | +| GET | /api/v1/ai-ops/logs/export | 日志导出 CSV | 完成 | +| GET | /health | 健康检查 | 完成 | + +--- + +## 四、阻塞项 + +| 阻塞项 | 影响 | 解决方案 | 负责人 | +|--------|------|----------|--------| +| go.sum 缺失 | 无法编译运行 | 运行 `go mod tidy` 下载依赖并生成 go.sum | 用户/运维 | + +--- + +## 五、下一步行动 + +1. **编译验证**：运行 `go mod tidy` + `go build ./...` 解决依赖和编译错误 +2. **模板分离**：将内联 HTML 模板提取到 `web/templates/` 目录 +3. **集成测试**：补充数据库集成测试（使用 testcontainers 或内存 PostgreSQL） +4. **Phase 1 验收**：所有 API 可访问，日志查询响应 <1s，CSV 导出正常 diff --git a/docs/IMPLEMENTATION_PLAN.md b/docs/IMPLEMENTATION_PLAN.md new file mode 100644 index 0000000..f5f895c --- /dev/null +++ b/docs/IMPLEMENTATION_PLAN.md @@ -0,0 +1,199 @@ +# AI-Ops 智能运维系统 — 详细实施计划 + +> 版本：v1.0 +> 生成日期：2026-05-11 +> 编制：小龙（统筹） +> 基准：汇总审核报告与改进任务清单 + +--- + +## 一、实施总览 + +| 项目 | 内容 | +|------|------| +| 总任务数 | 48 项（P0: 16, P1: 18, P2: 14） | +| 总预估工时 | 24 人天（含 20% 联调缓冲） | +| 建议人员配置 | PM 0.5F + TechLead 0.5F + QA 0.3F + Security 0.2F | +| 总周期 | 2~3 周（并行执行时） | +| 进入开发门禁 | 所有 P0 闭环 + PM/TechLead/QA 三方复审通过 | + +--- + +## 二、时间线 + +``` +Week 1 Week 2 Week 3 +|---------------|---------------|---------------| +Phase 0 文档修复 Phase 1+需求 Phase 2+技术 Phase 3+测试 Phase 4+安全 +(所有 P0) (所有 P1) (所有 P1) (所有 P1) (P1+P2) +|=======| |=======| |=======| |=======| |=====| + ↓复审 ↓复审 ↓复审 ↓复审 ↓复审 +``` + +--- + +## 三、Phase 0 — 文档修复与对齐（Week 1，16 项，8 人天） + +**目标：消除所有 P0 问题，确保文档间一致性。本 Phase 是进入开发的绝对前提。** + +### 3.1 接口对齐（TechLead 主导） + +| 任务 ID | 任务名称 | 责任人 | 工时 | 产出物 | 依赖 | 验收标准 | +|----------|----------|--------|------|--------|------|----------| +| D0-01 | 召开接口对齐会 | TechLead | 0.5d | `docs/INTEGRATION_CONTRACT.md` | 无 | HLD/INTERFACE/DEPLOYMENT 三份文档无接口冲突 | +| D0-02 | 补齐或删除 ER 图中 4 张缺失表 | TechLead | 0.5d | HLD §4.2 更新 + `migrations/000001_init_schema.up.sql` 更新 | D0-01 | migration 与 ER 图一致，CI `go test` 通过 | +| D0-03 | 统一自愈动作命名 | TechLead | 0.5d | HLD §3.3 + INTERFACE §1.3 + 功能清单 3.1.2 同步更新 | D0-01 | 全文档自愈动作命名一致，搜索无冲突 | +| D0-04 | 定义 IntegrationPlugin Go interface | TechLead | 0.5d | INTERFACE.md 新增 §X | D0-01 | interface 含 Name/Init/RegisterRoutes/HealthChecks/Shutdown 方法，有注释和示例 | + +### 3.2 需求修正（PM 主导） + +| 任务 ID | 任务名称 | 责任人 | 工时 | 产出物 | 依赖 | 验收标准 | +|----------|----------|--------|------|--------|------|----------| +| R0-01 | 解决范围冲突：明确供应商智能切换定位 | PM | 0.5d | PRD §3 更新 + 功能清单相关章节 | 无 | PRD In/Out of Scope 与功能清单一致，无范围模糊区 | +| R0-02 | 重新估算工期 | PM | 0.5d | 功能清单 “任务估算汇总” 更新 | 无 | 138 任务总估算在 30~40 人天，含缓冲 | +| R0-03 | 补充自愈动作“重启实例”实现任务 | PM | 0.5d | 功能清单 3.1.2 更新 | R0-01 | 功能清单包含重启实例任务，与 AC-6 对应 | + +### 3.3 安全基线（Security 主导） + +| 任务 ID | 任务名称 | 责任人 | 工时 | 产出物 | 依赖 | 验收标准 | +|----------|----------|--------|------|--------|------|----------| +| S0-01 | 在威胁建模中增加 LLM 特有风险 | Security | 0.5d | HLD §10.1 更新 | 无 | 威胁建模覆盖 LLM Top 5 风险，每个有缓解策略 | +| S0-02 | 补充审计表防篡改触发器 | Security | 0.5d | `migrations/000001_init_schema.up.sql` 新增触发器 | D0-02 | 审计表执行 UPDATE/DELETE 时报错，单测验证 | +| S0-03 | 明确审计写入与业务执行的事务顺序 | Security | 0.5d | HLD §3.3 更新 | 无 | 文档明确"先写审计再执行业务"，含回滚机制 | +| S0-04 | 补充 WebSocket JWT 鉴权说明 | Security | 0.5d | INTERFACE §3.4 更新 | 无 | WebSocket 接口含连接建立时的 token 校验流程 | +| S0-05 | 在 HLD 中增加参数化查询强制要求 | Security | 0.5d | HLD §4 更新 | 无 | 所有数据库交互层必须使用参数化/预编译查询 | +| S0-06 | 限制 /metrics 端点访问 | Security | 0.5d | INTERFACE §3.2 更新 | 无 | /metrics 含内网 IP 限制或 API Key 鉴权说明 | + +### 3.4 测试资产（QA 主导） + +| 任务 ID | 任务名称 | 责任人 | 工时 | 产出物 | 依赖 | 验收标准 | +|----------|----------|--------|------|--------|------|----------| +| T0-01 | 为 8 个缺失负向用例的 AC 补充负向用例 | QA | 1d | TEST_DESIGN.md + CASES.md 更新 | 无 | 每个 AC 至少 1 正向 + 1 负向，PRD AC 覆盖率 100% | +| T0-02 | 补充 F-05~F-08 异常流程用例 | QA | 0.5d | CASES.md 新增 TC-E5~E8 | 无 | 8 条异常流程全部有对应用例 | +| T0-03 | 创建 CI 配置文件 | QA | 0.5d | `.github/workflows/ci.yml` | 无 | PR 提交时自动触发，覆盖率不达标时 exit 1 | +| T0-04 | 创建性能压测目录 | QA | 0.5d | `test/perf/dashboard_k6.js` + `test/perf/drilldown_k6.js` + `test/perf/PERF_ENV.md` | 无 | k6 脚本可执行，含环境规格和 P99 计算方法 | + +--- + +## 四、Phase 1 — 需求与产品级 P1 闭环（Week 1~2，9 项，4.5 人天） + +**目标：PRD 完善，AC 可测试，权限明确。** + +| 任务 ID | 任务名称 | 责任人 | 工时 | 产出物 | 依赖 | 验收标准 | +|----------|----------|--------|------|--------|------|----------| +| R1-01 | 统一失败判定线 | PM | 0.5d | PRD §2 + §8.3 更新 | R0-01 | 只有一条失败判定线，时间窗口、阈值统一 | +| R1-02 | 删除“不仅仅包括于” | PM | 0.5d | PRD §3 更新 | 无 | In Scope 为封闭列表，无"等”和"包括不仅仅于" | +| R1-03 | 统一通知渠道列表 | PM | 0.5d | PRD AC-4 + 功能清单更新 | R0-01 | 通知渠道列表在所有文档中一致 | +| R1-04 | AC-7 补充不可篡改技术实现定义 | PM | 0.5d | PRD AC-7 更新 | S0-02 | 明确实现方式（触发器 + 只追加） | +| R1-05 | AC-8 补充“有效”判定标准 | PM | 0.5d | PRD AC-8 更新 | 无 | 明确"有效"的定义（非空、JSON 可解析、Schema 匹配） | +| R1-06 | AC-6 补充级联故障回退验收点 | PM | 0.5d | PRD AC-6 更新 | D0-03 | AC-6 含级联故障回退的验收条件 | +| R1-07 | 容量预测（AC-9）补充可测试标准 | PM | 0.5d | PRD AC-9 更新 | 无 | 含量化指标（如 MAPE<30%） | +| R1-08 | 补充 UI 最低兼容性要求 | PM | 0.5d | PRD 新增章节 | 无 | 明确浏览器、分辨率、移动端策略 | +| R1-09 | 细化角色权限矩阵到 API 级别 | PM | 0.5d | PRD AC-12 + 功能清单 G1 更新 | D1-07 | 以表格形式列出各角色对关键 API 的 CRUD 权限 | + +--- + +## 五、Phase 2 — 技术设计级 P1 闭环（Week 2，9 项，4.5 人天） + +**目标：HLD/DEPLOYMENT 完善，部署可执行，规则评估有扩展方案。** + +| 任务 ID | 任务名称 | 责任人 | 工时 | 产出物 | 依赖 | 验收标准 | +|----------|----------|--------|------|--------|------|----------| +| D1-05 | 修正 DEPLOYMENT “主备”为 active-active | TechLead | 0.5d | DEPLOYMENT §1.1 更新 | 无 | 描述为多实例多活 + 负载均衡 | +| D1-06 | 分离 migration 执行从 Worker | TechLead | 0.5d | DEPLOYMENT §3.2 更新 | D0-02 | migration 由 init container 或 K8s Job 执行 | +| D1-07 | 补充 `ai_ops_roles` 表结构 | TechLead | 0.5d | HLD §8.1 + migration 更新 | D0-02 | 表含 id/role_name/permissions/created_at，CI 通过 | +| D1-08 | 补充 `ai_ops_snapshots` 表结构 | TechLead | 0.5d | HLD §3.3 + migration 更新 | D0-02 | 表含 id/healing_id/state_json/config_version/created_at | +| D1-09 | 完善告警聚合状态机 | TechLead | 0.5d | HLD §5.2 更新 | 无 | 含解除规则、子告警与父告警状态同步策略 | +| D1-10 | 补充规则评估分片策略 | TechLead | 0.5d | HLD §9.1/9.2 更新 | 无 | 含分片键、负载均衡方案、水平扩展策略 | +| D2-12 | 完善 metrics 分区表管理策略 | TechLead | 0.5d | migration + HLD 更新 | D0-02 | 含按天分区或应用层定时任务说明 | +| D2-14 | 补充 Graceful Shutdown WebSocket 关闭策略 | TechLead | 0.5d | DEPLOYMENT §3.2 更新 | S0-04 | 含 close frame + 5s ack 等待机制 | +| D2-15 | 重新校准时序存储容量估算 | TechLead | 0.5d | HLD §9.3 更新 | 无 | 参考 Prometheus 官方公式，给出保守估算 | + +--- + +## 六、Phase 3 — 测试资产完善（Week 2~3，8 项，4 人天） + +**目标：测试用例完整，CI 可运行，混沌测试有设计，E2E 有场景。** + +| 任务 ID | 任务名称 | 责任人 | 工时 | 产出物 | 依赖 | 验收标准 | +|----------|----------|--------|------|--------|------|----------| +| T1-01 | 建立覆盖率验证机制 | QA | 0.5d | `scripts/check_coverage.sh` + STRATEGY.md 更新 | T0-03 | CI 中自动解析 coverprofile，按模块阻断 | +| T1-02 | 设计 3 条混沌测试用例 | QA | 0.5d | TEST_DESIGN.md 新增混沌测试章节 | T0-02 | 含 Given-When-Then，覆盖 Pod 杀死/Redis 分区/PG 切换 | +| T1-03 | 完善测试数据管理规范 | QA | 0.5d | STRATEGY.md 更新 + `test/fixtures/` 目录结构文档 | T0-03 | 含 SQL/JSON/Go seed 三种方式，含大数据生成脚本说明 | +| T1-04 | 为灰度门禁增加自动化判定脚本 | QA | 0.5d | `scripts/gate_check.sh` + TEST_DESIGN.md §5.2 更新 | T0-03 | 脚本可自动采集覆盖率/沙盒验证/安全扫描结果 | +| T1-05 | 明确安全扫描工具与阈值 | QA | 0.5d | STRATEGY.md 更新 | S0-01 | 明确工具（Trivy/Gosec）、漏洞等级定义、扫描时机 | +| T1-06 | 补充 E2E 详细场景设计 | QA | 0.5d | TEST_DESIGN.md + CASES.md 新增 E2E 章节 | T0-01 | 含完整链路：指标异常→告警触发→通知发送→自愈执行→事件记录 | +| T2-01 | 统一用例编号风格 | QA | 0.5d | TEST_DESIGN.md + CASES.md 全文更新 | T0-01 | 全部统一为 TC-{AC}-{seq} | +| T2-02 | 补充 Webhook 5xx 测试场景 | QA | 0.5d | CASES.md TC-E2 更新 | T0-02 | TC-E2 含 5xx 和 8xx 两种场景 | + +--- + +## 七、Phase 4 — 安全与运营工具（Week 3，6 项，3 人天） + +**目标：威胁建模完善，安全门禁可执行，商业化闭环有 ROI。** + +| 任务 ID | 任务名称 | 责任人 | 工时 | 产出物 | 依赖 | 验收标准 | +|----------|----------|--------|------|--------|------|----------| +| S1-01 | 补充敏感字段脱敏具体实现 | Security | 0.5d | HLD §8 更新 | S0-05 | 含密码替换策略、加密算法、脱敏测试用例 | +| S1-02 | 明确自愈引擎权限边界 | Security | 0.5d | PRD AC-6 + HLD §3.3 更新 | D0-03 | 含重启关键服务的白名单/黑名单机制 | +| R2-01 | 补充 ROI 量化模型 | PM | 0.5d | PRD 新增章节 | R0-02 | 含当前运维成本、目标节省金额、回收周期 | +| R2-02 | 补充发布策略量化门控标准 | PM | 0.5d | PRD §8 更新 | R1-01 | 含噪声率<10%、通知成功率>95% 等可量化条件 | +| R2-03 | 补充审计日志存储成本评估 | PM | 0.5d | PRD + HLD §9.3 更新 | D2-15 | 含压缩率、归档策略、存储成本上限 | +| D2-11 | 优化错误码排版 | TechLead | 0.5d | INTERFACE §3.3 更新 | D0-01 | 错误码分段排版，每个含注释说明 | + +--- + +## 八、关键路径与产出物清单 + +### 文档级产出物 + +| 文件路径 | 说明 | 贡献者 | +|----------|------|--------| +| `docs/INTEGRATION_CONTRACT.md` | 外部集成契约唯一信源源 | TechLead | +| `prd/PRD.md` | 主需求文档（更新后） | PM | +| `specs/功能清单.md` | 功能清单（更新后） | PM | +| `tech/HLD.md` | 高层设计（更新后） | TechLead | +| `tech/INTERFACE.md` | 接口设计（更新后） | TechLead | +| `tech/DEPLOYMENT.md` | 部署设计（更新后） | TechLead | +| `tech/TEST_DESIGN.md` | 测试设计（更新后） | QA | +| `test/CASES.md` | 测试用例（更新后） | QA | +| `test/STRATEGY.md` | 测试策略（更新后） | QA | + +### 代码级产出物 + +| 文件路径 | 说明 | 贡献者 | +|----------|------|--------| +| `.github/workflows/ci.yml` | CI Pipeline（覆盖率阻断、测试执行、失败通知） | QA | +| `scripts/check_coverage.sh` | 覆盖率解析脚本 | QA | +| `scripts/gate_check.sh` | 灰度门禁自动化判定脚本 | QA | +| `test/perf/dashboard_k6.js` | 看板首页性能压测脚本 | QA | +| `test/perf/drilldown_k6.js` | 下钻性能压测脚本 | QA | +| `test/perf/PERF_ENV.md` | 性能压测环境规格 | QA | +| `test/fixtures/` 目录结构文档 | 测试数据管理规范 | QA | +| `tech/migrations/000001_init_schema.up.sql` | 数据库 schema（更新后） | TechLead | +| `docs/汇总审核报告与改进任务清单.md` | 汇总审核报告 | 小龙 | +| `docs/IMPLEMENTATION_PLAN.md` | 本文档 | 小龙 | + +--- + +## 九、门禁与复审机制 + +| 门禁点 | 条件 | 复审者 | +|------|------|--------| +| Phase 0 完成 | 所有 16 项 P0 任务完成，文档间一致性通过自动化检查 | 小龙 + TechLead | +| Phase 1 完成 | 所有 9 项需求 P1 任务完成，PRD 可转测试用例 | PM + QA | +| Phase 2 完成 | 所有 9 项技术 P1 任务完成，migration 可执行 | TechLead | +| Phase 3 完成 | 所有 8 项测试任务完成，CI 可运行 | QA | +| Phase 4 完成 | 所有 6 项安全/运营任务完成 | Security + PM | +| 进入开发门禁 | 所有 Phase 完成，四方（PM/TechLead/QA/Security）复审通过 | 小龙 | + +--- + +## 十、风险与应对 + +| 风险 | 概率 | 影响 | 应对策略 | +|------|------|------|----------| +| 接口对齐会迟到或不能达成一致 | 中 | 高 | 由小龙主持，PM/TechLead 双方必须参与，不达成一致不开会 | +| 工期估算仍被认为过高 | 低 | 中 | 预留 20% 联调缓冲 + 15% 风险缓冲，每周回顾 | +| QA 资产补齐耗时超预期 | 中 | 中 | 优先完成 T0-01~T0-04（P0），P1/P2 可延后到开发期补充 | +| Security 审查引发范围变更 | 低 | 高 | S0-01 限于威胁建模文档更新，不扩展为新功能需求 | diff --git a/docs/INTEGRATION_CONTRACT.md b/docs/INTEGRATION_CONTRACT.md new file mode 100644 index 0000000..35df2c0 --- /dev/null +++ b/docs/INTEGRATION_CONTRACT.md @@ -0,0 +1,127 @@ +# AI-Ops 集成接口契约（Integration Contract） + +> 版本：v1.0 | 状态：正式版 +> 本文档是 AI-Ops 与立交桥主项目（gateway/supply-api/token-runtime）集成时的唯一信源源。 +> 所有集成接口的路径、命名、字段、错误码均以本文档为准。 + +--- + +## 1. 集成原则 + +1. **唯一信源源**：本文档覆盖的所有接口，以 INTERFACE.md 为技术基准，以本文档为集成契约准。 +2. **路径统一**：集成接口路径使用 `/internal/` 前缀，表明为内部服务间通信，不对外暴露。 +3. **命名统一**：所有自愈动作类型使用 snake_case，统一为：`switch_route`、`throttle`、`restart_instance`、`invoke_script`、`isolate_node`。 +4. **错误码统一**：所有错误码使用 `{SOURCE}_{CATEGORY}_{CODE}` 格式。 +5. **审计覆盖**：任何修改类操作（POST/PUT/DELETE/PATCH）必须记录审计日志。 + +--- + +## 2. 与 Bridge Gateway 集成 + +### 2.1 接口清单 + +| 方法 | 路径 | 请求 | 响应 | 说明 | 审计 | +|------|------|------|------|------|------| +| 查询服务状态 | `GET /internal/gateway/health` | - | `{"status":"up","services":{}}` | 诊断时查询各服务健康状态 | 否 | +| 获取路由策略 | `GET /internal/gateway/routes` | - | `{"routes":[]}` | 读取当前路由配置，用于影响面分析 | 否 | +| 修改路由策略 | `POST /internal/gateway/routes` | `{"action":"switch_route","target":"","config":{}}` | `{"success":true}` | 自愈动作调用 | 是 | +| 获取请求量统计 | `GET /internal/gateway/metrics` | `?metric=qps&duration=5m` | `{"value":1234.5}` | 采集指标数据 | 否 | + +### 2.2 安全约束 + +- `/internal/gateway/metrics` 仅限内网 IP 访问（10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16）或需携带有效的服务间 API Key。 +- 公网直接访问返回 403 Forbidden。 +- 修改路由策略必须经过 AI-Ops 审计流程，由 HLD §3.3 审计设计章节规范。 + +### 2.3 字段规范 + +**`POST /internal/gateway/routes` 请求体**: +```json +{ + "action": "switch_route", // 必填，枚举: switch_route/throttle/restart_instance/invoke_script/isolate_node + "target": "service_a", // 必填，目标资源 ID + "config": { // 可选，动作特定参数 + "fallback_provider": "p_b", // switch_route 时指定备用供应商 + "rate_limit_rps": 1000 // throttle 时指定限流值 + }, + "dry_run": false // 可选，默认 false，true 时仅验证不执行 +} +``` + +**`修改路由策略` 错误码**: +| 错误码 | HTTP 状态 | 说明 | +|---------|-----------|------| +| `OPS_GWY_4001` | 400 | 请求参数验证失败 | +| `OPS_GWY_4002` | 404 | 目标资源不存在 | +| `OPS_GWY_4003` | 409 | 目标资源正在被其他操作修改 | +| `OPS_GWY_5001` | 500 | gateway 内部错误 | + +--- + +## 3. 与 supply-api 集成 + +### 3.1 接口清单 + +| 方法 | 路径 | 请求 | 响应 | 说明 | 审计 | +|------|------|------|------|------|------| +| 查询供应商状态 | `GET /internal/supply/accounts/health` | - | `{"accounts":[]}` | 诊断供应商健康状态 | 否 | +| 获取审计日志格式 | `GET /internal/supply/audit/schema` | - | `{"schema":{}}` | 确保审计事件格式一致 | 否 | + +### 3.2 安全约束 + +- 供应商健康接口仅限内网访问。 +- 审计日志格式接口用于初始化时校验 schema 一致性，不对外暴露。 + +--- + +## 4. 与 platform-token-runtime 集成 + +### 4.1 接口清单 + +| 方法 | 路径 | 请求 | 响应 | 说明 | 审计 | +|------|------|------|------|------|------| +| 获取 Token 消耗 | `GET /internal/runtime/token-usage` | `?window=1h` | `{"total":12345,"by_model":{}}` | 采集 Token 消耗指标 | 否 | +| 获取容量使用率 | `GET /internal/runtime/capacity` | - | `{"utilization":0.75}` | 采集容量指标 | 否 | + +### 4.2 安全约束 + +- Token 消耗接口可能包含敏感信息，仅限内网访问。 +- 返回的 Token 数量应为汇总值，不得暴露用户级别的明细。 + +--- + +## 5. 错误码映射总表 + +| 错误码 | HTTP 状态 | 来源 | 说明 | +|---------|-----------|-------|------| +| `OPS_GEN_4001` | 400 | 通用 | 请求参数错误 | +| `OPS_GEN_4002` | 401 | 通用 | 未授权 | +| `OPS_GEN_4003` | 403 | 通用 | 权限不足 | +| `OPS_GEN_4004` | 404 | 通用 | 资源不存在 | +| `OPS_GEN_4005` | 409 | 通用 | 资源冲突 | +| `OPS_GEN_4006` | 413 | 通用 | 请求体过大 | +| `OPS_GEN_5001` | 500 | 通用 | 内部服务错误 | +| `OPS_MET_4001` | 400 | metric | 指标名称无效 | +| `OPS_MET_4002` | 400 | metric | 时间范围不合法 | +| `OPS_ALT_4001` | 400 | alert | 规则名称已存在 | +| `OPS_ALT_4002` | 400 | alert | 规则参数验证失败 | +| `OPS_ALT_4003` | 409 | alert | 版本冲突 | +| `OPS_HEAL_4001` | 400 | healing | 自愈动作参数无效 | +| `OPS_HEAL_4002` | 409 | healing | 自愈动作正在执行中 | +| `OPS_HEAL_4003` | 400 | healing | 回滚目标执行不存在 | +| `OPS_AUD_4001` | 403 | audit | 无权进行审计操作 | +| `OPS_AUD_4101` | 400 | audit | 回滚目标资源不存在 | +| `OPS_AUD_4102` | 409 | audit | 回滚目标已被后续修改覆盖 | +| `OPS_CAP_4001` | 400 | capacity | 容量指标不存在 | +| `OPS_GWY_4001` | 400 | gateway | 请求参数验证失败 | +| `OPS_GWY_4002` | 404 | gateway | 目标资源不存在 | +| `OPS_GWY_4003` | 409 | gateway | 目标资源正在被其他操作修改 | +| `OPS_GWY_5001` | 500 | gateway | gateway 内部错误 | + +--- + +## 6. 变更日志 + +| 版本 | 日期 | 修改人 | 内容 | +|------|------|---------|------| +| v1.0 | 2026-05-11 | TechLead | 初稿：统一集成接口路径、自愈动作命名、错误码、安全约束 | diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..79c8943 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,59 @@ +# AI-Ops 智能运维系统 + +> 立连桥（立交桥）平台的智能运维子系统，把从人工排查转为机器主导的实时保障。 + +--- + +## 快速导航 + +| 文档 | 路径 | 说明 | +|---|---|---| +| 产品需求 | `prd/PRD.md` | 需求范围、AC、用户场景、上线策略 | +| 竞品分析 | `prd/competitor-analysis.md` | 14 个竞品全景矩阵与差距分析 | +| 高层设计 | `tech/HLD.md` | 架构总览、模块设计、数据模型、安全与性能 | +| 接口设计 | `tech/INTERFACE.md` | 内部接口、REST API、WebSocket、错误码 | +| 部署设计 | `tech/DEPLOYMENT.md` | 容器化部署、资源需求、灾备方案 | +| 测试设计 | `tech/TEST_DESIGN.md` | 测试策略、用例矩阵、灰度门禁 | +| 功能清单 | `specs/功能清单.md` | 按 Phase 划分的模块级功能清单 | +| 竞品分析 | `specs/竞品分析.md` | 竞品差异化与市场定位 | +| 测试用例 | `test/CASES.md` | 按 AC 编号的测试用例 | +| 测试策略 | `test/STRATEGY.md` | 分层测试模型与工具链 | + +--- + +## 技术栈 + +| 组件 | 选型 | +|---|---| +| 语言 | Go 1.22+ | +| HTTP | 标准库 `net/http` + 自定义中间件 | +| 数据库 | PostgreSQL 15+ (`jackc/pgx/v5`) | +| 缓存 | Redis (`redis/go-redis/v9`) | +| 配置 | YAML + Viper | +| 时序库 | Prometheus / VictoriaMetrics | +| 前端 | React 18+ + ECharts 5.x | +| 测试 | Go testing + testify + miniredis + sqlmock | + +--- + +## 当前状态 + +- **门控结论**：HLD 与 TEST_DESIGN 均为 `REQUEST_CHANGES`，尚未达到进入开发的标准。 +- **待补齐项**： + 1. 错误码统一（已完成） + 2. 低级笔误修复（已完成） + 3. 数据库 migration SQL + 4. 项目骨架（go.mod、Makefile、Dockerfile） + 5. 功能清单裁剪（PM 越界细节） + 6. 门控行动项转化为可执行任务 + +--- + +## 与桥项目的关系 + +本系统是桥项目的延伸，技术体系保持一致： +- 同样的分层架构（Repository + Service + Handler） +- 同样的 Store 接口（含乐观锁版本控制） +- 同样的独立运行 + 集成运行双模式 +- 同样的审计事件模型（与 supply-api/ 一致） +- 数据库表前缀 `ai_ops_`，避免 schema 冲突 diff --git a/docs/SINGLE_NODE_RUNBOOK.md b/docs/SINGLE_NODE_RUNBOOK.md new file mode 100644 index 0000000..ea87cd5 --- /dev/null +++ b/docs/SINGLE_NODE_RUNBOOK.md @@ -0,0 +1,193 @@ +# AI-Ops 单机运行 Runbook + +> 适用范围：开发机、单台线上服务器。目标是稳定可重复启动、可健康检查、可备份、可回滚、可故障恢复。不是多节点高可用方案。 + +## 0. 前置条件 + +任选一种容器运行时： + +- Docker + docker compose +- Podman + podman-compose + +本机还需要： + +- go 1.22+ +- curl +- python3 +- gzip / zcat + +## 1. 一键启动 + +```bash +cd /home/long/project/ai-ops +scripts/aiops-single-node.sh start +``` + +脚本会自动完成： + +1. 生成 `.runtime/single-node.env`，包含随机 JWT secret 和 metrics auth。 +2. 生成 `.runtime/config.single.yaml`，使用 production mode。 +3. 编译静态二进制 `.runtime/ai-ops`。 +4. 启动 PostgreSQL、Redis、AI-Ops App。 +5. 等待 `/actuator/health/ready` 变绿。 +6. 执行 smoke：health、login、alerts、rules、channels、dashboard、openapi。 + +默认监听地址和端口： + +| 服务 | 默认监听 | 说明 | +|------|----------|------| +| App | 127.0.0.1:18080 | 默认只允许本机访问，生产机不要直接公网暴露 | +| PostgreSQL | 127.0.0.1:15432 | 默认只允许本机访问 | +| Redis | 127.0.0.1:16379 | 默认只允许本机访问 | + +可通过环境变量覆盖： + +```bash +AI_OPS_APP_PORT=28080 AI_OPS_DB_PORT=25432 AI_OPS_REDIS_PORT=26379 scripts/aiops-single-node.sh start +``` + +## 2. 日常检查 + +```bash +scripts/aiops-single-node.sh status +scripts/aiops-single-node.sh smoke +scripts/aiops-single-node.sh logs +``` + +直接访问： + +```bash +curl -fsS http://127.0.0.1:18080/health +curl -fsS http://127.0.0.1:18080/actuator/health/ready +curl -fsS http://127.0.0.1:18080/ops/dashboard +``` + +## 3. 告警能力边界 + +当前单机版支持： + +- 告警规则 CRUD +- 规则引擎定时评估 +- P2 持续 2 小时升级 P1 +- 同资源 1 分钟聚合告警 +- webhook 通知发送 +- 通知日志落库 +- 失败后尝试备用渠道 + +当前占位，不能作为正式值班渠道承诺： + +- email +- Feishu +- Wechat + +因此单机稳定版建议先用 webhook 接入现有告警网关、企业机器人转发器或自建 relay。 + +## 4. 备份 + +```bash +scripts/aiops-single-node.sh backup +``` + +备份文件输出到： + +```text +backups/ai_ops_YYYYMMDD-HHMMSS.sql.gz +``` + +建议线上服务器至少每天执行一次，可用 crontab： + +```cron +30 2 * * * cd /home/long/project/ai-ops && scripts/aiops-single-node.sh backup >> backups/backup.log 2>&1 +``` + +## 5. 回滚 / 恢复数据库 + +从某个备份恢复： + +```bash +scripts/aiops-single-node.sh restore backups/ai_ops_YYYYMMDD-HHMMSS.sql.gz +``` + +脚本会： + +1. 停止 app 容器，避免恢复期间写入。 +2. 清空 PostgreSQL `public` schema，避免表/函数/触发器已存在导致恢复失败。 +3. 用 psql 导入备份。 +4. 启动 app。 +5. 等待 ready。 +6. 自动 smoke。 + +注意：restore 是有副作用操作，执行前应先确认备份文件正确，必要时先复制一份当前备份。 + +## 6. 故障恢复 + +容器异常退出、服务器重启后： + +```bash +scripts/aiops-single-node.sh recover +``` + +脚本会基于现有 volume 重新拉起 PostgreSQL、Redis、App，并执行 ready + smoke。 + +如果 app 异常但 DB/Redis 正常： + +```bash +scripts/aiops-single-node.sh restart +scripts/aiops-single-node.sh smoke +``` + +## 7. 停止服务 + +```bash +scripts/aiops-single-node.sh stop +``` + +该命令保留 volume，不删除数据。 + +## 8. 安全配置 + +`.runtime/single-node.env` 默认权限由脚本以 `umask 077` 创建，包含： + +- `AI_OPS_JWT_SECRET` +- `AI_OPS_METRICS_AUTH` +- 数据库密码 + +不要提交 `.runtime/` 和 `backups/`。仓库 `.gitignore` 已屏蔽这些目录。 + +production mode 下应用会强制校验： + +- JWT secret 至少 32 字符 +- metrics auth 至少 16 字符 +- DB host/user/password/dbname 必填 +- port/pool/retention 必须合法 + +## 9. 单机版 Gate + +上线前至少执行： + +```bash +go vet ./... +go test -race -buildvcs=false ./... +scripts/aiops-single-node.sh doctor +scripts/aiops-single-node.sh start +scripts/aiops-single-node.sh backup +scripts/aiops-single-node.sh recover +``` + +如果有回滚演练窗口，再执行： + +```bash +scripts/aiops-single-node.sh restore backups/.sql.gz +``` + +## 10. 仍然不是多节点生产级 + +单机版不提供： + +- 多副本高可用 +- PostgreSQL 主从切换 +- Redis 高可用 +- 多节点任务互斥 +- 完整 Feishu/Wechat/email 生产通知实现 + +但它满足开发机和单台线上服务器的稳定运行、备份、回滚和恢复闭环。 diff --git a/docs/汇总审核报告与改进任务清单.md b/docs/汇总审核报告与改进任务清单.md new file mode 100644 index 0000000..94020c7 --- /dev/null +++ b/docs/汇总审核报告与改进任务清单.md @@ -0,0 +1,223 @@ +# AI-Ops 智能运维系统 — 汇总审核报告与改进任务清单 + +> 审核日期：2026-05-11 +> 审核角色：PM + TechLead + QA + Security（小龙统筹） +> 审核范围：PRD、HLD、INTERFACE、DEPLOYMENT、TEST_DESIGN、CASES、STRATEGY、功能清单、竞品分析、migration SQL + +--- + +## 一、各角色审核结论 + +| 角色 | 总体评级 | 审核文档 | 核心判断 | +|------|----------|----------|----------| +| PM | B | PRD.md, 功能清单.md, competitor-analysis.md | 用户旅程完整，AC 量化程度高，但存在范围冲突、工期估算失真、功能遗漏 | +| TechLead | B | HLD.md, INTERFACE.md, DEPLOYMENT.md, migration | 架构方向正确，但文档间接口严重不一致，ER 图与 migration 表缺失，IntegrationPlugin 未定义 | +| QA | C | TEST_DESIGN.md, CASES.md, STRATEGY.md | 测试策略框架较好，但负向用例大面积缺失、异常流程漏了 4 条、CI 零配置、性能压测无载体 | +| Security | C+ | HLD.md, INTERFACE.md, PRD.md, migration | 基础 RBAC 和脱敏有设计，但 LLM 特有风险未覆盖、审计防篡改触发器缺失、WebSocket/指标端点无鉴权 | + +**综合判断：当前设计不足以支撑进入开发，必须先修复 P0 问题，同步闭环 P1 问题，才能达到生产级交付标准。** + +--- + +## 二、P0 阻塞级问题（合计 16 项，已去重） + +### 2.1 文档一致性（4 项） + +| 编号 | 问题 | 影响 | 责任文档 | 修复方案 | +|------|------|------|----------|----------| +| D-P0-01 | 接口定义严重不一致：HLD 与 INTERFACE 中 gateway/supply-api/token-runtime 的路径、命名完全不同 | 开发团队无法确定真实契约，集成测试必败 | HLD §7, INTERFACE §2 | 召开接口对齐会，以 INTERFACE.md 为基准，生成 INTEGRATION_CONTRACT.md 作为唯一信源源 | +| D-P0-02 | ER 图与 migration 存在 4 张表缺失：ai_ops_events、ai_ops_notifys、ai_ops_configs、ai_ops_snapshots | 核心流程（通知、快照、配置版本）无法落地 | HLD §4.1 vs §4.2, migration | 确认是否需要：若需要则补齐表结构和 migration；若不需要则从 ER 图删除并说明替代方案 | +| D-P0-03 | 自愈动作类型命名不一致：HLD 用 restart_instance/switch_route，INTERFACE 用 restart_service/switch_provider | 存储序列化、API 校验、前端枚举全部混乱 | HLD §3.3, INTERFACE §1.3 | 统一命名，建议采用 snake_case 一致规范 | +| D-P0-04 | IntegrationPlugin 接口未定义：集成模式核心契约无 Go interface，无生命周期、注册方式 | 集成模式无法编码实现，CI 无法断点检查 | HLD §1.3, §3.2, §7, §10.2 | 在 INTERFACE.md 中增加 IntegrationPlugin Go interface 定义 | + +### 2.2 需求完整性（2 项） + +| 编号 | 问题 | 影响 | 责任文档 | 修复方案 | +|------|------|------|----------|----------| +| R-P0-01 | 范围冲突：供应商智能切换未在 PRD In Scope 明确纳入，但功能清单作为 Phase 3 核心模块（16+ 任务） | 与 Out of Scope “不做自动扩容决策”擦边，开发阶段极易产生范围争议 | PRD §3, 功能清单 3.4 | 明确纳入 In Scope 或移入 Out of Scope，若纳入则在 AC 中补充验收标准 | +| R-P0-02 | 自愈动作“重启实例”在功能清单中遗漏具体实现任务 | QA 无法验收该自愈动作 | PRD AC-6, 功能清单 3.1.2 | 补充重启实例实现任务（如调用 K8s API 或主机 agent） | + +### 2.3 测试设计完整性（4 项） + +| 编号 | 问题 | 影响 | 责任文档 | 修复方案 | +|------|------|------|----------|----------| +| T-P0-01 | AC 负向测试用例大面积缺失：12 个 AC 中至少 8 个无负向/异常输入用例 | 无法验证非法输入、边界越界、权限不足等场景，生产缺陷逃逸 | TEST_DESIGN.md, CASES.md | 为 AC-01/02/04/05/06/09/10/11 各补充至少 1 条负向用例 | +| T-P0-02 | CASES.md 遗漏异常流程 F-05~F-08（审计满盘、级联故障、数据库中断、看板超时） | 核心容灾与降级场景无测试用例实底 | CASES.md | 补充 TC-E5~E8 四条异常流程用例 | +| T-P0-03 | CI 集成零配置：STRATEGY.md 仅文字描述，无 workflow 文件、覆盖率阻断逻辑、失败通知机制 | 无法形成自动化质量门禁 | STRATEGY.md | 创建 `.github/workflows/ci.yml`，含覆盖率解析与阻断、失败通知 | +| T-P0-04 | 性能压测无执行载体：k6 脚本、环境规格、P99 计算方式、持续时间均未定义 | 性能基准无法复现和验证，灰度门禁无法判定 | TEST_DESIGN.md §9.1 | 创建 `test/perf/` 目录，含 k6 脚本和环境规格文档 | + +### 2.4 安全设计完整性（6 项） + +| 编号 | 问题 | 影响 | 责任文档 | 修复方案 | +|------|------|------|----------|----------| +| S-P0-01 | LLM 特有风险未覆盖：提示注入、提示泄露、幻觉、模型偷取等 OWASP LLM Top 10 风险 | 系统集成 LLM/Gateway 后可能遭受提示攻击或数据泄露 | HLD §10.1 | 在威胁建模中增加 LLM 特有风险识别与缓解策略 | +| S-P0-02 | 审计表无防篡改触发器：migration 未创建 `audit_log_prevent_update_delete` 类似触发器 | 审计日志可被更新或删除，违背不可篡改要求 | migration | 补充 PostgreSQL 触发器或在应用层强制只追加 | +| S-P0-03 | Append-only 审计设计未在 HLD 中明确陈述：无“先写审计再执行业务”的 fail-closed 设计 | 业务操作失败时可能丢失审计记录，影响故障定责 | HLD §3.3 | 明确审计写入与业务执行的事务顺序和回滚机制 | +| S-P0-04 | WebSocket 接口无鉴权说明：告警数据为敏感生产信息 | 公网可能监听告警流，数据泄露风险 | INTERFACE §3.4 | 补充 JWT Token 鉴权说明，包括连接建立时的 token 校验 | +| S-P0-05 | SQL 注入防护无明确设计：HLD 未强制参数化/预编译查询 | 自定义规则、日志查询等功能存在 SQL 注入风险 | HLD §4 | 在 HLD 数据层设计中增加参数化查询强制要求 | +| S-P0-06 | /metrics 探针端点无鉴权说明：揭露生产指标给公网风险 | 攻击者可通过 /metrics 获取系统运行状态和敏感信息 | INTERFACE §3.2 | 限制内网 IP 访问或增加 API Key 鉴权 | + +--- + +## 三、P1 重要级问题（合计 18 项，已去重） + +### 3.1 需求与产品 + +| 编号 | 问题 | 责任文档 | +|------|------|----------| +| R-P1-01 | 双重失败判定线：开发期 vs 上线后 30 天阈值不统一（20% vs 15%） | PRD §2, §8.3 | +| R-P1-02 | In Scope 使用“不仅仅包括于”留下范围蔓延口子 | PRD §3 In Scope | +| R-P1-03 | 通知渠道定义不一致：PRD 未含钉钉，功能清单出现钉钉 | PRD AC-4, 功能清单 2.3.2/3.4.3 | +| R-P1-04 | AC-7 “不可篡改”缺乏技术实现定义 | PRD AC-7 | +| R-P1-05 | AC-8 “操作前值有效”定义模糊 | PRD AC-8 | +| R-P1-06 | 级联故障回退（F-6）未在 AC 中体现 | PRD F-6, AC-6 | +| R-P1-07 | 容量预测算法缺可测试标准（“仅供参考”导致无法验收） | PRD AC-9 | +| R-P1-08 | 缺少 UI/UX 最低兼容性要求 | PRD 全文 | +| R-P1-09 | 角色权限矩阵过粗，缺少 API 级权限对照 | PRD AC-12, 功能清单 G1 | + +### 3.2 技术设计 + +| 编号 | 问题 | 责任文档 | +|------|------|----------| +| D-P1-05 | DEPLOYMENT “主备”与 active-active 多活逻辑矛盾 | DEPLOYMENT §1.1 vs §4.2 | +| D-P1-06 | Worker 执行 migration 存在多副本并发冲突风险 | DEPLOYMENT §3.2 | +| D-P1-07 | `ai_ops_roles` 表在 HLD 中提及但 migration 未定义 | HLD §8.1 vs migration | +| D-P1-08 | 快照表缺失影响级联故障回退 | HLD §3.3 | +| D-P1-09 | 告警聚合状态机不完整（解除规则未定义） | HLD §5.2 | +| D-P1-10 | 规则评估性能扩展性未给出分片策略 | HLD §9.1/9.2 | + +### 3.3 测试 + +| 编号 | 问题 | 责任文档 | +|------|------|----------| +| T-P1-01 | 覆盖率门槛缺少验证机制 | STRATEGY.md | +| T-P1-02 | 混沌测试无具体用例设计 | STRATEGY.md, TEST_DESIGN.md | +| T-P1-03 | 测试数据管理策略缺关键细节 | STRATEGY.md | +| T-P1-04 | 灰度门禁缺自动化判定脚本 | TEST_DESIGN.md §5.2 | +| T-P1-05 | 安全扫描工具与阈值未指定 | STRATEGY.md | +| T-P1-06 | E2E 测试缺少详细场景设计 | STRATEGY.md, TEST_DESIGN.md | + +### 3.4 安全 + +| 编号 | 问题 | 责任文档 | +|------|------|----------| +| S-P1-01 | 敏感字段脱敏策略仅有文字，无具体实现（如密码替换、数据加密） | HLD §8 | +| S-P1-02 | 自愈引擎权限边界未明确（如何防止自愈动作被滥用去重启关键服务） | PRD AC-6, HLD §3.3 | + +--- + +## 四、P2 改进建议（关键项） + +| 编号 | 问题 | 建议 | +|------|------|------| +| R-P2-01 | 商业化闭环缺 ROI 量化模型 | 补充运维人力成本节省计算示例 | +| R-P2-02 | 发布策略缺量化门控标准 | 补充告警噪声率<10%、通知成功率>95% 等可量化条件 | +| R-P2-03 | 审计日志 90 天保留未评估存储成本 | 补充压缩/归档策略或存储成本上限 | +| D-P2-11 | 错误码排版混淆（4001 与 4101 相邻易混） | 重新分段排版或增加注释说明 | +| D-P2-12 | metrics 分区表仅有 DEFAULT，无按天分区和自动清理 | 引入 pg_partman 或应用层定时任务 | +| D-P2-14 | Graceful Shutdown 未说明 WebSocket 长连接关闭策略 | 补充 close frame + ack 等待机制 | +| D-P2-15 | 存储估算假设 Prometheus 每样本 8 bytes，与实际严重偏低 | 参考官方容量规划公式重新估算 | +| T-P2-01 | 用例编号风格不统一（TC-01-01 vs TC-1.1） | 统一为 TC-{AC}-{seq} | +| T-P2-02 | CASES.md TC-E2 漏掉 5xx 场景 | 补充 Webhook 5xx 测试 | + +--- + +## 五、改进任务清单（按模块分类，已去重排序） + +### Phase 0 — 文档修复与对齐（开发前必须完成） + +| 任务 ID | 任务名称 | 严重度 | 责任文档 | 估算工时 | 验收标准 | +|----------|----------|--------|----------|----------|----------| +| D0-01 | 召开接口对齐会，统一 gateway/supply-api/token-runtime 路径、命名、字段 | P0 | INTEGRATION_CONTRACT.md | 0.5d | 三份文档无接口冲突 | +| D0-02 | 补齐或删除 ER 图中 4 张缺失表（events/notifys/configs/snapshots） | P0 | HLD §4.2, migration | 0.5d | migration 与 ER 图一致 | +| D0-03 | 统一自愈动作命名并同步到所有文档 | P0 | HLD, INTERFACE, 功能清单 | 0.5d | 全文档自愈动作命名一致 | +| D0-04 | 定义 IntegrationPlugin Go interface 并写入 INTERFACE.md | P0 | INTERFACE.md | 0.5d | interface 定义包含 Init/RegisterRoutes/HealthChecks/Shutdown | +| R0-01 | 解决范围冲突：明确供应商智能切换 In/Out of Scope 定位 | P0 | PRD §3, 功能清单 | 0.5d | PRD 与功能清单范围一致 | +| R0-02 | 重新估算工期：138 任务按复杂度系数 + 20%联调 + 15%风险缓冲 | P0 | 功能清单 | 0.5d | 工期估算在 30~40 人天 | +| R0-03 | 补充自愈动作“重启实例”实现任务 | P0 | 功能清单 3.1.2 | 0.5d | 功能清单包含重启实例任务 | +| S0-01 | 在威胁建模中增加 LLM 特有风险（提示注入、幻觉、模型偷取） | P0 | HLD §10.1 | 0.5d | 威胁建模覆盖 LLM Top 5 风险 | +| S0-02 | 补充审计表防篡改触发器或应用层只追加约束 | P0 | migration | 0.5d | 审计表无法 UPDATE/DELETE | +| S0-03 | 明确审计写入与业务执行的事务顺序（fail-closed） | P0 | HLD §3.3 | 0.5d | 文档明确"先写审计再执行业务" | +| S0-04 | 补充 WebSocket JWT 鉴权说明 | P0 | INTERFACE §3.4 | 0.5d | WebSocket 接口含鉴权流程 | +| S0-05 | 在 HLD 中增加参数化查询强制要求 | P0 | HLD §4 | 0.5d | 所有数据库交互层必须使用参数化查询 | +| S0-06 | 限制 /metrics 端点访问（内网 IP 或 API Key） | P0 | INTERFACE §3.2 | 0.5d | /metrics 含访问控制说明 | +| T0-01 | 为 8 个缺失负向用例的 AC 补充负向用例 | P0 | TEST_DESIGN.md, CASES.md | 1d | 每个 AC 至少 1 正向 + 1 负向 | +| T0-02 | 补充 F-05~F-08 异常流程用例（TC-E5~E8） | P0 | CASES.md | 0.5d | 8 条异常流程全部覆盖 | +| T0-03 | 创建 `.github/workflows/ci.yml` 含覆盖率阻断与失败通知 | P0 | STRATEGY.md, ci.yml | 0.5d | PR 提交时自动触发并阻断不达标 PR | +| T0-04 | 创建 `test/perf/` 目录含 k6 脚本和环境规格 | P0 | TEST_DESIGN.md, test/perf/ | 0.5d | 性能压测可复现执行 | + +### Phase 1 — 需求与产品级 P1 闭环 + +| 任务 ID | 任务名称 | 严重度 | 责任文档 | 估算工时 | +|----------|----------|--------|----------|----------| +| R1-01 | 统一失败判定线：上线后 30 天为统一窗口，噪声率<15% | P1 | PRD §2, §8.3 | 0.5d | +| R1-02 | 删除 In Scope 中“不仅仅包括于”，改为封闭列表 | P1 | PRD §3 | 0.5d | +| R1-03 | 统一通知渠道列表（是否含钉钉） | P1 | PRD AC-4, 功能清单 | 0.5d | +| R1-04 | AC-7 补充不可篡改的技术实现定义 | P1 | PRD AC-7 | 0.5d | +| R1-05 | AC-8 补充“有效”的判定标准 | P1 | PRD AC-8 | 0.5d | +| R1-06 | 在 AC-6 中补充级联故障回退验收点 | P1 | PRD AC-6 | 0.5d | +| R1-07 | 为容量预测（AC-9）补充可测试标准（如 MAPE<30%） | P1 | PRD AC-9 | 0.5d | +| R1-08 | 补充 UI 最低兼容性要求 | P1 | PRD | 0.5d | +| R1-09 | 细化角色权限矩阵到 API 级别 | P1 | PRD AC-12, 功能清单 G1 | 0.5d | + +### Phase 2 — 技术设计级 P1 闭环 + +| 任务 ID | 任务名称 | 严重度 | 责任文档 | 估算工时 | +|----------|----------|--------|----------|----------| +| D1-05 | 修正 DEPLOYMENT “主备”为 active-active 多活 | P1 | DEPLOYMENT §1.1 | 0.5d | +| D1-06 | 分离 migration 执行从 Worker 启动逻辑（init container 或 Job） | P1 | DEPLOYMENT §3.2 | 0.5d | +| D1-07 | 补充 `ai_ops_roles` 表结构 | P1 | HLD §8.1, migration | 0.5d | +| D1-08 | 补充 `ai_ops_snapshots` 表结构（级联故障回退） | P1 | HLD §3.3, migration | 0.5d | +| D1-09 | 完善告警聚合状态机（解除规则、子告警同步） | P1 | HLD §5.2 | 0.5d | +| D1-10 | 补充规则评估分片策略与负载均衡方案 | P1 | HLD §9.1/9.2 | 0.5d | +| D2-12 | 完善 metrics 分区表管理策略（pg_partman 或应用层） | P2 | migration, HLD | 0.5d | +| D2-14 | 补充 Graceful Shutdown 中 WebSocket 关闭策略 | P2 | DEPLOYMENT §3.2 | 0.5d | +| D2-15 | 重新校准时序存储容量估算 | P2 | HLD §9.3 | 0.5d | + +### Phase 3 — 测试资产完善 + +| 任务 ID | 任务名称 | 严重度 | 责任文档 | 估算工时 | +|----------|----------|--------|----------|----------| +| T1-01 | 建立覆盖率验证机制（CI 解析 domain≥70%, service≥80%） | P1 | STRATEGY.md | 0.5d | +| T1-02 | 设计 3 条混沌测试用例（Pod 杀死、Redis 分区、PG 主从切换） | P1 | TEST_DESIGN.md | 0.5d | +| T1-03 | 完善测试数据管理规范（fixtures 目录结构、大数据生成脚本、并行隔离） | P1 | STRATEGY.md | 0.5d | +| T1-04 | 为灰度门禁增加自动化判定脚本 | P1 | TEST_DESIGN.md §5.2 | 0.5d | +| T1-05 | 明确安全扫描工具（Trivy/Gosec）与阈值 | P1 | STRATEGY.md | 0.5d | +| T1-06 | 补充 E2E 详细场景设计（完整链路） | P1 | TEST_DESIGN.md, CASES.md | 0.5d | +| T2-01 | 统一用例编号风格为 TC-{AC}-{seq} | P2 | TEST_DESIGN.md, CASES.md | 0.5d | +| T2-02 | 补充 Webhook 5xx 测试场景 | P2 | CASES.md TC-E2 | 0.5d | + +### Phase 4 — 安全与运营工具 + +| 任务 ID | 任务名称 | 严重度 | 责任文档 | 估算工时 | +|----------|----------|--------|----------|----------| +| S1-01 | 补充敏感字段脱敏具体实现（密码替换、加密） | P1 | HLD §8 | 0.5d | +| S1-02 | 明确自愈引擎权限边界（防止滥用重启关键服务） | P1 | PRD AC-6, HLD §3.3 | 0.5d | +| R2-01 | 补充 ROI 量化模型与财务指标 | P2 | PRD, competitor-analysis | 0.5d | +| R2-02 | 补充发布策略量化门控标准 | P2 | PRD §8 | 0.5d | +| R2-03 | 补充审计日志存储成本评估与压缩策略 | P2 | PRD, HLD §9.3 | 0.5d | +| D2-11 | 优化错误码排版并增加注释说明 | P2 | INTERFACE §3.3 | 0.5d | + +--- + +## 六、总体改进计划 + +| 阶段 | 任务数 | 预估工时 | 目标 | +|------|--------|----------|------| +| Phase 0 文档修复与对齐 | 16 项 | 8 人天 | 消除所有 P0 问题，文档间一致 | +| Phase 1 需求与产品级 P1 | 9 项 | 4.5 人天 | PRD 完善，AC 可测试，权限明确 | +| Phase 2 技术设计级 P1 | 9 项 | 4.5 人天 | HLD/DEPLOYMENT 完善，部署可执行 | +| Phase 3 测试资产完善 | 8 项 | 4 人天 | 测试用例完整，CI 可运行 | +| Phase 4 安全与运营工具 | 6 项 | 3 人天 | 威胁建模完善，安全门禁可执行 | +| **合计** | **48 项** | **24 人天** | 达到生产级设计质量 | + +--- + +## 七、小龙结论 + +1. **当前状态不能进入开发**：合计 16 项 P0 阻塞级问题，涵盖文档一致性、测试完整性、安全基线三大类。 +2. **最危险的系统性风险是接口定义不一致**：HLD/INTERFACE/DEPLOYMENT 三份文档对同一集成点有不同路径和命名，开发团队无法确定真实契约，必须第一时间对齐。 +3. **QA 是最薄弱环节**：评级 C，负向用例大面积缺失、CI 零配置、性能压测无载体。建议优先补齐 T0-01~T0-04。 +4. **Security 未覆盖 LLM 特有风险**：项目集成 LLM/Gateway 后，提示注入、幻觉等风险可能导致严重安全事故，必须在威胁建模中补充。 +5. **工期估算严重失真**：138 任务仅 18 人天，实际至少需要 30~40 人天，建议重新估算并预留 20% 联调 + 15% 风险缓冲。 +6. **建议执行顺序**：Phase 0 → Phase 1 → Phase 2 → Phase 3 → Phase 4，每个 Phase 完成后由对应角色复审，所有 P0 闭环后才能进入开发。 diff --git a/go.mod b/go.mod new file mode 100644 index 0000000..9005560 --- /dev/null +++ b/go.mod @@ -0,0 +1,43 @@ +module github.com/company/ai-ops + +go 1.22.2 + +require ( + github.com/golang-jwt/jwt/v5 v5.2.0 + github.com/jackc/pgx/v5 v5.6.0 + github.com/redis/go-redis/v9 v9.6.0 + github.com/spf13/viper v1.19.0 + github.com/stretchr/testify v1.9.0 +) + +require ( + github.com/cespare/xxhash/v2 v2.2.0 // indirect + github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect + github.com/dgryski/go-rendezvous v0.0.0-20200823014737-9f7001d12a5f // indirect + github.com/fsnotify/fsnotify v1.7.0 // indirect + github.com/hashicorp/hcl v1.0.0 // indirect + github.com/jackc/pgpassfile v1.0.0 // indirect + github.com/jackc/pgservicefile v0.0.0-20221227161230-091c0ba34f0a // indirect + github.com/jackc/puddle/v2 v2.2.1 // indirect + github.com/magiconair/properties v1.8.7 // indirect + github.com/mitchellh/mapstructure v1.5.0 // indirect + github.com/pelletier/go-toml/v2 v2.2.2 // indirect + github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect + github.com/sagikazarmark/locafero v0.4.0 // indirect + github.com/sagikazarmark/slog-shim v0.1.0 // indirect + github.com/sourcegraph/conc v0.3.0 // indirect + github.com/spf13/afero v1.11.0 // indirect + github.com/spf13/cast v1.6.0 // indirect + github.com/spf13/pflag v1.0.5 // indirect + github.com/stretchr/objx v0.5.2 // indirect + github.com/subosito/gotenv v1.6.0 // indirect + go.uber.org/atomic v1.9.0 // indirect + go.uber.org/multierr v1.9.0 // indirect + golang.org/x/crypto v0.21.0 // indirect + golang.org/x/exp v0.0.0-20230905200255-921286631fa9 // indirect + golang.org/x/sync v0.6.0 // indirect + golang.org/x/sys v0.18.0 // indirect + golang.org/x/text v0.14.0 // indirect + gopkg.in/ini.v1 v1.67.0 // indirect + gopkg.in/yaml.v3 v3.0.1 // indirect +) diff --git a/go.sum b/go.sum new file mode 100644 index 0000000..b591b52 --- /dev/null +++ b/go.sum @@ -0,0 +1,97 @@ +github.com/bsm/ginkgo/v2 v2.12.0 h1:Ny8MWAHyOepLGlLKYmXG4IEkioBysk6GpaRTLC8zwWs= +github.com/bsm/ginkgo/v2 v2.12.0/go.mod h1:SwYbGRRDovPVboqFv0tPTcG1sN61LM1Z4ARdbAV9g4c= +github.com/bsm/gomega v1.27.10 h1:yeMWxP2pV2fG3FgAODIY8EiRE3dy0aeFYt4l7wh6yKA= +github.com/bsm/gomega v1.27.10/go.mod h1:JyEr/xRbxbtgWNi8tIEVPUYZ5Dzef52k01W3YH0H+O0= +github.com/cespare/xxhash/v2 v2.2.0 h1:DC2CZ1Ep5Y4k3ZQ899DldepgrayRUGE6BBZ/cd9Cj44= +github.com/cespare/xxhash/v2 v2.2.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs= +github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= +github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= +github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc h1:U9qPSI2PIWSS1VwoXQT9A3Wy9MM3WgvqSxFWenqJduM= +github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= +github.com/dgryski/go-rendezvous v0.0.0-20200823014737-9f7001d12a5f h1:lO4WD4F/rVNCu3HqELle0jiPLLBs70cWOduZpkS1E78= +github.com/dgryski/go-rendezvous v0.0.0-20200823014737-9f7001d12a5f/go.mod h1:cuUVRXasLTGF7a8hSLbxyZXjz+1KgoB3wDUb6vlszIc= +github.com/frankban/quicktest v1.14.6 h1:7Xjx+VpznH+oBnejlPUj8oUpdxnVs4f8XU8WnHkI4W8= +github.com/frankban/quicktest v1.14.6/go.mod h1:4ptaffx2x8+WTWXmUCuVU6aPUX1/Mz7zb5vbUoiM6w0= +github.com/fsnotify/fsnotify v1.7.0 h1:8JEhPFa5W2WU7YfeZzPNqzMP6Lwt7L2715Ggo0nosvA= +github.com/fsnotify/fsnotify v1.7.0/go.mod h1:40Bi/Hjc2AVfZrqy+aj+yEI+/bRxZnMJyTJwOpGvigM= +github.com/golang-jwt/jwt/v5 v5.2.0 h1:d/ix8ftRUorsN+5eMIlF4T6J8CAt9rch3My2winC1Jw= +github.com/golang-jwt/jwt/v5 v5.2.0/go.mod h1:pqrtFR0X4osieyHYxtmOUWsAWrfe1Q5UVIyoH402zdk= +github.com/google/go-cmp v0.5.9 h1:O2Tfq5qg4qc4AmwVlvv0oLiVAGB7enBSJ2x2DqQFi38= +github.com/google/go-cmp v0.5.9/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY= +github.com/hashicorp/hcl v1.0.0 h1:0Anlzjpi4vEasTeNFn2mLJgTSwt0+6sfsiTG8qcWGx4= +github.com/hashicorp/hcl v1.0.0/go.mod h1:E5yfLk+7swimpb2L/Alb/PJmXilQ/rhwaUYs4T20WEQ= +github.com/jackc/pgpassfile v1.0.0 h1:/6Hmqy13Ss2zCq62VdNG8tM1wchn8zjSGOBJ6icpsIM= +github.com/jackc/pgpassfile v1.0.0/go.mod h1:CEx0iS5ambNFdcRtxPj5JhEz+xB6uRky5eyVu/W2HEg= +github.com/jackc/pgservicefile v0.0.0-20221227161230-091c0ba34f0a h1:bbPeKD0xmW/Y25WS6cokEszi5g+S0QxI/d45PkRi7Nk= +github.com/jackc/pgservicefile v0.0.0-20221227161230-091c0ba34f0a/go.mod h1:5TJZWKEWniPve33vlWYSoGYefn3gLQRzjfDlhSJ9ZKM= +github.com/jackc/pgx/v5 v5.6.0 h1:SWJzexBzPL5jb0GEsrPMLIsi/3jOo7RHlzTjcAeDrPY= +github.com/jackc/pgx/v5 v5.6.0/go.mod h1:DNZ/vlrUnhWCoFGxHAG8U2ljioxukquj7utPDgtQdTw= +github.com/jackc/puddle/v2 v2.2.1 h1:RhxXJtFG022u4ibrCSMSiu5aOq1i77R3OHKNJj77OAk= +github.com/jackc/puddle/v2 v2.2.1/go.mod h1:vriiEXHvEE654aYKXXjOvZM39qJ0q+azkZFrfEOc3H4= +github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE= +github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk= +github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY= +github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE= +github.com/magiconair/properties v1.8.7 h1:IeQXZAiQcpL9mgcAe1Nu6cX9LLw6ExEHKjN0VQdvPDY= +github.com/magiconair/properties v1.8.7/go.mod h1:Dhd985XPs7jluiymwWYZ0G4Z61jb3vdS329zhj2hYo0= +github.com/mitchellh/mapstructure v1.5.0 h1:jeMsZIYE/09sWLaz43PL7Gy6RuMjD2eJVyuac5Z2hdY= +github.com/mitchellh/mapstructure v1.5.0/go.mod h1:bFUtVrKA4DC2yAKiSyO/QUcy7e+RRV2QTWOzhPopBRo= +github.com/pelletier/go-toml/v2 v2.2.2 h1:aYUidT7k73Pcl9nb2gScu7NSrKCSHIDE89b3+6Wq+LM= +github.com/pelletier/go-toml/v2 v2.2.2/go.mod h1:1t835xjRzz80PqgE6HHgN2JOsmgYu/h4qDAS4n929Rs= +github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= +github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 h1:Jamvg5psRIccs7FGNTlIRMkT8wgtp5eCXdBlqhYGL6U= +github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= +github.com/redis/go-redis/v9 v9.6.0 h1:NLck+Rab3AOTHw21CGRpvQpgTrAU4sgdCswqGtlhGRA= +github.com/redis/go-redis/v9 v9.6.0/go.mod h1:hdY0cQFCN4fnSYT6TkisLufl/4W5UIXyv0b/CLO2V2M= +github.com/rogpeppe/go-internal v1.9.0 h1:73kH8U+JUqXU8lRuOHeVHaa/SZPifC7BkcraZVejAe8= +github.com/rogpeppe/go-internal v1.9.0/go.mod h1:WtVeX8xhTBvf0smdhujwtBcq4Qrzq/fJaraNFVN+nFs= +github.com/sagikazarmark/locafero v0.4.0 h1:HApY1R9zGo4DBgr7dqsTH/JJxLTTsOt7u6keLGt6kNQ= +github.com/sagikazarmark/locafero v0.4.0/go.mod h1:Pe1W6UlPYUk/+wc/6KFhbORCfqzgYEpgQ3O5fPuL3H4= +github.com/sagikazarmark/slog-shim v0.1.0 h1:diDBnUNK9N/354PgrxMywXnAwEr1QZcOr6gto+ugjYE= +github.com/sagikazarmark/slog-shim v0.1.0/go.mod h1:SrcSrq8aKtyuqEI1uvTDTK1arOWRIczQRv+GVI1AkeQ= +github.com/sourcegraph/conc v0.3.0 h1:OQTbbt6P72L20UqAkXXuLOj79LfEanQ+YQFNpLA9ySo= +github.com/sourcegraph/conc v0.3.0/go.mod h1:Sdozi7LEKbFPqYX2/J+iBAM6HpqSLTASQIKqDmF7Mt0= +github.com/spf13/afero v1.11.0 h1:WJQKhtpdm3v2IzqG8VMqrr6Rf3UYpEF239Jy9wNepM8= +github.com/spf13/afero v1.11.0/go.mod h1:GH9Y3pIexgf1MTIWtNGyogA5MwRIDXGUr+hbWNoBjkY= +github.com/spf13/cast v1.6.0 h1:GEiTHELF+vaR5dhz3VqZfFSzZjYbgeKDpBxQVS4GYJ0= +github.com/spf13/cast v1.6.0/go.mod h1:ancEpBxwJDODSW/UG4rDrAqiKolqNNh2DX3mk86cAdo= +github.com/spf13/pflag v1.0.5 h1:iy+VFUOCP1a+8yFto/drg2CJ5u0yRoB7fZw3DKv/JXA= +github.com/spf13/pflag v1.0.5/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg= +github.com/spf13/viper v1.19.0 h1:RWq5SEjt8o25SROyN3z2OrDB9l7RPd3lwTWU8EcEdcI= +github.com/spf13/viper v1.19.0/go.mod h1:GQUN9bilAbhU/jgc1bKs99f/suXKeUMct8Adx5+Ntkg= +github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME= +github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw= +github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo= +github.com/stretchr/objx v0.5.2 h1:xuMeJ0Sdp5ZMRXx/aWO6RZxdr3beISkG5/G/aIRr3pY= +github.com/stretchr/objx v0.5.2/go.mod h1:FRsXN1f5AsAjCGJKqEizvkpNtU+EGNCLh3NxZ/8L+MA= +github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI= +github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= +github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= +github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU= +github.com/stretchr/testify v1.8.4/go.mod h1:sz/lmYIOXD/1dqDmKjjqLyZ2RngseejIcXlSw2iwfAo= +github.com/stretchr/testify v1.9.0 h1:HtqpIVDClZ4nwg75+f6Lvsy/wHu+3BoSGCbBAcpTsTg= +github.com/stretchr/testify v1.9.0/go.mod h1:r2ic/lqez/lEtzL7wO/rwa5dbSLXVDPFyf8C91i36aY= +github.com/subosito/gotenv v1.6.0 h1:9NlTDc1FTs4qu0DDq7AEtTPNw6SVm7uBMsUCUjABIf8= +github.com/subosito/gotenv v1.6.0/go.mod h1:Dk4QP5c2W3ibzajGcXpNraDfq2IrhjMIvMSWPKKo0FU= +go.uber.org/atomic v1.9.0 h1:ECmE8Bn/WFTYwEW/bpKD3M8VtR/zQVbavAoalC1PYyE= +go.uber.org/atomic v1.9.0/go.mod h1:fEN4uk6kAWBTFdckzkM89CLk9XfWZrxpCo0nPH17wJc= +go.uber.org/multierr v1.9.0 h1:7fIwc/ZtS0q++VgcfqFDxSBZVv/Xo49/SYnDFupUwlI= +go.uber.org/multierr v1.9.0/go.mod h1:X2jQV1h+kxSjClGpnseKVIxpmcjrj7MNnI0bnlfKTVQ= +golang.org/x/crypto v0.21.0 h1:X31++rzVUdKhX5sWmSOFZxx8UW/ldWx55cbf08iNAMA= +golang.org/x/crypto v0.21.0/go.mod h1:0BP7YvVV9gBbVKyeTG0Gyn+gZm94bibOW5BjDEYAOMs= +golang.org/x/exp v0.0.0-20230905200255-921286631fa9 h1:GoHiUyI/Tp2nVkLI2mCxVkOjsbSXD66ic0XW0js0R9g= +golang.org/x/exp v0.0.0-20230905200255-921286631fa9/go.mod h1:S2oDrQGGwySpoQPVqRShND87VCbxmc6bL1Yd2oYrm6k= +golang.org/x/sync v0.6.0 h1:5BMeUDZ7vkXGfEr1x9B4bRcTH4lpkTkpdh0T/J+qjbQ= +golang.org/x/sync v0.6.0/go.mod h1:Czt+wKu1gCyEFDUtn0jG5QVvpJ6rzVqr5aXyt9drQfk= +golang.org/x/sys v0.18.0 h1:DBdB3niSjOA/O0blCZBqDefyWNYveAYMNF1Wum0DYQ4= +golang.org/x/sys v0.18.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA= +golang.org/x/text v0.14.0 h1:ScX5w1eTa3QqT8oi6+ziP7dTV1S2+ALU0bI+0zXKWiQ= +golang.org/x/text v0.14.0/go.mod h1:18ZOQIKpY8NJVqYksKHtTdi31H5itFRjB5/qKTNYzSU= +gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= +gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk= +gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q= +gopkg.in/ini.v1 v1.67.0 h1:Dgnx+6+nfE+IfzjUEISNeydPJh9AXNNsWbGP9KzCsOA= +gopkg.in/ini.v1 v1.67.0/go.mod h1:pNLf8WUiyNEtQjuu5G5vTm06TEv9tsIgeAvK8hOrP4k= +gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= +gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= +gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= diff --git a/internal/config/config.go b/internal/config/config.go new file mode 100644 index 0000000..4f1e37b --- /dev/null +++ b/internal/config/config.go @@ -0,0 +1,138 @@ +package config + +import ( + "fmt" + "os" + "strings" + + "github.com/spf13/viper" +) + +// Config 是应用配置结构 +type Config struct { + Server ServerConfig `mapstructure:"server"` + Database DatabaseConfig `mapstructure:"database"` + Redis RedisConfig `mapstructure:"redis"` + Metrics MetricsConfig `mapstructure:"metrics"` +} + +type ServerConfig struct { + Port int `mapstructure:"port"` + Mode string `mapstructure:"mode"` // development / production + JWTSecret string `mapstructure:"jwt_secret"` + MetricsAuth string `mapstructure:"metrics_auth"` // API Key for /metrics +} + +type DatabaseConfig struct { + Host string `mapstructure:"host"` + Port int `mapstructure:"port"` + User string `mapstructure:"user"` + Password string `mapstructure:"password"` + DBName string `mapstructure:"dbname"` + SSLMode string `mapstructure:"sslmode"` + PoolSize int `mapstructure:"pool_size"` +} + +type RedisConfig struct { + Host string `mapstructure:"host"` + Port int `mapstructure:"port"` + Password string `mapstructure:"password"` + DB int `mapstructure:"db"` +} + +type MetricsConfig struct { + PrometheusURL string `mapstructure:"prometheus_url"` + RetentionDays int `mapstructure:"retention_days"` +} + +// Load 从配置文件和环境变量加载配置 +func Load(path string) (*Config, error) { + v := viper.New() + v.SetConfigFile(path) + v.SetEnvPrefix("AI_OPS") + v.SetEnvKeyReplacer(strings.NewReplacer(".", "_")) + v.AutomaticEnv() + + // 默认值 + v.SetDefault("server.port", 8080) + v.SetDefault("server.mode", "development") + v.SetDefault("database.host", "localhost") + v.SetDefault("database.port", 5432) + v.SetDefault("database.sslmode", "disable") + v.SetDefault("database.pool_size", 10) + v.SetDefault("redis.host", "localhost") + v.SetDefault("redis.port", 6379) + v.SetDefault("metrics.retention_days", 7) + + if err := v.ReadInConfig(); err != nil { + if _, ok := err.(viper.ConfigFileNotFoundError); !ok { + return nil, fmt.Errorf("read config: %w", err) + } + } + + var cfg Config + if err := v.Unmarshal(&cfg); err != nil { + return nil, fmt.Errorf("unmarshal config: %w", err) + } + + // 环境变量覆盖 + if host := os.Getenv("SPRING_DATASOURCE_URL"); host != "" { + // 兼容 Spring Boot 风格的数据库配置 + cfg.Database.Host = host + } + applyExplicitEnvOverrides(&cfg) + if err := cfg.Validate(); err != nil { + return nil, err + } + + return &cfg, nil +} + +func applyExplicitEnvOverrides(cfg *Config) { + setString := func(key string, dst *string) { + if v := os.Getenv(key); v != "" { + *dst = v + } + } + setString("AI_OPS_SERVER_JWT_SECRET", &cfg.Server.JWTSecret) + setString("AI_OPS_SERVER_METRICS_AUTH", &cfg.Server.MetricsAuth) + setString("AI_OPS_DATABASE_HOST", &cfg.Database.Host) + setString("AI_OPS_DATABASE_USER", &cfg.Database.User) + setString("AI_OPS_DATABASE_PASSWORD", &cfg.Database.Password) + setString("AI_OPS_DATABASE_DBNAME", &cfg.Database.DBName) + setString("AI_OPS_REDIS_HOST", &cfg.Redis.Host) + setString("AI_OPS_REDIS_PASSWORD", &cfg.Redis.Password) +} + +func (c *Config) Validate() error { + if c.Server.Port <= 0 || c.Server.Port > 65535 { + return fmt.Errorf("invalid server.port: %d", c.Server.Port) + } + if c.Database.Port <= 0 || c.Database.Port > 65535 { + return fmt.Errorf("invalid database.port: %d", c.Database.Port) + } + if c.Database.PoolSize <= 0 { + return fmt.Errorf("invalid database.pool_size: %d", c.Database.PoolSize) + } + if c.Metrics.RetentionDays <= 0 { + return fmt.Errorf("invalid metrics.retention_days: %d", c.Metrics.RetentionDays) + } + if strings.EqualFold(c.Server.Mode, "production") { + if len(c.Server.JWTSecret) < 32 { + return fmt.Errorf("server.jwt_secret must be at least 32 characters in production") + } + if len(c.Server.MetricsAuth) < 16 { + return fmt.Errorf("server.metrics_auth must be at least 16 characters in production") + } + if c.Database.Host == "" || c.Database.User == "" || c.Database.Password == "" || c.Database.DBName == "" { + return fmt.Errorf("database host/user/password/dbname are required in production") + } + } + return nil +} + +// DSN 返回 PostgreSQL 连接字符串 +func (c DatabaseConfig) DSN() string { + return fmt.Sprintf("host=%s port=%d user=%s password=%s dbname=%s sslmode=%s pool_max_conns=%d", + c.Host, c.Port, c.User, c.Password, c.DBName, c.SSLMode, c.PoolSize) +} diff --git a/internal/config/config_test.go b/internal/config/config_test.go new file mode 100644 index 0000000..21ccef2 --- /dev/null +++ b/internal/config/config_test.go @@ -0,0 +1,136 @@ +package config + +import ( + "os" + "path/filepath" + "strings" + "testing" +) + +func TestLoadReadsConfigAndBuildsDSN(t *testing.T) { + dir := t.TempDir() + path := filepath.Join(dir, "config.yaml") + content := []byte(`server: + port: 19090 + mode: production + jwt_secret: "0123456789abcdef0123456789abcdef" + metrics_auth: "metrics-api-key-123456" +database: + host: db + port: 15432 + user: user + password: pass + dbname: aiops + sslmode: require + pool_size: 7 +redis: + host: redis + port: 16379 + password: redispass + db: 2 +metrics: + prometheus_url: http://prom + retention_days: 14 +`) + if err := os.WriteFile(path, content, 0o600); err != nil { + t.Fatal(err) + } + + cfg, err := Load(path) + if err != nil { + t.Fatal(err) + } + if cfg.Server.Port != 19090 || cfg.Database.Host != "db" || cfg.Redis.DB != 2 || cfg.Metrics.RetentionDays != 14 { + t.Fatalf("unexpected config: %+v", cfg) + } + dsn := cfg.Database.DSN() + for _, want := range []string{"host=db", "port=15432", "user=user", "password=pass", "dbname=aiops", "sslmode=require", "pool_max_conns=7"} { + if !strings.Contains(dsn, want) { + t.Fatalf("dsn %q missing %q", dsn, want) + } + } +} + +func TestLoadAppliesDefaultsAndSpringDatasourceCompatibility(t *testing.T) { + t.Setenv("SPRING_DATASOURCE_URL", "spring-host") + path := filepath.Join(t.TempDir(), "empty.yaml") + if err := os.WriteFile(path, []byte("{}\n"), 0o600); err != nil { + t.Fatal(err) + } + + cfg, err := Load(path) + if err != nil { + t.Fatal(err) + } + if cfg.Server.Port != 8080 || cfg.Database.Port != 5432 || cfg.Redis.Port != 6379 || cfg.Metrics.RetentionDays != 7 { + t.Fatalf("defaults not applied: %+v", cfg) + } + if cfg.Database.Host != "spring-host" { + t.Fatalf("spring datasource compatibility not applied: %s", cfg.Database.Host) + } +} + +func TestLoadReturnsErrorForMalformedConfig(t *testing.T) { + path := filepath.Join(t.TempDir(), "bad.yaml") + if err := os.WriteFile(path, []byte("server: ["), 0o600); err != nil { + t.Fatal(err) + } + if _, err := Load(path); err == nil { + t.Fatal("expected malformed config error") + } +} + +func TestLoadRejectsWeakProductionSecrets(t *testing.T) { + path := filepath.Join(t.TempDir(), "config.yaml") + content := []byte(`server: + mode: production + jwt_secret: short + metrics_auth: short +database: + host: db + port: 5432 + user: aiops + password: aiops123 + dbname: ai_ops + pool_size: 1 +metrics: + retention_days: 7 +`) + if err := os.WriteFile(path, content, 0o600); err != nil { + t.Fatal(err) + } + _, err := Load(path) + if err == nil || !strings.Contains(err.Error(), "jwt_secret") { + t.Fatalf("expected weak jwt secret error, got %v", err) + } +} + +func TestLoadAppliesExplicitEnvironmentOverrides(t *testing.T) { + path := filepath.Join(t.TempDir(), "config.yaml") + content := []byte(`server: + mode: production + jwt_secret: "0123456789abcdef0123456789abcdef" + metrics_auth: "metrics-api-key-123456" +database: + host: db + port: 5432 + user: aiops + password: aiops123 + dbname: ai_ops + pool_size: 1 +metrics: + retention_days: 7 +`) + if err := os.WriteFile(path, content, 0o600); err != nil { + t.Fatal(err) + } + t.Setenv("AI_OPS_DATABASE_PASSWORD", "override-pass") + t.Setenv("AI_OPS_SERVER_METRICS_AUTH", "override-metrics-key") + cfg, err := Load(path) + if err != nil { + t.Fatal(err) + } + if cfg.Database.Password != "override-pass" || cfg.Server.MetricsAuth != "override-metrics-key" { + t.Fatalf("env overrides not applied: %+v", cfg) + } +} diff --git a/internal/database/database.go b/internal/database/database.go new file mode 100644 index 0000000..3fea60a --- /dev/null +++ b/internal/database/database.go @@ -0,0 +1,47 @@ +package database + +import ( + "context" + "fmt" + "time" + + "github.com/company/ai-ops/internal/config" + "github.com/jackc/pgx/v5/pgxpool" +) + +// Pool 是全局数据库连接池 +var Pool *pgxpool.Pool + +// Init 初始化数据库连接 +func Init(cfg config.DatabaseConfig) error { + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + defer cancel() + + poolConfig, err := pgxpool.ParseConfig(cfg.DSN()) + if err != nil { + return fmt.Errorf("parse db config: %w", err) + } + + poolConfig.MaxConns = int32(cfg.PoolSize) + poolConfig.MinConns = 2 + poolConfig.MaxConnLifetime = 30 * time.Minute + poolConfig.MaxConnIdleTime = 10 * time.Minute + + Pool, err = pgxpool.NewWithConfig(ctx, poolConfig) + if err != nil { + return fmt.Errorf("create db pool: %w", err) + } + + if err := Pool.Ping(ctx); err != nil { + return fmt.Errorf("ping db: %w", err) + } + + return nil +} + +// Close 关闭数据库连接 +func Close() { + if Pool != nil { + Pool.Close() + } +} diff --git a/internal/database/database_test.go b/internal/database/database_test.go new file mode 100644 index 0000000..147daf1 --- /dev/null +++ b/internal/database/database_test.go @@ -0,0 +1,37 @@ +package database + +import ( + "testing" + + "github.com/company/ai-ops/internal/config" +) + +func TestInitAndCloseWithLocalPostgres(t *testing.T) { + ports := []int{15432, 5432} + var lastErr error + for _, port := range ports { + lastErr = Init(config.DatabaseConfig{Host: "localhost", Port: port, User: "aiops", Password: "aiops123", DBName: "ai_ops", SSLMode: "disable", PoolSize: 4}) + if lastErr == nil { + break + } + Close() + Pool = nil + } + if lastErr != nil { + t.Skipf("PostgreSQL integration database not available: %v", lastErr) + } + if Pool == nil { + t.Fatal("pool not initialized") + } + Close() + Pool = nil +} + +func TestInitReturnsErrorForInvalidConfig(t *testing.T) { + if err := Init(config.DatabaseConfig{Host: "::::bad-host::::", Port: 1, User: "u", Password: "p", DBName: "d", SSLMode: "disable", PoolSize: 1}); err == nil { + Close() + Pool = nil + t.Fatal("expected invalid db config error") + } + Pool = nil +} diff --git a/internal/domain/model/alert.go b/internal/domain/model/alert.go new file mode 100644 index 0000000..d24467f --- /dev/null +++ b/internal/domain/model/alert.go @@ -0,0 +1,50 @@ +package model + +import "time" + +// AlertRule 是告警规则 +type AlertRule struct { + ID string `json:"id"` + Name string `json:"name"` + MetricSource string `json:"metric_source"` + MetricName string `json:"metric_name"` + ThresholdType string `json:"threshold_type"` + ThresholdValue string `json:"threshold_value"` + DurationMin int `json:"duration_min"` + Level string `json:"level"` + ChannelIDs []string `json:"channel_ids"` + HealingAction *string `json:"healing_action,omitempty"` + HealingConfig map[string]any `json:"healing_config,omitempty"` + IsSandboxed bool `json:"is_sandboxed"` + Enabled bool `json:"enabled"` + Version int `json:"version"` + CreatedBy string `json:"created_by"` + CreatedAt time.Time `json:"created_at"` + UpdatedAt time.Time `json:"updated_at"` +} + +// AlertEvent 是告警事件 +type AlertEvent struct { + ID string `json:"id"` + RuleID string `json:"rule_id"` + Level string `json:"level"` + ResourceType string `json:"resource_type"` + ResourceID string `json:"resource_id"` + CurrentValue string `json:"current_value"` + ThresholdValue string `json:"threshold_value"` + Status string `json:"status"` + IsAggregated bool `json:"is_aggregated"` + AggregatedCount int `json:"aggregated_count"` + ParentAlertID *string `json:"parent_alert_id,omitempty"` + StartedAt time.Time `json:"started_at"` + ResolvedAt *time.Time `json:"resolved_at,omitempty"` +} + +// AlertCount 是告警统计 +type AlertCount struct { + Open int `json:"open"` + P0 int `json:"p0"` + P1 int `json:"p1"` + P2 int `json:"p2"` + P3 int `json:"p3"` +} diff --git a/internal/domain/model/channel.go b/internal/domain/model/channel.go new file mode 100644 index 0000000..84ff609 --- /dev/null +++ b/internal/domain/model/channel.go @@ -0,0 +1,21 @@ +package model + +import "time" + +// NotificationChannel 是通知渠道 +type NotificationChannel struct { + ID string `json:"id"` + Name string `json:"name"` + ChannelType string `json:"channel_type"` + Config map[string]any `json:"config"` + Priority int `json:"priority"` + Enabled bool `json:"enabled"` + CreatedAt time.Time `json:"created_at"` +} + +// ChannelConfig 是通道配置结构 +type ChannelConfig struct { + WebhookURL string `json:"webhook_url,omitempty"` + EmailTo string `json:"email_to,omitempty"` + APIToken string `json:"api_token,omitempty"` +} diff --git a/internal/domain/model/log.go b/internal/domain/model/log.go new file mode 100644 index 0000000..a2dbcaf --- /dev/null +++ b/internal/domain/model/log.go @@ -0,0 +1,30 @@ +package model + +import "time" + +// RequestLog 是请求日志记录 +type RequestLog struct { + ID string `json:"id"` + Timestamp time.Time `json:"timestamp"` + Service string `json:"service"` + Path string `json:"path"` + StatusCode int `json:"status_code"` + LatencyMs float64 `json:"latency_ms"` + UserID string `json:"user_id"` + SupplierID string `json:"supplier_id"` + Method string `json:"method"` + ErrorCode string `json:"error_code,omitempty"` +} + +// LogQueryFilter 是日志查询过滤条件 +type LogQueryFilter struct { + StartTime *time.Time `json:"start_time,omitempty"` + EndTime *time.Time `json:"end_time,omitempty"` + Service string `json:"service,omitempty"` + Path string `json:"path,omitempty"` + StatusCode *int `json:"status_code,omitempty"` + UserID string `json:"user_id,omitempty"` + SupplierID string `json:"supplier_id,omitempty"` + Page int `json:"page"` + PageSize int `json:"page_size"` +} diff --git a/internal/domain/model/metric.go b/internal/domain/model/metric.go new file mode 100644 index 0000000..ac6a4e1 --- /dev/null +++ b/internal/domain/model/metric.go @@ -0,0 +1,37 @@ +package model + +import "time" + +// MetricPoint 是时序数据点 +type MetricPoint struct { + Source string `json:"source"` + Name string `json:"name"` + Value float64 `json:"value"` + Tags map[string]string `json:"tags"` + Timestamp time.Time `json:"timestamp"` +} + +// MetricQueryRequest 是指标查询请求 +type MetricQueryRequest struct { + Source string `json:"source"` + Name string `json:"name"` + StartTime time.Time `json:"start_time"` + EndTime time.Time `json:"end_time"` + Interval time.Duration `json:"interval"` + Tags map[string]string `json:"tags"` +} + +// RealtimeMetrics 是首页实时指标 +type RealtimeMetrics struct { + QPS float64 `json:"qps"` + AvgLatency float64 `json:"avg_latency_ms"` + P99Latency float64 `json:"p99_latency_ms"` + ErrorRate float64 `json:"error_rate"` +} + +// SupplierCount 是供应商统计 +type SupplierCount struct { + Total int `json:"total"` + Healthy int `json:"healthy"` + Unhealthy int `json:"unhealthy"` +} diff --git a/internal/domain/model/notification.go b/internal/domain/model/notification.go new file mode 100644 index 0000000..0634251 --- /dev/null +++ b/internal/domain/model/notification.go @@ -0,0 +1,16 @@ +package model + +import "time" + +// NotificationLog 记录单次通知渠道发送结果。 +type NotificationLog struct { + ID string `json:"id"` + EventID string `json:"event_id"` + ChannelID string `json:"channel_id"` + ChannelType string `json:"channel_type"` + Status string `json:"status"` + RetryCount int `json:"retry_count"` + ErrorMessage *string `json:"error_message,omitempty"` + SentAt *time.Time `json:"sent_at,omitempty"` + CreatedAt time.Time `json:"created_at"` +} diff --git a/internal/domain/repository/alert_repository.go b/internal/domain/repository/alert_repository.go new file mode 100644 index 0000000..4ec15ec --- /dev/null +++ b/internal/domain/repository/alert_repository.go @@ -0,0 +1,28 @@ +package repository + +import ( + "context" + "time" + + "github.com/company/ai-ops/internal/domain/model" +) + +// AlertRepository 是告警数据存储接口 +type AlertRepository interface { + // 告警统计 + GetOpenCount(ctx context.Context) (*model.AlertCount, error) + + // 规则 CRUD + ListRules(ctx context.Context) ([]model.AlertRule, error) + GetRuleByID(ctx context.Context, id string) (*model.AlertRule, error) + CreateRule(ctx context.Context, rule *model.AlertRule) error + UpdateRule(ctx context.Context, rule *model.AlertRule) error + DeleteRule(ctx context.Context, id string) error + + // 告警事件 + ListEvents(ctx context.Context, status string, page, pageSize int) ([]model.AlertEvent, int, error) + CreateEvent(ctx context.Context, event *model.AlertEvent) error + CreateEventWithAggregation(ctx context.Context, event *model.AlertEvent, window time.Duration, threshold int) (*model.AlertEvent, error) + UpdateEventStatus(ctx context.Context, id, status string) error + EscalateEvent(ctx context.Context, id, newLevel string) error +} diff --git a/internal/domain/repository/channel_repository.go b/internal/domain/repository/channel_repository.go new file mode 100644 index 0000000..ef5775b --- /dev/null +++ b/internal/domain/repository/channel_repository.go @@ -0,0 +1,16 @@ +package repository + +import ( + "context" + + "github.com/company/ai-ops/internal/domain/model" +) + +// ChannelRepository 是通知渠道存储接口 +type ChannelRepository interface { + List(ctx context.Context) ([]model.NotificationChannel, error) + GetByID(ctx context.Context, id string) (*model.NotificationChannel, error) + Create(ctx context.Context, ch *model.NotificationChannel) error + Update(ctx context.Context, ch *model.NotificationChannel) error + Delete(ctx context.Context, id string) error +} diff --git a/internal/domain/repository/log_repository.go b/internal/domain/repository/log_repository.go new file mode 100644 index 0000000..be0ee30 --- /dev/null +++ b/internal/domain/repository/log_repository.go @@ -0,0 +1,13 @@ +package repository + +import ( + "context" + + "github.com/company/ai-ops/internal/domain/model" +) + +// LogRepository 是日志数据存储接口 +type LogRepository interface { + // Query 查询日志 + Query(ctx context.Context, filter model.LogQueryFilter) ([]model.RequestLog, int, error) +} diff --git a/internal/domain/repository/metric_repository.go b/internal/domain/repository/metric_repository.go new file mode 100644 index 0000000..eaed949 --- /dev/null +++ b/internal/domain/repository/metric_repository.go @@ -0,0 +1,17 @@ +package repository + +import ( + "context" + + "github.com/company/ai-ops/internal/domain/model" +) + +// MetricRepository 是指标数据存储接口 +type MetricRepository interface { + // GetRealtime 获取实时指标 + GetRealtime(ctx context.Context) (*model.RealtimeMetrics, error) + // Query 按条件查询指标 + Query(ctx context.Context, req model.MetricQueryRequest) ([]model.MetricPoint, error) + // GetLatest 获取最新指标值 + GetLatest(ctx context.Context, source, name string) (*model.MetricPoint, error) +} diff --git a/internal/domain/repository/notification_repository.go b/internal/domain/repository/notification_repository.go new file mode 100644 index 0000000..4965a52 --- /dev/null +++ b/internal/domain/repository/notification_repository.go @@ -0,0 +1,14 @@ +package repository + +import ( + "context" + + "github.com/company/ai-ops/internal/domain/model" +) + +// NotificationLogRepository 是通知发送记录存储接口。 +type NotificationLogRepository interface { + CreateLog(ctx context.Context, log *model.NotificationLog) error + MarkSent(ctx context.Context, id string) error + MarkFailed(ctx context.Context, id string, retryCount int, errMessage string) error +} diff --git a/internal/handler/alert_handler.go b/internal/handler/alert_handler.go new file mode 100644 index 0000000..a08eaea --- /dev/null +++ b/internal/handler/alert_handler.go @@ -0,0 +1,43 @@ +package handler + +import ( + "net/http" + "strconv" + + "github.com/company/ai-ops/internal/domain/repository" + "github.com/company/ai-ops/pkg/errors" + "github.com/company/ai-ops/pkg/response" +) + +// AlertHandler 是告警事件 HTTP 处理器 +type AlertHandler struct { + repo repository.AlertRepository +} + +func NewAlertHandler(repo repository.AlertRepository) *AlertHandler { + return &AlertHandler{repo: repo} +} + +func (h *AlertHandler) RegisterRoutes(mux *http.ServeMux) { + mux.HandleFunc("GET /api/v1/ai-ops/alerts", h.ListAlerts) +} + +func (h *AlertHandler) ListAlerts(w http.ResponseWriter, r *http.Request) { + query := r.URL.Query() + status := query.Get("status") + page, _ := strconv.Atoi(query.Get("page")) + pageSize, _ := strconv.Atoi(query.Get("page_size")) + if page < 1 { + page = 1 + } + if pageSize < 1 || pageSize > 100 { + pageSize = 20 + } + + events, total, err := h.repo.ListEvents(r.Context(), status, page, pageSize) + if err != nil { + response.Error(w, errors.Wrap(err, errors.ErrInternal)) + return + } + response.Success(w, map[string]any{"items": events, "total": total, "page": page, "page_size": pageSize}) +} diff --git a/internal/handler/audit_handler.go b/internal/handler/audit_handler.go new file mode 100644 index 0000000..932c090 --- /dev/null +++ b/internal/handler/audit_handler.go @@ -0,0 +1,55 @@ +package handler + +import ( + "net/http" + "strconv" + + "github.com/company/ai-ops/internal/service" + "github.com/company/ai-ops/pkg/errors" + "github.com/company/ai-ops/pkg/response" +) + +// AuditHandler 是审计日志 HTTP 处理器 +type AuditHandler struct { + service *service.AuditService +} + +func NewAuditHandler(s *service.AuditService) *AuditHandler { + return &AuditHandler{service: s} +} + +func (h *AuditHandler) RegisterRoutes(mux *http.ServeMux) { + mux.HandleFunc("GET /api/v1/ai-ops/audits", h.ListAudits) + mux.HandleFunc("POST /api/v1/ai-ops/audits/{id}/rollback", h.Rollback) +} + +func (h *AuditHandler) ListAudits(w http.ResponseWriter, r *http.Request) { + query := r.URL.Query() + objectType := query.Get("object_type") + objectID := query.Get("object_id") + page, _ := strconv.Atoi(query.Get("page")) + pageSize, _ := strconv.Atoi(query.Get("page_size")) + if page < 1 { + page = 1 + } + if pageSize < 1 || pageSize > 100 { + pageSize = 20 + } + + logs, total, err := h.service.List(r.Context(), objectType, objectID, page, pageSize) + if err != nil { + response.Error(w, errors.Wrap(err, errors.ErrInternal)) + return + } + response.Success(w, map[string]any{"items": logs, "total": total, "page": page, "page_size": pageSize}) +} + +func (h *AuditHandler) Rollback(w http.ResponseWriter, r *http.Request) { + id := r.PathValue("id") + log, err := h.service.Rollback(r.Context(), id) + if err != nil { + response.Error(w, errors.Wrap(err, errors.ErrBadRequest).WithDetail(map[string]any{"error": err.Error()})) + return + } + response.Success(w, log) +} diff --git a/internal/handler/auth_handler.go b/internal/handler/auth_handler.go new file mode 100644 index 0000000..b9e1baf --- /dev/null +++ b/internal/handler/auth_handler.go @@ -0,0 +1,59 @@ +package handler + +import ( + "net/http" + + "github.com/company/ai-ops/internal/service" + "github.com/company/ai-ops/pkg/errors" + "github.com/company/ai-ops/pkg/response" +) + +// AuthHandler 是认证 HTTP 处理器 +type AuthHandler struct { + authSvc *service.AuthService +} + +func NewAuthHandler(authSvc *service.AuthService) *AuthHandler { + return &AuthHandler{authSvc: authSvc} +} + +func (h *AuthHandler) RegisterRoutes(mux *http.ServeMux) { + mux.HandleFunc("POST /api/v1/ai-ops/login", h.Login) +} + +func (h *AuthHandler) Login(w http.ResponseWriter, r *http.Request) { + var req struct { + Username string `json:"username"` + Password string `json:"password"` + } + if err := decodeJSON(r, &req); err != nil { + response.Error(w, errors.ErrBadRequest.WithDetail(map[string]any{"error": err.Error()})) + return + } + + // TODO: 实现真实的用户验证（当前为简化实现） + if req.Username == "" || req.Password == "" { + response.Error(w, errors.ErrBadRequest.WithDetail(map[string]any{"error": "username and password required"})) + return + } + + // 默认角色为 viewer + role := "viewer" + if req.Username == "admin" { + role = "admin" + } else if req.Username == "ops" { + role = "operator" + } + + token, err := h.authSvc.IssueToken(req.Username, role) + if err != nil { + response.Error(w, errors.Wrap(err, errors.ErrInternal)) + return + } + + response.Success(w, map[string]any{ + "token": token, + "expires_in": 28800, + "role": role, + }) +} diff --git a/internal/handler/channel_handler.go b/internal/handler/channel_handler.go new file mode 100644 index 0000000..fece94f --- /dev/null +++ b/internal/handler/channel_handler.go @@ -0,0 +1,97 @@ +package handler + +import ( + "net/http" + + "github.com/company/ai-ops/internal/domain/model" + "github.com/company/ai-ops/internal/service" + "github.com/company/ai-ops/pkg/errors" + "github.com/company/ai-ops/pkg/response" +) + +// ChannelHandler 是通知渠道 HTTP 处理器 +type ChannelHandler struct { + service *service.ChannelService +} + +func NewChannelHandler(s *service.ChannelService) *ChannelHandler { + return &ChannelHandler{service: s} +} + +func (h *ChannelHandler) RegisterRoutes(mux *http.ServeMux) { + mux.HandleFunc("GET /api/v1/ai-ops/channels", h.ListChannels) + mux.HandleFunc("GET /api/v1/ai-ops/channels/{id}", h.GetChannel) + mux.HandleFunc("POST /api/v1/ai-ops/channels", h.CreateChannel) + mux.HandleFunc("PUT /api/v1/ai-ops/channels/{id}", h.UpdateChannel) + mux.HandleFunc("DELETE /api/v1/ai-ops/channels/{id}", h.DeleteChannel) + mux.HandleFunc("POST /api/v1/ai-ops/channels/test", h.TestChannel) +} + +func (h *ChannelHandler) ListChannels(w http.ResponseWriter, r *http.Request) { + channels, err := h.service.List(r.Context()) + if err != nil { + response.Error(w, errors.Wrap(err, errors.ErrInternal)) + return + } + response.Success(w, channels) +} + +func (h *ChannelHandler) GetChannel(w http.ResponseWriter, r *http.Request) { + id := r.PathValue("id") + ch, err := h.service.Get(r.Context(), id) + if err != nil { + response.Error(w, errors.Wrap(err, errors.ErrNotFound)) + return + } + response.Success(w, ch) +} + +func (h *ChannelHandler) CreateChannel(w http.ResponseWriter, r *http.Request) { + var ch model.NotificationChannel + if err := decodeJSON(r, &ch); err != nil { + response.Error(w, errors.ErrBadRequest.WithDetail(map[string]any{"error": err.Error()})) + return + } + if err := h.service.Create(r.Context(), &ch); err != nil { + response.Error(w, errors.Wrap(err, errors.ErrBadRequest)) + return + } + w.WriteHeader(http.StatusCreated) + response.Success(w, ch) +} + +func (h *ChannelHandler) UpdateChannel(w http.ResponseWriter, r *http.Request) { + id := r.PathValue("id") + var ch model.NotificationChannel + if err := decodeJSON(r, &ch); err != nil { + response.Error(w, errors.ErrBadRequest.WithDetail(map[string]any{"error": err.Error()})) + return + } + ch.ID = id + if err := h.service.Update(r.Context(), &ch); err != nil { + response.Error(w, errors.Wrap(err, errors.ErrBadRequest)) + return + } + response.Success(w, ch) +} + +func (h *ChannelHandler) DeleteChannel(w http.ResponseWriter, r *http.Request) { + id := r.PathValue("id") + if err := h.service.Delete(r.Context(), id); err != nil { + response.Error(w, errors.Wrap(err, errors.ErrInternal)) + return + } + w.WriteHeader(http.StatusNoContent) +} + +func (h *ChannelHandler) TestChannel(w http.ResponseWriter, r *http.Request) { + var req struct { + ChannelID string `json:"channel_id"` + Message string `json:"message"` + } + if err := decodeJSON(r, &req); err != nil { + response.Error(w, errors.ErrBadRequest.WithDetail(map[string]any{"error": err.Error()})) + return + } + response.Success(w, map[string]any{"ok": true}) +} diff --git a/internal/handler/core_handlers_test.go b/internal/handler/core_handlers_test.go new file mode 100644 index 0000000..132017c --- /dev/null +++ b/internal/handler/core_handlers_test.go @@ -0,0 +1,391 @@ +package handler + +import ( + "context" + "crypto/rand" + "encoding/hex" + "errors" + "net/http" + "net/http/httptest" + "os" + "path/filepath" + "sort" + "strings" + "testing" + "time" + + "github.com/company/ai-ops/internal/config" + "github.com/company/ai-ops/internal/database" + "github.com/company/ai-ops/internal/domain/model" + "github.com/company/ai-ops/internal/service" +) + +type fakeHandlerAlertRepo struct { + rules []model.AlertRule + events []model.AlertEvent + err error +} + +func (r *fakeHandlerAlertRepo) GetOpenCount(context.Context) (*model.AlertCount, error) { + return &model.AlertCount{}, r.err +} +func (r *fakeHandlerAlertRepo) ListRules(context.Context) ([]model.AlertRule, error) { + return r.rules, r.err +} +func (r *fakeHandlerAlertRepo) GetRuleByID(_ context.Context, id string) (*model.AlertRule, error) { + if r.err != nil { + return nil, r.err + } + return &model.AlertRule{ID: id, Name: "rule"}, nil +} +func (r *fakeHandlerAlertRepo) CreateRule(context.Context, *model.AlertRule) error { return r.err } +func (r *fakeHandlerAlertRepo) UpdateRule(context.Context, *model.AlertRule) error { return r.err } +func (r *fakeHandlerAlertRepo) DeleteRule(context.Context, string) error { return r.err } +func (r *fakeHandlerAlertRepo) ListEvents(context.Context, string, int, int) ([]model.AlertEvent, int, error) { + return r.events, len(r.events), r.err +} +func (r *fakeHandlerAlertRepo) CreateEvent(context.Context, *model.AlertEvent) error { return r.err } +func (r *fakeHandlerAlertRepo) CreateEventWithAggregation(_ context.Context, e *model.AlertEvent, _ time.Duration, _ int) (*model.AlertEvent, error) { + return e, r.err +} +func (r *fakeHandlerAlertRepo) UpdateEventStatus(context.Context, string, string) error { return r.err } +func (r *fakeHandlerAlertRepo) EscalateEvent(context.Context, string, string) error { return r.err } + +type fakeHandlerChannelRepo struct { + channels []model.NotificationChannel + err error +} + +func (r *fakeHandlerChannelRepo) List(context.Context) ([]model.NotificationChannel, error) { + return r.channels, r.err +} +func (r *fakeHandlerChannelRepo) GetByID(_ context.Context, id string) (*model.NotificationChannel, error) { + if r.err != nil { + return nil, r.err + } + return &model.NotificationChannel{ID: id, Name: "hook"}, nil +} +func (r *fakeHandlerChannelRepo) Create(context.Context, *model.NotificationChannel) error { + return r.err +} +func (r *fakeHandlerChannelRepo) Update(context.Context, *model.NotificationChannel) error { + return r.err +} +func (r *fakeHandlerChannelRepo) Delete(context.Context, string) error { return r.err } + +type fakeHandlerLogRepo struct { + logs []model.RequestLog + total int + err error +} + +func (r *fakeHandlerLogRepo) Query(context.Context, model.LogQueryFilter) ([]model.RequestLog, int, error) { + return r.logs, r.total, r.err +} + +func TestAuthHandlerLoginRolesAndValidation(t *testing.T) { + h := NewAuthHandler(service.NewAuthService("secret")) + + cases := []struct{ username, wantRole string }{{"admin", "admin"}, {"ops", "operator"}, {"alice", "viewer"}} + for _, tc := range cases { + w := httptest.NewRecorder() + h.Login(w, httptest.NewRequest(http.MethodPost, "/api/v1/ai-ops/login", strings.NewReader(`{"username":"`+tc.username+`","password":"pw"}`))) + if w.Code != http.StatusOK || !strings.Contains(w.Body.String(), `"role":"`+tc.wantRole+`"`) || !strings.Contains(w.Body.String(), `"token"`) { + t.Fatalf("login %s failed: status=%d body=%s", tc.username, w.Code, w.Body.String()) + } + } + + w := httptest.NewRecorder() + h.Login(w, httptest.NewRequest(http.MethodPost, "/api/v1/ai-ops/login", strings.NewReader(`{"username":"","password":""}`))) + if w.Code != http.StatusBadRequest { + t.Fatalf("invalid login status = %d", w.Code) + } + + bad := httptest.NewRecorder() + h.Login(bad, httptest.NewRequest(http.MethodPost, "/api/v1/ai-ops/login", strings.NewReader(`{`))) + if bad.Code != http.StatusBadRequest { + t.Fatalf("bad json status = %d", bad.Code) + } +} + +func TestHealthAndDashboardHandlers(t *testing.T) { + health := NewHealthHandler() + w := httptest.NewRecorder() + health.Health(w, httptest.NewRequest(http.MethodGet, "/actuator/health", nil)) + if w.Code != http.StatusOK || !strings.Contains(w.Body.String(), `"status":"UP"`) { + t.Fatalf("health = %d %s", w.Code, w.Body.String()) + } + + live := httptest.NewRecorder() + health.Live(live, httptest.NewRequest(http.MethodGet, "/actuator/health/live", nil)) + if live.Code != http.StatusOK { + t.Fatalf("live = %d", live.Code) + } + + ready := httptest.NewRecorder() + health.Ready(ready, httptest.NewRequest(http.MethodGet, "/actuator/health/ready", nil)) + if ready.Code != http.StatusServiceUnavailable || !strings.Contains(ready.Body.String(), `"status":"DOWN"`) { + t.Fatalf("ready = %d %s", ready.Code, ready.Body.String()) + } + + dash := httptest.NewRecorder() + NewDashboardHandler().Dashboard(dash, httptest.NewRequest(http.MethodGet, "/ops/dashboard", nil)) + if dash.Code != http.StatusOK || !strings.Contains(dash.Body.String(), "AI-Ops 运维看板") { + t.Fatalf("dashboard = %d", dash.Code) + } +} + +func TestRuleHandlerCRUDHappyAndErrorPaths(t *testing.T) { + repo := &fakeHandlerAlertRepo{rules: []model.AlertRule{{ID: "r1", Name: "rule"}}} + h := NewRuleHandler(service.NewRuleService(repo)) + mux := http.NewServeMux() + h.RegisterRoutes(mux) + + for _, tc := range []struct { + method, path string + body string + want int + }{ + {http.MethodGet, "/api/v1/ai-ops/rules", "", http.StatusOK}, + {http.MethodGet, "/api/v1/ai-ops/rules/r1", "", http.StatusOK}, + {http.MethodPost, "/api/v1/ai-ops/rules", `{"id":"r2","name":"latency","metric_name":"p99"}`, http.StatusCreated}, + {http.MethodPut, "/api/v1/ai-ops/rules/r2", `{"name":"latency","metric_name":"p99"}`, http.StatusOK}, + {http.MethodDelete, "/api/v1/ai-ops/rules/r2", "", http.StatusNoContent}, + {http.MethodPost, "/api/v1/ai-ops/rules", `{`, http.StatusBadRequest}, + {http.MethodPost, "/api/v1/ai-ops/rules", `{}`, http.StatusBadRequest}, + } { + w := httptest.NewRecorder() + mux.ServeHTTP(w, httptest.NewRequest(tc.method, tc.path, strings.NewReader(tc.body))) + if w.Code != tc.want { + t.Fatalf("%s %s status=%d want=%d body=%s", tc.method, tc.path, w.Code, tc.want, w.Body.String()) + } + } + + errHandler := NewRuleHandler(service.NewRuleService(&fakeHandlerAlertRepo{err: errors.New("db")})) + errMux := http.NewServeMux() + errHandler.RegisterRoutes(errMux) + w := httptest.NewRecorder() + errMux.ServeHTTP(w, httptest.NewRequest(http.MethodGet, "/api/v1/ai-ops/rules", nil)) + if w.Code != http.StatusInternalServerError { + t.Fatalf("error list status = %d", w.Code) + } +} + +func TestChannelHandlerCRUDHappyAndErrorPaths(t *testing.T) { + h := NewChannelHandler(service.NewChannelService(&fakeHandlerChannelRepo{channels: []model.NotificationChannel{{ID: "c1", Name: "hook"}}})) + mux := http.NewServeMux() + h.RegisterRoutes(mux) + + for _, tc := range []struct { + method, path, body string + want int + }{ + {http.MethodGet, "/api/v1/ai-ops/channels", "", http.StatusOK}, + {http.MethodGet, "/api/v1/ai-ops/channels/c1", "", http.StatusOK}, + {http.MethodPost, "/api/v1/ai-ops/channels", `{"name":"hook","channel_type":"webhook"}`, http.StatusCreated}, + {http.MethodPut, "/api/v1/ai-ops/channels/c1", `{"name":"hook","channel_type":"webhook"}`, http.StatusOK}, + {http.MethodDelete, "/api/v1/ai-ops/channels/c1", "", http.StatusNoContent}, + {http.MethodPost, "/api/v1/ai-ops/channels/test", `{"channel_id":"c1","message":"hello"}`, http.StatusOK}, + {http.MethodPost, "/api/v1/ai-ops/channels", `{}`, http.StatusBadRequest}, + {http.MethodPost, "/api/v1/ai-ops/channels/test", `{`, http.StatusBadRequest}, + } { + w := httptest.NewRecorder() + mux.ServeHTTP(w, httptest.NewRequest(tc.method, tc.path, strings.NewReader(tc.body))) + if w.Code != tc.want { + t.Fatalf("%s %s status=%d want=%d body=%s", tc.method, tc.path, w.Code, tc.want, w.Body.String()) + } + } +} + +func TestAlertAndLogHandlers(t *testing.T) { + alertHandler := NewAlertHandler(&fakeHandlerAlertRepo{events: []model.AlertEvent{{ID: "e1", Status: "triggered"}}}) + alertMux := http.NewServeMux() + alertHandler.RegisterRoutes(alertMux) + aw := httptest.NewRecorder() + alertMux.ServeHTTP(aw, httptest.NewRequest(http.MethodGet, "/api/v1/ai-ops/alerts?status=triggered&page=2&page_size=5", nil)) + if aw.Code != http.StatusOK || !strings.Contains(aw.Body.String(), `"items"`) { + t.Fatalf("alerts = %d %s", aw.Code, aw.Body.String()) + } + + logHandler := NewLogHandler(service.NewLogService(&fakeHandlerLogRepo{logs: []model.RequestLog{{Timestamp: time.Date(2026, 1, 2, 3, 4, 5, 0, time.UTC), Service: "api", Path: "/v1", Method: "GET", StatusCode: 200}}, total: 1})) + logMux := http.NewServeMux() + logHandler.RegisterRoutes(logMux) + lw := httptest.NewRecorder() + logMux.ServeHTTP(lw, httptest.NewRequest(http.MethodGet, "/api/v1/ai-ops/logs?page=2&page_size=5&status_code=200", nil)) + if lw.Code != http.StatusOK || !strings.Contains(lw.Body.String(), `"total_pages"`) { + t.Fatalf("logs = %d %s", lw.Code, lw.Body.String()) + } + + csv := httptest.NewRecorder() + logMux.ServeHTTP(csv, httptest.NewRequest(http.MethodGet, "/api/v1/ai-ops/logs/export", nil)) + if csv.Code != http.StatusOK || !strings.Contains(csv.Body.String(), "时间,服务名") { + t.Fatalf("csv = %d %s", csv.Code, csv.Body.String()) + } + + badCSV := httptest.NewRecorder() + badLogHandler := NewLogHandler(service.NewLogService(&fakeHandlerLogRepo{err: errors.New("export failed")})) + badLogMux := http.NewServeMux() + badLogHandler.RegisterRoutes(badLogMux) + badLogMux.ServeHTTP(badCSV, httptest.NewRequest(http.MethodGet, "/api/v1/ai-ops/logs/export?start=bad&end=bad&status_code=bad", nil)) + if badCSV.Code != http.StatusInternalServerError { + t.Fatalf("csv error = %d %s", badCSV.Code, badCSV.Body.String()) + } +} + +type fakeHandlerMetricRepo struct { + realtime *model.RealtimeMetrics + points []model.MetricPoint + err error +} + +func (r *fakeHandlerMetricRepo) GetRealtime(context.Context) (*model.RealtimeMetrics, error) { + if r.err != nil { + return nil, r.err + } + return r.realtime, nil +} +func (r *fakeHandlerMetricRepo) Query(context.Context, model.MetricQueryRequest) ([]model.MetricPoint, error) { + if r.err != nil { + return nil, r.err + } + return r.points, nil +} +func (r *fakeHandlerMetricRepo) GetLatest(context.Context, string, string) (*model.MetricPoint, error) { + if r.err != nil { + return nil, r.err + } + return &model.MetricPoint{Value: 1}, nil +} + +func TestRegisterRoutesForSmallHandlers(t *testing.T) { + mux := http.NewServeMux() + NewAuthHandler(service.NewAuthService("secret")).RegisterRoutes(mux) + NewDashboardHandler().RegisterRoutes(mux) + NewHealthHandler().RegisterRoutes(mux) + NewHealingHandler().RegisterRoutes(mux) + + w := httptest.NewRecorder() + mux.ServeHTTP(w, httptest.NewRequest(http.MethodGet, "/api/v1/ai-ops/healings", nil)) + if w.Code != http.StatusOK || !strings.Contains(w.Body.String(), `"total":0`) { + t.Fatalf("healings = %d %s", w.Code, w.Body.String()) + } + + one := httptest.NewRecorder() + mux.ServeHTTP(one, httptest.NewRequest(http.MethodGet, "/api/v1/ai-ops/healings/h1", nil)) + if one.Code != http.StatusOK || !strings.Contains(one.Body.String(), `"id":"h1"`) { + t.Fatalf("healing = %d %s", one.Code, one.Body.String()) + } +} + +func TestMetricHandlerRoutesAndErrors(t *testing.T) { + metricRepo := &fakeHandlerMetricRepo{realtime: &model.RealtimeMetrics{QPS: 9}, points: []model.MetricPoint{{Name: "qps", Value: 1}}} + alertRepo := &fakeHandlerAlertRepo{} + h := NewMetricHandler(service.NewMetricService(metricRepo, alertRepo)) + mux := http.NewServeMux() + h.RegisterRoutes(mux) + + for _, path := range []string{ + "/api/v1/ai-ops/metrics/realtime", + "/api/v1/ai-ops/metrics/suppliers/count", + "/api/v1/ai-ops/alerts/open/count", + "/api/v1/ai-ops/metrics/query?source=prom&name=qps&start=2026-01-01T00:00:00Z&end=2026-01-01T01:00:00Z", + } { + w := httptest.NewRecorder() + mux.ServeHTTP(w, httptest.NewRequest(http.MethodGet, path, nil)) + if w.Code != http.StatusOK { + t.Fatalf("%s status=%d body=%s", path, w.Code, w.Body.String()) + } + } + + errHandler := NewMetricHandler(service.NewMetricService(&fakeHandlerMetricRepo{err: errors.New("metrics down")}, &fakeHandlerAlertRepo{err: errors.New("alerts down")})) + errMux := http.NewServeMux() + errHandler.RegisterRoutes(errMux) + for _, path := range []string{"/api/v1/ai-ops/metrics/realtime", "/api/v1/ai-ops/metrics/query", "/api/v1/ai-ops/alerts/open/count"} { + w := httptest.NewRecorder() + errMux.ServeHTTP(w, httptest.NewRequest(http.MethodGet, path, nil)) + if w.Code != http.StatusInternalServerError { + t.Fatalf("%s status=%d", path, w.Code) + } + } +} + +func setupHandlerAuditDB(t *testing.T) context.Context { + t.Helper() + ctx := context.Background() + if database.Pool == nil { + ports := []int{15432, 5432} + var lastErr error + for _, port := range ports { + lastErr = database.Init(config.DatabaseConfig{Host: "localhost", Port: port, User: "aiops", Password: "aiops123", DBName: "ai_ops", SSLMode: "disable", PoolSize: 4}) + if lastErr == nil { + break + } + database.Close() + database.Pool = nil + } + if lastErr != nil { + t.Skipf("PostgreSQL integration database not available: %v", lastErr) + } + } + if _, err := database.Pool.Exec(ctx, `SELECT pg_advisory_lock(424242001)`); err != nil { + t.Fatal(err) + } + defer database.Pool.Exec(ctx, `SELECT pg_advisory_unlock(424242001)`) + files, err := filepath.Glob(filepath.Join("..", "..", "tech", "migrations", "*.up.sql")) + if err != nil { + t.Fatal(err) + } + sort.Strings(files) + for _, f := range files { + b, err := os.ReadFile(f) + if err != nil { + t.Fatal(err) + } + if _, err := database.Pool.Exec(ctx, string(b)); err != nil { + t.Fatalf("apply migration %s: %v", f, err) + } + } + return ctx +} + +func handlerUUID(t *testing.T) string { + t.Helper() + b := make([]byte, 16) + if _, err := rand.Read(b); err != nil { + t.Fatal(err) + } + b[6] = (b[6] & 0x0f) | 0x40 + b[8] = (b[8] & 0x3f) | 0x80 + return hex.EncodeToString(b[0:4]) + "-" + hex.EncodeToString(b[4:6]) + "-" + hex.EncodeToString(b[6:8]) + "-" + hex.EncodeToString(b[8:10]) + "-" + hex.EncodeToString(b[10:16]) +} + +func TestAuditHandlerListAndRollback(t *testing.T) { + ctx := setupHandlerAuditDB(t) + svc := service.NewAuditService() + id := handlerUUID(t) + defer database.Pool.Exec(ctx, `DELETE FROM ai_ops_audits WHERE id=$1 OR parent_audit_id=$1 OR object_id=$1`, id) + if err := svc.Record(ctx, &service.AuditLog{ID: id, TenantID: "tenant", ObjectType: "rule", ObjectID: id, Action: "update", BeforeState: map[string]any{"enabled": false}, AfterState: map[string]any{"enabled": true}, RequestID: "req", ResultCode: "SUCCESS", SourceIP: "127.0.0.1", ActorID: "actor", RiskLevel: "normal"}); err != nil { + t.Fatal(err) + } + h := NewAuditHandler(svc) + mux := http.NewServeMux() + h.RegisterRoutes(mux) + + list := httptest.NewRecorder() + mux.ServeHTTP(list, httptest.NewRequest(http.MethodGet, "/api/v1/ai-ops/audits?object_type=rule&object_id="+id+"&page=0&page_size=999", nil)) + if list.Code != http.StatusOK || !strings.Contains(list.Body.String(), id) { + t.Fatalf("list audits = %d %s", list.Code, list.Body.String()) + } + + rollback := httptest.NewRecorder() + mux.ServeHTTP(rollback, httptest.NewRequest(http.MethodPost, "/api/v1/ai-ops/audits/"+id+"/rollback", nil)) + if rollback.Code != http.StatusOK || !strings.Contains(rollback.Body.String(), `"action":"rollback"`) { + t.Fatalf("rollback = %d %s", rollback.Code, rollback.Body.String()) + } + + missing := httptest.NewRecorder() + mux.ServeHTTP(missing, httptest.NewRequest(http.MethodPost, "/api/v1/ai-ops/audits/"+handlerUUID(t)+"/rollback", nil)) + if missing.Code != http.StatusBadRequest { + t.Fatalf("missing rollback status = %d", missing.Code) + } +} diff --git a/internal/handler/dashboard_handler.go b/internal/handler/dashboard_handler.go new file mode 100644 index 0000000..9e32812 --- /dev/null +++ b/internal/handler/dashboard_handler.go @@ -0,0 +1,118 @@ +package handler + +import ( + "html/template" + "net/http" +) + +// DashboardHandler 是前端页面路由处理器 +type DashboardHandler struct { + templates *template.Template +} + +func NewDashboardHandler() *DashboardHandler { + tmpl := template.Must(template.New("dashboard").Parse(dashboardHTML)) + return &DashboardHandler{templates: tmpl} +} + +// RegisterRoutes 注册页面路由 +func (h *DashboardHandler) RegisterRoutes(mux *http.ServeMux) { + mux.HandleFunc("GET /ops/dashboard", h.Dashboard) + mux.HandleFunc("GET /ops/dashboard/logs", h.Dashboard) + mux.HandleFunc("GET /ops/dashboard/rules", h.Dashboard) + mux.HandleFunc("GET /ops/dashboard/alerts", h.Dashboard) + mux.HandleFunc("GET /ops/dashboard/channels", h.Dashboard) +} + +// Dashboard 首页 +func (h *DashboardHandler) Dashboard(w http.ResponseWriter, r *http.Request) { + w.Header().Set("Content-Type", "text/html; charset=utf-8") + _ = h.templates.ExecuteTemplate(w, "dashboard", nil) +} + +const dashboardHTML = ` + + + + + AI-Ops 运维看板 + + + +

AI-Ops 运维看板 · 规则 / 事件 / 渠道 / 日志

QPS

平均延迟

P99

错误率

告警事件

告警规则

通知渠道

日志

+ + + +` diff --git a/internal/handler/healing_handler.go b/internal/handler/healing_handler.go new file mode 100644 index 0000000..31dfdf0 --- /dev/null +++ b/internal/handler/healing_handler.go @@ -0,0 +1,29 @@ +package handler + +import ( + "net/http" + + "github.com/company/ai-ops/pkg/response" +) + +// HealingHandler 是自愈管理 HTTP 处理器 +type HealingHandler struct{} + +func NewHealingHandler() *HealingHandler { + return &HealingHandler{} +} + +func (h *HealingHandler) RegisterRoutes(mux *http.ServeMux) { + mux.HandleFunc("GET /api/v1/ai-ops/healings", h.ListHealings) + mux.HandleFunc("GET /api/v1/ai-ops/healings/{id}", h.GetHealing) +} + +func (h *HealingHandler) ListHealings(w http.ResponseWriter, r *http.Request) { + // TODO: 实现列表查询 + response.Success(w, map[string]any{"items": []any{}, "total": 0}) +} + +func (h *HealingHandler) GetHealing(w http.ResponseWriter, r *http.Request) { + id := r.PathValue("id") + response.Success(w, map[string]any{"id": id, "status": "pending"}) +} diff --git a/internal/handler/health_handler.go b/internal/handler/health_handler.go new file mode 100644 index 0000000..285655f --- /dev/null +++ b/internal/handler/health_handler.go @@ -0,0 +1,62 @@ +package handler + +import ( + "net/http" + + "github.com/company/ai-ops/internal/database" + "github.com/company/ai-ops/internal/redis" + "github.com/company/ai-ops/pkg/response" +) + +// HealthHandler 是健康检查 HTTP 处理器 +type HealthHandler struct{} + +func NewHealthHandler() *HealthHandler { + return &HealthHandler{} +} + +func (h *HealthHandler) RegisterRoutes(mux *http.ServeMux) { + mux.HandleFunc("GET /actuator/health", h.Health) + mux.HandleFunc("GET /actuator/health/live", h.Live) + mux.HandleFunc("GET /actuator/health/ready", h.Ready) +} + +func (h *HealthHandler) Health(w http.ResponseWriter, r *http.Request) { + response.Success(w, map[string]any{ + "status": "UP", + "components": map[string]any{ + "self": map[string]any{"status": "UP"}, + }, + }) +} + +func (h *HealthHandler) Live(w http.ResponseWriter, r *http.Request) { + response.Success(w, map[string]any{"status": "UP"}) +} + +func (h *HealthHandler) Ready(w http.ResponseWriter, r *http.Request) { + status := "UP" + components := map[string]any{ + "self": map[string]any{"status": "UP"}, + } + + // 检查 DB 连接 + if database.Pool == nil { + status = "DOWN" + components["database"] = map[string]any{"status": "DOWN", "detail": "not initialized"} + } else { + components["database"] = map[string]any{"status": "UP"} + } + + // 检查 Redis 连接 + if redis.Client == nil { + components["redis"] = map[string]any{"status": "DOWN", "detail": "not initialized"} + } else { + components["redis"] = map[string]any{"status": "UP"} + } + + if status == "DOWN" { + w.WriteHeader(http.StatusServiceUnavailable) + } + response.Success(w, map[string]any{"status": status, "components": components}) +} diff --git a/internal/handler/log_handler.go b/internal/handler/log_handler.go new file mode 100644 index 0000000..4896a24 --- /dev/null +++ b/internal/handler/log_handler.go @@ -0,0 +1,109 @@ +package handler + +import ( + "net/http" + "strconv" + "time" + + "github.com/company/ai-ops/internal/domain/model" + "github.com/company/ai-ops/internal/service" + "github.com/company/ai-ops/pkg/errors" + "github.com/company/ai-ops/pkg/response" +) + +// LogHandler 是日志 HTTP 处理器 +type LogHandler struct { + service *service.LogService +} + +func NewLogHandler(s *service.LogService) *LogHandler { + return &LogHandler{service: s} +} + +// RegisterRoutes 注册日志相关路由 +func (h *LogHandler) RegisterRoutes(mux *http.ServeMux) { + mux.HandleFunc("GET /api/v1/ai-ops/logs", h.QueryLogs) + mux.HandleFunc("GET /api/v1/ai-ops/logs/export", h.ExportLogs) +} + +// QueryLogs 日志查询 +func (h *LogHandler) QueryLogs(w http.ResponseWriter, r *http.Request) { + query := r.URL.Query() + + filter := model.LogQueryFilter{ + Service: query.Get("service"), + Path: query.Get("path"), + UserID: query.Get("user_id"), + SupplierID: query.Get("supplier_id"), + } + + if startStr := query.Get("start"); startStr != "" { + if t, err := time.Parse(time.RFC3339, startStr); err == nil { + filter.StartTime = &t + } + } + if endStr := query.Get("end"); endStr != "" { + if t, err := time.Parse(time.RFC3339, endStr); err == nil { + filter.EndTime = &t + } + } + if codeStr := query.Get("status_code"); codeStr != "" { + if code, err := strconv.Atoi(codeStr); err == nil { + filter.StatusCode = &code + } + } + if page, err := strconv.Atoi(query.Get("page")); err == nil && page > 0 { + filter.Page = page + } else { + filter.Page = 1 + } + if pageSize, err := strconv.Atoi(query.Get("page_size")); err == nil && pageSize > 0 && pageSize <= 100 { + filter.PageSize = pageSize + } else { + filter.PageSize = 20 + } + + logs, total, err := h.service.QueryLogs(r.Context(), filter) + if err != nil { + response.Error(w, errors.Wrap(err, errors.ErrInternal)) + return + } + + response.PaginatedResponse(w, logs, total, filter.Page, filter.PageSize) +} + +// ExportLogs 导出日志为 CSV +func (h *LogHandler) ExportLogs(w http.ResponseWriter, r *http.Request) { + query := r.URL.Query() + + filter := model.LogQueryFilter{ + Service: query.Get("service"), + Path: query.Get("path"), + UserID: query.Get("user_id"), + SupplierID: query.Get("supplier_id"), + } + + if startStr := query.Get("start"); startStr != "" { + if t, err := time.Parse(time.RFC3339, startStr); err == nil { + filter.StartTime = &t + } + } + if endStr := query.Get("end"); endStr != "" { + if t, err := time.Parse(time.RFC3339, endStr); err == nil { + filter.EndTime = &t + } + } + if codeStr := query.Get("status_code"); codeStr != "" { + if code, err := strconv.Atoi(codeStr); err == nil { + filter.StatusCode = &code + } + } + + w.Header().Set("Content-Type", "text/csv; charset=utf-8") + w.Header().Set("Content-Disposition", "attachment; filename=logs_"+time.Now().Format("20060102_150405")+".csv") + + if err := h.service.ExportLogsCSV(r.Context(), filter, w); err != nil { + response.Error(w, errors.Wrap(err, errors.ErrInternal)) + return + } +} diff --git a/internal/handler/metric_handler.go b/internal/handler/metric_handler.go new file mode 100644 index 0000000..9a7e98d --- /dev/null +++ b/internal/handler/metric_handler.go @@ -0,0 +1,86 @@ +package handler + +import ( + "net/http" + "time" + + "github.com/company/ai-ops/internal/domain/model" + "github.com/company/ai-ops/internal/service" + "github.com/company/ai-ops/pkg/errors" + "github.com/company/ai-ops/pkg/response" +) + +// MetricHandler 是指标 HTTP 处理器 +type MetricHandler struct { + service *service.MetricService +} + +func NewMetricHandler(s *service.MetricService) *MetricHandler { + return &MetricHandler{service: s} +} + +// RegisterRoutes 注册指标相关路由 +func (h *MetricHandler) RegisterRoutes(mux *http.ServeMux) { + mux.HandleFunc("GET /api/v1/ai-ops/metrics/realtime", h.GetRealtime) + mux.HandleFunc("GET /api/v1/ai-ops/metrics/suppliers/count", h.GetSupplierCount) + mux.HandleFunc("GET /api/v1/ai-ops/alerts/open/count", h.GetOpenAlertCount) + mux.HandleFunc("GET /api/v1/ai-ops/metrics/query", h.QueryMetrics) +} + +// GetRealtime 返回实时指标 +func (h *MetricHandler) GetRealtime(w http.ResponseWriter, r *http.Request) { + metrics, err := h.service.GetRealtimeMetrics(r.Context()) + if err != nil { + response.Error(w, errors.Wrap(err, errors.ErrInternal)) + return + } + response.Success(w, metrics) +} + +// GetSupplierCount 返回活跃供应商数量 +func (h *MetricHandler) GetSupplierCount(w http.ResponseWriter, r *http.Request) { + count, err := h.service.GetSupplierCount(r.Context()) + if err != nil { + response.Error(w, errors.Wrap(err, errors.ErrInternal)) + return + } + response.Success(w, count) +} + +// GetOpenAlertCount 返回未关闭告警数量 +func (h *MetricHandler) GetOpenAlertCount(w http.ResponseWriter, r *http.Request) { + count, err := h.service.GetOpenAlertCount(r.Context()) + if err != nil { + response.Error(w, errors.Wrap(err, errors.ErrInternal)) + return + } + response.Success(w, count) +} + +// QueryMetrics 指标下钻查询 +func (h *MetricHandler) QueryMetrics(w http.ResponseWriter, r *http.Request) { + query := r.URL.Query() + + req := model.MetricQueryRequest{ + Source: query.Get("source"), + Name: query.Get("name"), + } + + if startStr := query.Get("start"); startStr != "" { + if t, err := time.Parse(time.RFC3339, startStr); err == nil { + req.StartTime = t + } + } + if endStr := query.Get("end"); endStr != "" { + if t, err := time.Parse(time.RFC3339, endStr); err == nil { + req.EndTime = t + } + } + + points, err := h.service.QueryMetrics(r.Context(), req) + if err != nil { + response.Error(w, errors.Wrap(err, errors.ErrInternal)) + return + } + response.Success(w, points) +} diff --git a/internal/handler/metric_handler_test.go b/internal/handler/metric_handler_test.go new file mode 100644 index 0000000..85938d5 --- /dev/null +++ b/internal/handler/metric_handler_test.go @@ -0,0 +1,93 @@ +package handler + +import ( + "context" + "net/http" + "net/http/httptest" + "testing" + "time" + + "github.com/company/ai-ops/internal/domain/model" + "github.com/company/ai-ops/internal/service" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/mock" +) + +type mockMetricRepo struct{ mock.Mock } + +func (m *mockMetricRepo) GetRealtime(ctx context.Context) (*model.RealtimeMetrics, error) { + args := m.Called(ctx) + return args.Get(0).(*model.RealtimeMetrics), args.Error(1) +} +func (m *mockMetricRepo) Query(ctx context.Context, req model.MetricQueryRequest) ([]model.MetricPoint, error) { + args := m.Called(ctx, req) + return args.Get(0).([]model.MetricPoint), args.Error(1) +} +func (m *mockMetricRepo) GetLatest(ctx context.Context, source, name string) (*model.MetricPoint, error) { + args := m.Called(ctx, source, name) + return args.Get(0).(*model.MetricPoint), args.Error(1) +} + +type mockAlertRepo struct{ mock.Mock } + +func (m *mockAlertRepo) GetOpenCount(ctx context.Context) (*model.AlertCount, error) { + args := m.Called(ctx) + return args.Get(0).(*model.AlertCount), args.Error(1) +} +func (m *mockAlertRepo) ListRules(ctx context.Context) ([]model.AlertRule, error) { + args := m.Called(ctx) + return args.Get(0).([]model.AlertRule), args.Error(1) +} +func (m *mockAlertRepo) GetRuleByID(ctx context.Context, id string) (*model.AlertRule, error) { + args := m.Called(ctx, id) + return args.Get(0).(*model.AlertRule), args.Error(1) +} +func (m *mockAlertRepo) CreateRule(ctx context.Context, rule *model.AlertRule) error { + args := m.Called(ctx, rule) + return args.Error(0) +} +func (m *mockAlertRepo) UpdateRule(ctx context.Context, rule *model.AlertRule) error { + args := m.Called(ctx, rule) + return args.Error(0) +} +func (m *mockAlertRepo) DeleteRule(ctx context.Context, id string) error { + args := m.Called(ctx, id) + return args.Error(0) +} +func (m *mockAlertRepo) ListEvents(ctx context.Context, status string, page, pageSize int) ([]model.AlertEvent, int, error) { + args := m.Called(ctx, status, page, pageSize) + return args.Get(0).([]model.AlertEvent), args.Int(1), args.Error(2) +} +func (m *mockAlertRepo) CreateEvent(ctx context.Context, event *model.AlertEvent) error { + args := m.Called(ctx, event) + return args.Error(0) +} +func (m *mockAlertRepo) CreateEventWithAggregation(ctx context.Context, event *model.AlertEvent, window time.Duration, threshold int) (*model.AlertEvent, error) { + args := m.Called(ctx, event, window, threshold) + return args.Get(0).(*model.AlertEvent), args.Error(1) +} +func (m *mockAlertRepo) UpdateEventStatus(ctx context.Context, id, status string) error { + args := m.Called(ctx, id, status) + return args.Error(0) +} +func (m *mockAlertRepo) EscalateEvent(ctx context.Context, id, newLevel string) error { + args := m.Called(ctx, id, newLevel) + return args.Error(0) +} + +func TestMetricHandler_GetRealtime(t *testing.T) { + mr := new(mockMetricRepo) + ar := new(mockAlertRepo) + svc := service.NewMetricService(mr, ar) + h := NewMetricHandler(svc) + + expected := &model.RealtimeMetrics{QPS: 100, AvgLatency: 50, P99Latency: 100, ErrorRate: 0.01} + mr.On("GetRealtime", mock.Anything).Return(expected, nil) + + req := httptest.NewRequest("GET", "/api/v1/ai-ops/metrics/realtime", nil) + w := httptest.NewRecorder() + h.GetRealtime(w, req) + + assert.Equal(t, http.StatusOK, w.Code) + assert.Contains(t, w.Body.String(), `"qps":100`) +} diff --git a/internal/handler/rule_handler.go b/internal/handler/rule_handler.go new file mode 100644 index 0000000..2f6d0a2 --- /dev/null +++ b/internal/handler/rule_handler.go @@ -0,0 +1,84 @@ +package handler + +import ( + "net/http" + + "github.com/company/ai-ops/internal/domain/model" + "github.com/company/ai-ops/internal/service" + "github.com/company/ai-ops/pkg/errors" + "github.com/company/ai-ops/pkg/response" +) + +// RuleHandler 是告警规则 HTTP 处理器 +type RuleHandler struct { + service *service.RuleService +} + +func NewRuleHandler(s *service.RuleService) *RuleHandler { + return &RuleHandler{service: s} +} + +func (h *RuleHandler) RegisterRoutes(mux *http.ServeMux) { + mux.HandleFunc("GET /api/v1/ai-ops/rules", h.ListRules) + mux.HandleFunc("GET /api/v1/ai-ops/rules/{id}", h.GetRule) + mux.HandleFunc("POST /api/v1/ai-ops/rules", h.CreateRule) + mux.HandleFunc("PUT /api/v1/ai-ops/rules/{id}", h.UpdateRule) + mux.HandleFunc("DELETE /api/v1/ai-ops/rules/{id}", h.DeleteRule) +} + +func (h *RuleHandler) ListRules(w http.ResponseWriter, r *http.Request) { + rules, err := h.service.ListRules(r.Context()) + if err != nil { + response.Error(w, errors.Wrap(err, errors.ErrInternal)) + return + } + response.Success(w, rules) +} + +func (h *RuleHandler) GetRule(w http.ResponseWriter, r *http.Request) { + id := r.PathValue("id") + rule, err := h.service.GetRule(r.Context(), id) + if err != nil { + response.Error(w, errors.Wrap(err, errors.ErrNotFound)) + return + } + response.Success(w, rule) +} + +func (h *RuleHandler) CreateRule(w http.ResponseWriter, r *http.Request) { + var rule model.AlertRule + if err := decodeJSON(r, &rule); err != nil { + response.Error(w, errors.ErrBadRequest.WithDetail(map[string]any{"error": err.Error()})) + return + } + if err := h.service.CreateRule(r.Context(), &rule); err != nil { + response.Error(w, errors.Wrap(err, errors.ErrBadRequest)) + return + } + w.WriteHeader(http.StatusCreated) + response.Success(w, rule) +} + +func (h *RuleHandler) UpdateRule(w http.ResponseWriter, r *http.Request) { + id := r.PathValue("id") + var rule model.AlertRule + if err := decodeJSON(r, &rule); err != nil { + response.Error(w, errors.ErrBadRequest.WithDetail(map[string]any{"error": err.Error()})) + return + } + rule.ID = id + if err := h.service.UpdateRule(r.Context(), &rule); err != nil { + response.Error(w, errors.Wrap(err, errors.ErrBadRequest)) + return + } + response.Success(w, rule) +} + +func (h *RuleHandler) DeleteRule(w http.ResponseWriter, r *http.Request) { + id := r.PathValue("id") + if err := h.service.DeleteRule(r.Context(), id); err != nil { + response.Error(w, errors.Wrap(err, errors.ErrInternal)) + return + } + w.WriteHeader(http.StatusNoContent) +} diff --git a/internal/handler/utils.go b/internal/handler/utils.go new file mode 100644 index 0000000..d6d4a2a --- /dev/null +++ b/internal/handler/utils.go @@ -0,0 +1,10 @@ +package handler + +import ( + "encoding/json" + "net/http" +) + +func decodeJSON(r *http.Request, v any) error { + return json.NewDecoder(r.Body).Decode(v) +} diff --git a/internal/infra/repository/pg_alert_repository.go b/internal/infra/repository/pg_alert_repository.go new file mode 100644 index 0000000..2b17a6d --- /dev/null +++ b/internal/infra/repository/pg_alert_repository.go @@ -0,0 +1,314 @@ +package repository + +import ( + "context" + "crypto/rand" + "encoding/hex" + "fmt" + "time" + + "github.com/company/ai-ops/internal/database" + "github.com/company/ai-ops/internal/domain/model" + "github.com/jackc/pgx/v5" +) + +// PGAlertRepository 是基于 PostgreSQL 的告警存储实现 +type PGAlertRepository struct{} + +func NewPGAlertRepository() *PGAlertRepository { + return &PGAlertRepository{} +} + +func (r *PGAlertRepository) GetOpenCount(ctx context.Context) (*model.AlertCount, error) { + var count model.AlertCount + err := database.Pool.QueryRow(ctx, ` + SELECT + COUNT(*) FILTER (WHERE status != 'resolved') AS open_count, + COUNT(*) FILTER (WHERE status != 'resolved' AND level = 'P0') AS p0_count, + COUNT(*) FILTER (WHERE status != 'resolved' AND level = 'P1') AS p1_count, + COUNT(*) FILTER (WHERE status != 'resolved' AND level = 'P2') AS p2_count, + COUNT(*) FILTER (WHERE status != 'resolved' AND level = 'P3') AS p3_count + FROM ai_ops_alerts + `).Scan(&count.Open, &count.P0, &count.P1, &count.P2, &count.P3) + if err != nil { + return nil, fmt.Errorf("query alert count: %w", err) + } + return &count, nil +} + +func (r *PGAlertRepository) ListRules(ctx context.Context) ([]model.AlertRule, error) { + rows, err := database.Pool.Query(ctx, ` + SELECT id, name, metric_source, metric_name, threshold_type, threshold_value, + duration_min, level, channel_ids, healing_action, healing_config, + is_sandboxed, enabled, version, created_by, created_at, updated_at + FROM ai_ops_rules + WHERE enabled = true + ORDER BY created_at DESC + `) + if err != nil { + return nil, fmt.Errorf("query rules: %w", err) + } + defer rows.Close() + + rules := make([]model.AlertRule, 0) + for rows.Next() { + var ru model.AlertRule + var channelIDs []string + if err := rows.Scan( + &ru.ID, &ru.Name, &ru.MetricSource, &ru.MetricName, &ru.ThresholdType, &ru.ThresholdValue, + &ru.DurationMin, &ru.Level, &channelIDs, &ru.HealingAction, &ru.HealingConfig, + &ru.IsSandboxed, &ru.Enabled, &ru.Version, &ru.CreatedBy, &ru.CreatedAt, &ru.UpdatedAt, + ); err != nil { + return nil, fmt.Errorf("scan rule: %w", err) + } + ru.ChannelIDs = channelIDs + rules = append(rules, ru) + } + return rules, rows.Err() +} + +func (r *PGAlertRepository) GetRuleByID(ctx context.Context, id string) (*model.AlertRule, error) { + var ru model.AlertRule + var channelIDs []string + err := database.Pool.QueryRow(ctx, ` + SELECT id, name, metric_source, metric_name, threshold_type, threshold_value, + duration_min, level, channel_ids, healing_action, healing_config, + is_sandboxed, enabled, version, created_by, created_at, updated_at + FROM ai_ops_rules WHERE id = $1 + `, id).Scan( + &ru.ID, &ru.Name, &ru.MetricSource, &ru.MetricName, &ru.ThresholdType, &ru.ThresholdValue, + &ru.DurationMin, &ru.Level, &channelIDs, &ru.HealingAction, &ru.HealingConfig, + &ru.IsSandboxed, &ru.Enabled, &ru.Version, &ru.CreatedBy, &ru.CreatedAt, &ru.UpdatedAt, + ) + if err == pgx.ErrNoRows { + return nil, fmt.Errorf("rule not found") + } + if err != nil { + return nil, fmt.Errorf("query rule: %w", err) + } + ru.ChannelIDs = channelIDs + return &ru, nil +} + +func (r *PGAlertRepository) CreateRule(ctx context.Context, rule *model.AlertRule) error { + _, err := database.Pool.Exec(ctx, ` + INSERT INTO ai_ops_rules (id, name, metric_source, metric_name, threshold_type, threshold_value, + duration_min, level, channel_ids, healing_action, healing_config, + is_sandboxed, enabled, version, created_by, created_at, updated_at) + VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,NOW(),NOW()) + `, rule.ID, rule.Name, rule.MetricSource, rule.MetricName, rule.ThresholdType, rule.ThresholdValue, + rule.DurationMin, rule.Level, rule.ChannelIDs, rule.HealingAction, rule.HealingConfig, + rule.IsSandboxed, rule.Enabled, rule.Version, rule.CreatedBy) + if err != nil { + return fmt.Errorf("insert rule: %w", err) + } + return nil +} + +func (r *PGAlertRepository) UpdateRule(ctx context.Context, rule *model.AlertRule) error { + _, err := database.Pool.Exec(ctx, ` + UPDATE ai_ops_rules SET + name=$2, metric_source=$3, metric_name=$4, threshold_type=$5, threshold_value=$6, + duration_min=$7, level=$8, channel_ids=$9, healing_action=$10, healing_config=$11, + is_sandboxed=$12, enabled=$13, version=$14, updated_at=NOW() + WHERE id=$1 + `, rule.ID, rule.Name, rule.MetricSource, rule.MetricName, rule.ThresholdType, rule.ThresholdValue, + rule.DurationMin, rule.Level, rule.ChannelIDs, rule.HealingAction, rule.HealingConfig, + rule.IsSandboxed, rule.Enabled, rule.Version) + if err != nil { + return fmt.Errorf("update rule: %w", err) + } + return nil +} + +func (r *PGAlertRepository) DeleteRule(ctx context.Context, id string) error { + _, err := database.Pool.Exec(ctx, `DELETE FROM ai_ops_rules WHERE id = $1`, id) + if err != nil { + return fmt.Errorf("delete rule: %w", err) + } + return nil +} + +func (r *PGAlertRepository) ListEvents(ctx context.Context, status string, page, pageSize int) ([]model.AlertEvent, int, error) { + where := "" + args := []any{} + if status != "" { + where = "WHERE status = $1" + args = append(args, status) + } + + var total int + countQuery := fmt.Sprintf("SELECT COUNT(*) FROM ai_ops_alerts %s", where) + if err := database.Pool.QueryRow(ctx, countQuery, args...).Scan(&total); err != nil { + return nil, 0, fmt.Errorf("count events: %w", err) + } + + if page < 1 { + page = 1 + } + if pageSize < 1 || pageSize > 100 { + pageSize = 20 + } + offset := (page - 1) * pageSize + + dataQuery := fmt.Sprintf(` + SELECT id, rule_id, level, resource_type, resource_id, current_value, threshold_value, + status, is_aggregated, aggregated_count, parent_alert_id, started_at, resolved_at + FROM ai_ops_alerts %s + ORDER BY started_at DESC + LIMIT $%d OFFSET $%d + `, where, len(args)+1, len(args)+2) + queryArgs := append(args, pageSize, offset) + + rows, err := database.Pool.Query(ctx, dataQuery, queryArgs...) + if err != nil { + return nil, 0, fmt.Errorf("query events: %w", err) + } + defer rows.Close() + + events := make([]model.AlertEvent, 0) + for rows.Next() { + var e model.AlertEvent + if err := rows.Scan( + &e.ID, &e.RuleID, &e.Level, &e.ResourceType, &e.ResourceID, + &e.CurrentValue, &e.ThresholdValue, &e.Status, &e.IsAggregated, &e.AggregatedCount, + &e.ParentAlertID, &e.StartedAt, &e.ResolvedAt, + ); err != nil { + return nil, 0, fmt.Errorf("scan event: %w", err) + } + events = append(events, e) + } + return events, total, rows.Err() +} + +func (r *PGAlertRepository) CreateEvent(ctx context.Context, event *model.AlertEvent) error { + _, err := r.CreateEventWithAggregation(ctx, event, 0, 0) + return err +} + +func (r *PGAlertRepository) CreateEventWithAggregation(ctx context.Context, event *model.AlertEvent, window time.Duration, threshold int) (*model.AlertEvent, error) { + tx, err := database.Pool.Begin(ctx) + if err != nil { + return nil, fmt.Errorf("begin create event: %w", err) + } + defer tx.Rollback(ctx) + + startedAt := event.StartedAt + if startedAt.IsZero() { + startedAt = time.Now() + } + + _, err = tx.Exec(ctx, ` + INSERT INTO ai_ops_alerts (id, rule_id, level, resource_type, resource_id, + current_value, threshold_value, status, is_aggregated, aggregated_count, parent_alert_id, started_at) + VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12) + `, event.ID, event.RuleID, event.Level, event.ResourceType, event.ResourceID, + event.CurrentValue, event.ThresholdValue, event.Status, event.IsAggregated, + event.AggregatedCount, event.ParentAlertID, startedAt) + if err != nil { + return nil, fmt.Errorf("insert event: %w", err) + } + event.StartedAt = startedAt + + if window <= 0 || threshold <= 0 { + if err := tx.Commit(ctx); err != nil { + return nil, fmt.Errorf("commit event: %w", err) + } + return event, nil + } + + var count int + err = tx.QueryRow(ctx, ` + SELECT COUNT(*) + FROM ai_ops_alerts + WHERE resource_type = $1 + AND resource_id = $2 + AND started_at >= $3 + AND is_aggregated = false + AND parent_alert_id IS NULL + `, event.ResourceType, event.ResourceID, startedAt.Add(-window)).Scan(&count) + if err != nil { + return nil, fmt.Errorf("count aggregation candidates: %w", err) + } + + if count <= threshold { + if err := tx.Commit(ctx); err != nil { + return nil, fmt.Errorf("commit event: %w", err) + } + return event, nil + } + + aggregated := &model.AlertEvent{ + ID: newUUID(), + RuleID: event.RuleID, + Level: event.Level, + ResourceType: event.ResourceType, + ResourceID: event.ResourceID, + CurrentValue: event.CurrentValue, + ThresholdValue: fmt.Sprintf("cluster_count>%d", threshold), + Status: event.Status, + IsAggregated: true, + AggregatedCount: count, + StartedAt: startedAt, + } + + _, err = tx.Exec(ctx, ` + INSERT INTO ai_ops_alerts (id, rule_id, level, resource_type, resource_id, + current_value, threshold_value, status, is_aggregated, aggregated_count, started_at) + VALUES ($1,$2,$3,$4,$5,$6,$7,$8,true,$9,$10) + `, aggregated.ID, aggregated.RuleID, aggregated.Level, aggregated.ResourceType, aggregated.ResourceID, + aggregated.CurrentValue, aggregated.ThresholdValue, aggregated.Status, aggregated.AggregatedCount, aggregated.StartedAt) + if err != nil { + return nil, fmt.Errorf("insert aggregated event: %w", err) + } + + _, err = tx.Exec(ctx, ` + UPDATE ai_ops_alerts + SET parent_alert_id = $1 + WHERE resource_type = $2 + AND resource_id = $3 + AND started_at >= $4 + AND is_aggregated = false + AND parent_alert_id IS NULL + `, aggregated.ID, event.ResourceType, event.ResourceID, startedAt.Add(-window)) + if err != nil { + return nil, fmt.Errorf("attach aggregated children: %w", err) + } + + if err := tx.Commit(ctx); err != nil { + return nil, fmt.Errorf("commit aggregated event: %w", err) + } + return aggregated, nil +} + +func (r *PGAlertRepository) UpdateEventStatus(ctx context.Context, id, status string) error { + resolvedAt := "NULL" + if status == "resolved" { + resolvedAt = "NOW()" + } + _, err := database.Pool.Exec(ctx, fmt.Sprintf(` + UPDATE ai_ops_alerts SET status = $2, resolved_at = %s WHERE id = $1 + `, resolvedAt), id, status) + if err != nil { + return fmt.Errorf("update event status: %w", err) + } + return nil +} + +func (r *PGAlertRepository) EscalateEvent(ctx context.Context, id, newLevel string) error { + _, err := database.Pool.Exec(ctx, `UPDATE ai_ops_alerts SET level = $2 WHERE id = $1`, id, newLevel) + if err != nil { + return fmt.Errorf("escalate event: %w", err) + } + return nil +} + +func newUUID() string { + b := make([]byte, 16) + if _, err := rand.Read(b); err != nil { + return fmt.Sprintf("00000000-0000-4000-8000-%012d", time.Now().UnixNano()%1_000_000_000_000) + } + b[6] = (b[6] & 0x0f) | 0x40 + b[8] = (b[8] & 0x3f) | 0x80 + return fmt.Sprintf("%s-%s-%s-%s-%s", hex.EncodeToString(b[0:4]), hex.EncodeToString(b[4:6]), hex.EncodeToString(b[6:8]), hex.EncodeToString(b[8:10]), hex.EncodeToString(b[10:16])) +} diff --git a/internal/infra/repository/pg_channel_repository.go b/internal/infra/repository/pg_channel_repository.go new file mode 100644 index 0000000..59e3631 --- /dev/null +++ b/internal/infra/repository/pg_channel_repository.go @@ -0,0 +1,87 @@ +package repository + +import ( + "context" + "fmt" + + "github.com/company/ai-ops/internal/database" + "github.com/company/ai-ops/internal/domain/model" + "github.com/jackc/pgx/v5" +) + +// PGChannelRepository 是基于 PostgreSQL 的渠道存储实现 +type PGChannelRepository struct{} + +func NewPGChannelRepository() *PGChannelRepository { + return &PGChannelRepository{} +} + +func (r *PGChannelRepository) List(ctx context.Context) ([]model.NotificationChannel, error) { + rows, err := database.Pool.Query(ctx, ` + SELECT id, name, channel_type, config, priority, enabled, created_at + FROM ai_ops_channels + WHERE enabled = true + ORDER BY priority DESC, created_at DESC + `) + if err != nil { + return nil, fmt.Errorf("query channels: %w", err) + } + defer rows.Close() + + channels := make([]model.NotificationChannel, 0) + for rows.Next() { + var c model.NotificationChannel + if err := rows.Scan(&c.ID, &c.Name, &c.ChannelType, &c.Config, &c.Priority, &c.Enabled, &c.CreatedAt); err != nil { + return nil, fmt.Errorf("scan channel: %w", err) + } + channels = append(channels, c) + } + return channels, rows.Err() +} + +func (r *PGChannelRepository) GetByID(ctx context.Context, id string) (*model.NotificationChannel, error) { + var c model.NotificationChannel + err := database.Pool.QueryRow(ctx, ` + SELECT id, name, channel_type, config, priority, enabled, created_at + FROM ai_ops_channels + WHERE id = $1 + `, id).Scan(&c.ID, &c.Name, &c.ChannelType, &c.Config, &c.Priority, &c.Enabled, &c.CreatedAt) + if err == pgx.ErrNoRows { + return nil, fmt.Errorf("channel not found") + } + if err != nil { + return nil, fmt.Errorf("query channel: %w", err) + } + return &c, nil +} + +func (r *PGChannelRepository) Create(ctx context.Context, ch *model.NotificationChannel) error { + _, err := database.Pool.Exec(ctx, ` + INSERT INTO ai_ops_channels (id, name, channel_type, config, priority, enabled, created_at) + VALUES ($1, $2, $3, $4, $5, $6, NOW()) + `, ch.ID, ch.Name, ch.ChannelType, ch.Config, ch.Priority, ch.Enabled) + if err != nil { + return fmt.Errorf("insert channel: %w", err) + } + return nil +} + +func (r *PGChannelRepository) Update(ctx context.Context, ch *model.NotificationChannel) error { + _, err := database.Pool.Exec(ctx, ` + UPDATE ai_ops_channels + SET name = $2, channel_type = $3, config = $4, priority = $5, enabled = $6 + WHERE id = $1 + `, ch.ID, ch.Name, ch.ChannelType, ch.Config, ch.Priority, ch.Enabled) + if err != nil { + return fmt.Errorf("update channel: %w", err) + } + return nil +} + +func (r *PGChannelRepository) Delete(ctx context.Context, id string) error { + _, err := database.Pool.Exec(ctx, `DELETE FROM ai_ops_channels WHERE id = $1`, id) + if err != nil { + return fmt.Errorf("delete channel: %w", err) + } + return nil +} diff --git a/internal/infra/repository/pg_healing_repository.go b/internal/infra/repository/pg_healing_repository.go new file mode 100644 index 0000000..c7151f7 --- /dev/null +++ b/internal/infra/repository/pg_healing_repository.go @@ -0,0 +1,38 @@ +package repository + +import ( + "context" + "fmt" + + "github.com/company/ai-ops/internal/database" + "github.com/company/ai-ops/internal/service" +) + +// PGHealingRepository 是自愈记录的 PostgreSQL 实现 +type PGHealingRepository struct{} + +func NewPGHealingRepository() *PGHealingRepository { + return &PGHealingRepository{} +} + +func (r *PGHealingRepository) CreateHealing(ctx context.Context, h *service.HealingLog) error { + _, err := database.Pool.Exec(ctx, ` + INSERT INTO ai_ops_healings (id, alert_id, action_type, config, status, dry_run, result_detail, error_code, started_at) + VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9) + `, h.ID, h.AlertID, h.ActionType, h.Config, h.Status, h.DryRun, h.ResultDetail, h.ErrorCode, h.StartedAt) + if err != nil { + return fmt.Errorf("insert healing: %w", err) + } + return nil +} + +func (r *PGHealingRepository) UpdateHealingStatus(ctx context.Context, id, status string, result map[string]any, errCode string) error { + _, err := database.Pool.Exec(ctx, ` + UPDATE ai_ops_healings SET status = $2, result_detail = $3, error_code = $4, completed_at = NOW() + WHERE id = $1 + `, id, status, result, errCode) + if err != nil { + return fmt.Errorf("update healing: %w", err) + } + return nil +} diff --git a/internal/infra/repository/pg_log_repository.go b/internal/infra/repository/pg_log_repository.go new file mode 100644 index 0000000..991c4a5 --- /dev/null +++ b/internal/infra/repository/pg_log_repository.go @@ -0,0 +1,112 @@ +package repository + +import ( + "context" + "fmt" + "strings" + + "github.com/company/ai-ops/internal/database" + "github.com/company/ai-ops/internal/domain/model" +) + +// PGLogRepository 是基于 PostgreSQL 的日志存储实现 +type PGLogRepository struct{} + +func NewPGLogRepository() *PGLogRepository { + return &PGLogRepository{} +} + +func (r *PGLogRepository) Query(ctx context.Context, filter model.LogQueryFilter) ([]model.RequestLog, int, error) { + // 构建查询条件（参数化查询） + var conditions []string + var args []any + argIdx := 1 + + if filter.StartTime != nil { + conditions = append(conditions, fmt.Sprintf("timestamp >= $%d", argIdx)) + args = append(args, *filter.StartTime) + argIdx++ + } + if filter.EndTime != nil { + conditions = append(conditions, fmt.Sprintf("timestamp <= $%d", argIdx)) + args = append(args, *filter.EndTime) + argIdx++ + } + if filter.Service != "" { + conditions = append(conditions, fmt.Sprintf("service = $%d", argIdx)) + args = append(args, filter.Service) + argIdx++ + } + if filter.Path != "" { + conditions = append(conditions, fmt.Sprintf("path = $%d", argIdx)) + args = append(args, filter.Path) + argIdx++ + } + if filter.StatusCode != nil { + conditions = append(conditions, fmt.Sprintf("status_code = $%d", argIdx)) + args = append(args, *filter.StatusCode) + argIdx++ + } + if filter.UserID != "" { + conditions = append(conditions, fmt.Sprintf("user_id = $%d", argIdx)) + args = append(args, filter.UserID) + argIdx++ + } + if filter.SupplierID != "" { + conditions = append(conditions, fmt.Sprintf("supplier_id = $%d", argIdx)) + args = append(args, filter.SupplierID) + argIdx++ + } + + whereClause := "" + if len(conditions) > 0 { + whereClause = "WHERE " + strings.Join(conditions, " AND ") + } + + // 查询总数 + var total int + countQuery := fmt.Sprintf("SELECT COUNT(*) FROM ai_ops_request_logs %s", whereClause) + if err := database.Pool.QueryRow(ctx, countQuery, args...).Scan(&total); err != nil { + return nil, 0, fmt.Errorf("count logs: %w", err) + } + + // 查询分页数据 + page := filter.Page + if page < 1 { + page = 1 + } + pageSize := filter.PageSize + if pageSize < 1 || pageSize > 100 { + pageSize = 20 + } + offset := (page - 1) * pageSize + + queryArgs := append(args, pageSize, offset) + dataQuery := fmt.Sprintf(` + SELECT id, timestamp, service, path, status_code, latency_ms, user_id, supplier_id, method, error_code + FROM ai_ops_request_logs + %s + ORDER BY timestamp DESC + LIMIT $%d OFFSET $%d + `, whereClause, argIdx, argIdx+1) + + rows, err := database.Pool.Query(ctx, dataQuery, queryArgs...) + if err != nil { + return nil, 0, fmt.Errorf("query logs: %w", err) + } + defer rows.Close() + + var logs []model.RequestLog + for rows.Next() { + var l model.RequestLog + if err := rows.Scan( + &l.ID, &l.Timestamp, &l.Service, &l.Path, &l.StatusCode, + &l.LatencyMs, &l.UserID, &l.SupplierID, &l.Method, &l.ErrorCode, + ); err != nil { + return nil, 0, fmt.Errorf("scan log: %w", err) + } + logs = append(logs, l) + } + + return logs, total, rows.Err() +} diff --git a/internal/infra/repository/pg_metric_repository.go b/internal/infra/repository/pg_metric_repository.go new file mode 100644 index 0000000..f7088c1 --- /dev/null +++ b/internal/infra/repository/pg_metric_repository.go @@ -0,0 +1,95 @@ +package repository + +import ( + "context" + "fmt" + + "github.com/company/ai-ops/internal/database" + "github.com/company/ai-ops/internal/domain/model" + "github.com/jackc/pgx/v5" +) + +// PGMetricRepository 是基于 PostgreSQL 的指标存储实现 +type PGMetricRepository struct{} + +func NewPGMetricRepository() *PGMetricRepository { + return &PGMetricRepository{} +} + +func (r *PGMetricRepository) GetRealtime(ctx context.Context) (*model.RealtimeMetrics, error) { + // 从 ai_ops_metrics 表查询各指标的最新值 + queries := map[string]*float64{ + "qps": new(float64), + "avg_latency": new(float64), + "p99_latency": new(float64), + "error_rate": new(float64), + } + + for name, ptr := range queries { + var value float64 + err := database.Pool.QueryRow(ctx, ` + SELECT value FROM ai_ops_metrics + WHERE metric_name = $1 + ORDER BY recorded_at DESC + LIMIT 1 + `, name).Scan(&value) + if err != nil && err != pgx.ErrNoRows { + return nil, fmt.Errorf("query %s: %w", name, err) + } + *ptr = value + } + + return &model.RealtimeMetrics{ + QPS: *queries["qps"], + AvgLatency: *queries["avg_latency"], + P99Latency: *queries["p99_latency"], + ErrorRate: *queries["error_rate"], + }, nil +} + +func (r *PGMetricRepository) Query(ctx context.Context, req model.MetricQueryRequest) ([]model.MetricPoint, error) { + rows, err := database.Pool.Query(ctx, ` + SELECT metric_name, labels, value, recorded_at + FROM ai_ops_metrics + WHERE metric_name = $1 + AND recorded_at >= $2 + AND recorded_at <= $3 + ORDER BY recorded_at DESC + `, req.Name, req.StartTime, req.EndTime) + if err != nil { + return nil, fmt.Errorf("query metrics: %w", err) + } + defer rows.Close() + + var points []model.MetricPoint + for rows.Next() { + var p model.MetricPoint + var labels map[string]string + if err := rows.Scan(&p.Name, &labels, &p.Value, &p.Timestamp); err != nil { + return nil, fmt.Errorf("scan metric: %w", err) + } + p.Source = req.Source + p.Tags = labels + points = append(points, p) + } + + return points, rows.Err() +} + +func (r *PGMetricRepository) GetLatest(ctx context.Context, source, name string) (*model.MetricPoint, error) { + var p model.MetricPoint + var labels map[string]string + err := database.Pool.QueryRow(ctx, ` + SELECT metric_name, labels, value, recorded_at + FROM ai_ops_metrics + WHERE metric_name = $1 + ORDER BY recorded_at DESC + LIMIT 1 + `, name).Scan(&p.Name, &labels, &p.Value, &p.Timestamp) + if err != nil { + return nil, fmt.Errorf("query latest metric: %w", err) + } + p.Source = source + p.Tags = labels + return &p, nil +} diff --git a/internal/infra/repository/pg_notification_log_repository.go b/internal/infra/repository/pg_notification_log_repository.go new file mode 100644 index 0000000..d498eb2 --- /dev/null +++ b/internal/infra/repository/pg_notification_log_repository.go @@ -0,0 +1,57 @@ +package repository + +import ( + "context" + "fmt" + + "github.com/company/ai-ops/internal/database" + "github.com/company/ai-ops/internal/domain/model" +) + +// PGNotificationLogRepository 是基于 PostgreSQL 的通知日志存储实现。 +type PGNotificationLogRepository struct{} + +func NewPGNotificationLogRepository() *PGNotificationLogRepository { + return &PGNotificationLogRepository{} +} + +func (r *PGNotificationLogRepository) CreateLog(ctx context.Context, log *model.NotificationLog) error { + if log.ID == "" { + log.ID = newUUID() + } + if log.Status == "" { + log.Status = "pending" + } + _, err := database.Pool.Exec(ctx, ` + INSERT INTO ai_ops_notification_logs (id, event_id, channel_id, channel_type, status, retry_count, error_message, sent_at, created_at) + VALUES ($1,$2,$3,$4,$5,$6,$7,$8,NOW()) + `, log.ID, log.EventID, log.ChannelID, log.ChannelType, log.Status, log.RetryCount, log.ErrorMessage, log.SentAt) + if err != nil { + return fmt.Errorf("insert notification log: %w", err) + } + return nil +} + +func (r *PGNotificationLogRepository) MarkSent(ctx context.Context, id string) error { + _, err := database.Pool.Exec(ctx, ` + UPDATE ai_ops_notification_logs + SET status='sent', sent_at=NOW(), error_message=NULL + WHERE id=$1 + `, id) + if err != nil { + return fmt.Errorf("mark notification sent: %w", err) + } + return nil +} + +func (r *PGNotificationLogRepository) MarkFailed(ctx context.Context, id string, retryCount int, errMessage string) error { + _, err := database.Pool.Exec(ctx, ` + UPDATE ai_ops_notification_logs + SET status='failed', retry_count=$2, error_message=$3 + WHERE id=$1 + `, id, retryCount, errMessage) + if err != nil { + return fmt.Errorf("mark notification failed: %w", err) + } + return nil +} diff --git a/internal/infra/repository/pg_repository_integration_test.go b/internal/infra/repository/pg_repository_integration_test.go new file mode 100644 index 0000000..b4c641f --- /dev/null +++ b/internal/infra/repository/pg_repository_integration_test.go @@ -0,0 +1,269 @@ +package repository + +import ( + "context" + "crypto/rand" + "encoding/hex" + "fmt" + "os" + "path/filepath" + "sort" + "sync" + "testing" + "time" + + "github.com/company/ai-ops/internal/config" + "github.com/company/ai-ops/internal/database" + "github.com/company/ai-ops/internal/domain/model" + "github.com/company/ai-ops/internal/service" +) + +var pgMigrationOnce sync.Once +var pgMigrationErr error + +func setupPGIntegration(t *testing.T) context.Context { + t.Helper() + ctx := context.Background() + if database.Pool == nil { + ports := []int{15432, 5432} + var lastErr error + for _, port := range ports { + lastErr = database.Init(config.DatabaseConfig{Host: "localhost", Port: port, User: "aiops", Password: "aiops123", DBName: "ai_ops", SSLMode: "disable", PoolSize: 4}) + if lastErr == nil { + break + } + database.Close() + database.Pool = nil + } + if lastErr != nil { + t.Skipf("PostgreSQL integration database not available: %v", lastErr) + } + } + pgMigrationOnce.Do(func() { + pgMigrationErr = applyMigrations(ctx) + }) + if pgMigrationErr != nil { + t.Fatalf("apply migrations: %v", pgMigrationErr) + } + return ctx +} + +func applyMigrations(ctx context.Context) error { + if _, err := database.Pool.Exec(ctx, `SELECT pg_advisory_lock(424242001)`); err != nil { + return err + } + defer database.Pool.Exec(ctx, `SELECT pg_advisory_unlock(424242001)`) + + files, err := filepath.Glob(filepath.Join("..", "..", "..", "tech", "migrations", "*.up.sql")) + if err != nil { + return err + } + sort.Strings(files) + for _, f := range files { + b, err := os.ReadFile(f) + if err != nil { + return err + } + if _, err := database.Pool.Exec(ctx, string(b)); err != nil { + return fmt.Errorf("%s: %w", f, err) + } + } + return nil +} + +func testUUID(t *testing.T) string { + t.Helper() + b := make([]byte, 16) + if _, err := rand.Read(b); err != nil { + t.Fatal(err) + } + b[6] = (b[6] & 0x0f) | 0x40 + b[8] = (b[8] & 0x3f) | 0x80 + return hex.EncodeToString(b[0:4]) + "-" + hex.EncodeToString(b[4:6]) + "-" + hex.EncodeToString(b[6:8]) + "-" + hex.EncodeToString(b[8:10]) + "-" + hex.EncodeToString(b[10:16]) +} + +func cleanupIDs(t *testing.T, ctx context.Context, ids ...string) { + t.Helper() + for _, id := range ids { + _, _ = database.Pool.Exec(ctx, `DELETE FROM ai_ops_notification_logs WHERE id=$1 OR event_id=$1 OR channel_id=$1`, id) + _, _ = database.Pool.Exec(ctx, `DELETE FROM ai_ops_healings WHERE id=$1 OR alert_id=$1`, id) + _, _ = database.Pool.Exec(ctx, `DELETE FROM ai_ops_alerts WHERE id=$1 OR rule_id=$1 OR parent_alert_id=$1`, id) + _, _ = database.Pool.Exec(ctx, `DELETE FROM ai_ops_rules WHERE id=$1`, id) + _, _ = database.Pool.Exec(ctx, `DELETE FROM ai_ops_channels WHERE id=$1`, id) + _, _ = database.Pool.Exec(ctx, `DELETE FROM ai_ops_request_logs WHERE id=$1`, id) + } +} + +func TestPGChannelRepositoryCRUD(t *testing.T) { + ctx := setupPGIntegration(t) + repo := NewPGChannelRepository() + id := testUUID(t) + defer cleanupIDs(t, ctx, id) + + ch := &model.NotificationChannel{ID: id, Name: "test-channel", ChannelType: "webhook", Config: map[string]any{"webhook_url": "http://example.invalid"}, Priority: 7, Enabled: true} + if err := repo.Create(ctx, ch); err != nil { + t.Fatal(err) + } + got, err := repo.GetByID(ctx, id) + if err != nil || got.ID != id || got.Name != ch.Name { + t.Fatalf("get channel = %+v %v", got, err) + } + list, err := repo.List(ctx) + if err != nil { + t.Fatal(err) + } + found := false + for _, item := range list { + if item.ID == id { + found = true + } + } + if !found { + t.Fatalf("created channel not found in list: %+v", list) + } + ch.Name = "updated-channel" + ch.Priority = 8 + if err := repo.Update(ctx, ch); err != nil { + t.Fatal(err) + } + updated, err := repo.GetByID(ctx, id) + if err != nil || updated.Name != "updated-channel" || updated.Priority != 8 { + t.Fatalf("updated channel = %+v %v", updated, err) + } + if err := repo.Delete(ctx, id); err != nil { + t.Fatal(err) + } + if _, err := repo.GetByID(ctx, id); err == nil { + t.Fatal("expected not found after delete") + } +} + +func TestPGAlertRepositoryRulesEventsAndAggregation(t *testing.T) { + ctx := setupPGIntegration(t) + repo := NewPGAlertRepository() + ruleID, eventID, childID := testUUID(t), testUUID(t), testUUID(t) + defer cleanupIDs(t, ctx, ruleID, eventID, childID) + + rule := &model.AlertRule{ID: ruleID, Name: "rule-" + ruleID, MetricSource: "prom", MetricName: "p99", ThresholdType: ">", ThresholdValue: "100", DurationMin: 1, Level: "P1", ChannelIDs: []string{}, IsSandboxed: true, Enabled: true, Version: 1, CreatedBy: "test"} + if err := repo.CreateRule(ctx, rule); err != nil { + t.Fatal(err) + } + if got, err := repo.GetRuleByID(ctx, ruleID); err != nil || got.ID != ruleID || got.Name != rule.Name { + t.Fatalf("get rule = %+v %v", got, err) + } + rules, err := repo.ListRules(ctx) + if err != nil || len(rules) == 0 { + t.Fatalf("list rules = %d %v", len(rules), err) + } + rule.Name = "rule-updated-" + ruleID + rule.Version = 2 + if err := repo.UpdateRule(ctx, rule); err != nil { + t.Fatal(err) + } + + now := time.Now().UTC() + event := &model.AlertEvent{ID: eventID, RuleID: ruleID, Level: "P1", ResourceType: "svc", ResourceID: "res-" + ruleID, CurrentValue: "120", ThresholdValue: "100", Status: "triggered", StartedAt: now} + created, err := repo.CreateEventWithAggregation(ctx, event, time.Minute, 10) + if err != nil || created.ID != eventID { + t.Fatalf("create event = %+v %v", created, err) + } + directID := testUUID(t) + defer cleanupIDs(t, ctx, directID) + if err := repo.CreateEvent(ctx, &model.AlertEvent{ID: directID, RuleID: ruleID, Level: "P2", ResourceType: "svc", ResourceID: "direct-" + ruleID, CurrentValue: "101", ThresholdValue: "100", Status: "triggered", StartedAt: now.Add(2 * time.Second)}); err != nil { + t.Fatalf("create direct event: %v", err) + } + if err := repo.UpdateEventStatus(ctx, eventID, "resolved"); err != nil { + t.Fatal(err) + } + if err := repo.EscalateEvent(ctx, eventID, "P0"); err != nil { + t.Fatal(err) + } + + agg, err := repo.CreateEventWithAggregation(ctx, &model.AlertEvent{ID: childID, RuleID: ruleID, Level: "P1", ResourceType: "svc", ResourceID: "res-" + ruleID, CurrentValue: "130", ThresholdValue: "100", Status: "triggered", StartedAt: now.Add(time.Second)}, time.Minute, 1) + if err != nil || !agg.IsAggregated || agg.AggregatedCount < 2 { + t.Fatalf("aggregation = %+v %v", agg, err) + } + defer cleanupIDs(t, ctx, agg.ID) + + events, total, err := repo.ListEvents(ctx, "triggered", 1, 20) + if err != nil || total < 1 || len(events) < 1 { + t.Fatalf("list events = total=%d len=%d err=%v", total, len(events), err) + } + count, err := repo.GetOpenCount(ctx) + if err != nil || count.Open < 1 { + t.Fatalf("open count = %+v %v", count, err) + } + if err := repo.DeleteRule(ctx, ruleID); err != nil { + t.Fatal(err) + } +} + +func TestPGMetricAndLogRepositories(t *testing.T) { + ctx := setupPGIntegration(t) + metricRepo := NewPGMetricRepository() + logRepo := NewPGLogRepository() + logID := testUUID(t) + metricName := "test_metric_" + logID + defer cleanupIDs(t, ctx, logID) + defer database.Pool.Exec(ctx, `DELETE FROM ai_ops_metrics WHERE metric_name=$1`, metricName) + + now := time.Now().UTC() + if _, err := database.Pool.Exec(ctx, `INSERT INTO ai_ops_metrics(metric_name, labels, value, recorded_at) VALUES ($1, $2, $3, $4)`, metricName, map[string]string{"source": "test"}, 42.5, now); err != nil { + t.Fatal(err) + } + latest, err := metricRepo.GetLatest(ctx, "unit", metricName) + if err != nil || latest.Name != metricName || latest.Source != "unit" || latest.Value != 42.5 { + t.Fatalf("latest metric = %+v %v", latest, err) + } + points, err := metricRepo.Query(ctx, model.MetricQueryRequest{Source: "unit", Name: metricName, StartTime: now.Add(-time.Minute), EndTime: now.Add(time.Minute)}) + if err != nil || len(points) != 1 { + t.Fatalf("query metric = %d %v", len(points), err) + } + if realtime, err := metricRepo.GetRealtime(ctx); err != nil || realtime == nil { + t.Fatalf("realtime metric = %+v %v", realtime, err) + } + + if _, err := database.Pool.Exec(ctx, `INSERT INTO ai_ops_request_logs(id, timestamp, service, path, method, status_code, latency_ms, user_id, supplier_id, error_code) VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10)`, logID, now, "svc-test", "/unit", "GET", 200, 11.2, "u1", "s1", ""); err != nil { + t.Fatal(err) + } + status := 200 + logs, total, err := logRepo.Query(ctx, model.LogQueryFilter{Service: "svc-test", Path: "/unit", StatusCode: &status, UserID: "u1", SupplierID: "s1", Page: 1, PageSize: 10}) + if err != nil || total != 1 || len(logs) != 1 || logs[0].ID != logID { + t.Fatalf("query logs = total=%d logs=%+v err=%v", total, logs, err) + } +} + +func TestPGHealingAndNotificationRepositories(t *testing.T) { + ctx := setupPGIntegration(t) + ruleID, eventID, channelID, healingID, notificationID := testUUID(t), testUUID(t), testUUID(t), testUUID(t), testUUID(t) + defer cleanupIDs(t, ctx, ruleID, eventID, channelID, healingID, notificationID) + alertRepo := NewPGAlertRepository() + channelRepo := NewPGChannelRepository() + healingRepo := NewPGHealingRepository() + notificationRepo := NewPGNotificationLogRepository() + + if err := alertRepo.CreateRule(ctx, &model.AlertRule{ID: ruleID, Name: "notify-rule-" + ruleID, MetricSource: "prom", MetricName: "qps", ThresholdType: ">", ThresholdValue: "1", DurationMin: 1, Level: "P2", ChannelIDs: []string{}, IsSandboxed: true, Enabled: true, Version: 1, CreatedBy: "test"}); err != nil { + t.Fatal(err) + } + if _, err := alertRepo.CreateEventWithAggregation(ctx, &model.AlertEvent{ID: eventID, RuleID: ruleID, Level: "P2", ResourceType: "svc", ResourceID: "res", CurrentValue: "2", ThresholdValue: "1", Status: "triggered", StartedAt: time.Now().UTC()}, 0, 0); err != nil { + t.Fatal(err) + } + if err := channelRepo.Create(ctx, &model.NotificationChannel{ID: channelID, Name: "notify-channel", ChannelType: "webhook", Config: map[string]any{"webhook_url": "http://example.invalid"}, Priority: 1, Enabled: true}); err != nil { + t.Fatal(err) + } + if err := healingRepo.CreateHealing(ctx, &service.HealingLog{ID: healingID, AlertID: eventID, ActionType: "throttle", Config: map[string]any{"endpoint": "http://example.invalid"}, Status: "pending", DryRun: true, StartedAt: time.Now().UTC()}); err != nil { + t.Fatal(err) + } + if err := healingRepo.UpdateHealingStatus(ctx, healingID, "succeeded", map[string]any{"ok": true}, ""); err != nil { + t.Fatal(err) + } + if err := notificationRepo.CreateLog(ctx, &model.NotificationLog{ID: notificationID, EventID: eventID, ChannelID: channelID, ChannelType: "webhook", Status: "pending"}); err != nil { + t.Fatal(err) + } + if err := notificationRepo.MarkSent(ctx, notificationID); err != nil { + t.Fatal(err) + } + if err := notificationRepo.MarkFailed(ctx, notificationID, 1, "retry failed"); err != nil { + t.Fatal(err) + } +} diff --git a/internal/middleware/auth.go b/internal/middleware/auth.go new file mode 100644 index 0000000..398a159 --- /dev/null +++ b/internal/middleware/auth.go @@ -0,0 +1,112 @@ +package middleware + +import ( + "context" + "net/http" + "strings" + + "github.com/company/ai-ops/internal/config" + "github.com/company/ai-ops/pkg/errors" + "github.com/company/ai-ops/pkg/response" + "github.com/golang-jwt/jwt/v5" +) + +// Auth 中间件检查认证 +func Auth(cfg config.ServerConfig) func(http.Handler) http.Handler { + return func(next http.Handler) http.Handler { + return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + // 白名单路径免认证 + if isPublicPath(r.URL.Path) { + next.ServeHTTP(w, r) + return + } + + // API Key 检查（用于 /metrics 等机器对机器接口） + if strings.HasPrefix(r.URL.Path, "/metrics") { + apiKey := r.Header.Get("X-API-Key") + if apiKey == "" { + apiKey = r.URL.Query().Get("api_key") + } + if apiKey == cfg.MetricsAuth { + next.ServeHTTP(w, r) + return + } + } + + // JWT 检查 + tokenStr := r.Header.Get("Authorization") + if tokenStr == "" { + response.Error(w, errors.ErrUnauthorized) + return + } + tokenStr = strings.TrimPrefix(tokenStr, "Bearer ") + + token, err := jwt.Parse(tokenStr, func(token *jwt.Token) (interface{}, error) { + return []byte(cfg.JWTSecret), nil + }, jwt.WithValidMethods([]string{"HS256"})) + if err != nil || !token.Valid { + response.Error(w, errors.ErrUnauthorized) + return + } + + // 将用户ID和角色写入上下文 + if claims, ok := token.Claims.(jwt.MapClaims); ok { + if userID, ok := claims["user_id"].(string); ok { + r = r.WithContext(context.WithValue(r.Context(), "user_id", userID)) + } + if role, ok := claims["role"].(string); ok { + r = r.WithContext(context.WithValue(r.Context(), "role", role)) + } + } + + next.ServeHTTP(w, r) + }) + } +} + +// RequireRole 角色权限中间件 +func RequireRole(roles ...string) func(http.Handler) http.Handler { + roleSet := make(map[string]bool) + for _, r := range roles { + roleSet[r] = true + } + return func(next http.Handler) http.Handler { + return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + role, _ := r.Context().Value("role").(string) + if !roleSet[role] { + response.Error(w, errors.ErrForbidden.WithDetail(map[string]any{ + "error": "insufficient permissions", + "code": "OPS_AUTH_1001", + "required": roles, + "current": role, + })) + return + } + next.ServeHTTP(w, r) + }) + } +} + +// RequireWrite 允许 GET 或需要写权限 +func RequireWrite(next http.Handler) http.Handler { + return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if r.Method == "GET" || r.Method == "HEAD" { + next.ServeHTTP(w, r) + return + } + role, _ := r.Context().Value("role").(string) + if role != "operator" && role != "admin" { + response.Error(w, errors.ErrForbidden.WithDetail(map[string]any{ + "error": "write permission required", + "code": "OPS_AUTH_1001", + "current": role, + })) + return + } + next.ServeHTTP(w, r) + }) +} + +func isPublicPath(path string) bool { + return path == "/health" || strings.HasPrefix(path, "/actuator/health") || path == "/api/v1/ai-ops/login" || path == "/openapi.json" || strings.HasPrefix(path, "/ops/dashboard") +} diff --git a/internal/middleware/auth_test.go b/internal/middleware/auth_test.go new file mode 100644 index 0000000..79e3d08 --- /dev/null +++ b/internal/middleware/auth_test.go @@ -0,0 +1,100 @@ +package middleware + +import ( + "context" + "net/http" + "net/http/httptest" + "testing" + + "github.com/company/ai-ops/internal/config" + "github.com/company/ai-ops/internal/service" +) + +func TestAuthAllowsPublicPaths(t *testing.T) { + cfg := config.ServerConfig{JWTSecret: "secret", MetricsAuth: "metrics-key"} + for _, path := range []string{"/health", "/actuator/health/ready", "/api/v1/ai-ops/login", "/openapi.json", "/ops/dashboard"} { + t.Run(path, func(t *testing.T) { + called := false + h := Auth(cfg)(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { called = true })) + h.ServeHTTP(httptest.NewRecorder(), httptest.NewRequest(http.MethodGet, path, nil)) + if !called { + t.Fatalf("public path %s was blocked", path) + } + }) + } +} + +func TestAuthMetricsAPIKeyAndJWT(t *testing.T) { + cfg := config.ServerConfig{JWTSecret: "secret", MetricsAuth: "metrics-key"} + t.Run("metrics api key", func(t *testing.T) { + called := false + h := Auth(cfg)(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { called = true })) + req := httptest.NewRequest(http.MethodGet, "/metrics", nil) + req.Header.Set("X-API-Key", "metrics-key") + h.ServeHTTP(httptest.NewRecorder(), req) + if !called { + t.Fatal("metrics api key did not pass") + } + }) + + t.Run("missing token rejected", func(t *testing.T) { + h := Auth(cfg)(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { t.Fatal("should not call next") })) + w := httptest.NewRecorder() + h.ServeHTTP(w, httptest.NewRequest(http.MethodGet, "/api/v1/ai-ops/rules", nil)) + if w.Code != http.StatusUnauthorized { + t.Fatalf("status = %d", w.Code) + } + }) + + t.Run("valid jwt sets context", func(t *testing.T) { + token, err := service.NewAuthService("secret").IssueToken("u1", "operator") + if err != nil { + t.Fatal(err) + } + h := Auth(cfg)(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if r.Context().Value("user_id") != "u1" || r.Context().Value("role") != "operator" { + t.Fatalf("context not populated: user=%v role=%v", r.Context().Value("user_id"), r.Context().Value("role")) + } + })) + req := httptest.NewRequest(http.MethodGet, "/api/v1/ai-ops/rules", nil) + req.Header.Set("Authorization", "Bearer "+token) + h.ServeHTTP(httptest.NewRecorder(), req) + }) +} + +func TestRequireRoleAndRequireWrite(t *testing.T) { + next := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { w.WriteHeader(http.StatusAccepted) }) + + allowed := RequireRole("admin")(next) + req := httptest.NewRequest(http.MethodGet, "/x", nil).WithContext(context.WithValue(context.Background(), "role", "admin")) + w := httptest.NewRecorder() + allowed.ServeHTTP(w, req) + if w.Code != http.StatusAccepted { + t.Fatalf("role allowed status = %d", w.Code) + } + + denied := httptest.NewRecorder() + RequireRole("admin")(next).ServeHTTP(denied, httptest.NewRequest(http.MethodGet, "/x", nil)) + if denied.Code != http.StatusForbidden { + t.Fatalf("role denied status = %d", denied.Code) + } + + read := httptest.NewRecorder() + RequireWrite(next).ServeHTTP(read, httptest.NewRequest(http.MethodGet, "/x", nil)) + if read.Code != http.StatusAccepted { + t.Fatalf("read status = %d", read.Code) + } + + writeDenied := httptest.NewRecorder() + RequireWrite(next).ServeHTTP(writeDenied, httptest.NewRequest(http.MethodPost, "/x", nil)) + if writeDenied.Code != http.StatusForbidden { + t.Fatalf("write denied status = %d", writeDenied.Code) + } + + writeAllowed := httptest.NewRecorder() + writeReq := httptest.NewRequest(http.MethodPost, "/x", nil).WithContext(context.WithValue(context.Background(), "role", "operator")) + RequireWrite(next).ServeHTTP(writeAllowed, writeReq) + if writeAllowed.Code != http.StatusAccepted { + t.Fatalf("write allowed status = %d", writeAllowed.Code) + } +} diff --git a/internal/middleware/logging.go b/internal/middleware/logging.go new file mode 100644 index 0000000..1b34b9c --- /dev/null +++ b/internal/middleware/logging.go @@ -0,0 +1,37 @@ +package middleware + +import ( + "log/slog" + "net/http" + "time" +) + +// responseWriter 是用于捕获状态码的响应写入器 +type responseWriter struct { + http.ResponseWriter + statusCode int +} + +func (rw *responseWriter) WriteHeader(code int) { + rw.statusCode = code + rw.ResponseWriter.WriteHeader(code) +} + +// Logging 中间件记录请求日志 +func Logging(next http.Handler) http.Handler { + return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + start := time.Now() + rw := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK} + + next.ServeHTTP(rw, r) + + duration := time.Since(start) + slog.Info("http_request", + "method", r.Method, + "path", r.URL.Path, + "status", rw.statusCode, + "duration_ms", float64(duration.Microseconds())/1000, + "remote_addr", r.RemoteAddr, + ) + }) +} diff --git a/internal/middleware/logging_recovery_test.go b/internal/middleware/logging_recovery_test.go new file mode 100644 index 0000000..7bce3ac --- /dev/null +++ b/internal/middleware/logging_recovery_test.go @@ -0,0 +1,41 @@ +package middleware + +import ( + "net/http" + "net/http/httptest" + "testing" +) + +func TestLoggingCapturesStatusCode(t *testing.T) { + h := Logging(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(http.StatusCreated) + _, _ = w.Write([]byte("created")) + })) + w := httptest.NewRecorder() + h.ServeHTTP(w, httptest.NewRequest(http.MethodPost, "/x", nil)) + if w.Code != http.StatusCreated || w.Body.String() != "created" { + t.Fatalf("response = %d %q", w.Code, w.Body.String()) + } +} + +func TestRecoveryConvertsPanicToInternalError(t *testing.T) { + h := Recovery(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + panic("boom") + })) + w := httptest.NewRecorder() + h.ServeHTTP(w, httptest.NewRequest(http.MethodGet, "/panic", nil)) + if w.Code != http.StatusInternalServerError { + t.Fatalf("status = %d body=%s", w.Code, w.Body.String()) + } +} + +func TestRecoveryPassesThroughNormalRequest(t *testing.T) { + h := Recovery(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(http.StatusAccepted) + })) + w := httptest.NewRecorder() + h.ServeHTTP(w, httptest.NewRequest(http.MethodGet, "/ok", nil)) + if w.Code != http.StatusAccepted { + t.Fatalf("status = %d", w.Code) + } +} diff --git a/internal/middleware/recovery.go b/internal/middleware/recovery.go new file mode 100644 index 0000000..8a545c4 --- /dev/null +++ b/internal/middleware/recovery.go @@ -0,0 +1,27 @@ +package middleware + +import ( + "log/slog" + "net/http" + "runtime/debug" + + "github.com/company/ai-ops/pkg/errors" + "github.com/company/ai-ops/pkg/response" +) + +// Recovery 中间件捕获 panic +func Recovery(next http.Handler) http.Handler { + return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + defer func() { + if rec := recover(); rec != nil { + slog.Error("panic_recovered", + "error", rec, + "stack", string(debug.Stack()), + "path", r.URL.Path, + ) + response.Error(w, errors.ErrInternal) + } + }() + next.ServeHTTP(w, r) + }) +} diff --git a/internal/redis/client.go b/internal/redis/client.go new file mode 100644 index 0000000..5e7e0de --- /dev/null +++ b/internal/redis/client.go @@ -0,0 +1,40 @@ +package redis + +import ( + "context" + "fmt" + "time" + + "github.com/company/ai-ops/internal/config" + "github.com/redis/go-redis/v9" +) + +// Client 是全局 Redis 客户端 +var Client *redis.Client + +// Init 初始化 Redis 连接 +func Init(cfg config.RedisConfig) error { + Client = redis.NewClient(&redis.Options{ + Addr: fmt.Sprintf("%s:%d", cfg.Host, cfg.Port), + Password: cfg.Password, + DB: cfg.DB, + PoolSize: 10, + }) + + ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + defer cancel() + + if err := Client.Ping(ctx).Err(); err != nil { + return fmt.Errorf("ping redis: %w", err) + } + + return nil +} + +// Close 关闭 Redis 连接 +func Close() error { + if Client != nil { + return Client.Close() + } + return nil +} diff --git a/internal/redis/client_test.go b/internal/redis/client_test.go new file mode 100644 index 0000000..b8f47a3 --- /dev/null +++ b/internal/redis/client_test.go @@ -0,0 +1,33 @@ +package redis + +import ( + "testing" + + "github.com/company/ai-ops/internal/config" +) + +func TestInitAndCloseWithLocalRedis(t *testing.T) { + ports := []int{16379, 6379} + var lastErr error + for _, port := range ports { + lastErr = Init(config.RedisConfig{Host: "localhost", Port: port, DB: 0}) + if lastErr == nil { + break + } + _ = Close() + Client = nil + } + if lastErr != nil { + t.Skipf("Redis integration server not available: %v", lastErr) + } + if Client == nil { + t.Fatal("client not initialized") + } + if err := Close(); err != nil { + t.Fatal(err) + } + Client = nil + if err := Close(); err != nil { + t.Fatal(err) + } +} diff --git a/internal/service/alert_engine.go b/internal/service/alert_engine.go new file mode 100644 index 0000000..10902db --- /dev/null +++ b/internal/service/alert_engine.go @@ -0,0 +1,277 @@ +package service + +import ( + "context" + "crypto/rand" + "encoding/hex" + "fmt" + "log/slog" + "strconv" + "sync" + "time" + + "github.com/company/ai-ops/internal/domain/model" + "github.com/company/ai-ops/internal/domain/repository" +) + +// TriggerState 记录规则触发状态 +type TriggerState struct { + FirstTriggeredAt time.Time // 首次超阈值时间 + LastTriggeredAt time.Time // 最近一次触发时间 +} + +// AlertEngine 是告警规则评估引擎 +type AlertEngine struct { + alertRepo repository.AlertRepository + metricRepo repository.MetricRepository + notifySvc *NotificationService + interval time.Duration + stopCh chan struct{} + + // 规则触发状态（持续时间判定） + triggerStates map[string]*TriggerState + statesMu sync.RWMutex + + // 抑制期：同一规则 5 分钟内不重复触发 + suppressWindow time.Duration + + // 告警升级：P2 持续 2 小时未确认 → 升级 P1 + escalationInterval time.Duration + + // 告警聚合：同一资源 1 分钟内超过 20 条时生成聚合告警 + aggregationWindow time.Duration + aggregationThreshold int +} + +// NewAlertEngine 创建规则评估引擎 +func NewAlertEngine(ar repository.AlertRepository, mr repository.MetricRepository, ns *NotificationService) *AlertEngine { + return &AlertEngine{ + alertRepo: ar, + metricRepo: mr, + notifySvc: ns, + interval: 30 * time.Second, + stopCh: make(chan struct{}), + triggerStates: make(map[string]*TriggerState), + suppressWindow: 5 * time.Minute, + escalationInterval: 2 * time.Hour, + aggregationWindow: 1 * time.Minute, + aggregationThreshold: 20, + } +} + +// Start 启动定时评估 +func (e *AlertEngine) Start() { + slog.Info("alert_engine_started", "interval", e.interval, "suppress_window", e.suppressWindow) + go e.loop() +} + +// Stop 停止引擎 +func (e *AlertEngine) Stop() { + close(e.stopCh) + slog.Info("alert_engine_stopped") +} + +func (e *AlertEngine) loop() { + ticker := time.NewTicker(e.interval) + defer ticker.Stop() + + escalationTicker := time.NewTicker(5 * time.Minute) + defer escalationTicker.Stop() + + e.evaluate(context.Background()) + + for { + select { + case <-ticker.C: + e.evaluate(context.Background()) + case <-escalationTicker.C: + e.escalate(context.Background()) + case <-e.stopCh: + return + } + } +} + +func (e *AlertEngine) evaluate(ctx context.Context) { + rules, err := e.alertRepo.ListRules(ctx) + if err != nil { + slog.Error("list_rules_failed", "error", err) + return + } + + for _, rule := range rules { + if err := e.evaluateRule(ctx, &rule); err != nil { + slog.Error("evaluate_rule_failed", "rule_id", rule.ID, "error", err) + } + } +} + +func (e *AlertEngine) evaluateRule(ctx context.Context, rule *model.AlertRule) error { + point, err := e.metricRepo.GetLatest(ctx, rule.MetricSource, rule.MetricName) + if err != nil { + return fmt.Errorf("get metric: %w", err) + } + + threshold, err := strconv.ParseFloat(rule.ThresholdValue, 64) + if err != nil { + return fmt.Errorf("parse threshold: %w", err) + } + + triggered := e.compare(point.Value, threshold, rule.ThresholdType) + now := time.Now() + + e.statesMu.Lock() + state, exists := e.triggerStates[rule.ID] + + if !triggered { + // 指标恢复正常，清除触发状态 + if exists { + delete(e.triggerStates, rule.ID) + slog.Info("alert_cleared", "rule_id", rule.ID, "metric", rule.MetricName) + } + e.statesMu.Unlock() + return nil + } + + // 指标超阈值 + if !exists { + state = &TriggerState{FirstTriggeredAt: now, LastTriggeredAt: time.Time{}} + e.triggerStates[rule.ID] = state + } + e.statesMu.Unlock() + + // 持续时间判定：必须持续 N 分钟才触发 + duration := time.Since(state.FirstTriggeredAt) + requiredDuration := time.Duration(rule.DurationMin) * time.Minute + if duration < requiredDuration { + slog.Debug("alert_breaching_not_yet_triggered", + "rule_id", rule.ID, + "duration", duration, + "required", requiredDuration, + ) + return nil + } + + // 抑制期检查：5 分钟内不重复触发 + if !state.LastTriggeredAt.IsZero() && now.Sub(state.LastTriggeredAt) < e.suppressWindow { + return nil + } + + // 更新最近触发时间 + e.statesMu.Lock() + state.LastTriggeredAt = now + e.statesMu.Unlock() + + // 创建告警事件 + event := &model.AlertEvent{ + ID: generateID(), + RuleID: rule.ID, + Level: rule.Level, + ResourceType: rule.MetricSource, + ResourceID: rule.MetricName, + CurrentValue: fmt.Sprintf("%.4f", point.Value), + ThresholdValue: rule.ThresholdValue, + Status: "triggered", + IsAggregated: false, + AggregatedCount: 1, + } + + notifyEvent, err := e.alertRepo.CreateEventWithAggregation(ctx, event, e.aggregationWindow, e.aggregationThreshold) + if err != nil { + return fmt.Errorf("create event: %w", err) + } + if notifyEvent == nil { + notifyEvent = event + } + + // 异步发送通知 + if e.notifySvc != nil && len(rule.ChannelIDs) > 0 { + e.notifySvc.Enqueue(notifyEvent, rule.ChannelIDs) + } + + slog.Info("alert_triggered", + "rule_id", rule.ID, + "level", rule.Level, + "metric", rule.MetricName, + "value", point.Value, + "threshold", threshold, + "duration_min", rule.DurationMin, + ) + return nil +} + +func (e *AlertEngine) escalate(ctx context.Context) { + // 查询 open 状态的 P2 告警 + events, _, err := e.alertRepo.ListEvents(ctx, "triggered", 1, 100) + if err != nil { + slog.Error("list_open_events_failed", "error", err) + return + } + + now := time.Now() + for _, event := range events { + if event.Level != "P2" { + continue + } + if now.Sub(event.StartedAt) < e.escalationInterval { + continue + } + + // 升级为 P1 + if err := e.alertRepo.EscalateEvent(ctx, event.ID, "P1"); err != nil { + slog.Error("escalate_event_failed", "event_id", event.ID, "error", err) + continue + } + + // 发送升级通知 + upgraded := &model.AlertEvent{ + ID: event.ID, + RuleID: event.RuleID, + Level: "P1", + ResourceType: event.ResourceType, + ResourceID: event.ResourceID, + CurrentValue: event.CurrentValue, + ThresholdValue: event.ThresholdValue, + Status: "triggered", + } + rule, err := e.alertRepo.GetRuleByID(ctx, event.RuleID) + if err == nil && e.notifySvc != nil && len(rule.ChannelIDs) > 0 { + e.notifySvc.Enqueue(upgraded, rule.ChannelIDs) + } + + slog.Info("alert_escalated", + "event_id", event.ID, + "rule_id", event.RuleID, + "from_level", "P2", + "to_level", "P1", + "duration", now.Sub(event.StartedAt), + ) + } +} + +func (e *AlertEngine) compare(value, threshold float64, op string) bool { + switch op { + case ">": + return value > threshold + case "<": + return value < threshold + case "=": + return value == threshold + case ">=": + return value >= threshold + case "<=": + return value <= threshold + default: + return false + } +} + +func generateID() string { + b := make([]byte, 16) + if _, err := rand.Read(b); err != nil { + return fmt.Sprintf("00000000-0000-4000-8000-%012d", time.Now().UnixNano()%1_000_000_000_000) + } + b[6] = (b[6] & 0x0f) | 0x40 + b[8] = (b[8] & 0x3f) | 0x80 + return fmt.Sprintf("%s-%s-%s-%s-%s", hex.EncodeToString(b[0:4]), hex.EncodeToString(b[4:6]), hex.EncodeToString(b[6:8]), hex.EncodeToString(b[8:10]), hex.EncodeToString(b[10:16])) +} diff --git a/internal/service/alert_engine_test.go b/internal/service/alert_engine_test.go new file mode 100644 index 0000000..604c2ad --- /dev/null +++ b/internal/service/alert_engine_test.go @@ -0,0 +1,191 @@ +package service + +import ( + "context" + "testing" + "time" + + "github.com/company/ai-ops/internal/domain/model" + "github.com/stretchr/testify/mock" +) + +type fakeAggregationAlertRepo struct { + rules []model.AlertRule + events []model.AlertEvent + createdEvents []*model.AlertEvent + escalated []string +} + +func (r *fakeAggregationAlertRepo) GetOpenCount(ctx context.Context) (*model.AlertCount, error) { + return &model.AlertCount{}, nil +} + +func (r *fakeAggregationAlertRepo) ListRules(ctx context.Context) ([]model.AlertRule, error) { + return r.rules, nil +} + +func (r *fakeAggregationAlertRepo) GetRuleByID(ctx context.Context, id string) (*model.AlertRule, error) { + for i := range r.rules { + if r.rules[i].ID == id { + return &r.rules[i], nil + } + } + return nil, nil +} + +func (r *fakeAggregationAlertRepo) CreateRule(ctx context.Context, rule *model.AlertRule) error { + return nil +} +func (r *fakeAggregationAlertRepo) UpdateRule(ctx context.Context, rule *model.AlertRule) error { + return nil +} +func (r *fakeAggregationAlertRepo) DeleteRule(ctx context.Context, id string) error { return nil } +func (r *fakeAggregationAlertRepo) ListEvents(ctx context.Context, status string, page, pageSize int) ([]model.AlertEvent, int, error) { + return r.events, len(r.events), nil +} +func (r *fakeAggregationAlertRepo) CreateEvent(ctx context.Context, event *model.AlertEvent) error { + r.createdEvents = append(r.createdEvents, event) + return nil +} +func (r *fakeAggregationAlertRepo) CreateEventWithAggregation(ctx context.Context, event *model.AlertEvent, window time.Duration, threshold int) (*model.AlertEvent, error) { + r.createdEvents = append(r.createdEvents, event) + if len(r.createdEvents) > threshold { + return &model.AlertEvent{ + ID: "agg-1", + RuleID: event.RuleID, + Level: event.Level, + ResourceType: event.ResourceType, + ResourceID: event.ResourceID, + CurrentValue: event.CurrentValue, + ThresholdValue: event.ThresholdValue, + Status: "triggered", + IsAggregated: true, + AggregatedCount: len(r.createdEvents), + }, nil + } + return event, nil +} +func (r *fakeAggregationAlertRepo) UpdateEventStatus(ctx context.Context, id, status string) error { + return nil +} +func (r *fakeAggregationAlertRepo) EscalateEvent(ctx context.Context, id, newLevel string) error { + r.escalated = append(r.escalated, id+":"+newLevel) + return nil +} + +type fakeMetricRepo struct { + point *model.MetricPoint +} + +func (r *fakeMetricRepo) GetRealtime(ctx context.Context) (*model.RealtimeMetrics, error) { + return &model.RealtimeMetrics{}, nil +} +func (r *fakeMetricRepo) Query(ctx context.Context, req model.MetricQueryRequest) ([]model.MetricPoint, error) { + return nil, nil +} +func (r *fakeMetricRepo) GetLatest(ctx context.Context, source, name string) (*model.MetricPoint, error) { + return r.point, nil +} + +func TestAlertEngineAggregatesWhenSameResourceExceedsTwentyEventsWithinWindow(t *testing.T) { + alertRepo := &fakeAggregationAlertRepo{rules: []model.AlertRule{{ + ID: "rule-1", + MetricSource: "service", + MetricName: "api-error-rate", + ThresholdType: ">", + ThresholdValue: "0.1", + DurationMin: 0, + Level: "P2", + }}} + metricRepo := &fakeMetricRepo{point: &model.MetricPoint{Value: 0.5}} + engine := NewAlertEngine(alertRepo, metricRepo, nil) + engine.suppressWindow = 0 + + var last *model.AlertEvent + for i := 0; i < 21; i++ { + if err := engine.evaluateRule(context.Background(), &alertRepo.rules[0]); err != nil { + t.Fatalf("evaluate rule: %v", err) + } + last = alertRepo.createdEvents[len(alertRepo.createdEvents)-1] + } + + if got := len(alertRepo.createdEvents); got != 21 { + t.Fatalf("created events = %d, want 21", got) + } + if last.IsAggregated { + t.Fatalf("raw child event must not be marked aggregated") + } +} + +func TestAlertEngineEvaluateAndEscalateBranches(t *testing.T) { + alertRepo := &fakeAggregationAlertRepo{rules: []model.AlertRule{{ + ID: "rule-eval", + MetricSource: "service", + MetricName: "latency", + ThresholdType: ">=", + ThresholdValue: "10", + DurationMin: 0, + Level: "P2", + }}} + metricRepo := &fakeMetricRepo{point: &model.MetricPoint{Value: 10}} + engine := NewAlertEngine(alertRepo, metricRepo, nil) + engine.suppressWindow = time.Hour + engine.evaluate(context.Background()) + if len(alertRepo.createdEvents) != 1 { + t.Fatalf("created events = %d", len(alertRepo.createdEvents)) + } + // suppressed second event + engine.evaluate(context.Background()) + if len(alertRepo.createdEvents) != 1 { + t.Fatalf("suppression failed, events = %d", len(alertRepo.createdEvents)) + } + + if !engine.compare(1, 1, "=") || !engine.compare(1, 2, "<") || !engine.compare(2, 1, ">") || !engine.compare(2, 2, ">=") || !engine.compare(1, 2, "<=") || engine.compare(1, 2, "regex") { + t.Fatal("compare operators not covered as expected") + } + if generateID() == "" { + t.Fatal("empty alert id") + } +} + +func TestMetricServiceSupplierAndQuery(t *testing.T) { + mockMetric := new(MockMetricRepository) + mockAlert := new(MockAlertRepository) + svc := NewMetricService(mockMetric, mockAlert) + + query := model.MetricQueryRequest{Name: "qps"} + points := []model.MetricPoint{{Name: "qps", Value: 1}} + mockMetric.On("Query", mock.Anything, query).Return(points, nil).Once() + mockMetric.On("Query", mock.Anything, model.MetricQueryRequest{Name: "supplier_health"}).Return([]model.MetricPoint{{Value: 1}, {Value: 0}}, nil).Once() + got, err := svc.QueryMetrics(context.Background(), query) + if err != nil || len(got) != 1 { + t.Fatalf("query metrics = %+v %v", got, err) + } + count, err := svc.GetSupplierCount(context.Background()) + if err != nil || count.Healthy != 1 || count.Unhealthy != 1 || count.Total != 2 { + t.Fatalf("supplier count = %+v %v", count, err) + } +} + +func TestAlertEngineStartStopCoversLoop(t *testing.T) { + engine := NewAlertEngine(&fakeAggregationAlertRepo{}, &fakeMetricRepo{point: &model.MetricPoint{Value: 0}}, nil) + engine.interval = time.Hour + engine.Start() + time.Sleep(5 * time.Millisecond) + engine.Stop() +} + +func TestAlertEngineEscalatesOldP2EventsOnly(t *testing.T) { + oldEvent := model.AlertEvent{ID: "old", RuleID: "rule-old", Level: "P2", ResourceType: "svc", ResourceID: "api", CurrentValue: "9", ThresholdValue: "1", Status: "triggered", StartedAt: time.Now().Add(-3 * time.Hour)} + freshEvent := model.AlertEvent{ID: "fresh", RuleID: "rule-fresh", Level: "P2", StartedAt: time.Now()} + p1Event := model.AlertEvent{ID: "p1", RuleID: "rule-p1", Level: "P1", StartedAt: time.Now().Add(-3 * time.Hour)} + repo := &fakeAggregationAlertRepo{ + events: []model.AlertEvent{oldEvent, freshEvent, p1Event}, + rules: []model.AlertRule{{ID: "rule-old", ChannelIDs: []string{"ch-1"}}}, + } + engine := NewAlertEngine(repo, &fakeMetricRepo{point: &model.MetricPoint{Value: 0}}, nil) + engine.escalate(context.Background()) + if len(repo.escalated) != 1 || repo.escalated[0] != "old:P1" { + t.Fatalf("escalated = %+v", repo.escalated) + } +} diff --git a/internal/service/audit_service.go b/internal/service/audit_service.go new file mode 100644 index 0000000..b289684 --- /dev/null +++ b/internal/service/audit_service.go @@ -0,0 +1,181 @@ +package service + +import ( + "context" + "crypto/rand" + "encoding/hex" + "fmt" + "time" + + "github.com/company/ai-ops/internal/database" +) + +// AuditService 是审计服务 +type AuditService struct{} + +func NewAuditService() *AuditService { + return &AuditService{} +} + +// AuditLog 是审计日志记录 +type AuditLog struct { + ID string `json:"id"` + TenantID string `json:"tenant_id"` + ObjectType string `json:"object_type"` + ObjectID string `json:"object_id"` + Action string `json:"action"` + BeforeState map[string]any `json:"before_state,omitempty"` + AfterState map[string]any `json:"after_state,omitempty"` + RequestID string `json:"request_id"` + ResultCode string `json:"result_code"` + SourceIP string `json:"source_ip"` + ActorID string `json:"actor_id"` + RiskLevel string `json:"risk_level"` + ParentAuditID *string `json:"parent_audit_id,omitempty"` + CreatedAt time.Time `json:"created_at"` +} + +// Record 记录审计日志 +func (s *AuditService) Record(ctx context.Context, log *AuditLog) error { + var parentID any + if log.ParentAuditID != nil { + parentID = *log.ParentAuditID + } + _, err := database.Pool.Exec(ctx, ` + INSERT INTO ai_ops_audits (id, tenant_id, object_type, object_id, action, + before_state, after_state, request_id, result_code, source_ip, actor_id, + risk_level, parent_audit_id, created_at) + VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, NOW()) + `, log.ID, log.TenantID, log.ObjectType, log.ObjectID, log.Action, + log.BeforeState, log.AfterState, log.RequestID, log.ResultCode, + log.SourceIP, log.ActorID, log.RiskLevel, parentID) + if err != nil { + return fmt.Errorf("insert audit: %w", err) + } + return nil +} + +// List 查询审计日志 +func (s *AuditService) List(ctx context.Context, objectType, objectID string, page, pageSize int) ([]AuditLog, int, error) { + if page < 1 { + page = 1 + } + if pageSize < 1 || pageSize > 100 { + pageSize = 20 + } + + where := "" + args := []any{} + argIdx := 1 + + if objectType != "" { + where = fmt.Sprintf("WHERE object_type = $%d", argIdx) + args = append(args, objectType) + argIdx++ + } + if objectID != "" { + if where != "" { + where += fmt.Sprintf(" AND object_id = $%d", argIdx) + } else { + where = fmt.Sprintf("WHERE object_id = $%d", argIdx) + } + args = append(args, objectID) + argIdx++ + } + + var total int + countQuery := fmt.Sprintf("SELECT COUNT(*) FROM ai_ops_audits %s", where) + if err := database.Pool.QueryRow(ctx, countQuery, args...).Scan(&total); err != nil { + return nil, 0, fmt.Errorf("count audits: %w", err) + } + + dataQuery := fmt.Sprintf(` + SELECT id, tenant_id, object_type, object_id, action, + before_state, after_state, request_id, result_code, source_ip, actor_id, + risk_level, parent_audit_id, created_at + FROM ai_ops_audits %s + ORDER BY created_at DESC + LIMIT $%d OFFSET $%d + `, where, argIdx, argIdx+1) + queryArgs := append(args, pageSize, (page-1)*pageSize) + + rows, err := database.Pool.Query(ctx, dataQuery, queryArgs...) + if err != nil { + return nil, 0, fmt.Errorf("query audits: %w", err) + } + defer rows.Close() + + var logs []AuditLog + for rows.Next() { + var l AuditLog + var parentID *string + if err := rows.Scan( + &l.ID, &l.TenantID, &l.ObjectType, &l.ObjectID, &l.Action, + &l.BeforeState, &l.AfterState, &l.RequestID, &l.ResultCode, + &l.SourceIP, &l.ActorID, &l.RiskLevel, &parentID, &l.CreatedAt, + ); err != nil { + return nil, 0, fmt.Errorf("scan audit: %w", err) + } + l.ParentAuditID = parentID + logs = append(logs, l) + } + return logs, total, rows.Err() +} + +// Rollback 回滚配置 +func (s *AuditService) Rollback(ctx context.Context, auditID string) (*AuditLog, error) { + // 查找原始审计记录 + var original AuditLog + var parentID *string + err := database.Pool.QueryRow(ctx, ` + SELECT id, tenant_id, object_type, object_id, action, + before_state, after_state, request_id, result_code, source_ip, actor_id, + risk_level, parent_audit_id, created_at + FROM ai_ops_audits WHERE id = $1 + `, auditID).Scan( + &original.ID, &original.TenantID, &original.ObjectType, &original.ObjectID, &original.Action, + &original.BeforeState, &original.AfterState, &original.RequestID, &original.ResultCode, + &original.SourceIP, &original.ActorID, &original.RiskLevel, &parentID, &original.CreatedAt, + ) + if err != nil { + return nil, fmt.Errorf("audit record not found") + } + + // 检查目标资源是否存在（简化处理：假设总是存在） + if original.BeforeState == nil { + return nil, fmt.Errorf("no before_state available for rollback") + } + + // 创建回滚审计记录 + rollbackLog := &AuditLog{ + ID: generateAuditID(), + TenantID: original.TenantID, + ObjectType: original.ObjectType, + ObjectID: original.ObjectID, + Action: "rollback", + BeforeState: original.AfterState, + AfterState: original.BeforeState, + RequestID: original.RequestID, + ResultCode: "SUCCESS", + SourceIP: original.SourceIP, + ActorID: original.ActorID, + RiskLevel: "high", + ParentAuditID: &original.ID, + } + + if err := s.Record(ctx, rollbackLog); err != nil { + return nil, fmt.Errorf("record rollback audit: %w", err) + } + + return rollbackLog, nil +} + +func generateAuditID() string { + b := make([]byte, 16) + if _, err := rand.Read(b); err != nil { + return fmt.Sprintf("00000000-0000-4000-8000-%012d", time.Now().UnixNano()%1_000_000_000_000) + } + b[6] = (b[6] & 0x0f) | 0x40 + b[8] = (b[8] & 0x3f) | 0x80 + return fmt.Sprintf("%s-%s-%s-%s-%s", hex.EncodeToString(b[0:4]), hex.EncodeToString(b[4:6]), hex.EncodeToString(b[6:8]), hex.EncodeToString(b[8:10]), hex.EncodeToString(b[10:16])) +} diff --git a/internal/service/audit_service_integration_test.go b/internal/service/audit_service_integration_test.go new file mode 100644 index 0000000..1db1224 --- /dev/null +++ b/internal/service/audit_service_integration_test.go @@ -0,0 +1,114 @@ +package service + +import ( + "context" + "crypto/rand" + "encoding/hex" + "os" + "path/filepath" + "sort" + "testing" + "time" + + "github.com/company/ai-ops/internal/config" + "github.com/company/ai-ops/internal/database" +) + +func setupServicePGIntegration(t *testing.T) context.Context { + t.Helper() + ctx := context.Background() + if database.Pool == nil { + ports := []int{15432, 5432} + var lastErr error + for _, port := range ports { + lastErr = database.Init(config.DatabaseConfig{Host: "localhost", Port: port, User: "aiops", Password: "aiops123", DBName: "ai_ops", SSLMode: "disable", PoolSize: 4}) + if lastErr == nil { + break + } + database.Close() + database.Pool = nil + } + if lastErr != nil { + t.Skipf("PostgreSQL integration database not available: %v", lastErr) + } + } + files, err := filepath.Glob(filepath.Join("..", "..", "tech", "migrations", "*.up.sql")) + if err != nil { + t.Fatal(err) + } + sort.Strings(files) + if _, err := database.Pool.Exec(ctx, `SELECT pg_advisory_lock(424242001)`); err != nil { + t.Fatal(err) + } + defer database.Pool.Exec(ctx, `SELECT pg_advisory_unlock(424242001)`) + for _, f := range files { + b, err := os.ReadFile(f) + if err != nil { + t.Fatal(err) + } + if _, err := database.Pool.Exec(ctx, string(b)); err != nil { + t.Fatalf("apply migration %s: %v", f, err) + } + } + return ctx +} + +func serviceTestUUID(t *testing.T) string { + t.Helper() + b := make([]byte, 16) + if _, err := rand.Read(b); err != nil { + t.Fatal(err) + } + b[6] = (b[6] & 0x0f) | 0x40 + b[8] = (b[8] & 0x3f) | 0x80 + return hex.EncodeToString(b[0:4]) + "-" + hex.EncodeToString(b[4:6]) + "-" + hex.EncodeToString(b[6:8]) + "-" + hex.EncodeToString(b[8:10]) + "-" + hex.EncodeToString(b[10:16]) +} + +func cleanupAudit(t *testing.T, ctx context.Context, ids ...string) { + t.Helper() + for _, id := range ids { + _, _ = database.Pool.Exec(ctx, `DELETE FROM ai_ops_audits WHERE id=$1 OR parent_audit_id=$1 OR object_id=$1`, id) + } +} + +func TestAuditServiceRecordListRollback(t *testing.T) { + ctx := setupServicePGIntegration(t) + svc := NewAuditService() + id := serviceTestUUID(t) + defer cleanupAudit(t, ctx, id) + + log := &AuditLog{ID: id, TenantID: "tenant", ObjectType: "rule", ObjectID: id, Action: "update", BeforeState: map[string]any{"enabled": false}, AfterState: map[string]any{"enabled": true}, RequestID: "req", ResultCode: "SUCCESS", SourceIP: "127.0.0.1", ActorID: "actor", RiskLevel: "normal"} + if err := svc.Record(ctx, log); err != nil { + t.Fatal(err) + } + logs, total, err := svc.List(ctx, "rule", id, 0, 500) + if err != nil || total != 1 || len(logs) != 1 || logs[0].ID != id { + t.Fatalf("list = total=%d logs=%+v err=%v", total, logs, err) + } + rollback, err := svc.Rollback(ctx, id) + if err != nil { + t.Fatal(err) + } + if rollback.Action != "rollback" || rollback.ParentAuditID == nil || *rollback.ParentAuditID != id || rollback.RiskLevel != "high" { + t.Fatalf("rollback = %+v", rollback) + } + cleanupAudit(t, ctx, rollback.ID) +} + +func TestAuditServiceRollbackRejectsMissingBeforeState(t *testing.T) { + ctx := setupServicePGIntegration(t) + svc := NewAuditService() + id := serviceTestUUID(t) + defer cleanupAudit(t, ctx, id) + + log := &AuditLog{ID: id, TenantID: "tenant", ObjectType: "rule", ObjectID: id, Action: "create", AfterState: map[string]any{"enabled": true}, RequestID: "req", ResultCode: "SUCCESS", SourceIP: "127.0.0.1", ActorID: "actor", RiskLevel: "normal", CreatedAt: time.Now()} + if err := svc.Record(ctx, log); err != nil { + t.Fatal(err) + } + if _, err := svc.Rollback(ctx, id); err == nil { + t.Fatal("expected rollback error without before state") + } + if _, err := svc.Rollback(ctx, serviceTestUUID(t)); err == nil { + t.Fatal("expected missing audit error") + } +} diff --git a/internal/service/auth_service.go b/internal/service/auth_service.go new file mode 100644 index 0000000..493e218 --- /dev/null +++ b/internal/service/auth_service.go @@ -0,0 +1,55 @@ +package service + +import ( + "fmt" + "time" + + "github.com/golang-jwt/jwt/v5" +) + +// AuthService 是认证服务 +type AuthService struct { + secret []byte +} + +func NewAuthService(secret string) *AuthService { + return &AuthService{secret: []byte(secret)} +} + +// Claims 是 JWT 宣告 +type Claims struct { + UserID string `json:"user_id"` + Role string `json:"role"` + jwt.RegisteredClaims +} + +// IssueToken 签发 JWT Token，有效期 8 小时 +func (s *AuthService) IssueToken(userID, role string) (string, error) { + claims := Claims{ + UserID: userID, + Role: role, + RegisteredClaims: jwt.RegisteredClaims{ + ExpiresAt: jwt.NewNumericDate(time.Now().Add(8 * time.Hour)), + IssuedAt: jwt.NewNumericDate(time.Now()), + }, + } + token := jwt.NewWithClaims(jwt.SigningMethodHS256, claims) + return token.SignedString(s.secret) +} + +// ParseToken 验证并解析 Token +func (s *AuthService) ParseToken(tokenStr string) (*Claims, error) { + token, err := jwt.ParseWithClaims(tokenStr, &Claims{}, func(token *jwt.Token) (any, error) { + if _, ok := token.Method.(*jwt.SigningMethodHMAC); !ok { + return nil, fmt.Errorf("unexpected signing method: %v", token.Header["alg"]) + } + return s.secret, nil + }) + if err != nil { + return nil, fmt.Errorf("parse token: %w", err) + } + if claims, ok := token.Claims.(*Claims); ok && token.Valid { + return claims, nil + } + return nil, fmt.Errorf("invalid token") +} diff --git a/internal/service/channel_service.go b/internal/service/channel_service.go new file mode 100644 index 0000000..e51fa30 --- /dev/null +++ b/internal/service/channel_service.go @@ -0,0 +1,45 @@ +package service + +import ( + "context" + "fmt" + + "github.com/company/ai-ops/internal/domain/model" + "github.com/company/ai-ops/internal/domain/repository" +) + +// ChannelService 是通知渠道业务层 +type ChannelService struct { + repo repository.ChannelRepository +} + +func NewChannelService(repo repository.ChannelRepository) *ChannelService { + return &ChannelService{repo: repo} +} + +func (s *ChannelService) List(ctx context.Context) ([]model.NotificationChannel, error) { + return s.repo.List(ctx) +} + +func (s *ChannelService) Get(ctx context.Context, id string) (*model.NotificationChannel, error) { + return s.repo.GetByID(ctx, id) +} + +func (s *ChannelService) Create(ctx context.Context, ch *model.NotificationChannel) error { + if ch.Name == "" || ch.ChannelType == "" { + return fmt.Errorf("name and channel_type are required") + } + ch.Enabled = true + return s.repo.Create(ctx, ch) +} + +func (s *ChannelService) Update(ctx context.Context, ch *model.NotificationChannel) error { + if ch.ID == "" { + return fmt.Errorf("channel id is required") + } + return s.repo.Update(ctx, ch) +} + +func (s *ChannelService) Delete(ctx context.Context, id string) error { + return s.repo.Delete(ctx, id) +} diff --git a/internal/service/core_services_test.go b/internal/service/core_services_test.go new file mode 100644 index 0000000..1f56e6c --- /dev/null +++ b/internal/service/core_services_test.go @@ -0,0 +1,225 @@ +package service + +import ( + "bytes" + "context" + "errors" + "strings" + "testing" + "time" + + "github.com/company/ai-ops/internal/domain/model" +) + +type fakeRuleAlertRepo struct { + rules []model.AlertRule + gotRuleID string + createdRule *model.AlertRule + updatedRule *model.AlertRule + deletedID string + err error +} + +func (r *fakeRuleAlertRepo) GetOpenCount(context.Context) (*model.AlertCount, error) { + return &model.AlertCount{}, nil +} +func (r *fakeRuleAlertRepo) ListRules(context.Context) ([]model.AlertRule, error) { + return r.rules, r.err +} +func (r *fakeRuleAlertRepo) GetRuleByID(_ context.Context, id string) (*model.AlertRule, error) { + r.gotRuleID = id + if r.err != nil { + return nil, r.err + } + return &model.AlertRule{ID: id, Name: "rule"}, nil +} +func (r *fakeRuleAlertRepo) CreateRule(_ context.Context, rule *model.AlertRule) error { + r.createdRule = rule + return r.err +} +func (r *fakeRuleAlertRepo) UpdateRule(_ context.Context, rule *model.AlertRule) error { + r.updatedRule = rule + return r.err +} +func (r *fakeRuleAlertRepo) DeleteRule(_ context.Context, id string) error { + r.deletedID = id + return r.err +} +func (r *fakeRuleAlertRepo) ListEvents(context.Context, string, int, int) ([]model.AlertEvent, int, error) { + return nil, 0, nil +} +func (r *fakeRuleAlertRepo) CreateEvent(context.Context, *model.AlertEvent) error { return nil } +func (r *fakeRuleAlertRepo) CreateEventWithAggregation(_ context.Context, e *model.AlertEvent, _ time.Duration, _ int) (*model.AlertEvent, error) { + return e, nil +} +func (r *fakeRuleAlertRepo) UpdateEventStatus(context.Context, string, string) error { return nil } +func (r *fakeRuleAlertRepo) EscalateEvent(context.Context, string, string) error { return nil } + +type fakeChannelRepository struct { + channels []model.NotificationChannel + gotID string + created *model.NotificationChannel + updated *model.NotificationChannel + deleted string + err error +} + +func (r *fakeChannelRepository) List(context.Context) ([]model.NotificationChannel, error) { + return r.channels, r.err +} +func (r *fakeChannelRepository) GetByID(_ context.Context, id string) (*model.NotificationChannel, error) { + r.gotID = id + if r.err != nil { + return nil, r.err + } + return &model.NotificationChannel{ID: id, Name: "webhook"}, nil +} +func (r *fakeChannelRepository) Create(_ context.Context, ch *model.NotificationChannel) error { + r.created = ch + return r.err +} +func (r *fakeChannelRepository) Update(_ context.Context, ch *model.NotificationChannel) error { + r.updated = ch + return r.err +} +func (r *fakeChannelRepository) Delete(_ context.Context, id string) error { + r.deleted = id + return r.err +} + +type fakeLogRepository struct { + logs []model.RequestLog + total int + lastFilter model.LogQueryFilter + err error +} + +func (r *fakeLogRepository) Query(_ context.Context, filter model.LogQueryFilter) ([]model.RequestLog, int, error) { + r.lastFilter = filter + return r.logs, r.total, r.err +} + +func TestAuthServiceIssuesAndParsesToken(t *testing.T) { + svc := NewAuthService("secret") + token, err := svc.IssueToken("u1", "admin") + if err != nil { + t.Fatal(err) + } + claims, err := svc.ParseToken(token) + if err != nil { + t.Fatal(err) + } + if claims.UserID != "u1" || claims.Role != "admin" { + t.Fatalf("unexpected claims: %+v", claims) + } + if _, err := NewAuthService("other").ParseToken(token); err == nil { + t.Fatal("expected invalid signature error") + } + if _, err := svc.ParseToken("not-a-jwt"); err == nil { + t.Fatal("expected malformed token error") + } +} + +func TestRuleServiceValidationAndRepositoryCalls(t *testing.T) { + repo := &fakeRuleAlertRepo{rules: []model.AlertRule{{ID: "r1"}}} + svc := NewRuleService(repo) + if rules, err := svc.ListRules(context.Background()); err != nil || len(rules) != 1 { + t.Fatalf("list = %v %v", rules, err) + } + if rule, err := svc.GetRule(context.Background(), "r1"); err != nil || rule.ID != "r1" { + t.Fatalf("get = %+v %v", rule, err) + } + if err := svc.CreateRule(context.Background(), &model.AlertRule{}); err == nil { + t.Fatal("expected missing id error") + } + if err := svc.CreateRule(context.Background(), &model.AlertRule{ID: "r2"}); err == nil { + t.Fatal("expected missing name/metric error") + } + rule := &model.AlertRule{ID: "r2", Name: "latency", MetricName: "p99"} + if err := svc.CreateRule(context.Background(), rule); err != nil { + t.Fatal(err) + } + if !rule.Enabled || rule.Version != 1 || repo.createdRule != rule { + t.Fatalf("create did not normalize rule: %+v", rule) + } + if err := svc.UpdateRule(context.Background(), &model.AlertRule{}); err == nil { + t.Fatal("expected missing update id error") + } + updating := &model.AlertRule{ID: "r2", Version: 2} + if err := svc.UpdateRule(context.Background(), updating); err != nil { + t.Fatal(err) + } + if updating.Version != 3 || repo.updatedRule != updating { + t.Fatalf("version not incremented: %+v", updating) + } + if err := svc.DeleteRule(context.Background(), "r2"); err != nil || repo.deletedID != "r2" { + t.Fatalf("delete failed: %v", err) + } +} + +func TestChannelServiceValidationAndRepositoryCalls(t *testing.T) { + repo := &fakeChannelRepository{channels: []model.NotificationChannel{{ID: "c1"}}} + svc := NewChannelService(repo) + if channels, err := svc.List(context.Background()); err != nil || len(channels) != 1 { + t.Fatalf("list = %v %v", channels, err) + } + if ch, err := svc.Get(context.Background(), "c1"); err != nil || ch.ID != "c1" { + t.Fatalf("get = %+v %v", ch, err) + } + if err := svc.Create(context.Background(), &model.NotificationChannel{}); err == nil { + t.Fatal("expected validation error") + } + ch := &model.NotificationChannel{Name: "hook", ChannelType: "webhook"} + if err := svc.Create(context.Background(), ch); err != nil { + t.Fatal(err) + } + if !ch.Enabled || repo.created != ch { + t.Fatalf("create did not enable channel: %+v", ch) + } + if err := svc.Update(context.Background(), &model.NotificationChannel{}); err == nil { + t.Fatal("expected missing id error") + } + if err := svc.Update(context.Background(), &model.NotificationChannel{ID: "c1"}); err != nil { + t.Fatal(err) + } + if err := svc.Delete(context.Background(), "c1"); err != nil || repo.deleted != "c1" { + t.Fatalf("delete failed: %v", err) + } +} + +func TestLogServiceQueryAndExportCSV(t *testing.T) { + repo := &fakeLogRepository{ + logs: []model.RequestLog{{Timestamp: time.Date(2026, 5, 12, 1, 2, 3, 0, time.UTC), Service: "api", Path: "/v1", Method: "GET", StatusCode: 200, LatencyMs: 12.34, UserID: "u", SupplierID: "s"}}, + total: 1, + } + svc := NewLogService(repo) + logs, total, err := svc.QueryLogs(context.Background(), model.LogQueryFilter{Service: "api", Page: 2, PageSize: 5}) + if err != nil || total != 1 || len(logs) != 1 { + t.Fatalf("query = %v %d %v", logs, total, err) + } + if repo.lastFilter.Service != "api" || repo.lastFilter.Page != 2 { + t.Fatalf("filter not passed: %+v", repo.lastFilter) + } + + var buf bytes.Buffer + if err := svc.ExportLogsCSV(context.Background(), model.LogQueryFilter{Page: 9, PageSize: 1}, &buf); err != nil { + t.Fatal(err) + } + out := buf.String() + if !strings.Contains(out, "时间,服务名,路径,方法,状态码") || !strings.Contains(out, "api,/v1,GET,200,12.34") { + t.Fatalf("unexpected csv: %s", out) + } + if repo.lastFilter.Page != 1 || repo.lastFilter.PageSize != 10000 { + t.Fatalf("export did not enforce bounds: %+v", repo.lastFilter) + } +} + +func TestLogServicePropagatesRepositoryErrors(t *testing.T) { + svc := NewLogService(&fakeLogRepository{err: errors.New("db down")}) + if _, _, err := svc.QueryLogs(context.Background(), model.LogQueryFilter{}); err == nil || !strings.Contains(err.Error(), "query logs") { + t.Fatalf("unexpected query err: %v", err) + } + if err := svc.ExportLogsCSV(context.Background(), model.LogQueryFilter{}, &bytes.Buffer{}); err == nil || !strings.Contains(err.Error(), "query logs for export") { + t.Fatalf("unexpected export err: %v", err) + } +} diff --git a/internal/service/healing_engine.go b/internal/service/healing_engine.go new file mode 100644 index 0000000..0b1b10a --- /dev/null +++ b/internal/service/healing_engine.go @@ -0,0 +1,253 @@ +package service + +import ( + "bytes" + "context" + "crypto/rand" + "encoding/hex" + "encoding/json" + "fmt" + "log/slog" + "net/http" + "time" + + "github.com/company/ai-ops/internal/domain/model" + "github.com/company/ai-ops/internal/domain/repository" +) + +// HealingEngine 是自愈引擎 +type HealingEngine struct { + alertRepo repository.AlertRepository + healingRepo HealingRepository + client *http.Client + interval time.Duration + stopCh chan struct{} +} + +// HealingRepository 是自愈记录存储接口 +type HealingRepository interface { + CreateHealing(ctx context.Context, h *HealingLog) error + UpdateHealingStatus(ctx context.Context, id, status string, result map[string]any, errCode string) error +} + +// HealingLog 是自愈执行记录 +type HealingLog struct { + ID string `json:"id"` + AlertID string `json:"alert_id"` + ActionType string `json:"action_type"` + Config map[string]any `json:"config"` + Status string `json:"status"` + DryRun bool `json:"dry_run"` + ResultDetail map[string]any `json:"result_detail,omitempty"` + ErrorCode string `json:"error_code,omitempty"` + StartedAt time.Time `json:"started_at"` + CompletedAt *time.Time `json:"completed_at,omitempty"` +} + +// NewHealingEngine 创建自愈引擎 +func NewHealingEngine(ar repository.AlertRepository, hr HealingRepository) *HealingEngine { + return &HealingEngine{ + alertRepo: ar, + healingRepo: hr, + client: &http.Client{Timeout: 20 * time.Second}, + interval: 30 * time.Second, + stopCh: make(chan struct{}), + } +} + +// Start 启动自愈引擎 +func (e *HealingEngine) Start() { + slog.Info("healing_engine_started", "interval", e.interval) + go e.loop() +} + +// Stop 停止自愈引擎 +func (e *HealingEngine) Stop() { + close(e.stopCh) + slog.Info("healing_engine_stopped") +} + +func (e *HealingEngine) loop() { + ticker := time.NewTicker(e.interval) + defer ticker.Stop() + + for { + select { + case <-ticker.C: + e.process(context.Background()) + case <-e.stopCh: + return + } + } +} + +func (e *HealingEngine) process(ctx context.Context) { + // 查询 triggered 状态的告警事件 + events, _, err := e.alertRepo.ListEvents(ctx, "triggered", 1, 100) + if err != nil { + slog.Error("list_triggered_events_failed", "error", err) + return + } + + for _, event := range events { + if err := e.handleEvent(ctx, &event); err != nil { + slog.Error("handle_event_failed", "event_id", event.ID, "error", err) + } + } +} + +func (e *HealingEngine) handleEvent(ctx context.Context, event *model.AlertEvent) error { + // 获取规则配置 + rule, err := e.alertRepo.GetRuleByID(ctx, event.RuleID) + if err != nil { + return fmt.Errorf("get rule: %w", err) + } + + // 检查是否有自愈动作 + if rule.HealingAction == nil || *rule.HealingAction == "" { + return nil + } + + // 创建自愈记录 + healing := &HealingLog{ + ID: generateHealingID(), + AlertID: event.ID, + ActionType: *rule.HealingAction, + Config: rule.HealingConfig, + Status: "pending", + DryRun: rule.IsSandboxed, + StartedAt: time.Now(), + } + + if err := e.healingRepo.CreateHealing(ctx, healing); err != nil { + return fmt.Errorf("create healing log: %w", err) + } + + // 沙盒模式：只记录不执行 + if healing.DryRun { + slog.Info("healing_dry_run", + "healing_id", healing.ID, + "action", healing.ActionType, + "alert_id", event.ID, + ) + healing.Status = "succeeded" + healing.ResultDetail = map[string]any{"message": "dry run, no actual action executed"} + return e.healingRepo.UpdateHealingStatus(ctx, healing.ID, healing.Status, healing.ResultDetail, "") + } + + // 执行自愈动作 + result, err := e.executeAction(ctx, healing) + if err != nil { + healing.Status = "failed" + healing.ErrorCode = "HEALING_EXEC_FAILED" + slog.Error("healing_action_failed", + "healing_id", healing.ID, + "action", healing.ActionType, + "error", err, + ) + } else { + healing.Status = "succeeded" + healing.ResultDetail = result + slog.Info("healing_action_succeeded", + "healing_id", healing.ID, + "action", healing.ActionType, + ) + } + + return e.healingRepo.UpdateHealingStatus(ctx, healing.ID, healing.Status, healing.ResultDetail, healing.ErrorCode) +} + +func (e *HealingEngine) executeAction(ctx context.Context, healing *HealingLog) (map[string]any, error) { + switch healing.ActionType { + case "switch_route": + return e.executeSwitchRoute(ctx, healing) + case "throttle": + return e.executeThrottle(ctx, healing) + case "restart_instance": + return e.executeRestartInstance(ctx, healing) + case "invoke_script": + return e.executeInvokeScript(ctx, healing) + default: + return nil, fmt.Errorf("unsupported healing action: %s", healing.ActionType) + } +} + +func (e *HealingEngine) executeSwitchRoute(ctx context.Context, healing *HealingLog) (map[string]any, error) { + return e.callConfiguredEndpoint(ctx, healing, "switch_route") +} + +func (e *HealingEngine) executeThrottle(ctx context.Context, healing *HealingLog) (map[string]any, error) { + return e.callConfiguredEndpoint(ctx, healing, "throttle") +} + +func (e *HealingEngine) executeRestartInstance(ctx context.Context, healing *HealingLog) (map[string]any, error) { + if allowed, _ := healing.Config["allow_restart"].(bool); !allowed { + return nil, fmt.Errorf("restart_instance requires allow_restart=true") + } + return e.callConfiguredEndpoint(ctx, healing, "restart_instance") +} + +func (e *HealingEngine) executeInvokeScript(ctx context.Context, healing *HealingLog) (map[string]any, error) { + if _, ok := healing.Config["script_id"].(string); !ok { + return nil, fmt.Errorf("invoke_script requires script_id; raw script content is not allowed") + } + return e.callConfiguredEndpoint(ctx, healing, "invoke_script") +} + +func (e *HealingEngine) callConfiguredEndpoint(ctx context.Context, healing *HealingLog, action string) (map[string]any, error) { + endpoint, ok := healing.Config["endpoint"].(string) + if !ok || endpoint == "" { + return nil, fmt.Errorf("%s requires endpoint", action) + } + method, _ := healing.Config["method"].(string) + if method == "" { + method = http.MethodPost + } + if method != http.MethodPost && method != http.MethodPut && method != http.MethodPatch { + return nil, fmt.Errorf("%s method %s is not allowed", action, method) + } + + payload := map[string]any{ + "healing_id": healing.ID, + "alert_id": healing.AlertID, + "action_type": healing.ActionType, + "config": healing.Config, + "dry_run": healing.DryRun, + } + body, err := json.Marshal(payload) + if err != nil { + return nil, fmt.Errorf("marshal healing payload: %w", err) + } + req, err := http.NewRequestWithContext(ctx, method, endpoint, bytes.NewReader(body)) + if err != nil { + return nil, fmt.Errorf("create healing request: %w", err) + } + req.Header.Set("Content-Type", "application/json") + if token, _ := healing.Config["token"].(string); token != "" { + req.Header.Set("Authorization", "Bearer "+token) + } + + resp, err := e.client.Do(req) + if err != nil { + return nil, fmt.Errorf("call healing endpoint: %w", err) + } + defer resp.Body.Close() + if resp.StatusCode >= 400 { + return nil, fmt.Errorf("healing endpoint returned status %d", resp.StatusCode) + } + return map[string]any{ + "message": action + " executed", + "endpoint": endpoint, + "status_code": resp.StatusCode, + }, nil +} + +func generateHealingID() string { + b := make([]byte, 16) + if _, err := rand.Read(b); err != nil { + return fmt.Sprintf("00000000-0000-4000-8000-%012d", time.Now().UnixNano()%1_000_000_000_000) + } + b[6] = (b[6] & 0x0f) | 0x40 + b[8] = (b[8] & 0x3f) | 0x80 + return fmt.Sprintf("%s-%s-%s-%s-%s", hex.EncodeToString(b[0:4]), hex.EncodeToString(b[4:6]), hex.EncodeToString(b[6:8]), hex.EncodeToString(b[8:10]), hex.EncodeToString(b[10:16])) +} diff --git a/internal/service/healing_engine_test.go b/internal/service/healing_engine_test.go new file mode 100644 index 0000000..459d92c --- /dev/null +++ b/internal/service/healing_engine_test.go @@ -0,0 +1,128 @@ +package service + +import ( + "context" + "net/http" + "net/http/httptest" + "testing" + "time" + + "github.com/company/ai-ops/internal/domain/model" +) + +type fakeHealingRepo struct { + created []HealingLog + updated []HealingLog +} + +func (r *fakeHealingRepo) CreateHealing(ctx context.Context, h *HealingLog) error { + r.created = append(r.created, *h) + return nil +} +func (r *fakeHealingRepo) UpdateHealingStatus(ctx context.Context, id, status string, result map[string]any, errCode string) error { + r.updated = append(r.updated, HealingLog{ID: id, Status: status, ResultDetail: result, ErrorCode: errCode}) + return nil +} + +func TestHealingEngineExecutesConfiguredEndpointAndRecordsSuccess(t *testing.T) { + called := false + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + called = true + if r.Method != http.MethodPost { + t.Fatalf("method = %s, want POST", r.Method) + } + w.WriteHeader(http.StatusAccepted) + })) + defer server.Close() + + action := "switch_route" + alertRepo := &fakeAggregationAlertRepo{rules: []model.AlertRule{{ + ID: "rule-1", + HealingAction: &action, + HealingConfig: map[string]any{"endpoint": server.URL}, + IsSandboxed: false, + }}} + healingRepo := &fakeHealingRepo{} + engine := NewHealingEngine(alertRepo, healingRepo) + + err := engine.handleEvent(context.Background(), &model.AlertEvent{ID: "alert-1", RuleID: "rule-1"}) + if err != nil { + t.Fatalf("handle event: %v", err) + } + if !called { + t.Fatalf("expected healing endpoint to be called") + } + if len(healingRepo.updated) != 1 || healingRepo.updated[0].Status != "succeeded" { + t.Fatalf("updated healing logs = %#v, want one succeeded", healingRepo.updated) + } +} + +func TestHealingEngineRejectsRestartWithoutExplicitAllow(t *testing.T) { + healing := &HealingLog{ActionType: "restart_instance", Config: map[string]any{"endpoint": "http://127.0.0.1"}} + engine := NewHealingEngine(nil, nil) + _, err := engine.executeAction(context.Background(), healing) + if err == nil { + t.Fatalf("expected restart_instance without allow_restart to fail") + } +} + +func TestHealingEngineProcessDryRunAndActionBranches(t *testing.T) { + action := "throttle" + alertRepo := &fakeAggregationAlertRepo{rules: []model.AlertRule{{ID: "rule-heal", HealingAction: &action, HealingConfig: map[string]any{"limit": 1}, IsSandboxed: true}}} + alertRepo.createdEvents = nil + healingRepo := &fakeHealingRepo{} + engine := NewHealingEngine(alertRepo, healingRepo) + alertRepo.rules[0].ID = "rule-heal" + // fakeAggregationAlertRepo ListEvents returns nil, so cover direct handleEvent dry-run. + if err := engine.handleEvent(context.Background(), &model.AlertEvent{ID: "event-heal", RuleID: "rule-heal"}); err != nil { + t.Fatal(err) + } + if len(healingRepo.created) != 1 || len(healingRepo.updated) != 1 || healingRepo.updated[0].Status != "succeeded" { + t.Fatalf("dry-run healing logs = created=%+v updated=%+v", healingRepo.created, healingRepo.updated) + } + + if _, err := engine.executeAction(context.Background(), &HealingLog{ActionType: "unsupported", Config: map[string]any{}}); err == nil { + t.Fatal("expected unsupported action error") + } + if _, err := engine.executeInvokeScript(context.Background(), &HealingLog{ActionType: "invoke_script", Config: map[string]any{"endpoint": "http://example.invalid"}}); err == nil { + t.Fatal("expected missing script_id error") + } + if generateHealingID() == "" { + t.Fatal("empty healing id") + } +} + +func TestHealingEngineEndpointVariants(t *testing.T) { + var gotAuth string + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + gotAuth = r.Header.Get("Authorization") + w.WriteHeader(http.StatusAccepted) + })) + defer server.Close() + engine := NewHealingEngine(&fakeAggregationAlertRepo{}, &fakeHealingRepo{}) + + if _, err := engine.executeThrottle(context.Background(), &HealingLog{ID: "h", AlertID: "a", ActionType: "throttle", Config: map[string]any{"endpoint": server.URL, "method": http.MethodPatch, "token": "tok"}}); err != nil { + t.Fatal(err) + } + if gotAuth != "Bearer tok" { + t.Fatalf("auth header = %s", gotAuth) + } + if _, err := engine.executeRestartInstance(context.Background(), &HealingLog{ID: "h", AlertID: "a", ActionType: "restart_instance", Config: map[string]any{"endpoint": server.URL, "allow_restart": true}}); err != nil { + t.Fatal(err) + } + if _, err := engine.executeInvokeScript(context.Background(), &HealingLog{ID: "h", AlertID: "a", ActionType: "invoke_script", Config: map[string]any{"endpoint": server.URL, "script_id": "script-1"}}); err != nil { + t.Fatal(err) + } + if _, err := engine.callConfiguredEndpoint(context.Background(), &HealingLog{Config: map[string]any{"endpoint": server.URL, "method": http.MethodGet}}, "bad"); err == nil { + t.Fatal("expected disallowed method error") + } +} + +func TestHealingEngineStartStopAndProcess(t *testing.T) { + engine := NewHealingEngine(&fakeAggregationAlertRepo{}, &fakeHealingRepo{}) + engine.interval = time.Hour + engine.process(context.Background()) + engine.Start() + time.Sleep(5 * time.Millisecond) + engine.Stop() +} diff --git a/internal/service/log_service.go b/internal/service/log_service.go new file mode 100644 index 0000000..203725d --- /dev/null +++ b/internal/service/log_service.go @@ -0,0 +1,105 @@ +package service + +import ( + "context" + "encoding/csv" + "fmt" + "io" + "time" + + "github.com/company/ai-ops/internal/domain/model" + "github.com/company/ai-ops/internal/domain/repository" + "github.com/company/ai-ops/internal/redis" + goredis "github.com/redis/go-redis/v9" +) + +// LogService 是日志业务逻辑层 +type LogService struct { + logRepo repository.LogRepository +} + +func NewLogService(lr repository.LogRepository) *LogService { + return &LogService{logRepo: lr} +} + +// QueryLogs 查询日志 +func (s *LogService) QueryLogs(ctx context.Context, filter model.LogQueryFilter) ([]model.RequestLog, int, error) { + // Redis 缓存键：基于筛选条件构建 + cacheKey := s.buildCacheKey(filter) + + // 尝试从缓存获取 + if redis.Client != nil { + var cached []model.RequestLog + var total int + err := redis.Client.Get(ctx, cacheKey+":items").Scan(&cached) + if err == nil { + redis.Client.Get(ctx, cacheKey+":total").Scan(&total) + return cached, total, nil + } + if err != goredis.Nil { + // 缓存错误不阻断业务，继续查数据库 + } + } + + // 超时控制 + queryCtx, cancel := context.WithTimeout(ctx, 3*time.Second) + defer cancel() + + logs, total, err := s.logRepo.Query(queryCtx, filter) + if err != nil { + return nil, 0, fmt.Errorf("query logs: %w", err) + } + + // 写入缓存（5分钟 TTL） + if redis.Client != nil { + redis.Client.Set(ctx, cacheKey+":items", logs, 5*time.Minute) + redis.Client.Set(ctx, cacheKey+":total", total, 5*time.Minute) + } + + return logs, total, nil +} + +// ExportLogsCSV 导出日志为 CSV +func (s *LogService) ExportLogsCSV(ctx context.Context, filter model.LogQueryFilter, w io.Writer) error { + filter.Page = 1 + filter.PageSize = 10000 // 导出上限 + + logs, _, err := s.logRepo.Query(ctx, filter) + if err != nil { + return fmt.Errorf("query logs for export: %w", err) + } + + csvWriter := csv.NewWriter(w) + defer csvWriter.Flush() + + // 写入表头 + if err := csvWriter.Write([]string{"时间", "服务名", "路径", "方法", "状态码", "延迟(ms)", "用户ID", "供应商ID", "错误码"}); err != nil { + return fmt.Errorf("write csv header: %w", err) + } + + // 写入数据 + for _, l := range logs { + row := []string{ + l.Timestamp.Format(time.RFC3339), + l.Service, + l.Path, + l.Method, + fmt.Sprintf("%d", l.StatusCode), + fmt.Sprintf("%.2f", l.LatencyMs), + l.UserID, + l.SupplierID, + l.ErrorCode, + } + if err := csvWriter.Write(row); err != nil { + return fmt.Errorf("write csv row: %w", err) + } + } + + return nil +} + +func (s *LogService) buildCacheKey(filter model.LogQueryFilter) string { + return fmt.Sprintf("ai-ops:logs:%s:%s:%d:%s:%s:%d:%d", + filter.Service, filter.Path, filter.StatusCode, + filter.UserID, filter.SupplierID, filter.Page, filter.PageSize) +} diff --git a/internal/service/metric_service.go b/internal/service/metric_service.go new file mode 100644 index 0000000..a9f09c1 --- /dev/null +++ b/internal/service/metric_service.go @@ -0,0 +1,60 @@ +package service + +import ( + "context" + "fmt" + + "github.com/company/ai-ops/internal/domain/model" + "github.com/company/ai-ops/internal/domain/repository" +) + +// MetricService 是指标业务逻辑层 +type MetricService struct { + metricRepo repository.MetricRepository + alertRepo repository.AlertRepository +} + +func NewMetricService(mr repository.MetricRepository, ar repository.AlertRepository) *MetricService { + return &MetricService{metricRepo: mr, alertRepo: ar} +} + +// GetRealtimeMetrics 获取首页实时指标 +func (s *MetricService) GetRealtimeMetrics(ctx context.Context) (*model.RealtimeMetrics, error) { + return s.metricRepo.GetRealtime(ctx) +} + +// GetSupplierCount 获取活跃供应商数量 +func (s *MetricService) GetSupplierCount(ctx context.Context) (*model.SupplierCount, error) { + // 从指标库查询供应商健康状态 + points, err := s.metricRepo.Query(ctx, model.MetricQueryRequest{ + Name: "supplier_health", + }) + if err != nil { + return nil, fmt.Errorf("query supplier health: %w", err) + } + + var healthy, unhealthy int + for _, p := range points { + if p.Value > 0.5 { + healthy++ + } else { + unhealthy++ + } + } + + return &model.SupplierCount{ + Total: healthy + unhealthy, + Healthy: healthy, + Unhealthy: unhealthy, + }, nil +} + +// GetOpenAlertCount 获取未关闭告警数量 +func (s *MetricService) GetOpenAlertCount(ctx context.Context) (*model.AlertCount, error) { + return s.alertRepo.GetOpenCount(ctx) +} + +// QueryMetrics 指标下钻查询 +func (s *MetricService) QueryMetrics(ctx context.Context, req model.MetricQueryRequest) ([]model.MetricPoint, error) { + return s.metricRepo.Query(ctx, req) +} diff --git a/internal/service/metric_service_test.go b/internal/service/metric_service_test.go new file mode 100644 index 0000000..766688a --- /dev/null +++ b/internal/service/metric_service_test.go @@ -0,0 +1,115 @@ +package service + +import ( + "context" + "testing" + "time" + + "github.com/company/ai-ops/internal/domain/model" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/mock" +) + +// MockMetricRepository 模拟指标存储 +type MockMetricRepository struct { + mock.Mock +} + +func (m *MockMetricRepository) GetRealtime(ctx context.Context) (*model.RealtimeMetrics, error) { + args := m.Called(ctx) + return args.Get(0).(*model.RealtimeMetrics), args.Error(1) +} + +func (m *MockMetricRepository) Query(ctx context.Context, req model.MetricQueryRequest) ([]model.MetricPoint, error) { + args := m.Called(ctx, req) + return args.Get(0).([]model.MetricPoint), args.Error(1) +} + +func (m *MockMetricRepository) GetLatest(ctx context.Context, source, name string) (*model.MetricPoint, error) { + args := m.Called(ctx, source, name) + return args.Get(0).(*model.MetricPoint), args.Error(1) +} + +// MockAlertRepository 模拟告警存储 +type MockAlertRepository struct { + mock.Mock +} + +func (m *MockAlertRepository) GetOpenCount(ctx context.Context) (*model.AlertCount, error) { + args := m.Called(ctx) + return args.Get(0).(*model.AlertCount), args.Error(1) +} + +func (m *MockAlertRepository) ListRules(ctx context.Context) ([]model.AlertRule, error) { + args := m.Called(ctx) + return args.Get(0).([]model.AlertRule), args.Error(1) +} +func (m *MockAlertRepository) GetRuleByID(ctx context.Context, id string) (*model.AlertRule, error) { + args := m.Called(ctx, id) + return args.Get(0).(*model.AlertRule), args.Error(1) +} +func (m *MockAlertRepository) CreateRule(ctx context.Context, rule *model.AlertRule) error { + args := m.Called(ctx, rule) + return args.Error(0) +} +func (m *MockAlertRepository) UpdateRule(ctx context.Context, rule *model.AlertRule) error { + args := m.Called(ctx, rule) + return args.Error(0) +} +func (m *MockAlertRepository) DeleteRule(ctx context.Context, id string) error { + args := m.Called(ctx, id) + return args.Error(0) +} +func (m *MockAlertRepository) ListEvents(ctx context.Context, status string, page, pageSize int) ([]model.AlertEvent, int, error) { + args := m.Called(ctx, status, page, pageSize) + return args.Get(0).([]model.AlertEvent), args.Int(1), args.Error(2) +} +func (m *MockAlertRepository) CreateEvent(ctx context.Context, event *model.AlertEvent) error { + args := m.Called(ctx, event) + return args.Error(0) +} +func (m *MockAlertRepository) CreateEventWithAggregation(ctx context.Context, event *model.AlertEvent, window time.Duration, threshold int) (*model.AlertEvent, error) { + args := m.Called(ctx, event, window, threshold) + return args.Get(0).(*model.AlertEvent), args.Error(1) +} +func (m *MockAlertRepository) UpdateEventStatus(ctx context.Context, id, status string) error { + args := m.Called(ctx, id, status) + return args.Error(0) +} +func (m *MockAlertRepository) EscalateEvent(ctx context.Context, id, newLevel string) error { + args := m.Called(ctx, id, newLevel) + return args.Error(0) +} + +func TestMetricService_GetRealtimeMetrics(t *testing.T) { + mockMetric := new(MockMetricRepository) + mockAlert := new(MockAlertRepository) + svc := NewMetricService(mockMetric, mockAlert) + + expected := &model.RealtimeMetrics{ + QPS: 100.5, + AvgLatency: 45.2, + P99Latency: 120.8, + ErrorRate: 0.01, + } + mockMetric.On("GetRealtime", mock.Anything).Return(expected, nil) + + result, err := svc.GetRealtimeMetrics(context.Background()) + assert.NoError(t, err) + assert.Equal(t, expected, result) + mockMetric.AssertExpectations(t) +} + +func TestMetricService_GetOpenAlertCount(t *testing.T) { + mockMetric := new(MockMetricRepository) + mockAlert := new(MockAlertRepository) + svc := NewMetricService(mockMetric, mockAlert) + + expected := &model.AlertCount{Open: 5, P0: 1, P1: 2, P2: 1, P3: 1} + mockAlert.On("GetOpenCount", mock.Anything).Return(expected, nil) + + result, err := svc.GetOpenAlertCount(context.Background()) + assert.NoError(t, err) + assert.Equal(t, expected, result) + mockAlert.AssertExpectations(t) +} diff --git a/internal/service/notification_service.go b/internal/service/notification_service.go new file mode 100644 index 0000000..8613bf2 --- /dev/null +++ b/internal/service/notification_service.go @@ -0,0 +1,248 @@ +package service + +import ( + "bytes" + "context" + "encoding/json" + "fmt" + "log/slog" + "net/http" + "time" + + "github.com/company/ai-ops/internal/domain/model" + "github.com/company/ai-ops/internal/domain/repository" +) + +// NotificationTask 是通知任务 +type NotificationTask struct { + Event *model.AlertEvent + ChannelIDs []string + Priority string // P0, P1, P2, P3 +} + +// NotificationService 是通知服务 +type NotificationService struct { + channelRepo repository.ChannelRepository + logRepo repository.NotificationLogRepository + client *http.Client + queue chan NotificationTask + stopCh chan struct{} +} + +// NewNotificationService 创建通知服务 +func NewNotificationService(cr repository.ChannelRepository, logRepos ...repository.NotificationLogRepository) *NotificationService { + var logRepo repository.NotificationLogRepository + if len(logRepos) > 0 { + logRepo = logRepos[0] + } + ns := &NotificationService{ + channelRepo: cr, + logRepo: logRepo, + client: &http.Client{Timeout: 10 * time.Second}, + queue: make(chan NotificationTask, 1000), + stopCh: make(chan struct{}), + } + go ns.worker() + return ns +} + +// Stop 停止通知服务 +func (s *NotificationService) Stop() { + close(s.stopCh) +} + +// Enqueue 将通知任务入队列 +func (s *NotificationService) Enqueue(event *model.AlertEvent, channelIDs []string) { + task := NotificationTask{ + Event: event, + ChannelIDs: channelIDs, + Priority: event.Level, + } + select { + case s.queue <- task: + slog.Info("notification_enqueued", "event_id", event.ID, "priority", event.Level) + default: + slog.Warn("notification_queue_full", "event_id", event.ID) + } +} + +func (s *NotificationService) worker() { + for { + select { + case task := <-s.queue: + s.processTask(context.Background(), task) + case <-s.stopCh: + return + } + } +} + +func (s *NotificationService) processTask(ctx context.Context, task NotificationTask) { + // 根据优先级设置发送超时 + timeout := 120 * time.Second + if task.Priority == "P0" || task.Priority == "P1" { + timeout = 30 * time.Second + } + + ctx, cancel := context.WithTimeout(ctx, timeout) + defer cancel() + + channels, err := s.channelRepo.List(ctx) + if err != nil { + slog.Error("list_channels_failed", "error", err, "event_id", task.Event.ID) + return + } + + // 按优先级排序渠道 + ordered := s.filterAndOrderChannels(channels, task.ChannelIDs) + + // 发送通知，失败时自动切换备用渠道 + sent := false + for _, ch := range ordered { + logID := s.createSendLog(ctx, task.Event, ch) + if err := s.sendToChannel(ctx, task.Event, ch); err != nil { + s.markSendFailed(ctx, logID, 1, err) + slog.Error("notify_channel_failed", + "event_id", task.Event.ID, + "channel_id", ch.ID, + "channel_type", ch.ChannelType, + "error", err, + ) + continue + } + s.markSendSent(ctx, logID) + sent = true + slog.Info("notify_sent", + "event_id", task.Event.ID, + "channel_id", ch.ID, + "channel_type", ch.ChannelType, + ) + break + } + + if !sent { + slog.Error("notify_all_channels_failed", "event_id", task.Event.ID) + } +} + +func (s *NotificationService) createSendLog(ctx context.Context, event *model.AlertEvent, ch *model.NotificationChannel) string { + if s.logRepo == nil { + return "" + } + log := &model.NotificationLog{ + EventID: event.ID, + ChannelID: ch.ID, + ChannelType: ch.ChannelType, + Status: "pending", + } + if err := s.logRepo.CreateLog(ctx, log); err != nil { + slog.Error("create_notification_log_failed", "event_id", event.ID, "channel_id", ch.ID, "error", err) + return "" + } + return log.ID +} + +func (s *NotificationService) markSendSent(ctx context.Context, logID string) { + if s.logRepo == nil || logID == "" { + return + } + if err := s.logRepo.MarkSent(ctx, logID); err != nil { + slog.Error("mark_notification_sent_failed", "log_id", logID, "error", err) + } +} + +func (s *NotificationService) markSendFailed(ctx context.Context, logID string, retryCount int, err error) { + if s.logRepo == nil || logID == "" { + return + } + if markErr := s.logRepo.MarkFailed(ctx, logID, retryCount, err.Error()); markErr != nil { + slog.Error("mark_notification_failed_failed", "log_id", logID, "error", markErr) + } +} + +func (s *NotificationService) filterAndOrderChannels(all []model.NotificationChannel, ids []string) []*model.NotificationChannel { + idSet := make(map[string]bool) + for _, id := range ids { + idSet[id] = true + } + + var filtered []*model.NotificationChannel + for i := range all { + if idSet[all[i].ID] { + filtered = append(filtered, &all[i]) + } + } + + // 按优先级排序（高优先级在前） + for i := 0; i < len(filtered)-1; i++ { + for j := i + 1; j < len(filtered); j++ { + if filtered[j].Priority > filtered[i].Priority { + filtered[i], filtered[j] = filtered[j], filtered[i] + } + } + } + return filtered +} + +func (s *NotificationService) sendToChannel(ctx context.Context, event *model.AlertEvent, ch *model.NotificationChannel) error { + switch ch.ChannelType { + case "webhook": + return s.sendWebhook(ctx, event, ch) + case "email": + return s.sendEmail(ctx, event, ch) + case "feishu": + return s.sendFeishu(ctx, event, ch) + case "wechat": + return s.sendWechat(ctx, event, ch) + default: + return fmt.Errorf("unsupported channel type: %s", ch.ChannelType) + } +} + +func (s *NotificationService) sendWebhook(ctx context.Context, event *model.AlertEvent, ch *model.NotificationChannel) error { + url, ok := ch.Config["webhook_url"].(string) + if !ok || url == "" { + return fmt.Errorf("webhook_url not configured") + } + + payload := map[string]any{ + "alert_id": event.ID, + "rule_id": event.RuleID, + "level": event.Level, + "status": event.Status, + "resource": event.ResourceID, + "value": event.CurrentValue, + "threshold": event.ThresholdValue, + "timestamp": time.Now().Format(time.RFC3339), + } + body, _ := json.Marshal(payload) + + req, err := http.NewRequestWithContext(ctx, "POST", url, bytes.NewReader(body)) + if err != nil { + return err + } + req.Header.Set("Content-Type", "application/json") + + resp, err := s.client.Do(req) + if err != nil { + return fmt.Errorf("webhook request failed: %w", err) + } + defer resp.Body.Close() + + if resp.StatusCode >= 400 { + return fmt.Errorf("webhook returned status %d", resp.StatusCode) + } + return nil +} + +func (s *NotificationService) sendEmail(ctx context.Context, event *model.AlertEvent, ch *model.NotificationChannel) error { + return fmt.Errorf("email channel not yet implemented") +} + +func (s *NotificationService) sendFeishu(ctx context.Context, event *model.AlertEvent, ch *model.NotificationChannel) error { + return fmt.Errorf("feishu channel not yet implemented") +} + +func (s *NotificationService) sendWechat(ctx context.Context, event *model.AlertEvent, ch *model.NotificationChannel) error { + return fmt.Errorf("wechat channel not yet implemented") +} diff --git a/internal/service/notification_service_test.go b/internal/service/notification_service_test.go new file mode 100644 index 0000000..647631b --- /dev/null +++ b/internal/service/notification_service_test.go @@ -0,0 +1,139 @@ +package service + +import ( + "context" + "net/http" + "net/http/httptest" + "testing" + + "github.com/company/ai-ops/internal/domain/model" +) + +type fakeChannelRepo struct { + channels []model.NotificationChannel +} + +func (r *fakeChannelRepo) List(ctx context.Context) ([]model.NotificationChannel, error) { + return r.channels, nil +} +func (r *fakeChannelRepo) GetByID(ctx context.Context, id string) (*model.NotificationChannel, error) { + return nil, nil +} +func (r *fakeChannelRepo) Create(ctx context.Context, ch *model.NotificationChannel) error { + return nil +} +func (r *fakeChannelRepo) Update(ctx context.Context, ch *model.NotificationChannel) error { + return nil +} +func (r *fakeChannelRepo) Delete(ctx context.Context, id string) error { return nil } + +type fakeNotificationLogRepo struct { + created []model.NotificationLog + sent []string + failed []string +} + +func (r *fakeNotificationLogRepo) CreateLog(ctx context.Context, log *model.NotificationLog) error { + if log.ID == "" { + log.ID = "log-1" + } + r.created = append(r.created, *log) + return nil +} +func (r *fakeNotificationLogRepo) MarkSent(ctx context.Context, id string) error { + r.sent = append(r.sent, id) + return nil +} +func (r *fakeNotificationLogRepo) MarkFailed(ctx context.Context, id string, retryCount int, errMessage string) error { + r.failed = append(r.failed, id) + return nil +} + +func TestNotificationServiceWritesLogWhenWebhookSent(t *testing.T) { + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(http.StatusOK) + })) + defer server.Close() + + channelRepo := &fakeChannelRepo{channels: []model.NotificationChannel{{ + ID: "11111111-1111-4111-8111-111111111111", + Name: "webhook", + ChannelType: "webhook", + Config: map[string]any{"webhook_url": server.URL}, + Priority: 10, + Enabled: true, + }}} + logRepo := &fakeNotificationLogRepo{} + svc := NewNotificationService(channelRepo, logRepo) + defer svc.Stop() + + svc.processTask(context.Background(), NotificationTask{ + Event: &model.AlertEvent{ + ID: "22222222-2222-4222-8222-222222222222", + RuleID: "33333333-3333-4333-8333-333333333333", + Level: "P1", + Status: "triggered", + ResourceID: "svc-a", + }, + ChannelIDs: []string{"11111111-1111-4111-8111-111111111111"}, + Priority: "P1", + }) + + if len(logRepo.created) != 1 { + t.Fatalf("created logs = %d, want 1", len(logRepo.created)) + } + if len(logRepo.sent) != 1 || logRepo.sent[0] != "log-1" { + t.Fatalf("sent logs = %#v, want [log-1]", logRepo.sent) + } + if len(logRepo.failed) != 0 { + t.Fatalf("failed logs = %#v, want empty", logRepo.failed) + } +} + +func TestNotificationServiceFailureAndFallbackBranches(t *testing.T) { + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(http.StatusBadGateway) + })) + defer server.Close() + + channels := []model.NotificationChannel{ + {ID: "c1", ChannelType: "webhook", Config: map[string]any{"webhook_url": server.URL}, Priority: 1, Enabled: true}, + {ID: "c2", ChannelType: "email", Priority: 2, Enabled: true}, + {ID: "c3", ChannelType: "feishu", Priority: 3, Enabled: true}, + {ID: "c4", ChannelType: "wechat", Priority: 4, Enabled: true}, + {ID: "c5", ChannelType: "sms", Priority: 5, Enabled: true}, + {ID: "disabled", ChannelType: "webhook", Priority: 99, Enabled: false}, + } + logs := &fakeNotificationLogRepo{} + svc := NewNotificationService(&fakeChannelRepo{channels: channels}, logs) + defer svc.Stop() + event := &model.AlertEvent{ID: "event-1", RuleID: "rule-1", Level: "P1", ResourceType: "svc", ResourceID: "api", CurrentValue: "10", ThresholdValue: "5"} + + ordered := svc.filterAndOrderChannels(channels, []string{"c1", "c2", "missing", "disabled"}) + if len(ordered) != 3 || ordered[0].ID != "disabled" || ordered[1].ID != "c2" || ordered[2].ID != "c1" { + t.Fatalf("unexpected ordered channels: %+v", ordered) + } + svc.processTask(context.Background(), NotificationTask{Event: event, ChannelIDs: []string{"c1", "c2"}}) + if len(logs.failed) < 2 || len(logs.sent) != 0 { + t.Fatalf("expected multiple failures and no success: sent=%+v failed=%+v", logs.sent, logs.failed) + } + if err := svc.sendToChannel(context.Background(), event, &model.NotificationChannel{ChannelType: "unknown"}); err == nil { + t.Fatal("expected unsupported channel error") + } + if err := svc.sendWebhook(context.Background(), event, &model.NotificationChannel{Config: map[string]any{}}); err == nil { + t.Fatal("expected missing webhook url error") + } + svc.Enqueue(event, []string{"c2"}) +} + +func TestNotificationServiceExplicitUnsupportedPlaceholders(t *testing.T) { + svc := NewNotificationService(&fakeChannelRepo{}) + defer svc.Stop() + event := &model.AlertEvent{ID: "event-placeholders", RuleID: "rule", Level: "P2"} + for _, channelType := range []string{"email", "feishu", "wechat"} { + err := svc.sendToChannel(context.Background(), event, &model.NotificationChannel{ChannelType: channelType}) + if err == nil { + t.Fatalf("expected %s placeholder error", channelType) + } + } +} diff --git a/internal/service/rule_service.go b/internal/service/rule_service.go new file mode 100644 index 0000000..dc03708 --- /dev/null +++ b/internal/service/rule_service.go @@ -0,0 +1,50 @@ +package service + +import ( + "context" + "fmt" + + "github.com/company/ai-ops/internal/domain/model" + "github.com/company/ai-ops/internal/domain/repository" +) + +// RuleService 是告警规则业务层 +type RuleService struct { + repo repository.AlertRepository +} + +func NewRuleService(repo repository.AlertRepository) *RuleService { + return &RuleService{repo: repo} +} + +func (s *RuleService) ListRules(ctx context.Context) ([]model.AlertRule, error) { + return s.repo.ListRules(ctx) +} + +func (s *RuleService) GetRule(ctx context.Context, id string) (*model.AlertRule, error) { + return s.repo.GetRuleByID(ctx, id) +} + +func (s *RuleService) CreateRule(ctx context.Context, rule *model.AlertRule) error { + if rule.ID == "" { + return fmt.Errorf("rule id is required") + } + if rule.Name == "" || rule.MetricName == "" { + return fmt.Errorf("name and metric_name are required") + } + rule.Enabled = true + rule.Version = 1 + return s.repo.CreateRule(ctx, rule) +} + +func (s *RuleService) UpdateRule(ctx context.Context, rule *model.AlertRule) error { + if rule.ID == "" { + return fmt.Errorf("rule id is required") + } + rule.Version++ + return s.repo.UpdateRule(ctx, rule) +} + +func (s *RuleService) DeleteRule(ctx context.Context, id string) error { + return s.repo.DeleteRule(ctx, id) +} diff --git a/migrations/V20250512_001__create_request_logs.sql b/migrations/V20250512_001__create_request_logs.sql new file mode 100644 index 0000000..768f0ca --- /dev/null +++ b/migrations/V20250512_001__create_request_logs.sql @@ -0,0 +1,25 @@ +-- Phase 1: 补充请求日志表，支持日志查询功能 +CREATE TABLE IF NOT EXISTS ai_ops_request_logs ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(), + service VARCHAR(64) NOT NULL, + path VARCHAR(256) NOT NULL, + method VARCHAR(8) NOT NULL, + status_code INT NOT NULL, + latency_ms DECIMAL(10,3) NOT NULL, + user_id VARCHAR(64), + supplier_id VARCHAR(64), + error_code VARCHAR(64), + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() +); + +-- 性能优化索引 +CREATE INDEX IF NOT EXISTS idx_request_logs_timestamp ON ai_ops_request_logs (timestamp DESC); +CREATE INDEX IF NOT EXISTS idx_request_logs_service ON ai_ops_request_logs (service); +CREATE INDEX IF NOT EXISTS idx_request_logs_path ON ai_ops_request_logs (path); +CREATE INDEX IF NOT EXISTS idx_request_logs_status_code ON ai_ops_request_logs (status_code); +CREATE INDEX IF NOT EXISTS idx_request_logs_user_id ON ai_ops_request_logs (user_id); +CREATE INDEX IF NOT EXISTS idx_request_logs_supplier_id ON ai_ops_request_logs (supplier_id); + +-- 复合索引：常见查询模式 +CREATE INDEX IF NOT EXISTS idx_request_logs_time_service ON ai_ops_request_logs (timestamp DESC, service); diff --git a/pkg/errors/errors.go b/pkg/errors/errors.go new file mode 100644 index 0000000..f8ca03d --- /dev/null +++ b/pkg/errors/errors.go @@ -0,0 +1,52 @@ +package errors + +import "fmt" + +// AppError 是应用错误结构 +type AppError struct { + Code string // OPS_{CATEGORY}_{CODE} + HTTPStatus int + Message string + Detail map[string]any +} + +func (e *AppError) Error() string { + return fmt.Sprintf("[%s] %s", e.Code, e.Message) +} + +// 预定义错误 +var ( + ErrBadRequest = &AppError{Code: "OPS_GEN_4001", HTTPStatus: 400, Message: "请求参数错误"} + ErrUnauthorized = &AppError{Code: "OPS_GEN_4002", HTTPStatus: 401, Message: "未授权"} + ErrForbidden = &AppError{Code: "OPS_GEN_4003", HTTPStatus: 403, Message: "权限不足"} + ErrNotFound = &AppError{Code: "OPS_GEN_4004", HTTPStatus: 404, Message: "资源不存在"} + ErrConflict = &AppError{Code: "OPS_GEN_4005", HTTPStatus: 409, Message: "资源冲突"} + ErrPayloadTooLarge = &AppError{Code: "OPS_GEN_4006", HTTPStatus: 413, Message: "请求体过大"} + ErrInternal = &AppError{Code: "OPS_GEN_5001", HTTPStatus: 500, Message: "内部服务错误"} + + ErrInvalidMetricName = &AppError{Code: "OPS_MET_4001", HTTPStatus: 400, Message: "指标名称无效"} + ErrInvalidTimeRange = &AppError{Code: "OPS_MET_4002", HTTPStatus: 400, Message: "时间范围不合法"} +) + +// WithDetail 为错误添加详细信息 +func (e *AppError) WithDetail(detail map[string]any) *AppError { + return &AppError{ + Code: e.Code, + HTTPStatus: e.HTTPStatus, + Message: e.Message, + Detail: detail, + } +} + +// Wrap 包裹原始错误 +func Wrap(err error, appErr *AppError) *AppError { + if err == nil { + return nil + } + return &AppError{ + Code: appErr.Code, + HTTPStatus: appErr.HTTPStatus, + Message: fmt.Sprintf("%s: %v", appErr.Message, err), + Detail: appErr.Detail, + } +} diff --git a/pkg/errors/errors_test.go b/pkg/errors/errors_test.go new file mode 100644 index 0000000..7a56e02 --- /dev/null +++ b/pkg/errors/errors_test.go @@ -0,0 +1,45 @@ +package errors + +import ( + "errors" + "testing" +) + +func TestAppErrorErrorIncludesCodeAndMessage(t *testing.T) { + err := &AppError{Code: "OPS_TEST", Message: "failed"} + if got := err.Error(); got != "[OPS_TEST] failed" { + t.Fatalf("unexpected error string: %s", got) + } +} + +func TestWithDetailReturnsCopyWithoutMutatingBase(t *testing.T) { + detail := map[string]any{"field": "name"} + err := ErrBadRequest.WithDetail(detail) + + if err == ErrBadRequest { + t.Fatal("expected a copy, got original pointer") + } + if err.Code != ErrBadRequest.Code || err.HTTPStatus != ErrBadRequest.HTTPStatus || err.Message != ErrBadRequest.Message { + t.Fatalf("metadata not preserved: %+v", err) + } + if err.Detail["field"] != "name" { + t.Fatalf("detail not attached: %+v", err.Detail) + } + if ErrBadRequest.Detail != nil { + t.Fatalf("base error was mutated: %+v", ErrBadRequest.Detail) + } +} + +func TestWrap(t *testing.T) { + if Wrap(nil, ErrInternal) != nil { + t.Fatal("nil input should return nil") + } + + wrapped := Wrap(errors.New("boom"), ErrInternal) + if wrapped.Code != ErrInternal.Code || wrapped.HTTPStatus != ErrInternal.HTTPStatus { + t.Fatalf("metadata not preserved: %+v", wrapped) + } + if wrapped.Message != "内部服务错误: boom" { + t.Fatalf("unexpected message: %s", wrapped.Message) + } +} diff --git a/pkg/response/response.go b/pkg/response/response.go new file mode 100644 index 0000000..c6ccf29 --- /dev/null +++ b/pkg/response/response.go @@ -0,0 +1,59 @@ +package response + +import ( + "encoding/json" + "net/http" + + "github.com/company/ai-ops/pkg/errors" +) + +// Response 是统一响应结构 +type Response struct { + Code string `json:"code,omitempty"` + Message string `json:"message,omitempty"` + Data any `json:"data,omitempty"` +} + +// JSON 返回 JSON 响应 +func JSON(w http.ResponseWriter, status int, data any) { + w.Header().Set("Content-Type", "application/json") + w.WriteHeader(status) + json.NewEncoder(w).Encode(data) +} + +// Success 返回成功响应 +func Success(w http.ResponseWriter, data any) { + JSON(w, http.StatusOK, Response{Data: data}) +} + +// Error 返回错误响应 +func Error(w http.ResponseWriter, err *errors.AppError) { + JSON(w, err.HTTPStatus, Response{ + Code: err.Code, + Message: err.Message, + }) +} + +// Paginated 是分页响应结构 +type Paginated struct { + Items any `json:"items"` + Total int `json:"total"` + Page int `json:"page"` + PageSize int `json:"page_size"` + TotalPages int `json:"total_pages"` +} + +// PaginatedResponse 返回分页响应 +func PaginatedResponse(w http.ResponseWriter, items any, total, page, pageSize int) { + totalPages := total / pageSize + if total%pageSize > 0 { + totalPages++ + } + Success(w, Paginated{ + Items: items, + Total: total, + Page: page, + PageSize: pageSize, + TotalPages: totalPages, + }) +} diff --git a/pkg/response/response_test.go b/pkg/response/response_test.go new file mode 100644 index 0000000..7cd9962 --- /dev/null +++ b/pkg/response/response_test.go @@ -0,0 +1,66 @@ +package response + +import ( + "encoding/json" + "net/http" + "net/http/httptest" + "testing" + + errorspkg "github.com/company/ai-ops/pkg/errors" +) + +func TestJSONWritesStatusContentTypeAndBody(t *testing.T) { + w := httptest.NewRecorder() + + JSON(w, http.StatusCreated, map[string]string{"ok": "true"}) + + if w.Code != http.StatusCreated { + t.Fatalf("status = %d", w.Code) + } + if got := w.Header().Get("Content-Type"); got != "application/json" { + t.Fatalf("content-type = %s", got) + } + if got := w.Body.String(); got != "{\"ok\":\"true\"}\n" { + t.Fatalf("body = %s", got) + } +} + +func TestSuccessAndErrorResponses(t *testing.T) { + t.Run("success", func(t *testing.T) { + w := httptest.NewRecorder() + Success(w, map[string]any{"id": "1"}) + var out Response + if err := json.Unmarshal(w.Body.Bytes(), &out); err != nil { + t.Fatal(err) + } + data := out.Data.(map[string]any) + if data["id"] != "1" { + t.Fatalf("unexpected data: %+v", out.Data) + } + }) + + t.Run("error", func(t *testing.T) { + w := httptest.NewRecorder() + Error(w, errorspkg.ErrForbidden) + if w.Code != http.StatusForbidden { + t.Fatalf("status = %d", w.Code) + } + if got := w.Body.String(); got == "" || !json.Valid(w.Body.Bytes()) { + t.Fatalf("invalid json body: %q", got) + } + }) +} + +func TestPaginatedResponseComputesTotalPages(t *testing.T) { + w := httptest.NewRecorder() + PaginatedResponse(w, []string{"a", "b"}, 21, 2, 10) + + var out Response + if err := json.Unmarshal(w.Body.Bytes(), &out); err != nil { + t.Fatal(err) + } + payload := out.Data.(map[string]any) + if payload["total_pages"].(float64) != 3 { + t.Fatalf("total_pages = %+v", payload["total_pages"]) + } +} diff --git a/prd/PM审核报告.md b/prd/PM审核报告.md new file mode 100644 index 0000000..7eea17c --- /dev/null +++ b/prd/PM审核报告.md @@ -0,0 +1,93 @@ +## PM 审核报告 + +### 总体评级：B + +**评级说明**：PRD 整体结构完整，用户旅程覆盖较全，AC 基本满足 SMART 原则，商业化闭环有框架但缺乏财务量化。存在 1 处 P0 级范围不一致、1 处 P0 级错误码冲突，以及工期估算严重偏乐观的问题。建议在进入 TechLead 评审前修复 P0/P1 问题。 + +--- + +### 优点 + +1. **用户旅程覆盖完整**：主流程（监控看板、配置审计回滚、告警处置）+ 异常流程（自愈失败、告警飙升、回滚失败）+ 边缘流程（无人处理告警、数据源丢失、误触发变更）全部覆盖，并配有 F-1~F-8 的独立失败路径表。 +2. **AC 量化程度高**：12 条 AC 中，90% 以上包含明确的数值约束（如 <2s、<3s、30s、90 天、50 条规则、100 条/页），可直接转化为 QA 测试用例。 +3. **In/Out of Scope 边界清晰**：明确排除了下游大模型监控、基础设施层监控、AI 自动扩缩容、外部监控系统整合，避免范围蔓延。 +4. **技术约束明确**：统一 Go 1.22+ / 标准库 net/http / pgx / go-redis，禁止引入 Gin/Echo；支持独立运行与集成运行双模式；数据库表名强制 `ai_ops_` 前缀，减少集成冲突。 +5. **发布策略安全**：分 4 阶段上线，自愈规则强制"沙盒模式"验证（>=10 次模拟），并设计了一键关闭自愈的权限开关，风险控制意识强。 +6. **竞品分析有价值**：对 LiteLLM、Sub2API、NewAPI、FreeRide 的对标分析到位，差距分析直接转化为产品机会点。 + +--- + +### 发现问题（按严重度分类） + +#### P0 — 阻塞性问题（必须修复后才能进入开发） + +| 编号 | 问题 | 影响 | 位置 | +|------|------|------|------| +| P0-1 | **范围冲突：供应商智能切换未在 PRD 正文 In Scope 中明确纳入，但功能清单将其作为 Phase 3 核心模块（3.4，含 16+ 个任务）** | 该功能涉及自动路由变更、供应商探测、Fallback 链管理，本质上属于"自动化配置变更/扩容决策"，与 Out of Scope 第 3 条"不做自动扩容决策"存在擦边风险；且 In Scope 第 1 条明确"不含下游大模型服务"，而供应商切换直接依赖下游供应商接口。若不加明确界定，开发阶段极易产生范围争议。 | PRD §3 In Scope / 功能清单模块 3.4 | +| P0-2 | **错误码不一致**：PRD 场景 F 与 AC-8 规定回滚失败错误码为 `OPS_AUD_4101`，但功能清单 3.3.2 使用 `AUDIT_ROLLBACK_TARGET_LOST` | 接口契约冲突，前后端/QA 无法对齐，将导致集成测试失败。 | PRD §4 场景 F、§5 AC-8 / 功能清单 3.3.2 | +| P0-3 | **工期估算严重偏离实际**：功能清单列出 138 个任务，总估算仅 18 人天（平均每个任务不足 1 小时），未包含联调、集成测试、Bug 修复、文档、评审、返工时间 | 该估算极可能导致项目延期、资源不足、质量下降。按行业常规，Go 后端 + 前端 + 测试的完整交付，138 个任务至少需要 6~8 周（30~40 人天）以上。 | 功能清单 "任务估算汇总" | +| P0-4 | **自愈动作"重启实例"在功能清单中遗漏具体任务**：PRD AC-6 明确列出"重启实例"为可选自愈动作之一，但功能清单 3.1.2 的自愈执行后端任务仅覆盖"切换备用路由、限流、触发脚本"，未提及"重启实例"的实现任务 | 功能遗漏，QA 无法验收该自愈动作。 | PRD §5 AC-6 / 功能清单 3.1.2 | + +#### P1 — 重要问题（强烈建议修复） + +| 编号 | 问题 | 影响 | 位置 | +|------|------|------|------| +| P1-1 | **双重失败判定线**：PRD §2 定义"开发期间任何一周内告警噪声率 >20% 或自愈规则误触发导致生产事故，即判定失败"；§8.3 又定义"上线 30 天内 MTTR 未下降至 <20min / 自动化覆盖率 <30% / 噪声率 >15% / 自愈误触发 1 次"进入救援模式 | 两条判定线的时间边界（开发期 vs 上线后）、指标阈值（20% vs 15%）、触发条件不统一，团队无法判断应以哪条为准。 | PRD §2 成功定义、§8.3 失败判定线 | +| P1-2 | **In Scope 使用"包括但不限于"**：第 1 条"包含但不限于：gateway/..." | "包括但不限于"是范围管理中的高风险词汇，为后续需求蔓延留下口子。应改为封闭列表，或明确"仅包含以下模块"。 | PRD §3 In Scope #1 | +| P1-3 | **通知渠道定义不一致**：PRD AC-4 要求"Webhook、邮件、飞书/企业微信至少 2 种"，但功能清单 2.3.2 和 3.4.3 出现了"钉钉"，且备份切换链为 Webhook→邮件→飞书→企微 | 若最终未实现钉钉，功能清单需同步删除；若实现，PRD AC-4 需更新。 | PRD §5 AC-4 / 功能清单 2.3.2、3.4.3 | +| P1-4 | **AC-7 "审计日志必须不可篡改"缺乏技术实现定义**：未说明是通过 WORM 存储、哈希链、数字签名还是仅通过数据库层禁止 UPDATE/DELETE 来实现 | QA 无法验证"不可篡改"，不同实现方式的成本和合规等级差异巨大。 | PRD §5 AC-7 | +| P1-5 | **AC-8 "操作前值有效"定义模糊**：未明确"有效"的判定标准（非空？JSON 可解析？符合当前 Schema？） | 可能导致回滚接口在边界情况下行为不一致。 | PRD §5 AC-8 | +| P1-6 | **级联故障回退（F-6）未在 AC 中体现**：PRD §6 F-6 描述自愈级联故障时"自动恢复上一步操作前的状态"，但 AC-6 仅提到"若未解除则升级为人工告警"，未要求验证级联回退能力 | 功能清单 3.1.3 有任务，但 AC 缺失，QA 无法据此验收。 | PRD §6 F-6 / §5 AC-6 | +| P1-7 | **容量预测算法缺乏可测试标准**：AC-9 要求"按当前增长率预测触达资源上限时间"，但注明"仅供参考，不自动执行扩容"，且未定义预测准确率、置信区间或最大可接受偏差 | "仅供参考"导致该功能无法被 QA 有效验收，开发完成后可能沦为无法量化的"演示功能"。 | PRD §5 AC-9 | +| P1-8 | **缺少 UI/UX 最低兼容性要求**：PRD 和功能清单均未规定浏览器支持范围、移动端适配策略、最低分辨率 | 前端工程师缺乏约束，可能在交付后因兼容性问题返工。 | PRD 全文 | +| P1-9 | **角色权限矩阵过粗**：AC-12 仅定义 3 个角色的一句话权限，缺少页面级/API 级权限对照表（如"运维人员能否导出审计日志？""查看者能否导出 CSV？"） | 功能清单 G1 中"管理员（可管理用户）"超出了 PRD 定义（PRD 未提用户管理），进一步加剧不一致。 | PRD §5 AC-12 / 功能清单 G1 | + +#### P2 — 改进建议（建议纳入后续迭代） + +| 编号 | 问题 | 建议 | +|------|------|------| +| P2-1 | 商业化闭环缺少 ROI 量化模型 | 补充"当前运维人力成本 = X 人月 × Y 元/人月，目标释放 40% 后节省 Z 元/月"的计算示例，使北极星指标与财务指标挂钩。 | +| P2-2 | 竞品分析中的技术设计模式未融入 PRD 正文 | 将 `CustomBatchLogger`、`DigestEntry`、`DualCache` 等设计模式从竞品分析报告迁移到 PRD 技术约束或架构建议章节，避免设计阶段遗漏。 | +| P2-3 | 发布策略缺少阶段门控的量化验收标准 | 阶段 2 进入阶段 3 的条件目前是"无 P1 以上告警 72h"，建议补充"告警噪声率 <10%""通知渠道成功率 >95%"等可量化门控。 | +| P2-4 | 未定义生产部署拓扑 | 建议明确是单集群还是多集群部署，自愈动作"重启实例"在 K8s 与裸金属环境下的实现差异巨大。 | +| P2-5 | 审计日志 90 天保留期未评估存储成本 | 高并发场景下全量 JSON 审计日志的存储量可能极大，建议补充日志压缩/归档策略或存储成本上限。 | +| P2-6 | PRD 自检清单声称"没有使用优化、支持、友好、尽量、快速等模糊词"，但正文中仍存在"等""等相关指标"等模糊表述 | 建议将 In Scope 中的"等"字去除，改为封闭列表；功能清单中的"等相关能力"也需同步清理。 | + +--- + +### 改进建议（优先级排序） + +1. **立即修复 P0 问题**： + - 在 PRD §3 In Scope 中明确加入"供应商智能切换（含健康探测、Fallback 链、策略化路由）"或将其移入 Out of Scope；若纳入，需在 AC 中补充对应的验收标准。 + - 统一回滚失败错误码为 `OPS_AUD_4101`，功能清单同步修正。 + - 重新进行工时估算，建议采用"任务 × 复杂度系数 + 联调缓冲（20%）+ 风险缓冲（15%）"的方式，输出 30~40 人天的 realistic estimate。 + - 在功能清单 3.1.2 中补充"重启实例"自愈动作的实现任务（如调用 K8s API 或主机 agent）。 + +2. **本周内修复 P1 问题**： + - 合并/统一失败判定线，建议按"上线后 30 天"为统一时间窗口，阈值取更严格的版本（噪声率 <15%）。 + - 删除 In Scope 中的"包括但不限于"，改为封闭枚举；如确需扩展，规定"新增范围需经 PM+TechLead 双签"。 + - 明确 AC-4 通知渠道的最终列表（是否含钉钉），并同步更新功能清单的备用切换链。 + - 在 AC-7 中补充"不可篡改"的实现方式（建议：数据库层禁止 UPDATE/DELETE + 应用层只追加写入）。 + - 补充 UI 最低兼容性要求（如：Chrome/Firefox/Edge 最新 2 个版本，最小宽度 1280px）。 + - 细化角色权限矩阵到 API 级别，建议以表格形式列出各角色对关键接口的 CRUD 权限。 + +3. **TechLead 阶段前补充**： + - 将竞品分析中的设计模式建议提炼为 PRD 架构约束章节（如告警批量化、摘要窗口、双缓存机制）。 + - 为容量预测（AC-9）补充可测试标准，例如"预测值与实际值的平均绝对百分比误差（MAPE）<30%"或至少提供趋势方向判断准确率。 + - 明确生产部署拓扑（K8s vs 裸金属 vs 混合），影响自愈动作设计。 + +--- + +### 审核结论 + +| 维度 | 评分 | 说明 | +|------|------|------| +| 用户旅程完整性 | A- | 主/异/边缘流程全覆盖，但级联回退未在 AC 中闭环 | +| AC 可测试性 | B+ | 大部分量化精确，但"仅供参考""有效""不可篡改"等不可测试 | +| In/Out of Scope 清晰度 | B | 主体清晰，但"包括但不限于"和供应商切换造成范围争议 | +| 成功指标与失败判定 | B- | 指标量化，但存在双重标准，时间边界模糊 | +| 商业化闭环 | B- | 有框架但缺 ROI 量化，外部收益链条弱 | +| 功能清单一致性 | C+ | 与 PRD 存在错误码冲突、渠道不一致、任务遗漏、估算失真 | +| 模糊词汇控制 | B+ | 主体控制良好，"等"字和"包括但不限于"需清理 | + +**建议行动**：修复 P0-1~P0-4 后，可进入 TechLead 评审；P1 问题建议在技术方案评审前同步闭环。 diff --git a/prd/PRD.md b/prd/PRD.md new file mode 100644 index 0000000..f84c474 --- /dev/null +++ b/prd/PRD.md @@ -0,0 +1,461 @@ +# 智能运维系统 PRD + +> 版本：v1.0 +> 负责人：PM +> 目标读者：TechLead、QA、SRE、运营人员 +> 状态：待 TechLead 评审 + +--- + +## 1. 概述 + +### 一句话价值 +通过自动化监控、告警辅助决策、故障自愈与配置变更管理，将立交桥平台的运维从人工排查转为机器主导的实时保障，降低 MTTR、减少人工成本、提升运行稳定性。 + +### 用户问题 +1. 当前运维严重依赖人工定期检查日志，问题发现与处置耗时过长，MTTR 超过 30 分钟。 +2. 告警规则缺乏分类与阈值动态调整，导致要么漏告警、要么误告警爆炸。 +3. 故障发生时无自动恢复机制，必须等待运维人员手动参与，产生可避免的服务中断。 +4. 配置变更无审计追溯能力，回滚窗口不明确，引发过多次生产故障。 +5. 规模扩张中缺乏量化的容量管理视角，出现无计划的资源短缺。 + +### 业务意义 +- 从 Demo 级运维向生产级运维过渡，建立可重复、可审计、可回滚的运维体系。 +- 在人员规模不增的前提下，支撑接入商家数、API 调用量与 Token 数量级的增长。 + +--- + +## 2. 目标 + +### 业务目标 +1. 将平台核心故障 MTTR 从 >30min 压缩至 <10min。 +2. 自动化处理覆盖 P1/P2 级告警事件的 60%以上（含自愈和故障匿离）。 +3. 告警噪声率降低至 5% 以下（误告警 / 总告警数）。 +4. 实现 100% 生产配置变更的审计追溯，回滚时间窗口 <5min。 + +### 用户目标 +| 用户 | 目标 | +|---|---| +| SRE | 不再 7x24 手动守候日志，告警可触达、可分类、可动作化 | +| 运营人员 | 缺陷发现后能在同一平台完成定位、分析、处置，无需切换多套工具 | +| 平台管理员 | 对任何配置变更能看到影响范围、执行记录、快速回滚能力 | +| 技术负责人 | 获取量化的运维健康度指标，支撑容量与稳定性决策 | + +### 成功定义 +- 必要条件：运维主控台可访问、监控数据可查、告警规则可配。 +- 充分条件：自愈规则生效、告警噪声率 <5%、审计日志完整。 +- 失败判定：开发期间任何一周内告警噪声率 >20%或自愈规则误触发导致生产事故，即判定失败。 + +--- + +## 3. 范围 + +### In Scope +1. 立交桥平台本身的运行时监控（不含下游大模型服务），包含但不限于： + - gateway/ 请求量、延迟、错误率、降级/稳定性规则命中率 + - supply-api/ 供应商健康状态与审计异常 + - platform-token-runtime/ 令牌耗尽、资源约束触发、异常恢复周期 +2. 告警规则引擎：多维度阈值、分级告警（P0/P1/P2/P3）、告警抑制与聚合。 +3. 故障自愈引擎：自动重启、切换路由、限流、隔离异常节点。 +4. **供应商智能切换（In Scope）**：含供应商健康探测、故障时的多级 Fallback 切换、策略化路由调度、切换后的健康状态监测与回滚。 +5. 配置管理与审计：配置变更审计日志、版本化、回滚。 +6. 容量视图：以 Token 数量、QPS、响应延迟、资源利用率为核心指标的容量主板。 +7. 日志/指标查询与下钻：支持按时间范围、服务、错误码、用户维度筛选。 + +### Out of Scope +1. 下游大模型服务的监控与告警（如 OpenAI、Claude 本身的稳定性）。 +2. 基础设施层监控（如物理机器 CPU/内存/磁盘，由云厂商或 Prometheus Node Exporter 覆盖）。 +3. **AI 负载预测/自动规模扩缩（本阶段仅提供容量视图与阈值提示，不做自动扩容决策）。** + - 供应商智能切换（如故障时切换到备用供应商）属于故障应对策略，In Scope。 + - 自动规模扩缩（如 K8s HPA 类似的自动扩缩容决策）属于资源调度策略，Out of Scope。 +4. 外部监控系统（如 Datadog、New Relic）的整合，仅提供标准 Prometheus 格式接口供自取。 + +### 假设与依赖 +1. 假设已有 Prometheus 或类似时序数据库存储指标，可接受定期 PromQL 查询。 +2. 假设平台日志已统一格式化，可通过标准化查询接口读取。 +3. 假设 gateway/internal/metrics/ 与 gateway/internal/alert/ 现有模块的接口契约在本项目中可延续或克隆。 +4. 依赖 supply-api/ 的供应商健康检查接口与审计日志接口。 +5. 依赖 platform-token-runtime/ 的运行时状态与异常恢复状态接口。 + +--- + +## 4. 用户场景 + +### 主流程 + +#### 场景 A：监控实时看板查看平台健康状态 +1. SRE 登录运维主控台。 +2. 首页展示实时 QPS、平均延迟、P99 延迟、错误率、活跃供应商数量、异常告警数量。 +3. SRE 点击任意指标卡片，下钻至分钟级趋势图与按服务/路径/供应商的分布。 +4. 如果某指标超过预设阈值，卡片变红并显示最近 3 条相关告警摘要。 + +#### 场景 B：配置审计与回滚 +1. 平台管理员修改供应商接口地址或路由规则。 +2. 系统自动记录操作人、操作前后值、时间戳、IP 地址，并生成唯一审计 ID。 +3. 管理员可以在审计日志中搜索该变更。 +4. 发现变更引发异常后，管理员在审计页面选择该记录执行回滚，系统在 60 秒内恢复原值并验证恢复后状态。 + +#### 场景 C：告警接收与处置 +1. 监控引擎检测到 P1 告警触发条件（如某供应商错误率 >10%持续 2min）。 +2. 告警在 30 秒内通过配置的通知渠道（Webhook/邮件/飞书/企业微信）发送给负责人。 +3. 自愈引擎判断该 P1 告警是否存在已配置自愈动作： + - 若有：执行自愈（如切换备用供应商、限流、重启异常实例），并在事件中记录动作结果。 + - 若无：仅发送通知，等待人工处理。 +4. SRE 在控台中对该告警进行确认/忽略/规避，并填写处置结果。 +5. 告警事件自动关闭或转为持续告警，根据反馈调整当前期的实时效果。 + +### 异常流程 + +#### 场景 D：自愈动作失败 +1. 自愈引擎尝试执行自愈动作（如切换供应商接口）。 +2. 动作执行失败（API 返回非 200 或超时）。 +3. 系统在 10 秒内尝试重试 1 次，若仍失败，停止自动尝试并升级为 P0 人工告警（电话/短信）。 +4. 记录失败原因与日志，保留事件状态供人工排查。 + +#### 场景 E：告警飙升（波浪式告警） +1. 某基础故障导致成百上千个服务实例同时触发告警。 +2. 告警引擎检测到同一资源/服务在 1 分钟内触发 >20 条告警。 +3. 自动触发聚合：生成一条 "集群告警"，将细节收拢为附件，停止单条通知爆炸。 +4. SRE 在控台中批量确认/忽略/属于同一根因的告警。 + +#### 场景 F：回滚失败 +1. 管理员发起回滚。 +2. 回滚目标值已被后续修改覆盖（关联记录不存在或已被删除）。 +3. 系统拒绝执行，返回明确错误码 `OPS_AUD_4101`（回滚目标不存在）。 +4. 记录回滚失败事件，告警通知技术负责人。 + +### 边缘流程 + +#### 场景 G：无人处理的持续告警 +1. P2 告警持续 2 小时未被确认。 +2. 系统自动将该告警升级为 P1，并通知上级负责人。 + +#### 场景 H：监控数据源丢失 +1. 指标采集器在 5 分钟内未收到任何新数据点。 +2. 控制台显示 "数据源丢失"标识，不显示过期数据，触发 P2 级别的内部告警。 +3. 恢复后自动补入缺失时段的空值标记，不伪造数据。 + +#### 场景 I：运维人员误触发配置变更 +1. 管理员提交一个将某供应商日请求上限从 10000 降为 10 的变更。 +2. 系统检测到该变更带来的影响面 > 预设阈值（比如触发将导致 90% 流量被拒绝）。 +3. 在审计日志中标记该变更为 "高风险"，并在执行前弹窗警告管理员需要二次确认。 + +### 用户故事 + +- 作为 SRE，我希望在午夜收到有效告警而不是噪音，以便在 10 分钟内完成定位和处置，避免影响生产。 +- 作为运营人员，我希望能在同一个控制台查看所有服务的健康状态和日志，而不需要登录多套系统。 +- 作为平台管理员，我希望任何配置变更都有日志和回滚能力，让我在发生问题时能快速恢复而不会黄乱找原始值。 +- 作为技术负责人，我希望看到量化的运维健康指标，以便在要求增量资源时有数据支撑。 + +--- + +## 5. 验收标准（AC） + +### AC-1 实时监控看板 +- 当访问运维主控台时，首页加载时间 <2s。 +- 首页必须显示以下 6 个指标数值：当前 QPS、平均响应延迟(ms)、P99 响应延迟(ms)、5xx 错误率(%)、活跃供应商数量、未关闭告警数量。 +- 每个指标卡片需在数据更新后 15s 内刷新显示。 + +### AC-2 指标下钻 +- 点击任何指标卡片后，页面展示该指标过去 1 小时的分钟级趋势图。 +- 趋势图支持按 `service`（gateway/supply-api/platform-token-runtime）、`path`（URL path）、`supplier`（供应商 ID）维度下钻分割。 +- 下钻结果查询时间 <3s。 + +### AC-3 告警规则配置 +- 控制台支持创建、编辑、启用、禁用告警规则。 +- 单条规则必须包含：规则名称、监控指标、阈值类型（>、<、=、匹配正则）、持续时间(min)、级别（P0/P1/P2/P3）、通知渠道。 +- 规则变更后 30s 内生效，无需重启服务。 +- 最少支持同时运行 50 条告警规则。 + +### AC-4 告警通知触达 +- P0/P1 级告警必须在触发后 30s 内完成通知发送。 +- P2 级告警必须在 120s 内完成通知发送。 +- 通知渠道必须支持 Webhook、邮件、飞书/企业微信至少 2 种。 +- 通知模板必须包含：告警级别、规则名称、触发时间、当前值、阈值、事件 ID、查看链接。 + +### AC-5 告警聚合与抑制 +- 当同一资源/服务在 1 分钟内触发 >20 条告警时，系统必须自动生成 1 条集群告警，停止单条通知爆炸。 +- 集群告警的通知内容必须包含：累计数量、涉及规则列表、时间范围。 +- 抑制周期：同一规则同一目标在 5 分钟内只发送 1 次告警（除非级别升级）。 + +### AC-6 自动自愈 +- 系统必须支持为每个告警规则配置可选的自愈动作：无、切换备用路由、限流、重启实例、触发程序化脚本。 +- 自愈动作必须在告警触发后 60s 内执行完成（含重试 1 次的时间）。 +- 自愈执行结果（成功/失败/拒绝）必须记录在告警事件中。 +- 自愈动作触发后，监控必须在 2 分钟内评估是否解除告警条件，若未解除则升级为人工告警。 + +### AC-7 配置审计日志 +- 任何对生产配置的增、删、改操作必须在 1s 内生成审计日志记录。 +- 审计日志必须包含：唯一 ID、操作人、操作类型、目标资源类型与 ID、操作前值(JSON)、操作后值(JSON)、时间戳(到毫秒)、IP 地址、请求 ID。 +- 审计日志必须不可篡改，存储保留期 >=90 天。 +- 控制台必须支持按时间范围、操作人、资源类型、关键词筛选查询，结果返回时间 <3s。 + +### AC-8 配置回滚 +- 对于任何审计日志记录，只要目标资源仍存在且操作前值有效，必须支持执行回滚。 +- 回滚执行时间必须 <60s，并在执行前显示所有会被覆盖的子资源列表。 +- 回滚必须生成新的审计记录，关联原始操作 ID。 +- 回滚失败时必须返回明确错误码，不得静默失败。 + +### AC-9 容量主板 +- 容量主板必须显示过去 7 天的 Token 数量、QPS、P99 延迟、各供应商资源利用率趋势。 +- 必须对每个服务标出当前负载等级：正常/警告/过载，判定依据可配置阈值。 +- 提供 "按当前增长率预测触达资源上限时间"的算法结果（仅供参考，不自动执行扩容）。 + +### AC-10 日志/指标查询 +- 控制台必须支持按时间范围、服务名称、HTTP 状态码、错误码、用户 ID、供应商 ID、关键词筛适日志。 +- 日志查询结果支持分页，单页最大 100 条，首页返回时间 <3s。 +- 支持将日志结果导出为 CSV 文件，单次导出上限 10000 条。 + +### AC-11 监控数据保存 +- 原始指标数据必须保留 >=7 天，用于短期查询与告警评估。 +- 分钟级聚合数据必须保留 >=30 天，用于趋势分析。 +- 小时级聚合数据必须保留 >=90 天，用于容量规划与月度报告。 + +### AC-12 角色与权限 +- 必须支持以下角色及其基本权限控制： + - 查看者：可查看监控看板、日志、告警事件，不可修改配置。 + - 运维人员：可确认/忽略/规避告警，可管理告警规则，不可执行回滚。 + - 管理员：可执行所有操作，包括回滚与高风险变更确认。 + +--- + +## 6. 边缘情况与失败路径 + +| 编号 | 边缘/失败场景 | 系统行为 | 人工介入时机 | +|---|---|---|---| +| F-1 | 自愈动作重试均失败 | 停止自动尝试，升级为 P0 人工告警 | 立即，电话/短信通知 | +| F-2 | 告警通知渠道失效（如 Webhook 8xx/5xx） | 记录发送失败，使用备用渠道（邮件→飞书→短信） | 三次切换后仍失败，通知 TechLead | +| F-3 | 回滚目标已不存在 | 拒绝回滚，返回错误码 `OPS_AUD_4101` | 需要运维人员手动修复或联系开发人员 | +| F-4 | 指标采集器连续 5min 无数据 | 显示数据源丢失标识，触发内部 P2 告警 | 检查采集器/网络/存储状态 | +| F-5 | 审计日志存储满盘/写入失败 | 丢弃非关键字段或改为异步上报，不阻断业务操作 | 存储计量触发预警，SRE 扩容或清理 | +| F-6 | 自愈动作触发后形成级联故障（如切换路由后导致新节点故障） | 自动恢复上一步操作前的状态（回退），然后升级为人工告警 | 立即，电话通知 | +| F-7 | 监控数据库丢失（全面中断） | 控制台进入只读/降级模式，告警引擎依赖本地缓存持续运行 | 立即，检查存储层 | +| F-8 | 实时看板指标计算结果超时 | 显示上次成功结果并标注时间戳，不等待当前请求 | 检查查询引擎性能或检索时间范围 | + +--- + +## 7. 上线与运营准备 + +### 发布策略 +- 阶段 1：监控看板 + 日志/指标查询。只提供可视化，不触发任何自动动作。 +- 阶段 2：告警规则引擎 + 通知渠道，告警只通知、不执行自愈。 +- 阶段 3：自愈引擎 + 审计回滚。 +- 阶段 4：容量主板与高级分析。 + +### 灰度与回滚 +- 每个阶段必须在单个可控集群部署 >=72h，无 P1 以上告警才进入下一环境。 +- 自愈规则必须通过 "沙盒模式"验证：先在非生产环境模拟触发 10 次以上，确认动作结果符合预期后才允许关联生产告警规则。 +- 回滚能力必须在发布前进行 1 次演练，涉及至少 3 个不同资源类型。 +- 如阶段 3 中自愈规则出现误触发导致生产事故，立即停用自愈引擎（通过权限开关），所有告警退化为仅通知模式。 + +### 埋点与监控 +- 必须实现以下事件埋点： + - `运维控制台页面加载`、`指标下钻`、`日志查询执行`、`告警规则创建/编辑/删除`、`告警确认/忽略/规避`、`自愈动作执行`、`自愈失败`、`回滚执行`、`回滚失败`。 +- 必须对自身监控层（指标采集器、告警引擎、通知发送器）进行健康检查，检查失败时触发内部 P2 告警。 + +### FAQ（预先准备） +| 问题 | 答案 | +|---|---| +| 告警通知没收到怎么办？ | 检查通知渠道配置中的接收地址/密钥；检查通知日志中的发送结果与失败原因。 | +| 自愈动作为什么没有触发？ | 确认规则中已开启自愈动作并选择了具体动作；确认沙盒测试已通过。 | +| 回滚为什么报错 `OPS_AUD_4101`？ | 该配置在变更后已被删除或覆盖，无法找到操作前状态，需要手动恢复。 | +| 数据看板为什么卡住？ | 检查页面顶部是否有 "数据源丢失"标识；尝试缩小时间范围或筛选条件。 | +| 如何避免误触发自愈规则？ | 在非生产环境测试自愈规则 10 次以上并验证结果正确后才关联生产告警规则。 | + +--- + +## 8. 商业化与价值闭环 + +### 收益路径 +1. 内部效益：减少运维人员 7x24 值班压力，释放人力至产品功能开发。 +2. 外部收益：提升平台 SLA 从 99.5% 至 99.9%，支撑企业客户签约与续费。 +3. 成本节省：将运维人工时长每月减少 40% 以上，可量化计算为节省人力成本。 + +### 北极星指标 +- 平台核心故障 MTTR（从 >30min 到 <10min）。 +- 自动化处理覆盖率（目标 >=60%）。 +- 告警噪声率（目标 <5%）。 + +### 失败判定线 +- 上线 30 天内 MTTR 未下降至 <20min。 +- 自动化覆盖率 <30%。 +- 告警噪声率 >15%。 +- 自愈规则误触发导致 1 次生产故障事件。 +任意一项触发，即进入救援模式。 + +### 止损条件 +- 自愈引擎误触发导致 2 次以上生产事故：立即锁定自愈功能，退回仅通知模式，启动事故复盘。 +- 监控数据丢失超过 24h：停用依赖监控数据的自动化规则，级联退化至人工处理。 + +--- + +## 9. 依赖与风险 + +### 技术依赖 +| 依赖 | 风险等级 | 备选方案 | +|---|---|---| +| Prometheus 或类似时序数据库 | 高 | 支持 VictoriaMetrics / Thanos 作为替代后端，提供存储适配层，不锁死单一存储 | +| 通知渠道（Webhook/邮件/飞书） | 中 | 必须支持多渠道且自动切换，单渠道不得作为唯一依赖 | +| 审计日志存储 | 中 | 主存储失败时转至本地文件缓存 + 异步上报，不阻断业务 | +| supply-api/ 审计接口 | 中 | 如接口不可用，运维平台自己写审计记录，后续补同步 | + +### 业务风险 +1. 自愈规则设计不当导致正常流量被掩断或重定向，影响客户请求。 +2. 告警规则过于敏感或缺乏抑制，导致噪音爆炸，运营人员麻木对待真实故障。 +3. 回滚操作不当导致配置状态更深层次的损坏，如回滚了一个依赖于新配置的下游变更。 +4. 审计日志丢失导致故障定责和合规审查受阻。 + +### 缓解措施 +1. 自愈规则必须经历 "沙盒模式"验证才能生效。 +2. 所有自愈动作支持通过权限开关一键关闭，关闭后所有告警退化为仅通知。 +3. 回滚执行前显示子资源影响范围，必须经二次确认。 +4. 审计日志存储采用主备双写，存储期 >=90 天。 + +--- + +## 10. 技术栈与集成约束 + +### 统一技术栈 +本项目必须与立交桥主项目保持一致： +- **语言**: Go 1.22+ +- **HTTP框架**: 标准库 `net/http` + 自定义中间件（禁止引入 Gin/Echo 等第三方框架，保持与 gateway/ 和 supply-api/ 的一致性） +- **数据库**: PostgreSQL 15+ ，驱动 `jackc/pgx/v5` +- **缓存**: Redis，客户端 `redis/go-redis/v9` +- **配置**: YAML + Viper，环境变量覆盖敏感字段 +- **日志/审计**: 结构化日志，审计事件模型与 supply-api/ 一致 +- **错误码**: `{SOURCE}_{CATEGORY}_{CODE}` 格式，例如 `OPS_ALT_4001` +- **健康检查**: `/actuator/health` 、 `/actuator/health/live` 、 `/actuator/health/ready` +- **测试**: Go testing + testify，覆盖率门槛 domain ≥ 70%、service/handler ≥ 80% + +### 独立运行与集成运行 +本系统必须同时支持两种运行模式： + +| 模式 | 特征 | 部署方式 | 适用场景 | +|------|------|---------|---------| +| **独立运行** | 自有 `cmd/ai-ops/main.go`，独立数据库 schema，独立 docker-compose | `docker-compose up` 或单独容器 | 外部用户只需要运维能力，不想接入立交桥全套 | +| **集成运行** | 作为 Go module 被 `gateway/` 或 `supply-api/` 引入，共享数据库连接池和配置，通过内部接口注册 | 编译时作为子模块编译，运行时挂载到立交桥主进程 | 立交桥用户希望获得一体化运维能力 | + +**集成约束**: +- 独立运行时，系统必须提供完整的 HTTP API 和管理后台。 +- 集成运行时，系统必须提供 `IntegrationPlugin` 接口，允许主程序通过配置开关启用/禁用各模块。 +- 数据库 schema 必须使用独立的 `ai_ops_` 前缀，避免与主项目表名冲突。 +- 配置文件必须支持分离加载：独立运行时读取自己的 `config.yaml`，集成运行时合并到主项目配置。 + +### NewAPI / Sub2API 适配支持 +本系统的核心能力必须能够对接 NewAPI 和 Sub2API 系统： +- **监控数据推送**: 提供 Prometheus 格式的 `/metrics` 接口，NewAPI/Sub2API 可通过 Prometheus scrape 获取运维数据。 +- **告警回调**: 支持 Webhook 告警通知，NewAPI/Sub2API 可配置接收本系统的告警事件。 +- **自愈脚本扩展**: 自愈动作中的"触发程序化脚本"支持调用 NewAPI/Sub2API 的管理 API（如切换供应商、限流配置、重启实例）。 +- **独立部署时**: 通过配置文件指定 NewAPI/Sub2API 的管理端点地址和鉴权信息，本系统通过适配层与之交互。 +- **集成部署时**: 若立交桥 gateway/ 已接入 NewAPI/Sub2API，本系统通过 gateway/ 的内部路由接口操作上游状态。 + +### 对外接口契约 +- 必须提供 OpenAPI 3.0 接口文档，确保 NewAPI/Sub2API 开发者可以独立接入。 +- 接口路径前缀默认为 `/api/v1/ai-ops/`，集成运行时可通过配置改为 `/internal/ai-ops/` 。 + +--- + +## 11. 阶段门控结论 + +### 当前状态 +- 需求范围已明确界定，In Scope / Out of Scope 清晰。 +- 验收标准已精确到可测试粒度，包含时间、数值、错误码、状态等维度。 +- 异常流程、边缘流程、失败路径已全面覆盖。 +- 上线策略、灰度方案、回滚路径、埋点检查已明确。 +- 技术栈与集成约束已明确（统一 Go 标准库、独立/集成双模式、NewAPI/Sub2API 适配）。 +- 北极星指标与失败判定线已量化。 +- 依赖与风险已识别，缓解措施已制定。 + +### 门控结论 +可进入 TechLead 阶段。 + +> 备注：TechLead 阶段需要完成的事项 +> 1. 确认现有 gateway/internal/metrics/ 与 gateway/internal/alert/ 的契约可延续性。 +> 2. 确认存储层技术选型（Prometheus / VictoriaMetrics / 自建时序库）。 +> 3. 确认通知渠道具体实现方案（Webhook / 飞书 / 邮件）。 +> 4. 确认审计日志与回滚是否复用 supply-api/ 既有审计能力还是独立实现。 +> 5. 确认角色权限体系是否复用平台统一认证系统。 + +--- + +## 自检清单 + +- [x] 已明确真实目标，不是只复述功能 +- [x] 已写清 In Scope / Out of Scope +- [x] 每个 AC 都可被 QA 或测试用例直接验证 +- [x] 已覆盖异常流、边缘流与失败路径 +- [x] 已补齐上线、运营、监控、回滚要求 +- [x] 已定义商业化/价值闭环 +- [x] 已明确成功指标与失败判定线 +- [x] 已明确当前可进入 TechLead 阶段 +- [x] 没有使用"优化、支持、友好、尽量、快速"等模糊词替代明确要求 + +--- +--- + +## 附：供应商智能切换（参考 FreeRide 思路） + +### 背景 + +[FreeRide](https://github.com/openclaw/skills/tree/main/skills/shaivpidadi/free-ride) 是 OpenClaw 的一个 Skill 插件，核心功能： +- 实时拉取 OpenRouter 免费模型列表，按 ELO 评分排序 +- 自动选择最强模型作为主模型 +- 配置 5 个高质量备用模型作为 Fallback 链 +- 主模型限速 → 自动切换下一个，用户无感知 +- 非破坏性配置更新，只改 model 相关字段 + +FreeRide 的设计哲学（自动选择 + 智能降级）对 AI-Ops 的供应商切换场景有直接参考价值。 + +### 智能供应商切换 vs FreeRide + +| 维度 | FreeRide | AI-Ops 供应商切换 | +|------|----------|-------------------| +| **目标用户** | 个人用户/极客 | 企业运维团队 | +| **模型来源** | OpenRouter 免费模型 | 多供应商中转 API | +| **核心价值** | 零成本用最强模型 | 供应商故障无感切换 | +| **Failover 粒度** | 模型级别 | 供应商级别 | +| **切换策略** | 固定轮询 | 成本优先/质量优先/延迟优先/手动 | +| **监控告警** | 无 | 多渠道告警 + 运维大盘 | +| **用量统计** | 无 | 成本分摊到部门/项目 | +| **自愈能力** | 仅切换 | 切换 + 通知 + 锁定 + 升级 | + +### 供应商切换策略 + +| 策略 | 决策依据 | 适用场景 | +|------|----------|----------| +| **成本优先** | input_cost + output_cost 最低 | 预算敏感型业务 | +| **质量优先** | 最近 24h 成功率最高 | 高可用要求业务 | +| **延迟优先** | 最近 probe 响应时间最低 | 低延迟要求业务 | +| **手动** | 每次切换需人工确认 | 高风险变更管控 | + +### 设计约束（继承 HLD） + +- 切换后冷却期默认 300s，防止震荡（同一供应商反复切换） +- 每次切换写入审计日志（切换时间、原供应商、目标供应商、切换原因） +- 供应商配置更新采用原子替换（写临时文件 → 验证 → 原子替换），防止配置损坏 +- 切换执行后立即验证新供应商可服务性，失败则回退并升级告警 + +### 参考实现 + +供应商探针任务（每 5 分钟执行）： +```go +type SupplierProbe struct { + SupplierID string `json:"supplier_id"` + ProbeAt time.Time `json:"probe_at"` + LatencyMs int `json:"latency_ms"` + ErrorRate float64 `json:"error_rate"` // 0.0~1.0 + ELOHistory []float64 `json:"elo_history"` // 最近7天 ELO 趋势 +} +``` + +供应商 Fallback 链配置： +```go +type SupplierChain struct { + Model string `json:"model"` + Primary string `json:"primary"` // 主供应商ID + Fallbacks []string `json:"fallbacks"` // 备用供应商列表（按优先级排序） + CooldownSec int `json:"cooldown_sec"` // 冷却秒数，默认300 + Strategy string `json:"strategy"` // cost/quality/latency/manual +} +``` + diff --git a/prd/competitor-analysis.md b/prd/competitor-analysis.md new file mode 100644 index 0000000..7d4c74f --- /dev/null +++ b/prd/competitor-analysis.md @@ -0,0 +1,272 @@ +# AI-Ops 智能运维 — 竞品分析报告 + +## 1. 竞品范围 + +| 竞品 | 项目地址 | 技术栈 | 相关能力 | +|-------|---------|--------|---------| +| **LiteLLM** | berriai/litellm | Python/FastAPI | 告警系统（SlackAlerting）、健康检查、自动路由、容灾切换 | +| **Sub2API** | Wei-Shaw/sub2api | Go/Gin/Ent | 基础代理健康、用量统计 | +| **NewAPI / OneAPI** | Calcium-Ion/new-api | Go/Gin/GORM | 渠道监控、状态切换 | + +--- + +## 2. 核心能力对标 + +### 2.1 告警系统 + +#### LiteLLM SlackAlerting（实现最完整） + +LiteLLM 的告警系统是当前开源 LLM Gateway 中最成熟的，其核心设计包括： + +**告警类型（12+种）**: +```python +class AlertType(str, Enum): + # LLM 相关 + llm_exceptions = "llm_exceptions" # LLM 调用异常 + llm_too_slow = "llm_too_slow" # 响应超时 + llm_requests_hanging = "llm_requests_hanging" # 请求悬停 + # 资源与成本 + budget_alerts = "budget_alerts" # 预算超支 + spend_reports = "spend_reports" # 消耗报告 + failed_tracking_spend = "failed_tracking_spend" # 成本跟踪失败 + # 数据库 + db_exceptions = "db_exceptions" # 数据库异常 + # 运营报告 + daily_reports = "daily_reports" # 每日运营报告 + # 部署与模型 + cooldown_deployment = "cooldown_deployment" # 部署冷却 + new_model_added = "new_model_added" # 新模型上线 + # 故障与容灾 + outage_alerts = "outage_alerts" # 模型故障 + region_outage_alerts = "region_outage_alerts" # 区域故障 + fallback_reports = "fallback_reports" # 容灾切换报告 +``` + +**关键技术细节**: +- **批量化与性能优化**: 采用 `CustomBatchLogger` 基类，告警批量发送（10秒或超过 X 事件触发），避免高并发下的性能瓶颈 +- **消息摘要（Digest）模式**: 支持按 `(alert_type, model, api_base)` 聚合告警，默认 24h 窗口期，避免滥发 +- **多渠道分发**: 支持按告警类型路由到不同 Webhook，如 `alert_to_webhook_url = {AlertType.outage_alerts: "#ops-channel", AlertType.budget_alerts: "#finance-channel"}` +- **告警阈值细分**: 悬停检测阈值可配置（默认 300s），故障检测分为 minor（5 次错误）和 major（10 次错误） +- **区域故障检测**: 同一区域内 2+ 模型报告错误时触发 region_outage_alerts +- **告警 TTL 缓解**: budget_alert_ttl=24h，outage_alert_ttl=1min，防止重复骚扰 + +**健康检查端点**: +- `/health` — 综合健康（可选择性检查已配置模型） +- `/health/liveliness` / `/health/liveness` — 进程存活 +- `/health/readiness` — 依赖就绪（Redis、DB、Cache） +- `/health/services?service=datadog` — 第三方服务健康 +- `/health/history` — 历史健康状态 +- `/health/latest` — 最新健康状态 +- `/health/backlog` — 请求队列积压 +- `/health/test_connection` — 测试特定模型连通性 + +#### Sub2API / NewAPI / OneAPI +- Sub2API: 仅提供基础代理状态查询，无结构化告警系统 +- NewAPI/OneAPI: 有渠道状态监控，支持切换上游，但缺乏自动化告警和根因分析 + +### 2.2 自动路由与容灾 + +#### LiteLLM Router Strategy +LiteLLM 提供多种路由策略： +- **lowest_latency**: 选择响应最快的部署 +- **lowest_cost**: 选择成本最低的部署 +- **lowest_tpm_rpm**: 选择 TPM/RPM 最低的部署 +- **least_busy**: 选择当前负载最低的部署 +- **auto_router**: 基于语义路由（使用 `SemanticRouter` 和向量编码器匹配请求到最适合的模型） +- **budget_limiter**: 按 key/team 限制预算 + +**容灾机制**: +- **Cooldown**: 当部署连续失败时自动进入 cooldown 状态，暂时从路由池中移除 +- **Fallback**: 主模型失败时自动切换到备用模型 +- **Retries**: 配置重试次数和策略 + +### 2.3 成本跟踪 + +#### LiteLLM Cost Tracking +- 维护 `model_prices_and_context_window_backup.json` 主数据库，包含所有支持模型的 input_cost_per_token / output_cost_per_token +- 支持分层定价（tiered_pricing）、批量定价（batch pricing）、音频 token 定价 +- 每次请求完成后计算并记录成本 +- 支持自定义成本覆盖 + +#### Sub2API Pricing Service +- 从 LiteLLM 上游镜像 `model_prices_and_context_window.json` +- 支持模型家族回退（如 gpt-5.3 未知时回退到 gpt-5.1） +- 本地 fallback 文件缓存 +- 支持动态价格字段优先级 + +--- + +## 3. 差距分析（我们的机会） + +| 能力维度 | 竞品现状 | 我们的机会 | +|---------|---------|---------| +| **告警渠道** | LiteLLM 仅支持 Slack/Webhook，无企微/钉钉/飞书 | 全面支持中国企业常用渠道 +通用 Webhook | +| **根因分析** | 竞品仅提供原始错误数据，无自动根因分析 | AI 驱动的根因分析，自动归类故障类型 | +| **自愈能力** | LiteLLM 仅有 cooldown 和 fallback，无可编程自愈 | 可编程自愈脚本，支持自定义操作（切换供应商、限流、重启） | +| **智能升级** | 竞品告警阈值是静态配置 | 基于历史数据自动建议/调整阈值 | +| **多维度健康** | LiteLLM 健康检查偏重连通性 | 连通性 + 配额 + 延迟 + 错误率 + 成功率综合健康指标 | +| **运维大盘** | LiteLLM 有 daily_reports，但无运维大盘概念 | 统一运维大盘，汇总所有指标与异常 | +| **预测性运维** | 竞品均为事后告警 | 基于趋势预测的预警（如配额耗尽预测、故障趋势预测） | + +--- + +## 4. 对产品规划的影响 + +### 强化方向 + +1. **告警系统设计参考 LiteLLM 的多类型分类**，但扩展为 15+ 种类型，增加： + - 配额耗尽预警（监测余额趋势） + - 响应时间 P99 突变预警 + - 模型质量跳水预警 + - 安全异常预警（密钥泄露、异常访问模式） + +2. **批量化与摘要机制**参考 LiteLLM 的 `CustomBatchLogger` 和 DigestEntry 设计： + - 告警批量发送（含压缩） + - 按 (alert_type, model, api_base) 聚合 + - 可配置摘要窗口（默认 24h，支持 5min/1h/24h） + +3. **健康检查端点**参考 LiteLLM 的多层级设计： + - `/health` 综合健康 + - `/health/live` 进程存活 + - `/health/ready` 依赖就绪 + - `/health/backlog` 队列积压 + - `/health/test_connection` 模型连通性测试 + +4. **自愈能力**超越竞品： + - LiteLLM 的 cooldown 只是"移除故障节点"，我们应提供"程序化自愈"，允许用户配置自定义动作 + - 参考 LiteLLM 的 fallback 机制，但增加"智能切换策略"（根据成本/质量/位置综合决策） + +### 新增差异化能力 + +5. **AI 驱动的根因分析**：竞品不具备，是核心差异化 +6. **运维大盘概念**：竞品无统一运维视图，我们应提供类似 Grafana Dashboard 的一体化运维大盘 +7. **预测性运维**：基于时序分析的预警，而不是事后告警 + +--- + +## 5. 对技术规划的影响 + +### 应引入的设计模式 + +| 设计模式 | 来源 | 应用场景 | +|---------|------|---------| +| **CustomBatchLogger** | LiteLLM | 告警事件批量处理，避免高并发下的 IO 瓶颈 | +| **DualCache** | LiteLLM | 告警状态缓存（内存 + Redis），确保告警可靠性 | +| **DigestEntry** | LiteLLM | 告警聚合，避免滥发 | +| **AlertType + AlertTypeConfig** | LiteLLM | 可扩展的告警类型系统，支持按类型配置不同策略 | +| **OutageModel + ProviderRegionOutageModel** | LiteLLM | 故障状态机，支持模型级和区域级故障检测 | +| **DeploymentMetrics** | LiteLLM | 每部署的运行时指标（failed_request, latency_per_output_token） | +| **Cooldown 机制** | LiteLLM | 故障部署自动移除，作为自愈动作的一种 | + +### 技术避坑 + +1. **不重复造轮子**: LiteLLM 的告警系统已经很成熟，我们不需要重新设计整套机制，而是将其思想迁移到 Go 技术栈，并增加本地化适配 +2. **性能优先**: LiteLLM 的批量处理机制是关键，告警系统不能成为性能瓶颈 +3. **可观测性**: 参考 LiteLLM 的健康端点设计，确保所有依赖都有对应的就绪检查 + +--- + +## 附：FreeRide — OpenClaw 自动模型切换插件（市场调研） + +### 1. 基本信息 + +| 项目 | 内容 | +|-----|------| +| **名称** | FreeRide | +| **类型** | OpenClaw Skill（插件） | +| **定位** | 自动模型选择 + Fallback 链管理 | +| **技术栈** | Shell + OpenClaw 原生 API | +| **开源地址** | `openclaw/skills/tree/main/skills/shaivpidadi/free-ride` | +| **安装方式** | `/learn @openclaw/freeride` | + +### 2. 核心功能 + +``` +FreeRide 做的事： +1. 实时拉取 OpenRouter 免费模型列表（30+ 免费模型） +2. 按社区 ELO 评分排序，选出当前最强免费模型 +3. 将最强模型设为主模型 +4. 自动配置 5 个高质量备用模型作为 Fallback 链 +5. 主模型限速 → 自动切换下一个，用户无感知 +6. 只修改 openclaw.json 中的 model 相关字段，不触碰其他配置 +``` + +### 3. 实测数据 + +- **每日完成**：200~500+ 次高质量对话 +- **覆盖场景**：写文章、代码调试、数据分析、日常聊天 +- **成本**：零（全部使用 OpenRouter 免费额度） + +### 4. 技术分析 + +#### 4.1 设计哲学 + +| 维度 | FreeRide | LiteLLM | 我们的 AI-Ops | +|-----|---------|---------|--------------| +| **目标用户** | 个人用户/极客 | 企业 | 企业运维团队 | +| **模型来源** | OpenRouter 免费模型 | 任意 OpenAI兼容API | 多供应商中转 | +| **核心价值** | 零成本用最强模型 | 企业级稳定性 | 供应商智能切换 + 运维自动化 | +| **Failover 机制** | 简单的模型列表轮询 | cooldown + fallback + retries | 智能化 failover + 自愈 | + +#### 4.2 技术亮点 + +**亮点1：实时模型排行** +```bash +# FreeRide 实时拉取 OpenRouter 免费模型，按 ELO 排序 +curl -s "https://openrouter.ai/models?free=true" | jq '.data | sort_by(.rating) | reverse' +``` +→ **借鉴点**：可用类似思路监控各供应商的模型质量变化，自动发现"性价比突变"模型 + +**亮点2：非破坏性配置更新** +```bash +# FreeRide 只更新 model 相关的 key +jq ".model = \"$BEST_MODEL\"" openclaw.json > tmp.json && mv tmp.json openclaw.json +``` +→ **借鉴点**：热切换配置时，先写入临时文件再原子替换，避免损坏配置文件 + +**亮点3：Fallback 链自动编排** +```bash +# FreeRide 默认配置 5 个备用模型 +FALLBACK_MODELS="model_a,model_b,model_c,model_d,model_e" +``` +→ **借鉴点**：供应商层面也可以做类似的多级 fallback，而不是单层 failover + +#### 4.3 不足与局限 + +| 问题 | 说明 | +|-----|------| +| **无监控告警** | FreeRide 没有告警概念，模型挂了用户需要自己发现 | +| **无用量统计** | 没有成本追踪，不知道花了多少钱 | +| **无自愈脚本** | 只是切换模型，不能执行重启/通知等操作 | +| **依赖 OpenRouter** | 只适合 OpenRouter，中国用户无法直接使用 | +| **免费模型质量不稳定** | OpenRouter 免费模型 ELO 排名波动大，不适合企业生产 | + +### 5. 对 AI-Ops 的借鉴 + +#### 5.1 可复用的设计 + +| FreeRide 思路 | AI-Ops 如何借鉴 | +|--------------|----------------| +| 实时模型排行 | **供应商模型质量监控**：定时拉取各中转的模型列表，按响应速度/成功率排序 | +| Fallback 链 | **多级降级策略**：主供应商 → 备供应商 → 降级回复（而不是简单的一层 failover） | +| 非破坏性配置 | **配置热切换规范**：所有配置更新走原子替换，不直接改原文件 | +| 限速自动切换 | **速率限制自适应**：监控各供应商 TPM/QPM 限制，预估耗尽时间并提前切换 | + +#### 5.2 AI-Ops 应超越 FreeRide 的地方 + +``` +FreeRide 做到了： AI-Ops 应做到： +✅ 模型自动切换 ✅ 供应商整体健康度评估（不止模型） +✅ Fallback 链 ✅ 切换策略可配置（成本优先/质量优先/延迟优先） +❌ 无监控告警 ✅ 多渠道告警（企微/飞书/钉钉/Slack） +❌ 无用量统计 ✅ 成本分摊到部门/项目/用户 +❌ 无自愈能力 ✅ 可编程自愈（切换 + 通知 + 锁定 + 升级） +❌ 无运维大盘 ✅ 统一运维视图（健康/配额/成本/故障） +``` + +### 6. 结论 + +FreeRide 是一个优秀的**个人用户工具**，核心价值是"零成本 + 自动切换"。它的设计哲学（自动选择 + 智能降级）对 AI-Ops 有参考价值，但企业级需求（监控/告警/成本/自愈）是它完全不覆盖的领域。 + +**AI-Ops 的差异化定位**：不做 FreeRide 的企业版，而是做一个有**自愈能力的智能运维平台**，FreeRide 的思路是其中一个模块（供应商切换策略）。 + diff --git a/scripts/aiops-single-node.sh b/scripts/aiops-single-node.sh new file mode 100755 index 0000000..16e7f27 --- /dev/null +++ b/scripts/aiops-single-node.sh @@ -0,0 +1,311 @@ +#!/usr/bin/env bash +set -Eeuo pipefail + +ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" +RUNTIME_DIR="$ROOT_DIR/.runtime" +BACKUP_DIR="$ROOT_DIR/backups" +COMPOSE_FILE="$ROOT_DIR/docker-compose.single.yml" +ENV_FILE="$RUNTIME_DIR/single-node.env" +CONFIG_FILE="$RUNTIME_DIR/config.single.yaml" +BINARY_FILE="$RUNTIME_DIR/ai-ops" +PROJECT_NAME="${AI_OPS_PROJECT:-ai-ops-single}" +APP_PORT="${AI_OPS_APP_PORT:-18080}" +DB_PORT="${AI_OPS_DB_PORT:-15432}" +REDIS_PORT="${AI_OPS_REDIS_PORT:-16379}" +DB_USER="${AI_OPS_DB_USER:-aiops}" +DB_NAME="${AI_OPS_DB_NAME:-ai_ops}" +DB_PASSWORD="${AI_OPS_DB_PASSWORD:-aiops123}" + +log() { printf '[ai-ops] %s\n' "$*"; } +fail() { printf '[ai-ops][ERROR] %s\n' "$*" >&2; exit 1; } + +need_cmd() { command -v "$1" >/dev/null 2>&1 || fail "missing command: $1"; } + +engine() { + if command -v docker >/dev/null 2>&1 && docker info >/dev/null 2>&1; then + echo docker + elif command -v podman >/dev/null 2>&1; then + echo podman + else + fail "docker or podman is required" + fi +} + +compose_cmd() { + local eng="$1" + if [[ "$eng" == docker ]]; then + if docker compose version >/dev/null 2>&1; then + echo "docker compose" + elif command -v docker-compose >/dev/null 2>&1; then + echo "docker-compose" + else + fail "docker compose plugin or docker-compose is required" + fi + else + if command -v podman-compose >/dev/null 2>&1; then + echo "podman-compose" + else + fail "podman-compose is required for podman mode" + fi + fi +} + +rand_hex() { + if command -v openssl >/dev/null 2>&1; then + openssl rand -hex "$1" + else + head -c "$1" /dev/urandom | od -An -tx1 | tr -d ' \n' + fi +} + +load_env() { + local keys=(AI_OPS_PROJECT AI_OPS_APP_PORT AI_OPS_DB_PORT AI_OPS_REDIS_PORT AI_OPS_DB_USER AI_OPS_DB_PASSWORD AI_OPS_DB_NAME AI_OPS_JWT_SECRET AI_OPS_METRICS_AUTH AI_OPS_POSTGRES_IMAGE AI_OPS_REDIS_IMAGE AI_OPS_RUNTIME_IMAGE) + local saved_key saved_val + declare -A saved=() + for saved_key in "${keys[@]}"; do + saved_val="${!saved_key-}" + if [[ -n "$saved_val" ]]; then + saved["$saved_key"]="$saved_val" + fi + done + if [[ -f "$ENV_FILE" ]]; then + set -a + # shellcheck disable=SC1090 + source "$ENV_FILE" + set +a + fi + for saved_key in "${!saved[@]}"; do + export "$saved_key=${saved[$saved_key]}" + done + PROJECT_NAME="${AI_OPS_PROJECT:-$PROJECT_NAME}" + APP_PORT="${AI_OPS_APP_PORT:-$APP_PORT}" + DB_PORT="${AI_OPS_DB_PORT:-$DB_PORT}" + REDIS_PORT="${AI_OPS_REDIS_PORT:-$REDIS_PORT}" + DB_USER="${AI_OPS_DB_USER:-$DB_USER}" + DB_NAME="${AI_OPS_DB_NAME:-$DB_NAME}" + DB_PASSWORD="${AI_OPS_DB_PASSWORD:-$DB_PASSWORD}" +} + +write_env_if_missing() { + mkdir -p "$RUNTIME_DIR" "$BACKUP_DIR" + if [[ ! -f "$ENV_FILE" ]]; then + umask 077 + cat >"$ENV_FILE" <"$CONFIG_FILE" </dev/null 2>&1; then + log "ready: http://127.0.0.1:${APP_PORT}" + return 0 + fi + sleep 1 + done + compose logs --tail=120 ai-ops || true + fail "service did not become ready" +} + +cmd_init() { + write_env_if_missing + write_config + log "runtime initialized under $RUNTIME_DIR" +} + +cmd_start() { + cmd_init + build_binary + compose up -d + wait_ready + cmd_smoke +} + +cmd_stop() { compose down; } +cmd_restart() { compose restart ai-ops; wait_ready; } +cmd_status() { compose ps; curl_json /actuator/health/ready || true; } +cmd_logs() { compose logs --tail="${TAIL:-200}" "${SERVICE:-ai-ops}"; } + +cmd_smoke() { + load_env + log "health" + curl_json /health >/dev/null + curl_json /actuator/health/ready >/dev/null + log "login" + local token + token="$(curl -fsS --max-time 5 -X POST "http://127.0.0.1:${APP_PORT}/api/v1/ai-ops/login" -H 'Content-Type: application/json' -d '{"username":"admin","password":"admin"}' | python3 -c 'import sys,json; d=json.load(sys.stdin); print((d.get("data") or d).get("token", ""))')" + [[ -n "$token" ]] || fail "login did not return token" + log "authenticated APIs" + curl -fsS --max-time 5 -H "Authorization: Bearer $token" "http://127.0.0.1:${APP_PORT}/api/v1/ai-ops/alerts?page=1&page_size=5" >/dev/null + curl -fsS --max-time 5 -H "Authorization: Bearer $token" "http://127.0.0.1:${APP_PORT}/api/v1/ai-ops/rules" >/dev/null + curl -fsS --max-time 5 -H "Authorization: Bearer $token" "http://127.0.0.1:${APP_PORT}/api/v1/ai-ops/channels" >/dev/null + curl -fsS --max-time 5 "http://127.0.0.1:${APP_PORT}/ops/dashboard" >/dev/null + curl -fsS --max-time 5 "http://127.0.0.1:${APP_PORT}/openapi.json" >/dev/null + log "SMOKE_OK" +} + +cmd_backup() { + load_env + mkdir -p "$BACKUP_DIR" + local ts out + ts="$(date +%Y%m%d-%H%M%S)" + out="$BACKUP_DIR/ai_ops_${ts}.sql.gz" + log "creating database backup: $out" + container_exec "${PROJECT_NAME}-postgres" pg_dump -U "${AI_OPS_DB_USER:-aiops}" "${AI_OPS_DB_NAME:-ai_ops}" | gzip >"$out" + test -s "$out" || fail "empty backup: $out" + log "BACKUP_OK $out" +} + +cmd_restore() { + local file="${1:-}" + [[ -n "$file" && -f "$file" ]] || fail "usage: $0 restore backups/file.sql.gz" + load_env + log "restoring from $file" + compose stop ai-ops + container_exec "${PROJECT_NAME}-postgres" psql -v ON_ERROR_STOP=1 -U "${AI_OPS_DB_USER:-aiops}" "${AI_OPS_DB_NAME:-ai_ops}" -c 'DROP SCHEMA public CASCADE; CREATE SCHEMA public;' + zcat "$file" | container_exec -i "${PROJECT_NAME}-postgres" psql -v ON_ERROR_STOP=1 -U "${AI_OPS_DB_USER:-aiops}" "${AI_OPS_DB_NAME:-ai_ops}" + compose start ai-ops + wait_ready + cmd_smoke + log "RESTORE_OK" +} + +cmd_recover() { + log "recovering single-node stack" + compose up -d postgres redis + compose up -d ai-ops + wait_ready + cmd_smoke + log "RECOVER_OK" +} + +cmd_doctor() { + log "doctor: commands" + need_cmd go + command -v curl >/dev/null 2>&1 || fail "missing curl" + command -v python3 >/dev/null 2>&1 || fail "missing python3" + engine >/dev/null + compose_cmd "$(engine)" >/dev/null + log "doctor: ports" + ss -ltn 2>/dev/null | grep -E ":(${APP_PORT}|${DB_PORT}|${REDIS_PORT}) " || true + log "doctor: config" + cmd_init + log "DOCTOR_OK" +} + +usage() { + cat <<'EOF_USAGE' +Usage: scripts/aiops-single-node.sh + +Commands: + init Generate .runtime/single-node.env and config.single.yaml + start Build binary, start DB/Redis/App, wait ready, run smoke + stop Stop and remove containers, keep volumes + restart Restart app container and wait ready + status Show compose status and readiness JSON + logs Show app logs; override SERVICE=postgres|redis|ai-ops TAIL=300 + smoke Run health/login/API/dashboard/openapi smoke checks + backup Create backups/ai_ops_.sql.gz via pg_dump + restore Restore a gzipped SQL backup, restart app, run smoke + recover Recreate stopped containers from existing volumes and smoke test + doctor Check local prerequisites and render runtime config +EOF_USAGE +} + +main() { + case "${1:-}" in + init) cmd_init ;; + start) cmd_start ;; + stop) cmd_stop ;; + restart) cmd_restart ;; + status) cmd_status ;; + logs) cmd_logs ;; + smoke) cmd_smoke ;; + backup) shift; cmd_backup "$@" ;; + restore) shift; cmd_restore "$@" ;; + recover) cmd_recover ;; + doctor) cmd_doctor ;; + *) usage; exit 2 ;; + esac +} +main "$@" diff --git a/specs/功能清单.md b/specs/功能清单.md new file mode 100644 index 0000000..e8f0696 --- /dev/null +++ b/specs/功能清单.md @@ -0,0 +1,286 @@ +# AI-Ops 功能清单（模块级） + +> 版本：v1.1 +> 日期：2026-05-11 +> 说明：模块级功能清单，仅到达"功能级"精度。具体的按钮、动画、色彩、字体等前端实现细节由 Engineer 和前端工程师在开发阶段自行决定，不在 PM 规划范围内。 +> +> **PM 范围：**模块定义、功能边界、数据模型、接口契约、验收标准。 +> **Engineer 范围：**代码实现、前端组件选型、动画/交互细节、测试用例实现。 +> +> **总工期估算**：34 人天（含 20% 协调缓冲） + 15% 风险缓冲 = **39 人天**。 +> 估算方法：简单任务 0.5d / 中等任务 1d / 复杂任务 2d，模块级汇总后加缓冲。 + +--- + +## Phase 1：监控看板 + 日志查询（不触发自动动作） + +> **工期估算**：8 人天（含 20% 协调缓冲） + +### 模块 1.1：监控首页 + +#### 1.1.1 首页基础布局 +- [ ] **任务**：实现首页路由 `/ops/dashboard`，返回监控首页 HTML 模板 + +#### 1.1.2 指标数据获取 +- [ ] **任务**：实现 `GET /api/v1/ai-ops/metrics/realtime` 接口，返回当前 QPS、平均延迟、P99、错误率 +- [ ] **任务**：实现 `GET /api/v1/ai-ops/metrics/suppliers/count` 接口，返回活跃供应商数量 +- [ ] **任务**：实现 `GET /api/v1/ai-ops/alerts/open/count` 接口，返回未关闭告警数量 + +#### 1.1.3 指标下钻 +- [ ] **任务**：实现 `GET /api/v1/ai-ops/metrics/query` 接口，支持 service/path/supplier 维度过滤 + +### 模块 1.2：日志查询 + +#### 1.2.1 日志查询页 +- [ ] **任务**：实现日志查询页路由 `/ops/dashboard/logs` + +#### 1.2.2 日志结果展示 +- [ ] **任务**：日志列表以表格展示，每行显示：时间 / 服务名 / 路径 / 状态码 / 延迟 / 用户ID / 供应商ID +- [ ] **任务**：日志列表支持分页，每页 100 条，显示总条数 +- [ ] **任务**：实现日志查询结果"导出 CSV"按钮，导出上限 10000 条 + +#### 1.2.3 日志查询性能 +- [ ] **任务**：日志查询接口添加查询超时逻辑，超时返回部分结果并提示 +- [ ] **任务**：实现日志查询结果缓存（Redis，5分钟 TTL），同一筛选条件命中缓存时直接返回 + +--- + +## Phase 2：告警规则引擎 + 通知渠道（告警只通知，不执行自愈） + +> **工期估算**：12 人天（含 20% 协调缓冲） + +### 模块 2.1：告警规则管理 + +#### 2.1.1 告警规则列表页 +- [ ] **任务**：实现告警规则列表页路由 `/ops/dashboard/alerts/rules` +- [ ] **任务**：规则列表支持分页，每页 50 条 + +#### 2.1.2 创建/编辑告警规则 +- [ ] **任务**：实现规则创建页路由 `/ops/dashboard/alerts/rules/create` +- [ ] **任务**：实现规则创建表单，包含字段：规则名称（必填）、监控指标（下拉：QPS/延迟/错误率/供应商健康度/Token消耗）、阈值类型（下拉：> / < / = / 正则匹配）、阈值数值（必填）、持续时间（分钟，必填）、告警级别（下拉：P0/P1/P2/P3，必填）、通知渠道（多选：Webhook/邮件/飞书/企微） +- [ ] **任务**：编辑页路由 `/ops/dashboard/alerts/rules/{rule_id}/edit`，回填已有数据 + +#### 2.1.3 告警规则引擎（后端） +- [ ] **任务**：实现规则引擎从 PostgreSQL 加载所有启用规则，每 30 秒刷新 +- [ ] **任务**：实现规则引擎对每个指标数据点执行阈值评估 +- [ ] **任务**：实现持续时间判定（指标超阈值必须持续 N 分钟才触发） +- [ ] **任务**：实现告警事件生成，写入 `ai_ops_alert_events` 表，状态 = triggered +- [ ] **任务**：实现同一规则同一目标 5 分钟抑制期逻辑（5 分钟内相同告警不重复生成） +- [ ] **任务**：实现告警升级逻辑（P2 持续 2 小时未确认 → 升级 P1） + +### 模块 2.2：告警事件与处置 + +#### 2.2.1 告警事件列表 +- [ ] **任务**：实现告警事件列表页路由 `/ops/dashboard/alerts/events` +- [ ] **任务**：事件列表每行显示：事件ID / 规则名称 / 级别 / 触发时间 / 持续时长 / 状态 / 操作 + +#### 2.2.2 告警集群聚合 +- [ ] **任务**：实现告警聚合逻辑：同一服务/资源 1 分钟内触发 >20 条告警时，生成 1 条集群告警 +- [ ] **任务**：集群告警列表每行显示：集群ID / 涉及规则数 / 累计告警数 / 首条时间 / 最新时间 / 级别 + +### 模块 2.3：通知渠道配置 + +#### 2.3.1 通知配置页 +- [ ] **任务**：实现通知配置页路由 `/ops/dashboard/alerts/channels` + +#### 2.3.2 通知发送后端 +- [ ] **任务**：实现通知发送队列（内存队列 + Redis 持久化） +- [ ] **任务**：实现 P0/P1 通知 30 秒内发送，P2 通知 120 秒内发送 +- [ ] **任务**：实现通知失败时自动切换备用渠道（Webhook 失败 → 邮件 → 飞书 → 企微） +- [ ] **任务**：实现通知日志记录，每次发送记录成功/失败原因到 `ai_ops_notification_logs` + +--- + +## Phase 3：自愈引擎 + 审计回滚 + +> **工期估算**：14 人天（含 20% 协调缓冲） + +### 模块 3.1：自愈规则配置 + +#### 3.1.1 自愈规则创建 +- [ ] **任务**：在告警规则创建/编辑页，添加"自愈动作"可选配置区块 +- [ ] **任务**：自愈动作类型下拉：无 / 切换备用路由 / 限流 / 重启实例 / 触发脚本 +- [ ] **任务**：当选择"切换备用路由"时，显示供应商下拉框（选择目标备用供应商） +- [ ] **任务**：配置"沙盒模式"开关，默认开启（沙盒模式下自愈动作仅记录，不实际执行） +- [ ] **任务**："保存"按钮同时保存告警规则和自愈动作配置 + +#### 3.1.2 自愈执行后端 +- [ ] **任务**：自愈引擎监听 triggered 状态的告警事件 +- [ ] **任务**：当告警关联自愈动作且沙盒模式关闭时，执行自愈动作 +- [ ] **任务**：执行切换备用路由：调用 gateway 管理接口，将流量切换到备用供应商 +- [ ] **任务**：执行限流：调用 gateway 管理接口，设置速率限制 +- [ ] **任务**：执行重启实例：通过 K8s API 或主机 agent 调用重启指定服务实例，配置超时 60 秒、最大重试 2 次 +- [ ] **任务**：执行触发脚本：在隔离环境中执行指定的 shell/Python 脚本，超时 30 秒 +- [ ] **任务**：自愈动作执行后 60 秒内评估监控指标是否恢复正常 +- [ ] **任务**：自愈成功：事件状态更新为 resolved，记录动作结果 +- [ ] **任务**：自愈失败（重试 1 次仍失败）：升级为 P0 人工告警（电话/短信），事件状态更新为 escalated + +#### 3.1.3 自愈级联失败处理 +- [ ] **任务**：切换备用路由后，监控新路由健康状态 2 分钟 +- [ ] **任务**：若新路由也触发告警，立即回退到原始路由 +- [ ] **任务**：回退完成后，升级为 P0 人工告警（电话/短信），注明"自愈级联失败" + +### 模块 3.2：配置审计 + +#### 3.2.1 审计日志查询页 +- [ ] **任务**：实现审计日志页路由 `/ops/dashboard/audit` +- [ ] **任务**：审计列表每行显示：审计ID / 操作时间 / 操作人 / 操作类型 / 资源类型 / 资源ID / 操作后值摘要 +- [ ] **任务**：审计列表支持导出（最多 10000 条，按时间范围导出） + +#### 3.2.2 审计后端 +- [ ] **任务**：拦截所有配置变更操作（CREATE/UPDATE/DELETE），在事务内同步写入审计日志 +- [ ] **任务**：审计日志写入使用追加模式（不支持 UPDATE/DELETE），数据库层设置禁止删除策略 +- [ ] **任务**：审计日志保留期 >= 90 天，后台 job 每天清理过期数据 + +### 模块 3.3：配置回滚 + +#### 3.3.1 回滚操作入口 +- [ ] **任务**：回滚成功提示："回滚成功，已恢复到 {时间} 的状态，耗时 X 秒" +- [ ] **任务**：回滚失败时显示错误码和原因（如 OPS_AUD_4101） + +#### 3.3.2 回滚后端 +- [ ] **任务**：实现回滚接口，根据审计记录 ID 查找操作前值 +- [ ] **任务**：回滚前检查目标资源是否仍存在，不存在时返回错误码 `AUDIT_ROLLBACK_TARGET_LOST` +- [ ] **任务**：回滚操作在独立事务中执行，更新目标资源值 +- [ ] **任务**：回滚成功后生成新审计记录，关联原始审计记录 ID（字段 `rolled_back_from_audit_id`） + +--- + +## Phase 4：容量主板与高级分析 + +### 模块 4.1：容量视图 + +#### 4.1.1 容量主页 +- [ ] **任务**：实现容量主页路由 `/ops/dashboard/capacity` + +#### 4.1.2 容量数据后端 +- [ ] **任务**：实现容量数据聚合 job（每小时执行），将原始指标聚合为小时级数据 +- [ ] **任务**：实现增长率计算算法（基于过去 7 天数据线性回归） +- [ ] **任务**：实现负载等级判定（可配置阈值，默认为：正常 < 60% 利用率 < 警告 < 80% < 过载） + +--- + +## 全局模块 + +### 模块 G1：认证与权限 + +- [ ] **任务**：实现登录页路由 `/ops/login`，支持账号密码登录 +- [ ] **任务**：实现 JWT Token 签发，Token 有效期 8 小时 +- [ ] **任务**：实现中间件，所有 `/api/v1/ai-ops/*` 接口需携带有效 JWT +- [ ] **任务**：实现角色权限中间件：查看者（GET only）、运维人员（可写告警规则）、管理员（可回滚、可管理用户） +- [ ] **任务**：实现权限不足时返回 HTTP 403，响应体包含错误码 `OPS_AUTH_1001` + +### 模块 G2：健康检查 + +- [ ] **任务**：实现 `GET /actuator/health` 接口，返回整体健康状态 +- [ ] **任务**：实现 `GET /actuator/health/live` 接口，用于 K8s liveness probe +- [ ] **任务**：实现 `GET /actuator/health/ready` 接口，用于 K8s readiness probe（依赖 DB + Redis 连通性） + +### 模块 G3：OpenAPI 文档 + +- [ ] **任务**：实现 OpenAPI 3.0 JSON spec 生成，端点 `/openapi.json` +- [ ] **任务**：确保所有对外 API（路由/请求/响应/错误码）均在 spec 中体现 + +--- + +## 技术基础设施（各 Phase 共享） + +### T1：项目骨架 +- [ ] **任务**：初始化 Go module `github.com/lijiaoliao/ai-ops` +- [ ] **任务**：创建 `cmd/ai-ops/main.go` 入口，支持 `api` 和 `worker` 两种运行模式 +- [ ] **任务**：创建 `internal/` 目录结构（domain/service/handler/infrastructure/repository） +- [ ] **任务**：配置 Viper 读取 `config.yaml`，支持环境变量覆盖 +- [ ] **任务**：配置 `log/slog` 结构化日志，输出 JSON 格式 +- [ ] **任务**：创建 PostgreSQL schema migration（使用 golang-migrate），表前缀 `ai_ops_` +- [ ] **任务**：创建 Redis 连接池配置 +- [ ] **任务**：配置 Dockerfile 和 docker-compose.yml +- [ ] **任务**：编写 `DEPLOYMENT.md` 中的 docker-compose 启动命令 + +### T2：单元测试骨架 +- [ ] **任务**：为每个 domain 层函数编写单元测试，覆盖率 >= 70% +- [ ] **任务**：为每个 service 层函数编写单元测试，覆盖率 >= 80% +- [ ] **任务**：配置 CI（GitHub Actions），PR 必须通过全部测试和覆盖率检查 + +### T3：IntegrationPlugin 接口 +- [ ] **任务**：实现 `IntegrationPlugin` 接口（`Init() error` / `Serve() error` / `Shutdown() error`） +- [ ] **任务**：实现插件模式下各模块的开关配置（`viper` 读取 `ops.enabled_modules`） +- [ ] **任务**：编写集成测试：插件模式启动，所有功能正常运作 + +--- + +## 任务估算汇总 + +| Phase | 模块 | 任务数 | 估计工时 | +|-------|------|--------|---------| +| Phase 1 | 1.1 首页 + 1.2 日志查询 | 28 | 3 人天 | +| Phase 2 | 2.1 告警规则 + 2.2 事件处置 + 2.3 通知渠道 | 30 | 4 人天 | +| Phase 3 | 3.1 自愈引擎 + 3.2 审计 + 3.3 回滚 + 3.4 供应商切换 | 26+16=42 | 4 人天 + 2 人天 | +| Phase 4 | 4.1 容量视图 | 10 | 1.5 人天 | +| 全局 | G1 认证 + G2 健康 + G3 文档 | 14 | 1.5 人天 | +| 技术基础设施 | T1 骨架 + T2 测试 + T3 插件 | 14 | 2 人天 | +| **合计** | | **122+16=138** | **~16+2=18 人天** | + +--- + +### 模块 3.4：供应商智能切换（参考 FreeRide 思路） + +> FreeRide 是 OpenClaw 的自动模型切换插件，核心思路：实时排行 → 自动选择 → Fallback 链 → 限速无感知切换。对应到 AI-Ops 的供应商切换场景，可以把这个思路产品化。 + +#### 3.4.1 供应商质量监控 + +- [ ] **任务**：定时任务（每 5 分钟）调用各中转供应商的 `/models` 接口，记录可用模型列表 +- [ ] **任务**：对每个供应商执行探测请求（测试请求），记录响应时间和错误率 +- **任务**：探测结果写入 `supplier_health` 表，记录字段：supplier_id、probe_at、latency_ms、error_rate、available_models、elo_score +- [ ] **任务**：`GET /api/v1/ai-ops/suppliers/health` 接口返回所有供应商的实时健康状态 +- [ ] **任务**：`GET /api/v1/ai-ops/suppliers/health/{supplier_id}` 接口返回指定供应商的详细健康状态 +- [ ] **任务**：在供应商管理页显示健康状态标签（健康 / 延迟高 / 错误率高 / 不可用） +- [ ] **任务**：健康状态数据保留 7 天，支持趋势查看 + +#### 3.4.2 供应商 Fallback 链管理 + +- [ ] **任务**：为每个接入的模型配置主供应商 + 备用供应商列表（至少 1 主 + 1 备） +- [ ] **任务**：供应商配置数据结构： + ```go + type SupplierChain struct { + Model string // 模型名 + Primary string // 主供应商ID + Fallbacks []string // 备用供应商ID列表（按优先级排序） + CooldownSec int // 故障后多少秒内不切换回来（默认300s） + } + ``` +- [ ] **任务**：配置页支持拖拽排序 Fallback 顺序 +- [ ] **任务**：切换后的冷却期内，即使主供应商恢复也不同质（避免震荡） +- [ ] **任务**：切换记录写入审计日志，包含：切换时间、原供应商、目标供应商、切换原因 + +#### 3.4.3 智能切换策略 + +- [ ] **任务**：切换策略下拉：成本优先 / 质量优先 / 延迟优先 / 手动 +- [ ] **任务**：**成本优先**：按 `input_cost_per_token + output_cost_per_token` 排序，选择最低者 +- [ ] **任务**：**质量优先**：按最近 24h 成功率排序，选择最高者 +- [ ] **任务**：**延迟优先**：按最近 probe 的 `latency_ms` 排序，选择最低者 +- [ ] **任务**：**手动**：每次切换需人工确认 +- [ ] **任务**：当主供应商触发告警（P1/P2），自动检查 Fallback 链是否可用 +- [ ] **任务**：选择最佳备用供应商后，自动执行切换（若策略不是"手动"） +- [ ] **任务**：切换完成后发送通知（飞书/企微/钉钉），告知：原供应商、目标供应商、切换原因 + +#### 3.4.4 供应商切换执行 + +- [ ] **任务**：`POST /api/v1/ai-ops/suppliers/switch` 接口：传入 model + target_supplier，执行切换 +- [ ] **任务**：调用 gateway 的 `/internal/suppliers/switch` 接口完成实际路由切换 +- [ ] **任务**：切换后立即执行一次探针验证，确认新供应商可服务工作 +- [ ] **任务**：验证失败时，回退到上一个供应商，并记录切换失败原因 +- [ ] **任务**：供应商切换作为自愈动作之一，可关联告警规则（Phase 3.1 已覆盖） + +#### 3.4.5 供应商健康看板 + +- [ ] **任务**：路由 `/ops/dashboard/suppliers` 显示供应商健康一览 +- [ ] **任务**：卡片展示：今日切换次数 / 当前不可用供应商数 / 各供应商平均延迟 / 各供应商错误率 +- [ ] **任务**：表格展示所有供应商：名称 / 健康状态 / 最后探针时间 / 平均延迟 / 24h 成功率 / 可用模型数 +- [ ] **任务**：支持按健康状态筛选（全部 / 健康 / 延迟高 / 不可用） + +#### 3.4.6 参考 FreeRide 的非破坏性配置更新 + +- [ ] **任务**：供应商配置更新采用原子替换策略：写临时文件 → 验证 → 原子替换 +- [ ] **任务**：防止配置损坏导致系统不可用 +- [ ] **任务**：配置更新前先在内存中验证 JSON Schema，不合法则拒绝更新 + diff --git a/specs/竞品分析.md b/specs/竞品分析.md new file mode 100644 index 0000000..a98e240 --- /dev/null +++ b/specs/竞品分析.md @@ -0,0 +1,132 @@ +# AI-Ops 竞品深度分析 + +> 版本：v1.0 +> 日期：2026-04-27 +> 内容：14 个竞品全景矩阵、功能逐项对比、技术分析、市场定位 + +--- + +## 一、市场概览 + +- 全球 ITOM 市场：2025 年约 **$420 亿**，AIOps 细分增速 25-30% CAGR +- 国内 AIOps 市场：约 **¥80-100 亿** +- 43% 的 SRE 团队在采纳监控工具后运营 toil 不降反升（Gartner 2025） +- AI 告警噪声降低幅度：60-80%；MTTR 缩短：50-70% + +--- + +## 二、竞品全景矩阵（14 个） + +| 竞品 | 类型 | LLM Gateway 特有监控 | 供应商健康检测 | 自愈能力 | 定价 | 核心劣势 | +|------|------|---------------------|--------------|---------|------|---------| +| **Datadog** | SaaS/企业 | ⚠️ LLM Observability（2024 新增） | ❌ | ❌ | $15+/host/月 | 价格高，对 LLM 特有故障无专项 | +| **New Relic** | SaaS/企业 | ⚠️ LLM 监控（新增） | ❌ | ❌ | $0.14-0.25/GiB | 非 LLM 原生，故障定位慢 | +| **PagerDuty AIOps** | SaaS | ❌ | ❌ | ⚠️ Runbook 触发 | $15-25/user/月 | 只管 On-call，监控能力弱 | +| **incident.io** | SaaS | ❌ | ❌ | ⚠️ AI 根因分析 | $20-35/user/月 | 无监控，只做事件响应 | +| **Dynatrace Davis AI** | 企业 | ⚠️ AI 监控 | ❌ | ⚠️ 有限 | 面议 | 重量级，LLM 场景不深 | +| **BigPanda** | SaaS | ❌ | ❌ | ⚠️ 自动化工作流 | 面议 | 企业级，配置复杂 | +| **Splunk AI** | 企业 | ❌ | ❌ | ❌ | 面议 | 价格极高，非实时 | +| **Grafana + Alerting** | 开源 | ❌ | ❌ | ❌ | 免费 | 规则维护成本高，无自愈 | +| **阿里云 ARMS** | 云厂商 | ⚠️ 国内模型 | ❌ | ⚠️ 限国内云 | ¥0.5-2/调用量 | 非阿里云环境弱 | +| **Opsgenie** | SaaS | ❌ | ❌ | ❌ | $10-20/user/月 | 告警管理，无监控 | +| **xMatters** | SaaS | ❌ | ❌ | ✅ 完整 | 面议 | 企业级，K8s 自愈强 | +| **Coralogix LLM Observability** | SaaS | ✅ LLM 专项 | ❌ | ❌ | 面议 | 只做可观测性，无自愈 | +| **Robusta** | 开源 | ❌ | ❌ | ✅ K8s 自愈 | 免费 | 只管 K8s，不懂 LLM | +| **OneAlert** | SaaS | ❌ | ❌ | ⚠️ 告警聚合 | 免费 | 基础告警，无深度 | +| **立连桥 ai-ops** | 内部工具 | ✅ 深度集成 | ✅ 分钟级探针 | ✅ 供应商自愈 | 内部成本 | 需从 0 构建 | + +--- + +## 三、功能逐项对比（19 项） + +``` +功能项 Datadog NewRelic PagerDuty incident.io xMatters Grafana ARMS ai-ops +LLM Gateway 垂直监控 ⚠️ ⚠️ ❌ ❌ ❌ ❌ ⚠️ ✅ +供应商密钥失效检测 ❌ ❌ ❌ ❌ ❌ ❌ ❌ ✅ +额度耗尽预警 ❌ ❌ ❌ ❌ ❌ ❌ ❌ ✅ +供应商故障自动切换 ❌ ❌ ⚠️ ❌ ✅ ❌ ⚠️ ✅ +配置变更审计+回滚 ⚠️ ⚠️ ❌ ❌ ❌ ⚠️ ⚠️ ✅ +Token 消耗趋势 ⚠️ ⚠️ ❌ ❌ ❌ ⚠️ ⚠️ ✅ +容量视图（QPS/延迟/利用率） ✅ ✅ ❌ ❌ ❌ ⚠️ ✅ ✅ +告警聚合+抑制 ✅ ✅ ✅ ✅ ✅ ⚠️ ✅ ✅ +多渠道告警通知 ✅ ✅ ✅ ✅ ✅ ⚠️ ✅ ✅ +MTTR 追踪 ✅ ✅ ✅ ✅ ⚠️ ❌ ✅ ✅ +OpenTelemetry 兼容 ✅ ✅ ⚠️ ✅ ⚠️ ✅ ❌ ✅ +自愈引擎 ❌ ❌ ⚠️ Runbook ❌ ✅ ❌ ⚠️ ✅ +独立部署模式 ❌ ❌ ❌ ❌ ❌ ✅ ❌ ✅ +集成部署模式（Go module） ❌ ❌ ❌ ❌ ❌ ❌ ❌ ✅ +Go 标准库实现 ❌ ❌ ❌ ❌ ❌ ⚠️ ❌ ✅ +Webhook/脚本化自愈 ❌ ❌ ✅ ❌ ✅ ❌ ❌ ✅ +RBAC 权限控制 ✅ ✅ ✅ ✅ ✅ ⚠️ ✅ ✅ +Prometheus 格式指标暴露 ✅ ✅ ⚠️ ⚠️ ⚠️ ✅ ⚠️ ✅ +LLM 特有错误码映射 ❌ ❌ ❌ ❌ ❌ ❌ ❌ ✅ +``` + +--- + +## 四、关键技术差异 + +### 4.1 告警引擎对比 + +| 方案 | 代表竞品 | 自愈能力 | LLM Gateway 适配 | +|------|---------|---------|----------------| +| 通用 SaaS | Datadog/New Relic | ❌ 无自愈 | ❌ 只做指标监控 | +| On-call 平台 | PagerDuty/incident.io | ⚠️ Runbook 触发 | ❌ 无供应商概念 | +| 自动化 Remediation | xMatters/Robusta | ✅ 完整 | ⚠️ 基于 K8s/基础设施 | +| **ai-ops** | 立连桥 | ✅ 供应商专项自愈 | ✅ 深度集成 | + +### 4.2 数据后端对比 + +| 竞品 | 监控后端 | 部署方式 | LLM 场景适配 | +|------|---------|---------|------------| +| Datadog | 专有 | SaaS | ⚠️ 需额外配置 | +| Grafana | Prometheus | 开源 | ⚠️ 需配置 | +| 阿里云 ARMS | 专有 | 云 | ⚠️ 只限阿里云 | +| **ai-ops** | VictoriaMetrics | 自部署 | ✅ 原生 | + +--- + +## 五、市场定位结论 + +### 5.1 竞品空白 + +**没有任何竞品同时提供：** +1. LLM Gateway 特有指标监控（供应商健康/Token 消耗/错误码映射） +2. 供应商密钥失效的分钟级自动检测 +3. 基于供应商状态的自动切换/限流/自愈 +4. 面向 LLM 运营场景的容量视图 + +### 5.2 ai-ops 差异化定位 + +``` +通用监控（Datadog/New Relic） + └─ 做不了：LLM 特有故障类型 + +On-call 平台（PagerDuty/incident.io） + └─ 做不了：供应商状态感知 + +K8s 自愈（xMatters/Robusta） + └─ 做不了：LLM 供应商层面自愈 + +LLM 可观测性（Coralogix） + └─ 做不了：自动 Remediation + +─────────────────────────────────── +立连桥 ai-ops = LLM Gateway 垂直场景 + ✅ 供应商健康探针（分钟级） + ✅ 密钥失效/额度耗尽自动检测 + ✅ 供应商故障自动切换/限流 + ✅ 配置审计+回滚+容量视图 +``` + +--- + +## 六、技术选型建议 + +| 组件 | 推荐方案 | 理由 | +|------|---------|------| +| 监控后端 | VictoriaMetrics | 单-binary，Prometheus 兼容，压缩率 10x | +| 告警引擎 | 自研 | LLM Gateway 特有逻辑，通用方案不支持 | +| 自愈执行 | API 调用为主 | 安全可控，可审计 | +| 通知渠道 | 飞书+企微双活 | 团队使用习惯，降级链路 | +| 配置回滚 | 审计日志+完整值快照 | 状态机简单，回滚可靠性高 | diff --git a/static/openapi.json b/static/openapi.json new file mode 100644 index 0000000..15a3011 --- /dev/null +++ b/static/openapi.json @@ -0,0 +1,222 @@ +{ + "openapi": "3.0.3", + "info": { + "title": "AI-Ops API", + "version": "1.0.0", + "description": "AI-Ops 智能运维平台 API" + }, + "servers": [ + {"url": "http://localhost:8080", "description": "Local development"} + ], + "paths": { + "/api/v1/ai-ops/login": { + "post": { + "summary": "用户登录", + "requestBody": { + "required": true, + "content": { + "application/json": { + "schema": { + "type": "object", + "properties": { + "username": {"type": "string"}, + "password": {"type": "string"} + } + } + } + } + }, + "responses": { + "200": {"description": "Login success"}, + "400": {"description": "Bad request"} + } + } + }, + "/api/v1/ai-ops/metrics/realtime": { + "get": { + "summary": "实时指标", + "security": [{"bearerAuth": []}], + "responses": {"200": {"description": "OK"}} + } + }, + "/api/v1/ai-ops/metrics/query": { + "get": { + "summary": "指标下钻查询", + "security": [{"bearerAuth": []}], + "parameters": [ + {"name": "service", "in": "query", "schema": {"type": "string"}}, + {"name": "path", "in": "query", "schema": {"type": "string"}}, + {"name": "supplier", "in": "query", "schema": {"type": "string"}} + ], + "responses": {"200": {"description": "OK"}} + } + }, + "/api/v1/ai-ops/metrics/suppliers/count": { + "get": { + "summary": "活跃供应商数量", + "security": [{"bearerAuth": []}], + "responses": {"200": {"description": "OK"}} + } + }, + "/api/v1/ai-ops/alerts/open/count": { + "get": { + "summary": "未关闭告警数量", + "security": [{"bearerAuth": []}], + "responses": {"200": {"description": "OK"}} + } + }, + "/api/v1/ai-ops/logs": { + "get": { + "summary": "日志查询", + "security": [{"bearerAuth": []}], + "parameters": [ + {"name": "service", "in": "query", "schema": {"type": "string"}}, + {"name": "path", "in": "query", "schema": {"type": "string"}}, + {"name": "status_code", "in": "query", "schema": {"type": "integer"}}, + {"name": "start_time", "in": "query", "schema": {"type": "string"}}, + {"name": "end_time", "in": "query", "schema": {"type": "string"}}, + {"name": "page", "in": "query", "schema": {"type": "integer", "default": 1}}, + {"name": "page_size", "in": "query", "schema": {"type": "integer", "default": 100}} + ], + "responses": {"200": {"description": "OK"}} + } + }, + "/api/v1/ai-ops/logs/export": { + "get": { + "summary": "日志导出 CSV", + "security": [{"bearerAuth": []}], + "responses": {"200": {"description": "CSV file"}} + } + }, + "/api/v1/ai-ops/rules": { + "get": { + "summary": "告警规则列表", + "security": [{"bearerAuth": []}], + "responses": {"200": {"description": "OK"}} + }, + "post": { + "summary": "创建规则", + "security": [{"bearerAuth": []}], + "responses": {"201": {"description": "Created"}} + } + }, + "/api/v1/ai-ops/rules/{id}": { + "get": { + "summary": "获取规则详情", + "security": [{"bearerAuth": []}], + "parameters": [{"name": "id", "in": "path", "required": true, "schema": {"type": "string"}}], + "responses": {"200": {"description": "OK"}} + }, + "put": { + "summary": "更新规则", + "security": [{"bearerAuth": []}], + "parameters": [{"name": "id", "in": "path", "required": true, "schema": {"type": "string"}}], + "responses": {"200": {"description": "OK"}} + }, + "delete": { + "summary": "删除规则", + "security": [{"bearerAuth": []}], + "parameters": [{"name": "id", "in": "path", "required": true, "schema": {"type": "string"}}], + "responses": {"204": {"description": "No Content"}} + } + }, + "/api/v1/ai-ops/alerts": { + "get": { + "summary": "告警事件列表", + "security": [{"bearerAuth": []}], + "parameters": [ + {"name": "status", "in": "query", "schema": {"type": "string"}}, + {"name": "page", "in": "query", "schema": {"type": "integer"}}, + {"name": "page_size", "in": "query", "schema": {"type": "integer"}} + ], + "responses": {"200": {"description": "OK"}} + } + }, + "/api/v1/ai-ops/channels": { + "get": { + "summary": "通知渠道列表", + "security": [{"bearerAuth": []}], + "responses": {"200": {"description": "OK"}} + }, + "post": { + "summary": "创建渠道", + "security": [{"bearerAuth": []}], + "responses": {"201": {"description": "Created"}} + } + }, + "/api/v1/ai-ops/channels/{id}": { + "get": { + "summary": "获取渠道详情", + "security": [{"bearerAuth": []}], + "parameters": [{"name": "id", "in": "path", "required": true, "schema": {"type": "string"}}], + "responses": {"200": {"description": "OK"}} + }, + "put": { + "summary": "更新渠道", + "security": [{"bearerAuth": []}], + "parameters": [{"name": "id", "in": "path", "required": true, "schema": {"type": "string"}}], + "responses": {"200": {"description": "OK"}} + }, + "delete": { + "summary": "删除渠道", + "security": [{"bearerAuth": []}], + "parameters": [{"name": "id", "in": "path", "required": true, "schema": {"type": "string"}}], + "responses": {"204": {"description": "No Content"}} + } + }, + "/api/v1/ai-ops/audits": { + "get": { + "summary": "审计日志列表", + "security": [{"bearerAuth": []}], + "parameters": [ + {"name": "object_type", "in": "query", "schema": {"type": "string"}}, + {"name": "object_id", "in": "query", "schema": {"type": "string"}}, + {"name": "page", "in": "query", "schema": {"type": "integer"}}, + {"name": "page_size", "in": "query", "schema": {"type": "integer"}} + ], + "responses": {"200": {"description": "OK"}} + } + }, + "/api/v1/ai-ops/audits/{id}/rollback": { + "post": { + "summary": "配置回滚", + "security": [{"bearerAuth": []}], + "parameters": [{"name": "id", "in": "path", "required": true, "schema": {"type": "string"}}], + "responses": {"200": {"description": "OK"}} + } + }, + "/health": { + "get": { + "summary": "健康检查", + "responses": {"200": {"description": "OK"}} + } + }, + "/actuator/health": { + "get": { + "summary": "健康检查", + "responses": {"200": {"description": "OK"}} + } + }, + "/actuator/health/live": { + "get": { + "summary": "Liveness probe", + "responses": {"200": {"description": "UP"}} + } + }, + "/actuator/health/ready": { + "get": { + "summary": "Readiness probe", + "responses": {"200": {"description": "UP"}, "503": {"description": "DOWN"}} + } + } + }, + "components": { + "securitySchemes": { + "bearerAuth": { + "type": "http", + "scheme": "bearer", + "bearerFormat": "JWT" + } + } + } +} diff --git a/tech/DEPLOYMENT.md b/tech/DEPLOYMENT.md new file mode 100644 index 0000000..a281fce --- /dev/null +++ b/tech/DEPLOYMENT.md @@ -0,0 +1,175 @@ +# AI-Ops 部署设计 + +> 版本：v1.0 | 状态：初稿 + +--- + +## 1. 部署架构 + +### 1.1 总体架构 + +``` +├── Load Balancer (Nginx / 云 CLB) + │ + ├── AI-Ops API Server x 2 (主备) + │ │ + │ ├── HTTP API (标准库 net/http) + │ └── WebSocket (告警推送) + │ + ├── AI-Ops Worker x 2 (后台任务) + │ │ + │ ├── 指标采集器 + │ ├── 告警评估器 + │ ├── 自愈执行器 + │ └── 审计清理器 + │ + └── 共享层 + │ + ├── PostgreSQL 15+ (主库 + 备库) + ├── Redis (缓存 + 会话 + 锁) + ├── Prometheus (时序数据) + └── Grafana (监控可视化) +``` + +### 1.2 容器化部署 + +使用 Docker Compose 或 Kubernetes： + +```yaml +# docker-compose.yml 抽象 +services: + ai-ops-api: + image: ai-ops:latest + command: ["./ai-ops", "api"] + replicas: 2 + ports: + - "8080:8080" + environment: + - DB_HOST=postgres + - REDIS_HOST=redis + - PROMETHEUS_HOST=prometheus + + ai-ops-worker: + image: ai-ops:latest + command: ["./ai-ops", "worker"] + replicas: 2 + environment: + - DB_HOST=postgres + - REDIS_HOST=redis + - PROMETHEUS_HOST=prometheus + + postgres: + image: postgres:15 + volumes: + - pg_data:/var/lib/postgresql/data + + redis: + image: redis:7 + + prometheus: + image: prom/prometheus:latest + volumes: + - ./prometheus.yml:/etc/prometheus/prometheus.yml + + grafana: + image: grafana/grafana:latest +``` + +--- + +## 2. 资源需求 + +### 2.1 API Server + +| 资源 | 需求 | 说明 | +|------|------|------| +| CPU | 2 核 | Go 服务主要为 IO 密集型 | +| 内存 | 1 GB | 含连接池缓存 | +| 存储 | 无 | 状态外部化 | +| 网络 | 内网 100Mbps | 调用内部服务 | + +### 2.2 Worker + +| 资源 | 需求 | 说明 | +|------|------|------| +| CPU | 1 核 | 定时任务，CPU 需求低 | +| 内存 | 512 MB | | +| 存储 | 无 | | + +### 2.3 数据库 + +| 资源 | 需求 | 说明 | +|------|------|------| +| CPU | 2 核 | | +| 内存 | 4 GB | 索引与缓冲 | +| 存储 | 200 GB | 90 天审计日志 + 时序数据 | +| 网络 | 内网 1Gbps | | + +### 2.4 Prometheus + +| 资源 | 需求 | 说明 | +|------|------|------| +| CPU | 1 核 | | +| 内存 | 2 GB | | +| 存储 | 100 GB | 时序数据保留 90 天 | + +--- + +## 3. 监控与运维钩子 + +### 3.1 健康检查 + +| 端点 | 路径 | 预期响应 | 失败行为 | +|------|------|----------|---------| +| 存活检查 | `/actuator/health/live` | HTTP 200 | 容器重启 | +| 就绪检查 | `/actuator/health/ready` | HTTP 200 | 从负载均衡移除 | +| 综合检查 | `/actuator/health` | HTTP 200 + JSON | 触发告警 | + +### 3.2 启动/关闭顺序 + +**启动顺序**: +1. PostgreSQL 启动完成 +2. Redis 启动完成 +3. Prometheus 启动完成 +4. Worker 启动（执行 migration） +5. API Server 启动 + +**关闭顺序**: +1. 停止接收新 HTTP 请求（健康检查返回非 200） +2. 等待现有请求处理完成（超时 30 秒） +3. 停止 Worker 定时器 +4. 关闭数据库连接池 +5. 退出进程 + +### 3.3 配置管理 + +- 配置文件 `config.yaml` + 环境变量覆盖。 +- 敏感字段（密钥、密码）仅通过环境变量传入，不落地配置文件。 +- 支持热更新的配置：告警规则、通知渠道。 + +--- + +## 4. 灾备设计 + +### 4.1 数据库灾备 + +| 策略 | 方案 | RTO | RPO | +|------|------|-----|-----| +| 主库故障 | 自动切换至备库 | < 5 min | < 1 min | +| 逻辑损坏 | 从备库恢复 + 审计日志回放 | < 30 min | < 1 min | +| 全库损坏 | 每日冷备份恢复 | < 2 h | < 24 h | + +### 4.2 应用层灾备 + +| 场景 | 处理 | +|------|------| +| API Server 单机故障 | 负载均衡自动移除，剩余节点继续服务 | +| Worker 单机故障 | 剩余 Worker 继续执行定时任务，某些任务可能延迟 | +| Redis 故障 | 审计日志落地 PostgreSQL，告警缓存失效不影响核心功能 | +| Prometheus 故障 | 实时指标采集中断，告警引擎依赖本地缓存继续运行 | + +### 4.3 多中心部署 + +- 当前阶段为单中心部署。 +- 备份中心仅用于数据库备份恢复，不提供活跃服务。 +- 未来扩展至多中心时，需要解决 PostgreSQL 的分布式写入和 Prometheus 的联邦查询问题。 diff --git a/tech/HLD.md b/tech/HLD.md new file mode 100644 index 0000000..70b143f --- /dev/null +++ b/tech/HLD.md @@ -0,0 +1,836 @@ +# AI-Ops 智能运维系统 — 高层设计文档 (HLD) + +> 版本：v1.0 +> 负责人：TechLead +> 目标读者：后端开发、SRE、QA +> 状态：初稿 + +--- + +## 1. 设计目标与约束 + +### 1.1 核心目标 + +| 指标 | 基线值 | 目标值 | 验证方式 | +|------|--------|--------|---------| +| 核心故障 MTTR | >30 min | <10 min | 从告警触发到服务恢复的 P99 时长 | +| P1/P2 自动化处理覆盖率 | 0% | >=60% | 自愈成功事件数 / (P1+P2 总事件数) | +| 告警噪声率 | >20% | <5% | 误告警数 / 总告警数 | +| 配置回滚时间窗口 | 无 | <5 min | 回滚指令发出到验证通过的时长 | +| 审计日志保留期 | 无 | >=90 天 | 存储系统自动清理策略 | + +### 1.2 技术约束（强制性） + +- **语言**: Go 1.22+ +- **HTTP 框架**: 标准库 `net/http` + 自定义中间件（禁止引入 Gin/Echo） +- **数据库**: PostgreSQL 15+ ，驱动 `jackc/pgx/v5` +- **缓存**: Redis，客户端 `redis/go-redis/v9` +- **配置**: YAML + Viper，环境变量覆盖敏感字段 +- **日志/审计**: 结构化日志，审计事件模型与 supply-api/ 一致 +- **错误码**: `{SOURCE}_{CATEGORY}_{CODE}` 格式，例如 `OPS_ALT_4001` +- **健康检查**: `/actuator/health`、`/actuator/health/live`、`/actuator/health/ready` +- **测试**: Go testing + testify，覆盖率门槛 domain >= 70%、service/handler >= 80% +- **Store 接口**: 必须包含版本控制（乐观锁） +- **条件能力**: 默认关闭，需要在 `BuildServer` / `BuildRuntime` 中显式挂载才算已交付 + +### 1.3 运行模式 + +系统必须同时支持两种运行模式： + +| 模式 | 特征 | 部署方式 | +|------|------|---------| +| **独立运行** | 自有 `cmd/ai-ops/main.go`，独立数据库 schema，独立 docker-compose | `docker-compose up` 或单独容器 | +| **集成运行** | 作为 Go module 被 `gateway/` 或 `supply-api/` 引入，共享数据库连接池和配置，通过内部接口注册 | 编译时作为子模块编译，运行时挂载到立交桥主进程 | + +**集成约束**： +- 独立运行时，系统提供完整的 HTTP API 和管理后台。 +- 集成运行时，系统提供 `IntegrationPlugin` 接口，允许主程序通过配置开关启用/禁用各模块。 +- 数据库 schema 必须使用独立的 `ai_ops_` 前缀，避免与主项目表名冲突。 +- 配置文件必须支持分离加载：独立运行时读取自己的 `config.yaml`，集成运行时合并到主项目配置。 + +--- + +## 2. 系统架构总览 + +### 2.1 逻辑架构图 + +``` ++---------------------+ +---------------------+ +---------------------+ +| 运维控制台 (Web) | | 外部系统调用者 | | 通知渠道 | +| - 监控看板 | | - NewAPI/Sub2API | | - Webhook | +| - 告警管理 |<--->| - 企业微信/飞书 |<--->| - 邮件 | +| - 日志查询 | | - Prometheus | | - 短信 | ++----------+----------+ +----------+----------+ +----------+----------+ + | | | + v v v ++---------------------+ +---------------------+ +---------------------+ +| HTTP API Layer | | /metrics (Prom) | | Notification | +| (标准库 net/http) | | /api/v1/ai-ops/ | | Dispatcher | ++----------+----------+ +----------+----------+ +----------+----------+ + | | | + v v v ++-----------------------------------------------------------------------------+ +| AI-Ops Core Domain Layer | +| +----------------+ +----------------+ +----------------+ +-----------+ | +| | Metric Service | | Alert Service | | Healing Engine | | Capacity | | +| | (指标采集/查询) | | (告警规则/触发) | | (自愈动作执行) | | Service | | +| +----------------+ +----------------+ +----------------+ +-----------+ | +| +----------------+ +----------------+ +----------------+ +-----------+ | +| | Audit Service | | Config Service | | Log Service | | Authz | | +| | (审计/回滚) | | (配置变更) | | (日志查询) | | Service | | +| +----------------+ +----------------+ +----------------+ +-----------+ | ++-----------------------------------------------------------------------------+ + | + v ++-----------------------------------------------------------------------------+ +| Infrastructure Layer | +| +----------------+ +----------------+ +----------------+ +-----------+ | +| | Metric Store | | PostgreSQL | | Redis | | Time-Series| | +| | (Prom/Victoria)| | (主审计/配置) | | (缓存/状态) | | DB | | +| +----------------+ +----------------+ +----------------+ +-----------+ | ++-----------------------------------------------------------------------------+ + | + v ++-----------------------------------------------------------------------------+ +| Bridge Integration Layer | +| +----------------+ +----------------+ +----------------+ +-----------+ | +| | Token Gateway | | Channel Manager| | Provider Health| | Runtime | | +| | (请求量/延迟) | | (供应商/路由) | | (健康检查) | | Status | | +| +----------------+ +----------------+ +----------------+ +-----------+ | ++-----------------------------------------------------------------------------+ +``` + +### 2.2 服务边界与职责 + +| 服务 | 职责 | 对应 PRD 场景 | 对应 AC | +|------|------|--------------|---------| +| **Metric Service** | 采集 gateway/、supply-api/、platform-token-runtime/ 的指标，提供 PromQL 查询、分钟级聚合 | A, H | AC-1, AC-2, AC-11 | +| **Alert Service** | 维护告警规则状态机，执行阈值评估，生成告警事件，负责聚合与抑制 | C, E, G | AC-3, AC-4, AC-5 | +| **Healing Engine** | 执行自愈动作：切换备用路由、限流、重启实例、触发脚本；记录执行结果 | C, D, F | AC-6 | +| **Audit Service** | 捕获所有配置变更，写入不可篡改审计日志，支持按原始操作记录回滚 | B, F, I | AC-7, AC-8 | +| **Config Service** | 管理告警规则、通知渠道、自愈策略的 CRUD，支持版本化与验证 | B, I | AC-7, AC-8 | +| **Log Service** | 按时间范围、服务、状态码、用户 ID 等维度筛选日志，支持 CSV 导出 | A, H | AC-10 | +| **Capacity Service** | 汇总过去 7 天 Token/QPS/延迟/利用率趋势，计算负载等级与增长率预测 | - | AC-9 | +| **Authz Service** | 角色鉴权：查看者/运维人员/管理员；控制台访问控制 | - | AC-12 | +| **Notification Dispatcher** | 将告警事件路由到配置的通知渠道，支持主备自动切换 | C, E | AC-4, AC-5 | + +--- + +## 3. 核心模块设计 + +### 3.1 自动运维流水线 (AutoOps Pipeline) + +运维流水线是系统的主干，接收指标数据，经过规则引擎评估，生成告警事件，触发自愈动作，并验证效果。 + +``` +指标数据流 + | + v ++-------------------+ +-------------------+ +-------------------+ +| Metric Ingestor | --> | Rule Engine | --> | Alert Event | +| (报文解析/格式化) | | (阈值评估/分级) | | Generator | ++-------------------+ +-------------------+ +---------+---------+ + | + v ++-------------------+ +-------------------+ +-------------------+ +| Validation Loop | <-- | Healing Engine | <-- | Notification | +| (2min 效果评估) | | (自愈动作执行) | | Dispatcher | ++-------------------+ +-------------------+ +-------------------+ +``` + +**流水线状态机**： + +| 状态 | 转移条件 | 超时 | +|------|---------|------| +| `triggered` | 规则阈值被触发 | - | +| `notified` | 通知已发送 | 30s (P0/P1), 120s (P2) | +| `healing` | 自愈动作执行中 | 60s 内完成 | +| `resolved` | 监控指标回复正常 | - | +| `escalated` | 自愈失败或未配置自愈 | 立即 | +| `acknowledged` | 人工确认 | 2h 未确认则自动升级 | + +### 3.2 健康探针 (Health Probe) + +参考 LiteLLM 的多层级健康检查设计，对于集成运行模式提供以下端点： + +| 端点 | 用途 | 检查内容 | 失败策略 | +|------|------|---------|---------| +| `/actuator/health` | 综合健康 | DB、Redis、时序库连接性 | 返回 503，触发内部告警 | +| `/actuator/health/live` | 存活探针 | 进程是否运行 | Kubernetes 重启 Pod | +| `/actuator/health/ready` | 就绪探针 | 所有依赖是否可服务 | 从负载均衡移除 | +| `/actuator/health/backlog` | 队列积压 | 告警事件队列长度 | >100 时触发内部告警 | +| `/actuator/health/datasource` | 数据源状态 | 最近 5min 内是否有新数据点 | 触发 P2 内部告警 | + +独立运行时，系统自身提供以上端点。集成运行时，通过 `IntegrationPlugin` 将检查逻辑注入到主程序的健康检查中。 + +### 3.3 异常自动恢复 (Healing Engine) + +自愈引擎的核心是动作执行器。每个动作是一个独立的可执行单元，支持沙盘模式验证。 + +**自愈动作类型**： + +| 动作 | 说明 | 执行时间限制 | 回退策略 | +|------|------|-----------|---------| +| `switch_provider` | 将流量从主路由切换到备用供应商 | 30s | 自动恢复原路由，升级人工告警 | +| `throttle` | 对目标服务/供应商启动限流 | 15s | 解除限流，升级人工告警 | +| `restart_service` | 重启异常服务（通过调用管理 API） | 45s | 不可回退，升级人工告警 | +| `invoke_script` | 执行用户配置的程序化脚本 | 60s | 脚本自身决定回退逻辑 | +| `isolate_node` | 将异常节点从负载均衡中移除 | 20s | 恢复节点到负载均衡 | + +**沙盘模式**： +- 所有自愈动作必须在沙盒环境中模拟触发 >=10 次，所有次数的执行结果符合预期，才能关联到生产告警规则。 +- 沙盒模式下，动作不会真正修改生产状态，而是记录 "dry-run" 结果。 +- 每个动作的沙盒执行结果必须包含：预期变更、实际变更、差异说明、风险标记。 + +**级联故障防护**（对应 PRD 场景 F-6）： +- 每次自愈动作执行前，系统记录当前状态快照（包含相关配置版本号）。 +- 若自愈动作执行后 2min 内触发新的 P1 以上告警，系统自动检测是否为级联故障。 +- 检测到级联故障时，自动回退上一步操作，然后升级为 P0 人工告警。 + +### 3.4 规模调度与容量视图 (Capacity Board) + +容量服务不执行自动扩缩容决策（当前版本 Out of Scope），仅提供量化视图与趋势预测。 + +**容量指标**： + +| 指标 | 采集频率 | 保留时长 | 负载等级判定 | +|------|---------|---------|-----------| +| Token 消耗量 | 1 min | 7 天(原始) / 30 天(分钟级) / 90 天(小时级) | 超过日上限 80% 为警告，100% 为过载 | +| QPS | 1 min | 同上 | 超过设计值 80% 为警告，100% 为过载 | +| P99 延迟 | 1 min | 同上 | 超过 5000ms 为警告，超过 10000ms 为过载 | +| 供应商资源利用率 | 5 min | 同上 | 超过 80% 为警告，超过 95% 为过载 | + +**增长率预测算法**： +- 采用简单线性回归，基于过去 7 天的分钟级数据计算日均增长率。 +- 计算公式：`days_to_limit = (limit - current) / daily_growth`，其中 `daily_growth = (latest - earliest) / 7`。 +- ⚠️ **免责声明**：结果仅供**参考，不作为扩容决策依据**。线性回归无法捕捉季节性波动和突增流量（如大促、热点事件），实际容量规划应以人工判断为主。 +- 建议在 UI 界面上也同步显示同样免责声明，控制台显示为 "预计 X 天达到上限（仅供参考，不作为扩容决策依据）"。 + +### 3.5 知识库管理 (审计与回滚) + +审计服务是运维系统的可信基础。所有生产配置变更必须被捕获并不可篡改地存储。 + +**审计事件模型**（与 supply-api/ 审计规范一致）： + +```go +type AuditEvent struct { + EventID string `json:"event_id"` + TenantID string `json:"tenant_id"` // 工作区 ID + ObjectType string `json:"object_type"` // 例如 "alert_rule", "route_policy" + ObjectID string `json:"object_id"` + Action string `json:"action"` // "create", "update", "delete", "rollback" + BeforeState map[string]any `json:"before_state"` + AfterState map[string]any `json:"after_state"` + RequestID string `json:"request_id"` + ResultCode string `json:"result_code"` // "OK", "OPS_AUD_4001" + SourceIP string `json:"source_ip"` + ActorID string `json:"actor_id"` // 操作人 ID + CreatedAt time.Time `json:"created_at"` +} +``` + +**高风险变更检测**（对应 PRD 场景 I）： +- 对于每次配置变更，系统计算 "影响面分数"。 +- 影响面计算方式：变更后将导致被拒绝的请求占比。若估算拒绝率 > 50%，标记为高风险。 +- 高风险变更在执行前必须弹出二次确认窗口，管理员角色才能继续。 + +**回滚机制**： +- 回滚操作不是简单的 "恢复原值"，而是一个新的审计事件（Action="rollback"），生成新的版本。 +- **fail-closed 设计**：任何配置变更操作必须先完成审计日志写入（ai_ops_audits 插入），审计记录写入成功后才能执行业务操作。若审计写入失败，业务操作立即中止并返回错误。回滚时先写入回滚审计记录，再执行回滚操作，确保审计链路始终先于业务执行。 +- 回滚前必须检查目标资源是否仍然存在。若不存在，返回错误码 `OPS_AUD_4101`。若目标已被后续修改覆盖，返回 `OPS_AUD_4102`。 +- 回滚执行前必须显示将被覆盖的子资源列表，并要求管理员二次确认。 +- 回滚必须在 60s 内完成并通过验证。 + +--- + +## 4. 数据模型设计 + +### 4.1 核心实体关系图 (ER) + +``` ++----------------+ +----------------+ +----------------+ +| ai_ops_rules |<----->| ai_ops_alerts |<----->| ai_ops_healings| ++----------------+ +----------------+ +----------------+ + | | | + v | v ++----------------+ | +----------------+ +| ai_ops_channels| | | ai_ops_snapshots| ++----------------+ | +----------------+ + | | + v v ++----------------+ +----------------+ +| ai_ops_audits | | ai_ops_configs | ++----------------+ +----------------+ + | + v ++----------------+ +| ai_ops_metrics | ++----------------+ +``` + +> **说明**：ER 图中已删除 `ai_ops_events` 和 `ai_ops_notifys` 两张表。 +> - `ai_ops_events` 的功能已被 `ai_ops_alerts` 表的状态变化（triggered→notified→healing→resolved）和 `ai_ops_healings` 表的执行记录覆盖。 +> - `ai_ops_notifys` 的功能已被 `ai_ops_channels` 表（渠道配置）以及 `ai_ops_alerts` 表的通知状态字段覆盖。 +> - `ai_ops_configs` 和 `ai_ops_snapshots` 保留在 ER 图中，将在 migration 中补齐表结构。 + +### 4.2 数据表结构 + +> **安全约束**：所有数据库交互必须使用参数化/预编译查询（prepared statements）。任何动态构建 SQL 的场景（如日志查询的模糊匹配、自定义规则的条件编译查询）必须通过应用层的 Query Builder 构建，禁止任何字符串拼接 SQL 的方式。Code Review 时必须检查所有数据库操作是否使用参数化查询。 + +#### 4.2.1 `ai_ops_rules` — 告警规则 + +| 字段 | 类型 | 约束 | 说明 | +|------|------|------|------| +| `id` | UUID | PK, 默认 gen_random_uuid() | 规则唯一标识 | +| `name` | VARCHAR(128) | NOT NULL, UNIQUE | 规则名称 | +| `metric_source` | VARCHAR(64) | NOT NULL | 指标来源：gateway/supply-api/platform-token-runtime | +| `metric_name` | VARCHAR(128) | NOT NULL | 指标名称：qps/latency_p99/error_rate/… | +| `threshold_type` | VARCHAR(16) | NOT NULL, CHECK IN ('>', '<', '=', 'regex') | 阈值类型 | +| `threshold_value` | TEXT | NOT NULL | 阈值（支持正则表达式） | +| `duration_min` | INT | NOT NULL, DEFAULT 1, CHECK >=1 | 持续触发时长（分钟） | +| `level` | VARCHAR(8) | NOT NULL, CHECK IN ('P0','P1','P2','P3') | 告警级别 | +| `channel_ids` | UUID[] | NOT NULL, DEFAULT '{}' | 关联通知渠道 ID 列表 | +| `healing_action` | VARCHAR(32) | DEFAULT NULL | 自愈动作类型（可选） | +| `healing_config` | JSONB | DEFAULT NULL | 自愈动作参数 | +| `is_sandboxed` | BOOLEAN | NOT NULL, DEFAULT FALSE | 是否已通过沙盒验证 | +| `enabled` | BOOLEAN | NOT NULL, DEFAULT TRUE | 是否启用 | +| `created_by` | VARCHAR(64) | NOT NULL | 创建人 | +| `created_at` | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() | 创建时间 | +| `updated_at` | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() | 更新时间 | +| `version` | INT | NOT NULL, DEFAULT 1 | 乐观锁版本 | + +**索引**：`CREATE INDEX idx_rules_enabled ON ai_ops_rules(enabled);` + +#### 4.2.2 `ai_ops_alerts` — 告警事件 + +| 字段 | 类型 | 约束 | 说明 | +|------|------|------|------| +| `id` | UUID | PK | 告警事件 ID | +| `rule_id` | UUID | NOT NULL, FK -> ai_ops_rules | 触发规则 | +| `level` | VARCHAR(8) | NOT NULL | 告警级别（可能升级） | +| `resource_type` | VARCHAR(64) | NOT NULL | 资源类型：service/provider/model | +| `resource_id` | VARCHAR(128) | NOT NULL | 资源标识 | +| `current_value` | TEXT | NOT NULL | 触发时的实际值 | +| `threshold_value` | TEXT | NOT NULL | 触发时的阈值 | +| `status` | VARCHAR(16) | NOT NULL, DEFAULT 'triggered' | triggered/notified/healing/resolved/escalated/acknowledged | +| `is_aggregated` | BOOLEAN | NOT NULL, DEFAULT FALSE | 是否为聚合告警 | +| `aggregated_count` | INT | DEFAULT 0 | 聚合的子告警数量 | +| `parent_alert_id` | UUID | NULL, FK -> ai_ops_alerts | 父聚合告警 ID | +| `started_at` | TIMESTAMPTZ | NOT NULL | 开始时间 | +| `resolved_at` | TIMESTAMPTZ | NULL | 解除时间 | +| `acknowledged_by` | VARCHAR(64) | NULL | 确认人 | +| `acknowledged_at` | TIMESTAMPTZ | NULL | 确认时间 | + +**索引**： +```sql +CREATE INDEX idx_alerts_status ON ai_ops_alerts(status); +CREATE INDEX idx_alerts_started_at ON ai_ops_alerts(started_at DESC); +CREATE INDEX idx_alerts_resource ON ai_ops_alerts(resource_type, resource_id); +``` + +#### 4.2.3 `ai_ops_healings` — 自愈执行记录 + +| 字段 | 类型 | 约束 | 说明 | +|------|------|------|------| +| `id` | UUID | PK | 自愈执行 ID | +| `alert_id` | UUID | NOT NULL, FK -> ai_ops_alerts | 关联告警 | +|| `action_type` | VARCHAR(32) | NOT NULL | switch_route/throttle/restart_instance/invoke_script/isolate_node | +| `config` | JSONB | NOT NULL | 执行时的参数快照 | +| `status` | VARCHAR(16) | NOT NULL, DEFAULT 'pending' | pending/succeeded/failed/rolled_back | +| `dry_run` | BOOLEAN | NOT NULL, DEFAULT FALSE | 是否沙盒执行 | +| `result_detail` | JSONB | NULL | 执行结果详情 | +| `error_code` | VARCHAR(16) | NULL | 失败时的错误码 | +| `started_at` | TIMESTAMPTZ | NOT NULL | 开始时间 | +| `completed_at` | TIMESTAMPTZ | NULL | 完成时间 | + +#### 4.2.4 `ai_ops_channels` — 通知渠道 + +| 字段 | 类型 | 约束 | 说明 | +|------|------|------|------| +| `id` | UUID | PK | 渠道 ID | +| `name` | VARCHAR(128) | NOT NULL | 渠道名称 | +| `channel_type` | VARCHAR(32) | NOT NULL, CHECK IN ('webhook','email','feishu','wechat','sms') | 渠道类型 | +| `config` | JSONB | NOT NULL | 渠道配置（URL/密钥/接收人等） | +| `priority` | INT | NOT NULL, DEFAULT 1 | 优先级（低数 = 高优先） | +| `enabled` | BOOLEAN | NOT NULL, DEFAULT TRUE | 是否启用 | +| `created_at` | TIMESTAMPTZ | NOT NULL | 创建时间 | + +#### 4.2.5 `ai_ops_audits` — 审计日志 + +| 字段 | 类型 | 约束 | 说明 | +|------|------|------|------| +| `id` | UUID | PK | 审计事件 ID | +| `tenant_id` | VARCHAR(64) | NOT NULL | 工作区 ID | +| `object_type` | VARCHAR(64) | NOT NULL | 目标资源类型 | +| `object_id` | VARCHAR(128) | NOT NULL | 目标资源 ID | +| `action` | VARCHAR(32) | NOT NULL | create/update/delete/rollback | +| `before_state` | JSONB | NULL | 变更前状态 | +| `after_state` | JSONB | NULL | 变更后状态 | +| `request_id` | VARCHAR(64) | NOT NULL | HTTP 请求 ID | +| `result_code` | VARCHAR(16) | NOT NULL | OK 或错误码 | +| `source_ip` | VARCHAR(45) | NOT NULL | 操作人 IP | +| `actor_id` | VARCHAR(64) | NOT NULL | 操作人 ID | +| `risk_level` | VARCHAR(8) | NOT NULL, DEFAULT 'normal' | normal/high/critical | +| `parent_audit_id` | UUID | NULL, FK -> ai_ops_audits | 回滚时关联原始审计 | +| `created_at` | TIMESTAMPTZ | NOT NULL, DEFAULT NOW() | 创建时间 | + +**索引**： +```sql +CREATE INDEX idx_audits_tenant_created ON ai_ops_audits(tenant_id, created_at DESC); +CREATE INDEX idx_audits_object ON ai_ops_audits(object_type, object_id); +CREATE INDEX idx_audits_actor ON ai_ops_audits(actor_id, created_at DESC); +CREATE INDEX idx_audits_request ON ai_ops_audits(request_id); +``` + +#### 4.2.6 `ai_ops_metrics` — 时序指标缓存 + +该表仅在未接入独立时序数据库时作为落地缓存，主时序数据仍然推荐存储在 Prometheus/VictoriaMetrics 中。 + +| 字段 | 类型 | 约束 | 说明 | +|------|------|------|------| +| `id` | BIGSERIAL | PK | 自增 ID | +| `metric_name` | VARCHAR(128) | NOT NULL | 指标名称 | +| `labels` | JSONB | NOT NULL, DEFAULT '{}' | 标签（service/path/supplier 等） | +| `value` | DOUBLE PRECISION | NOT NULL | 指标值 | +| `recorded_at` | TIMESTAMPTZ | NOT NULL | 采集时间 | + +**索引**：`CREATE INDEX idx_metrics_name_time ON ai_ops_metrics(metric_name, recorded_at DESC);` + +**分区策略**：按 `recorded_at` 分区，每日一个分区，自动删除 > 7 天的分区。 + +### 4.3 实体关系说明 + +- **Rule -> Alert** (1:N)：一条规则在不同时间可触发多个告警事件。 +- **Alert -> Healing** (1:1)：每个告警事件最多执行一次自愈动作（失败后升级人工处理）。 +- **Alert -> Alert** (1:N, 聚合)：父告警聚合多个子告警。 +- **Audit -> Audit** (1:1, 回滚)：回滚审计记录通过 `parent_audit_id` 关联原始记录。 +- **Rule -> Channel** (N:M)：通过 `channel_ids` 数组实现多对多关系。 + +--- + +## 5. 关键流程设计 + +### 5.1 异常检测 → 诊断 → 恢复 → 验证 → 回复 + +``` + Metric Ingestor Rule Engine Alert Service Healing Engine Validation Loop + | | | | | + | 1. 推送指标数据 | | | | + |---------------------->| | | | + | | 2. 评估阈值规则 | | | + | |---------------------->| | | + | | | 3. 生成告警事件 | | + | | |--------------------->| | + | | | 4. 检查自愈配置 | | + | | |--------------------->| | + | | | | 5. 执行自愈动作 | + | | | |--------------------->| + | | | | 6. 记录执行结果 | + | | |<---------------------| | + | | | 7. 发送通知 | | + | | |------------------------------------------------>| + | | | | | 8. 2min 后验证 + | | | |<---------------------| + | | | 9a. 解除告警 | | + | | |<---------------------| | + | | | 9b. 升级人工告警 | | + | | |<---------------------| | +``` + +**流程说明**： + +1. **指标采集** (<=15s): Metric Ingestor 每 15s 拉取一次 Prometheus 数据，或通过 Pushgateway 接收推送数据。 +2. **规则评估** (<=5s): Rule Engine 对每个启用的规则评估阈值条件。触发条件时，检查是否已在当前持续时间窗口内已存在未关闭的同类告警（抑制重复触发）。 +3. **告警生成** (<=1s): 创建 Alert 记录，状态为 `triggered`。 +4. **自愈检查** (<=1s): 检查规则是否配置了自愈动作，且已通过沙盒验证。 +5. **自愈执行** (<=60s): 执行自愈动作，包含最多 1 次重试。 +6. **结果记录** (<=1s): 将自愈执行结果写入 Healing 表，更新 Alert 状态为 `healing`。 +7. **通知发送** (P0/P1 <=30s, P2 <=120s): Notification Dispatcher 路由到配置的通知渠道。 +8. **效果验证** (2min 后): Validation Loop 查询监控指标，检查告警条件是否仍然满足。 +9. **终态处理**: + - 9a. 若指标恢复正常，Alert 状态变为 `resolved`。 + - 9b. 若指标仍未恢复，Alert 状态变为 `escalated`，通知升级为 P0 人工告警。 + +### 5.2 告警聚合流程 + +``` +Alert Service + | + | 1. 检测到新告警 + v ++-----------+ +----------------+ +----------------+ +| 同一资源 | --> | 1min 内数量 >20 | --> | 生成集群告警 | +| 在 1min | | 条? | | (is_aggregated) | +| 内的告警 | +----------------+ +--------+-------+ ++-----------+ | + | 2. 将子告警关联到父告警 + v + +--------+-------+ + | 停止单条通知 | + | 发送，只发集群 | + +----------------+ +``` + +**聚合规则**： +- 触发条件：同一 `resource_type` + `resource_id` 在 60s 内触发 > 20 条告警。 +- 聚合行为：生成一条新的 Alert，`is_aggregated=TRUE`，`aggregated_count=N`，将所有子告警的 `parent_alert_id` 设为该聚合告警 ID。 +- 通知行为：只发送一条集群告警通知，包含涉及的规则列表和时间范围。 +- 抑制周期：同一规则同一目标在 5min 内只发送 1 次通知（除非级别升级）。 + +### 5.3 配置回滚流程 + +``` +Admin Console + | + | 1. 选择审计记录，点击回滚 + v +Audit Service + | + | 2. 检查目标资源是否存在 + v ++-----------+ +----------------+ +----------------+ +| 目标存在? | --> | 是 | --> | 显示子资源影响面 | ++-----------+ +----------------+ +--------+-------+ + | | + | 否 | 3. 管理员确认 + v v ++-----------+ +--------+-------+ +| 返回错误 | | 执行回滚 | +| OPS_AUD_ | | (BeforeState | +| 4101 | | -> current) | ++-----------+ +--------+-------+ + | + | 4. 生成新审计记录 + v + +--------+-------+ + | 验证回滚后 | + | 状态，返回结果 | + +----------------+ +``` + +--- + +## 6. 技术选型与备选方案 + +### 6.1 时序数据库 + +| 方案 | 选择 | 理由 | 备选 | +|------|------|------|------| +| Prometheus | 推荐 | 已为 PRD 假设依赖，生态成熟，支持 PromQL，与 NewAPI/Sub2API 的 `/metrics` 集成自然 | VictoriaMetrics（更高性能，更低资源占用） | +| PostgreSQL 时序表 | 落地缓存 | 作为 Prometheus 不可用时的降级方案，保存最近 7 天原始指标 | - | + +**决策理由**： +- 主指标存储使用 Prometheus，提供 `/metrics` 端点供外部 scrape。 +- 在 PostgreSQL 中保存分钟级聚合指标（用于控制台快速查询）。 +- 若 Prometheus 丢失，系统进入只读降级模式，告警引擎依赖本地缓存持续运行。 + +### 6.2 告警状态缓存 + +| 方案 | 选择 | 理由 | 备选 | +|------|------|------|------| +| Redis + 本地内存 (DualCache) | 推荐 | 参考 LiteLLM 的 DualCache 模式，Redis 保证多实例共享状态，本地内存降低延迟 | 单纯 Redis | + +**设计细节**： +- 告警抑制状态存储在 Redis 中，TTL 为 5min。 +- 告警聚合计数器存储在 Redis 中，TTL 为 1min。 +- 本地内存作为 L1 缓存，命中失败时才访问 Redis（L2）。 + +### 6.3 告警批量处理 + +| 方案 | 选择 | 理由 | 备选 | +|------|------|------|------| +| 内存批量队列 + 定时刷盘 | 推荐 | 参考 LiteLLM CustomBatchLogger，每 10s 或队列长度 > 50 时刷盘，避免告警爆炸时的 IO 瓶颈 | 单条同步发送 | + +### 6.4 通知渠道 + +| 渠道 | 优先级 | 备份策略 | +|------|--------|---------| +| Webhook | 1 | 失败时降级到邮件 | +| 邮件 | 2 | 失败时降级到飞书/企业微信 | +| 飞书/企业微信 | 3 | 失败时降级到短信 | +| 短信 | 4 | 失败时通知 TechLead | + +--- + +## 7. 与立交桥主系统的集成点 + +### 7.1 Token Gateway (gateway/) + +**数据提供**： +- gateway/ 需要通过 Prometheus 指标暴露以下数据： + - `gateway_requests_total` (标签: path, method, status) + - `gateway_request_duration_seconds` (标签: path, method, quantile) + - `gateway_error_rate_5xx` (标签: path) + - `gateway_degradation_hits_total` (标签: rule_id) + +**集成接口**： +- gateway/ 提供内部 HTTP 接口供 AI-Ops 调用： + - `GET /internal/gateway/health` — 查询服务健康状态 + - `GET /internal/gateway/routes` — 获取当前路由配置，用于影响面分析 + - `POST /internal/gateway/routes` — 修改路由策略（如切换供应商、限流配置），自愈动作调用 + - `GET /internal/gateway/metrics` — 获取请求量统计 + +**集成方式**： +- 独立运行时：通过配置文件 `gateway.internal_endpoint` 指定地址，使用 API Key 鉴权。 +- 集成运行时：通过 `IntegrationPlugin` 直接调用 gateway/ 的内部方法，跳过 HTTP 层。 + +### 7.2 Channel Manager (supply-api/) + +**集成接口**： +- supply-api/ 提供以下内部 HTTP 接口供 AI-Ops 调用： + - `GET /internal/supply/accounts/health` — 供应商健康状态 + - `GET /internal/supply/audit/schema` — 审计日志格式定义，确保事件格式一致 + +**审计事件对接**： +- AI-Ops 的审计事件格式与 supply-api/ 保持一致。 +- 集成运行时，可选择复用 supply-api/ 的 AuditStore 接口，或使用独立的 `ai_ops_audits` 表（推荐独立表，避免 schema 冲突）。 + +### 7.3 Platform Token Runtime + +**集成接口**： +- platform-token-runtime/ 提供以下内部 HTTP 接口供 AI-Ops 调用： + - `GET /internal/runtime/token-usage` — 获取 Token 消耗指标 + - `GET /internal/runtime/capacity` — 获取容量使用率 + +--- + +## 8. 安全设计 + +### 8.1 角色与权限控制 (RBAC) + +| 角色 | 监控查看 | 日志查询 | 告警确认/忽略 | 告警规则管理 | 配置回滚 | 高风险变更 | +|------|---------|---------|-------------|-------------|---------|-----------| +| 查看者 (viewer) | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | +| 运维人员 (operator) | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | +| 管理员 (admin) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | + +**实现方案**： +- 独立运行时，系统自带角色表 `ai_ops_roles`。 +- 集成运行时，通过 `IntegrationPlugin` 接口从主程序获取当前用户角色，或复用主程序的 IAM 系统。 +- 每个 HTTP 请求必须经过 Authz Middleware 检查，在响应头中返回 `X-Permitted-Actions` 列表。 + +### 8.2 审计与日志安全 + +- 审计日志必须使用只读存储，禁止任何用户/管理员直接修改 `ai_ops_audits` 表。 +- 审计日志保留期 >= 90 天，通过 PostgreSQL 分区表 + 自动清理实现。 +- 敏感字段脱敏：审计日志中的 `BeforeState` / `AfterState` 包含密钥、密码时，必须通过 Sanitizer 脱敏处理。 +- 所有管理端点必须记录访问日志，包含操作人 IP、时间戳、操作类型。 + +### 8.3 数据隔离 + +- 所有数据查询必须带有 `tenant_id` / `workspace_id` 过滤条件，防止跨租户数据泄露。 +- 数据库层面使用 Row Level Security (RLS) 作为最后一道防线（可选，根据性能决策）。 + +--- + +## 9. 性能考量 + +### 9.1 并发能力 + +| 指标 | 目标值 | 验证方式 | +|------|--------|---------| +| 告警规则评估吞吐量 | >= 50 条规则 / 15s | 压力测试 | +| 并发告警处理 | >= 100 事件/s | 压力测试 | +| 控制台首页加载 | < 2s | 性能测试 | +| 日志查询首页返回 | < 3s | 性能测试 | +| 审计日志查询 | < 3s | 性能测试 | + +### 9.2 扩展性 + +- **水平扩展**：AI-Ops 服务无状态（状态存储在 Redis/PostgreSQL），可通过增加 Pod 数量水平扩展。 +- **告警引擎分片**：当规则数量 > 200 条时，可将规则按 `metric_source` 分片到不同的评估器实例。 +- **时序库扩展**：Prometheus 采用 Remote Write 到 VictoriaMetrics 或 Thanos，支持长期存储扩展。 + +### 9.3 存储估算 + +**指标数据**（以 Prometheus 为主存储）： +- 假设 10 个指标，每个指标 10 个标签组合，采集频率 15s。 +- 每天数据量: 10 * 10 * (86400/15) * 8 bytes = 4.6 MB/天 +- 7 天原始数据: ~32 MB +- 30 天分钟级聚合: ~200 MB +- 90 天小时级聚合: ~150 MB + +**审计日志**（PostgreSQL）： +- 假设每天 1000 次配置变更，每条记录平均 2 KB。 +- 每天: 2 MB +- 90 天: ~180 MB + +**告警事件**（PostgreSQL）： +- 假设每天 500 条告警，每条记录平均 1 KB。 +- 每天: 500 KB +- 90 天: ~45 MB + +**总存储估算**： +- 指标时序库：500 MB（含小时级聚合） +- PostgreSQL (审计+告警+配置): 500 MB +- Redis (状态缓存): 100 MB +总计: ~1.1 GB（无压缩），实际生产环境建议预留 5 GB 磁盘空间。 + +--- + +## 10. 风险评估与缓解策略 + +| 风险编号 | 风险描述 | 严重级别 | 发生概率 | 缓解策略 | +|---------|---------|---------|---------|---------| +| R-1 | 自愈规则设计不当导致正常流量被截断或重定向 | 高 | 中 | 沙盒模式强制验证；高风险变更二次确认；自愈引擎支持一键关闭 | +| R-2 | 告警规则过于敏感或缺乏抑制，导致噪音爆炸 | 高 | 中 | 告警聚合机制；抑制周期 5min；噪声率监控与自动告警；间隔 2h 未确认自动升级避免麻木 | +| R-3 | 回滚操作不当导致配置状态更深层次损坏 | 中 | 低 | 回滚前显示子资源影响面；二次确认；回滚后自动验证；高风险变更二次确认 | +| R-4 | 审计日志丢失导致故障定责和合规审查受阻 | 中 | 低 | 主备双写；异步文件缓存作为降级；90 天保留期；存储监控与预警 | +| R-5 | 时序数据库全面中断 | 高 | 低 | 控制台降级为只读模式；告警引擎依赖本地缓存持续运行；PostgreSQL 落地缓存作为最后防线 | +| R-6 | 通知渠道全部失效 | 中 | 低 | 主备自动切换机制；4 层降级；最终通知 TechLead；通知失败记录保留在事件中 | + +### 10.1 威胁建模 + +| 威胁场景 | 攻击/故障路径 | 影响 | 控制措施 | 验证要求 | +|---------|---------------|------|---------|---------| +| 自愈误触发 | 错误规则或坏数据触发切流/限流/重启 | 生产流量中断、雪崩放大 | 沙盒演练、双人确认、高风险动作默认关闭、回滚快照 | 每个高风险动作必须有沙盒验证和回滚演练 | +| 告警洪泛 | 外部噪声或错误规则导致告警风暴 | 值班麻木、真实故障被淹没 | 聚合、抑制、静默窗口、升级策略、噪声率告警 | 压测和回放验证 50 条并发规则下噪声可控 | +| 越权运维操作 | 低权限用户执行回滚/规则修改/高风险变更 | 生产配置被误改 | RBAC、二次确认、审计、资源级鉴权、响应头返回 permitted actions | QA 必测 viewer/operator/admin 差异权限 | +| 审计链路失真 | 审计未先写入或被篡改 | 无法追责、回滚依据失效 | 审计先写后执行业务；审计存储防篡改；失败阻断高风险操作 | 审计写失败时高风险变更必须拒绝 | +|| 外部适配层被滥用 | `/metrics`、Webhook、管理 API 适配暴露过多能力 | 信息泄露、被动放大攻击面 | 最小暴露面、签名校验、限流、只读隔离、错误码映射 | 合同测试覆盖外部接口鉴权与字段边界 | +|| **LLM 模型错误输出导致配置损坏** | AI 生成的自愈配置、回滚策略、影响面分析等包含错误信息，被直接认可后执行 | 配置被误写，影响所有依赖该配置的服务 | **人在环路中心**：任何 LLM 生成的动作或配置必须经过人工审批，不得直接自动执行；对生成内容进行语法/语义校验；限制 LLM 的功能范围（只读/分析/推荐，不写/不执行） | QA 必测 LLM 生成配置的审批流程和拒绝逻辑 | +|| **LLM 提示注入挑战** | 攻击者通过日志字段、Webhook 输入、审计查询参数等渠道注入恶意提示，诱导 LLM 生成危险配置或泄露敏感信息 | 身份认证绕过、审计信息泄露 | 对所有 LLM 输入进行输入验证和过滤；严格区分"系统提示”与"用户输入”；LLM 调用使用独立的系统角色，不能获取用户的任何权限；输出经过模板化处理 | QA 必测 LLM 输入渠道的注入防御 | + +### 10.2 设计阶段门控结论 + +**结论：REQUEST_CHANGES（已转化为以下行动项）** + +**已完成的修复：** +- [x] 错误码统一：PRD / HLD / INTERFACE 回滚错误码统一为 `OPS_AUD_4101` / `OPS_AUD_4102` +- [x] 低级笔误修复：测试策略中的"游戏化事务" → "编程式事务"，HLD 中的"预畈" → "预留" +- [x] 数据库 migration SQL：补齐 `tech/migrations/000001_init_schema.up.sql` / `.down.sql`，覆盖核心 6 张表 + 审计防篡改触发器 + 分区策略 +- [x] 功能清单裁剪：删除 66 条 PM 越界按钮级任务，添加 PM/Engineer 范围边界说明 + +**进入开发前必须补齐：** +- [ ] 威胁建模验证要求转化为可执行测试： + - 每个威胁场景在 TEST_DESIGN 中必须有对应的 CI 阻断测试用例 + - 毒性：自愈误触发 → TC-6.5 沙盒模式验证、TC-6.7 级联故障回退 + - 毒性：告警洪泛 → TC-5.1 聚合测试、TC-5.3 抑制周期测试 + - 毒性：越权运维 → TC-12.1~12.3 角色权限矩阵 + - 毒性：审计鏈路失真 → TC-7.2 审计不可篡改、TC-7.1 审计写入时效 + - 毒性：外部适配层被滥用 → TOPS-ADP-01~03 适配层验证 +- [ ] `BuildServer` / `BuildRuntime` 显式挂载约束必须落实为 QA 的阻断检查项： + - 每个模块在 `BuildServer` 中必须有对应的 `Register()` 调用，否则 CI 失败 + - 每个条件能力在 `BuildRuntime` 中必须有对应的 `Enable()` 调用，否则 CI 失败 +- [ ] 独立运行 / 集成运行 / IntegrationPlugin / OpenAPI / 适配层要求必须进入测试阻断矩阵： + - TOPS-RUN-01~04 必须通过 CI + - TOPS-PLG-01~03 必须通过 CI + - TOPS-OAS-01~03 必须通过 CI + - TOPS-ADP-01~03 必须通过 CI +- [ ] 高风险变更必须 fail-closed： + - 影响面 > 50% 的变更在审计写入失败时必须拒绝执行，有单独的 CI 测试用例验证 + +**阻断条件（任一触发则不得进入开发）：** +- 自愈动作没有沙盒、快照与回滚闭环。 +- 审计日志不能保证先写审计再执行业务。 +- 无法证明集成模式中路由、worker、健康检查全部真实挂载。 + +--- + +## 11. 可重用的设计模式 + +| 设计模式 | 来源 | 应用场景 | +|---------|------|---------| +| **CustomBatchLogger** | LiteLLM | 告警事件批量处理，避免高并发下的 IO 瓶颈 | +| **DualCache** | LiteLLM | 告警状态缓存（内存 + Redis），确保告警可靠性 | +| **DigestEntry** | LiteLLM | 告警聚合，避免滥发 | +| **AlertType + AlertTypeConfig** | LiteLLM | 可扩展的告警类型系统，支持按类型配置不同策略 | +| **OutageModel + ProviderRegionOutageModel** | LiteLLM | 故障状态机，支持模型级和区域级故障检测 | +| **Cooldown 机制** | LiteLLM | 故障部署自动移除，作为自愈动作的一种 | +| **FreeRide SupplierChain** | FreeRide (OpenClaw) | 供应商多级 Fallback 链 + 冷却期，防止震荡 | +| **SupplierProbe + ELOHistory** | FreeRide (OpenClaw) | 供应商探针定时任务 + 质量趋势记录 | +| **Repository + Service + Handler** | Bridge 主项目 | 分层架构，领域层定义接口，应用层实现业务逻辑，HTTP 层处理协议转换 | +| **Optimistic Locking** | supply-api/ | 配置变更时防止并发覆盖，Store 接口必须包含 expectedVersion | +| **Circuit Breaker** | 行业实践 | 自愈动作执行失败时，避免连续重试导致级联故障 | +| **Snapshot + Rollback** | 行业实践 | 自愈动作执行前记录状态快照，支持自动回退 | + +--- + +## X 技术选型（前端） + +### 前端技术栈 +- **框架**：React 18+（或与 gateway 现有前端保持一致） +- **组件库**：Tailwind CSS + Headless UI（或现有 UI 框架） +- **图表**：ECharts 5.x（已在功能清单中使用） +- **构建工具**：Vite +- **状态管理**：React Query（用于 API 数据获取和缓存） + +### 前端工作范围 +- 监控首页（6 个指标卡片 + 实时刷新） +- 指标下钻页（ECharts 趋势图 + 维度筛选） +- 日志查询页（表格 + 分页 + 导出） +- 告警规则管理页（CRUD 表单） +- 告警事件列表页（状态 Tab + 集群聚合） +- 配置审计与回滚页 +- 容量主板（多图表 + 预测卡片） + +### 约束 +- 前端不做后端逻辑，所有数据通过 `/api/v1/ai-ops/` REST 接口获取 +- 前端与后端通过 JWT Token 认证，Token 由后端签发 + +--- + +## 12. 技术栈与集成约束 + +### 12.1 统一技术栈 +本项目必须与立交桥主项目保持一致： +- **语言**: Go 1.22+ +- **HTTP框架**: 标准库 `net/http` + 自定义中间件（禁止引入 Gin/Echo 等第三方框架，保持与 gateway/ 和 supply-api/ 的一致性） +- **数据库**: PostgreSQL 15+ ，驱动 `jackc/pgx/v5` +- **缓存**: Redis，客户端 `redis/go-redis/v9` +- **配置**: YAML + Viper，环境变量覆盖敏感字段 +- **日志/审计**: 结构化日志，审计事件模型与 supply-api/ 一致 +- **错误码**: `{SOURCE}_{CATEGORY}_{CODE}` 格式，例如 `OPS_ALT_4001` +- **健康检查**: `/actuator/health` 、 `/actuator/health/live` 、 `/actuator/health/ready` +- **测试**: Go testing + testify，覆盖率门槛 domain ≥ 70%、service/handler ≥ 80% + +### 12.2 独立运行与集成运行 +本系统必须同时支持两种运行模式： + +| 模式 | 特征 | 部署方式 | 适用场景 | +|------|------|---------|---------| +| **独立运行** | 自有 `cmd/ai-ops/main.go`，独立数据库 schema，独立 docker-compose | `docker-compose up` 或单独容器 | 外部用户只需要运维能力，不想接入立交桥全套 | +| **集成运行** | 作为 Go module 被 `gateway/` 或 `supply-api/` 引入，共享数据库连接池和配置，通过内部接口注册 | 编译时作为子模块编译，运行时挂载到立交桥主进程 | 立交桥用户希望获得一体化运维能力 | + +**集成约束**: +- 独立运行时，系统必须提供完整的 HTTP API 和管理后台。 +- 集成运行时，系统必须提供 `IntegrationPlugin` 接口，允许主程序通过配置开关启用/禁用各模块。 +- 数据库 schema 必须使用独立的 `ai_ops_` 前缀，避免与主项目表名冲突。 +- 配置文件必须支持分离加载：独立运行时读取自己的 `config.yaml`，集成运行时合并到主项目配置。 + +### 12.3 NewAPI / Sub2API 适配支持 +本系统的核心能力必须能够对接 NewAPI 和 Sub2API 系统： +- **监控数据推送**: 提供 Prometheus 格式的 `/metrics` 接口，NewAPI/Sub2API 可通过 Prometheus scrape 获取运维数据。 +- **告警回调**: 支持 Webhook 告警通知，NewAPI/Sub2API 可配置接收本系统的告警事件。 +- **自愈脚本扩展**: 自愈动作中的“触发程序化脚本”支持调用 NewAPI/Sub2API 的管理 API（如切换供应商、限流配置、重启实例）。 +- **独立部署时**: 通过配置文件指定 NewAPI/Sub2API 的管理端点地址和鉴权信息，本系统通过适配层与之交互。 +- **集成部署时**: 若立交桥 gateway/ 已接入 NewAPI/Sub2API，本系统通过 gateway/ 的内部路由接口操作上游状态。 + +### 12.4 对外接口契约 +- 必须提供 OpenAPI 3.0 接口文档，确保 NewAPI/Sub2API 开发者可以独立接入。 +- 接口路径前缀默认为 `/api/v1/ai-ops/`，集成运行时可通过配置改为 `/internal/ai-ops/`。 + +--- + +## 13. 变更日志 + +| 版本 | 日期 | 修改人 | 内容 | +|------|------|--------|------| +| v1.0 | 2026-04-27 | TechLead | 初稿：完成系统架构、模块设计、数据模型、流程设计、技术选型、集成点、安全、性能、风险、设计模式 | + +--- + +## 附录 Y：参考文档与外部依赖 + +| 参考项目 | 版本/日期 | URL | 用途 | +|---------|---------|-----|------| +| LiteLLM | v1.40.0 (2026-03) | https://docs.litellm.ai/ | 模型接口标准化、健康检查设计 | +| Sub2API | main分支 (2026-04) | https://github.com/WeI-Shaw/sub2api | 公告系统、用户体系参考 | +| Intercom | - | https://www.intercom.com/ | 客服体验对标 | +| Prometheus | 3.x (2026-Q1) | https://prometheus.io/ | 时序数据存储 | +| VictoriaMetrics | 1.100.x (2026-Q1) | https://victoriametrics.com/ | 时序数据备选存储 | +| Playwright | 1.50.x (2026-Q1) | https://playwright.dev/ | 浏览器自动化 | +| Qdrant | 1.12.x (2026-Q1) | https://qdrant.tech/ | 向量数据库备选 | +| PGVector | 0.8.x (2026-Q1) | https://github.com/pgvector/pgvector | PostgreSQL向量扩展 | + +注：以上版本号为评审时（2026-04-28）的最新稳定版，随着项目开发应定期更新。 diff --git a/tech/INTERFACE.md b/tech/INTERFACE.md new file mode 100644 index 0000000..6085600 --- /dev/null +++ b/tech/INTERFACE.md @@ -0,0 +1,387 @@ +# AI-Ops 核心接口设计 + +> 版本：v1.0 | 状态：初稿 + +--- + +## 1. 内部模块间接口 + +### 1.1 MetricService + +```go +type MetricService interface { + // 采集指标 + Collect(ctx context.Context, source string, metrics []MetricPoint) error + // 查询时序数据 + Query(ctx context.Context, req MetricQueryRequest) (*MetricQueryResult, error) + // 获取最新值 + GetLatest(ctx context.Context, source, metricName string) (*MetricPoint, error) + // 存储保留期检查 + PurgeExpired(ctx context.Context, before time.Time) (int64, error) +} + +type MetricPoint struct { + Source string + Name string + Value float64 + Tags map[string]string + Timestamp time.Time +} + +type MetricQueryRequest struct { + Source string + Name string + StartTime time.Time + EndTime time.Time + Interval time.Duration // 聚合间隔 + Tags map[string]string +} + +type MetricQueryResult struct { + Points []MetricPoint +} +``` + +### 1.2 AlertService + +```go +type AlertService interface { + // 规则 CRUD + CreateRule(ctx context.Context, rule AlertRule) (*AlertRule, error) + UpdateRule(ctx context.Context, rule AlertRule) (*AlertRule, error) + DeleteRule(ctx context.Context, ruleID string) error + GetRule(ctx context.Context, ruleID string) (*AlertRule, error) + ListRules(ctx context.Context, filter RuleFilter) ([]AlertRule, error) + + // 告警事件管理 + ListAlerts(ctx context.Context, filter AlertFilter) ([]AlertEvent, error) + Acknowledge(ctx context.Context, alertID, actorID string) error + Ignore(ctx context.Context, alertID, actorID string) error + Escalate(ctx context.Context, alertID, reason string) error + + // 实时评估 + Evaluate(ctx context.Context, ruleID string) (*AlertEvent, error) +} + +type AlertRule struct { + ID string + Name string + MetricSource string + MetricName string + ThresholdType string // > < = regex + ThresholdValue string + DurationMin int + Level string // P0 P1 P2 P3 + ChannelIDs []string + HealingAction *string + HealingConfig map[string]any + IsSandboxed bool + Enabled bool + Version int +} + +type AlertEvent struct { + ID string + RuleID string + Level string + ResourceType string + ResourceID string + CurrentValue string + ThresholdValue string + Status string // triggered notified healing resolved escalated acknowledged + IsAggregated bool + AggregatedCount int + CreatedAt time.Time + UpdatedAt time.Time +} +``` + +### 1.3 HealingService + +```go +type HealingService interface { + // 执行自愈动作 + Execute(ctx context.Context, action HealingAction, target ResourceTarget) (*HealingResult, error) + // 获取可用动作列表 + ListActions(ctx context.Context) []HealingActionMeta + // 回滚自愈动作 + Rollback(ctx context.Context, executionID string) error + // 查询执行历史 + ListExecutions(ctx context.Context, filter ExecutionFilter) ([]HealingExecution, error) +} + +type HealingAction struct { + Type string // restart_instance switch_route throttle isolate_node invoke_script + Config map[string]any +} + +type ResourceTarget struct { + Type string // service provider model + ID string +} + +type HealingResult struct { + ExecutionID string + Success bool + BeforeState map[string]any + AfterState map[string]any + Error *string + ExecutedAt time.Time +} +``` + +### 1.4 AuditService + +```go +type AuditService interface { + // 记录审计事件 + Record(ctx context.Context, event AuditEvent) error + // 查询审计日志 + Query(ctx context.Context, filter AuditFilter) ([]AuditEvent, error) + // 回滚操作 + Rollback(ctx context.Context, eventID string, actorID string) (*AuditEvent, error) + // 影响面计算 + CalculateImpact(ctx context.Context, objectType, objectID string, proposedState map[string]any) (*ImpactReport, error) +} + +type AuditEvent struct { + EventID string + TenantID string + ObjectType string + ObjectID string + Action string // create update delete rollback + BeforeState map[string]any + AfterState map[string]any + RequestID string + ResultCode string + SourceIP string + ActorID string + CreatedAt time.Time +} + +type ImpactReport struct { + RiskLevel string // low medium high + EstimatedRejectRate float64 // 预估拒绝率 + AffectedResources []string + RequiresConfirm bool +} +``` + +### 1.5 CapacityService + +```go +type CapacityService interface { + // 获取容量视图 + GetDashboard(ctx context.Context, scope CapacityScope) (*CapacityDashboard, error) + // 增长率预测 + PredictGrowth(ctx context.Context, metric string, horizon time.Duration) (*GrowthPrediction, error) + // 设置容量阈值 + SetThreshold(ctx context.Context, metric string, threshold float64) error +} + +type CapacityDashboard struct { + Metrics []CapacityMetric + Predictions []GrowthPrediction + LastUpdated time.Time +} + +type CapacityMetric struct { + Name string + Current float64 + Limit float64 + Unit string + Utilization float64 +} + +type GrowthPrediction struct { + Metric string + DailyGrowth float64 + DaysToLimit *int // nil 表示不会达到上限 +} +``` + +### 1.6 IntegrationPlugin + +`IntegrationPlugin` 是 AI-Ops 与立交桥主项目（gateway/supply-api）集成运行时的核心接口。主项目通过实现该接口，将 AI-Ops 的能力挂载到自身进程中。 + +```go +// IntegrationPlugin 定义了 AI-Ops 模块在集成运行时必须实现的接口契约 +// 注意：模块必须通过显式 import + init 注册到全局注册表， +// 且主程序必须通过配置显式 Enable 才能激活模块。 +type IntegrationPlugin interface { + // Name 返回模块唯一标识，用于配置关联和日志区分 + // 示例: "alert", "healing", "audit", "capacity" + Name() string + + // Init 在模块被启用时执行一次初始化 + // 负责: 连接数据库、初始化缓存、启动后台 worker 等 + // 若初始化失败，整个模块不得启动，主程序应记录错误并继续启动其他模块 + Init(ctx context.Context, cfg Config) error + + // RegisterRoutes 将模块的 HTTP 接口注册到主程序的 ServeMux + // 路径必须以 /internal/ai-ops/{module}/ 为前缀 + // 示例: /internal/ai-ops/alert/rules, /internal/ai-ops/healing/actions + RegisterRoutes(mux *http.ServeMux) error + + // HealthChecks 返回模块的健康检查函数列表 + // 主程序将聚合所有模块的健康检查到 /actuator/health 和 /actuator/health/ready + HealthChecks() []HealthCheckFunc + + // Shutdown 在主程序退出时按 LIFO 顺序调用 + // 负责: 关闭数据库连接、停止 worker、释放资源 + // 超时上限 30 秒，超时后强制终止 + Shutdown(ctx context.Context) error +} + +// HealthCheckFunc 是健康检查函数签名 +type HealthCheckFunc func(ctx context.Context) (name string, status string, detail string) + +// PluginRegistry 是全局模块注册表（线程安全） +var registry = make(map[string]IntegrationPlugin) +var registryMu sync.RWMutex + +// Register 在 init() 中调用，将模块注册到全局注册表 +func Register(p IntegrationPlugin) { + registryMu.Lock() + defer registryMu.Unlock() + if _, exists := registry[p.Name()]; exists { + panic("duplicate plugin registration: " + p.Name()) + } + registry[p.Name()] = p +} + +// GetRegisteredPlugins 返回已注册的所有模块拷贝 +func GetRegisteredPlugins() []IntegrationPlugin { + registryMu.RLock() + defer registryMu.RUnlock() + result := make([]IntegrationPlugin, 0, len(registry)) + for _, p := range registry { + result = append(result, p) + } + return result +} +``` + +**注册与使用示例**： + +```go +package alert + +import ( + "context" + "net/http" + aiops "github.com/company/ai-ops" +) + +func init() { + // 显式注册到全局注册表 + aiops.Register(&AlertPlugin{}) +} + +type AlertPlugin struct{ /* ... */ } + +func (p *AlertPlugin) Name() string { return "alert" } +func (p *AlertPlugin) Init(ctx context.Context, cfg aiops.Config) error { /* ... */ } +func (p *AlertPlugin) RegisterRoutes(mux *http.ServeMux) error { /* ... */ } +func (p *AlertPlugin) HealthChecks() []aiops.HealthCheckFunc { /* ... */ } +func (p *AlertPlugin) Shutdown(ctx context.Context) error { /* ... */ } +``` + +**关键约束**： +1. **显式 Enable**：主程序配置文件中必须显式开启模块，默认关闭。示例：`ai_ops.alert.enabled: true`。 +2. **路由前缀统一**：所有注册的路由必须以 `/internal/ai-ops/` 为前缀，避免与主系统路径冲突。 +3. **数据库前缀统一**：插件创建的表必须使用 `ai_ops_` 前缀，避免 schema 冲突。 +4. **健康检查注入**：插件实现的 HealthChecks 必须被主程序聚合到 /actuator/health 和 /actuator/health/ready 。 +5. **顺序关闭**：主程序关闭时必须按后进先出（LIFO）顺序调用各插件的 Shutdown 。 + +--- + +## 2. 外部系统集成接口 + +### 2.1 与 Bridge Gateway 集成 + +| 方法 | 路径 | 请求 | 响应 | 说明 | +|------|------|------|------|------| +| 查询服务状态 | `GET /internal/gateway/health` | - | `{"status":"up","services":{}}` | 诊断时查询各服务健康状态 | +| 获取路由策略 | `GET /internal/gateway/routes` | - | `{"routes":[]}` | 读取当前路由配置，用于影响面分析 | +| 修改路由策略 | `POST /internal/gateway/routes` | `{"action":"switch_route","target":"","config":{}}` | `{"success":true}` | 自愈动作调用，需审计 | +|| 获取请求量统计 | `GET /internal/gateway/metrics` | `?metric=qps&duration=5m` | `{"value":1234.5}` | 采集指标数据 | + +> **安全约束**：`/internal/gateway/metrics` 端点仅限内网 IP 访问（如 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16），或需要携带有效的服务间 API Key。公网直接访问应返回 403 Forbidden。 + +### 2.2 与 supply-api 集成 + +| 方法 | 路径 | 请求 | 响应 | 说明 | +|------|------|------|------|------| +| 查询供应商状态 | `GET /internal/supply/accounts/health` | - | `{"accounts":[]}` | 诊断供应商健康状态 | +| 获取审计日志格式 | `GET /internal/supply/audit/schema` | - | `{"schema":{}}` | 确保审计事件格式一致 | + +### 2.3 与 platform-token-runtime 集成 + +| 方法 | 路径 | 请求 | 响应 | 说明 | +|------|------|------|------|------| +| 获取 Token 消耗 | `GET /internal/runtime/token-usage` | `?window=1h` | `{"total":12345,"by_model":{}}` | 采集 Token 消耗指标 | +| 获取容量使用率 | `GET /internal/runtime/capacity` | - | `{"utilization":0.75}` | 采集容量指标 | + +--- + +## 3. API 接口规范 + +### 3.1 REST API 基础 + +- **基础路径**: `/api/v1/ai-ops/` +- **内部路径** (集成模式): `/internal/ai-ops/` +- **内容类型**: `application/json` +- **错误响应格式**: + ```json + { + "error_code": "OPS_{CATEGORY}_{CODE}", + "message": "人类可读的错误信息", + "detail": {} // 可选，包含额外的调试信息 + } + ``` + +### 3.2 错误码 + +| 错误码 | HTTP 状态 | 说明 | +|---------|-----------|------| +| `OPS_GEN_4001` | 400 | 请求参数错误 | +| `OPS_GEN_4002` | 401 | 未授权 | +| `OPS_GEN_4003` | 403 | 权限不足 | +| `OPS_GEN_4004` | 404 | 资源不存在 | +| `OPS_GEN_4005` | 409 | 资源冲突（如名称已存在） | +| `OPS_GEN_4006` | 413 | 请求体过大（如日志查询时间范围过大） | +| `OPS_GEN_5001` | 500 | 内部服务错误 | +| `OPS_MET_4001` | 400 | 指标名称无效 | +| `OPS_MET_4002` | 400 | 时间范围不合法 | +| `OPS_ALT_4001` | 400 | 规则名称已存在 | +| `OPS_ALT_4002` | 400 | 规则参数验证失败 | +| `OPS_ALT_4003` | 409 | 规则被其他用户修改（版本冲突） | +| `OPS_HEAL_4001` | 400 | 自愈动作参数无效 | +| `OPS_HEAL_4002` | 409 | 自愈动作正在执行中 | +| `OPS_HEAL_4003` | 400 | 回滚目标执行不存在 | +| `OPS_AUD_4001` | 403 | 无权进行审计操作 | +| `OPS_AUD_4101` | 400 | 回滚目标资源不存在 | +| `OPS_AUD_4102` | 409 | 回滚目标已被后续修改覆盖 | +| `OPS_CAP_4001` | 400 | 容量指标不存在 | + +### 3.3 分页 + +- `列表接口` 支持分页参数：`?page=1&page_size=20` +- 默认 `page_size=20`，最大 `page_size=100` +- 响应体包含：`{"items":[],"total":123,"page":1,"page_size":20}` + +### 3.4 WebSocket 接口 + +**路径**: `/ws/v1/ai-ops/alerts` + +**鉴权机制**: +- 连接建立时必须在查询参数中携带有效 JWT Token：`?token=`。 +- 服务端在升级 WebSocket 连接前必须验证 token 有效性、过期时间和角色权限。 +- token 无效或已过期时，立即返回 401 Unauthorized 并关闭连接。 +- 订阅范围根据用户角色过滤，查看者只能接收 P1 及以下级别告警，管理员可接收所有级别。 + +**功能**: +- 客户端订阅后，实时推送新告警事件。 +- 支持按级别过滤：`?levels=P0,P1`。 +- 心跳间隔 30 秒。 diff --git a/tech/QA_REVIEW_REPORT.md b/tech/QA_REVIEW_REPORT.md new file mode 100644 index 0000000..5e6a73d --- /dev/null +++ b/tech/QA_REVIEW_REPORT.md @@ -0,0 +1,129 @@ +# QA 审核报告：AI-Ops 测试设计文档 + +> 审核日期：2026-05-11 +> 审核人：QA Agent +> 审核对象：TEST_DESIGN.md / CASES.md / STRATEGY.md +> 对照基准：PRD.md (AC-01 ~ AC-12, F-01 ~ F-08) + +--- + +## 总体评级：C + +**评级依据**：测试策略框架和分层模型设计较为完整，Mock 策略、环境矩阵、灰度 Phase 规划具备可执行基础。但存在 3 项 P0 严重缺陷：AC 负向用例大面积缺失、异常流程 F-05~F-08 在 CASES.md 中完全遗漏、CI 集成零配置。上述问题将导致测试覆盖存在盲区，且无法形成自动化门禁闭环。 + +--- + +## 优点 + +1. **测试分层模型清晰**：TEST_DESIGN.md 1.1 明确划分 Unit → Integration → E2E 三层，STRATEGY.md 补充 Chaos Test，结构合理。 +2. **Mock 策略全面**：覆盖 Prometheus、 supply-api、token-runtime、通知渠道、PostgreSQL、Redis 等全部核心外部依赖，工具选型合理（sqlmock / miniredis / gock / httptest）。 +3. **环境矩阵设计完整**：Local Dev / CI / Sandbox / Staging / Production 五层环境各有明确的用途、数据特征和外部依赖策略。 +4. **灰度 Phase 规划可落地**：Phase 1~4 的验证内容与回归集范围明确，与 PRD 发布策略对应。 +5. **发布门禁检查表（8.1）覆盖关键风险点**：独立/集成双模式验证、沙盒验证、回滚演练、权限矩阵、端到端链路验证等 8 项全部列出。 +6. **回归集分级合理**：区分快速回归集（9 条，5-10 分钟）与完整回归集（43 条，30-60 分钟），适合不同触发条件。 + +--- + +## 发现问题（按严重度分类） + +### P0 — 阻塞级（必须修复，否则无法进入开发/提测） + +| 编号 | 问题描述 | 影响 | 依据 | +|------|---------|------|------| +| P0-01 | **AC 负向测试用例大面积缺失**。12 个 AC 中至少 8 个（AC-01/02/04/05/06/09/10/11）在 CASES.md 与 TEST_DESIGN.md 中均无任何负向/异常输入用例。仅 AC-03、AC-08 有明确的 Negative 用例，AC-12 有权限越界类负向用例。 | 无法验证系统在非法输入、边界越界、权限不足、数据异常等场景下的行为，存在生产缺陷逃逸风险。 | 审核标准 #1 | +| P0-02 | **CASES.md 遗漏异常流程 F-05~F-08**。PRD 明确定义 F-01~F-08 共 8 条异常流程，CASES.md 仅覆盖 TC-E1~E4（对应 F-01~F-04），F-05（审计满盘）、F-06（级联故障）、F-07（数据库全面中断）、F-08（看板计算超时）完全缺失。 | 核心容灾与降级场景无测试用例兜底，与 PRD 6. 节要求不符。 | 审核标准 #2 | +| P0-03 | **CI 集成零配置**。STRATEGY.md 6. 仅文字描述"PR 提交时自动触发"，未提供任何 CI 配置文件（如 .github/workflows/ci.yml）、Pipeline 阶段定义、失败通知模板、覆盖率采集与阻断逻辑。 | 无法形成自动化质量门禁，所有覆盖率/通过率要求沦为纸面标准。 | 审核标准 #6 | +| P0-04 | **性能压测方法过于简略，无执行载体**。TEST_DESIGN.md 9.1 虽列出 k6 并发用户数，但未提供 k6 脚本、压测环境规格（CPU/内存/DB 实例）、数据量基准、P99 计算方式、持续时间。"单次告警触发计时"未说明计时起点/终点和采样次数。 | 性能基准无法复现和验证，灰度门禁中"性能基准测试通过"无法判定。 | 审核标准 #8 | + +### P1 — 高优先级（强烈建议修复，否则提测后返工风险高） + +| 编号 | 问题描述 | 影响 | 依据 | +|------|---------|------|------| +| P1-01 | **覆盖率门槛缺少验证机制**。文档多次声明 domain ≥70%、service/handler ≥80%，但未说明：使用 `go test -coverprofile` 还是第三方工具、CI 中如何解析并阻断未达标 PR、覆盖率报告存储位置、增量覆盖率是否校验。 | 覆盖率目标无法自动 enforce，开发者可能随时跌破门槛。 | 审核标准 #4 | +| P1-02 | **混沌测试（Chaos Test）无具体用例设计**。STRATEGY.md 提到 chaos-mesh / 自定义脚本和三类故障（单机故障、网络分区、主从切换），但 TEST_DESIGN.md 与 CASES.md 中均未设计任何 Chaos 用例（无 Given-When-Then、无验证点、无预期行为）。 | 混沌测试 layer 有名无实，无法验证系统韧性。 | 审核标准 #3 | +| P1-03 | **测试数据管理策略缺关键细节**。STRATEGY.md 提到 `test/fixtures/` 和"自洁"，但未给出：fixtures 目录结构规范、大数据量（如 10000 条审计日志）的生成脚本、敏感数据脱敏方法、不同测试并行时的数据隔离策略。 | 大数据量性能用例和 E2E 用例可能因数据准备不足而无法稳定执行。 | 审核标准 #7 | +| P1-04 | **灰度门禁缺少自动化判定脚本**。TEST_DESIGN.md 5.2 列出 6 项检查项，但均为人工勾选（`- [ ]`），未说明每项如何自动采集结果（如覆盖率报告解析、沙盒验证次数统计、安全扫描工具输出格式）。 | Phase 升级依赖人工审核，效率低且易遗漏。 | 审核标准 #5 | +| P1-05 | **安全扫描工具与阈值未指定**。灰度门禁和发布门禁均提到"安全扫描通过（无高危漏洞）"，但未指定扫描工具（Trivy / Snyk / Gosec）、漏洞等级定义、扫描时机（CI / 镜像构建 / 发布前）。 | 安全门禁无法执行。 | 审核标准 #5 | +| P1-06 | **E2E 测试缺少详细场景设计**。STRATEGY.md 提到"自定义 Go E2E 框架"和"前端流程测试"，但 TEST_DESIGN.md / CASES.md 中无任何 E2E 级别的 Given-When-Then 用例（如完整链路：模拟指标异常 → 告警触发 → 通知发送 → 自愈执行 → 事件记录）。 | E2E 覆盖率无法评估。 | 审核标准 #3 | + +### P2 — 一般优化（建议修复，提升可维护性） + +| 编号 | 问题描述 | 影响 | +|------|---------|------| +| P2-01 | **用例编号风格不统一**。TEST_DESIGN.md 使用 `TC-01-01`，CASES.md 使用 `TC-1.1`，同一项目内两种命名规范，易导致用例追溯混乱。 | +| P2-02 | **CASES.md TC-E2 与 PRD 描述不一致**。CASES.md 写"模拟 Webhook 8xx"，PRD F-2 写"Webhook 8xx/5xx"，遗漏 5xx 场景。 | +| P2-03 | **AC-06 自愈缺少负向/非法配置用例**。如：配置不存在的自愈动作类型、自愈脚本权限不足、沙盒模式未通过却尝试生产执行等。 | +| P2-04 | **AC-10 日志查询缺少负向用例**。如：超大时间范围查询、非法正则过滤、无权限访问其他服务日志等。 | +| P2-05 | **测试通过标准（TEST_DESIGN.md 1.2）中"告警噪声率 ≤1%"和"自愈误触发 0 次"缺少测量方法**。未说明沙盒测试的样本量、统计周期、噪声率计算公式。 | + +--- + +## 改进建议 + +### 立即行动（进入开发前必须完成） + +1. **补齐 AC 负向用例** + - AC-01：增加"未登录访问首页返回 401"、"非法时间范围参数返回 400"。 + - AC-02：增加"下钻不存在的 service 返回空结果/404"、"超大时间范围返回 413/截断"。 + - AC-04：增加"通知渠道全部失效时记录失败并触发内部告警"、"非法事件 ID 查询返回 404"。 + - AC-05：增加"聚合阈值设置为 0 或负数时的校验拒绝"。 + - AC-06：增加"沙盒未通过时禁止关联生产规则"、"自愈动作类型非法返回 400"。 + - AC-09：增加"容量主板数据源丢失时展示降级提示"。 + - AC-10：增加"导出超过 10000 条时返回 413 或分批"。 + - AC-11：增加"查询已清理数据返回空并提示保留策略"。 + +2. **在 CASES.md 中补全 F-05~F-08** + - TC-E5：模拟审计磁盘满，验证丢弃非关键字段/异步上报且业务不阻断。 + - TC-E6：模拟自愈切换导致新故障，验证自动回退 + P0 升级。 + - TC-E7：模拟时序库全面中断，验证控制台只读 + 告警引擎缓存运行。 + - TC-E8：模拟看板查询超时，验证显示上次成功结果 + 时间戳标注。 + +3. **提供 CI 配置文件** + - 创建 `.github/workflows/ci.yml`（或对应平台配置），至少包含： + - Go 版本声明（1.22+） + - `go test -race -coverprofile=coverage.out ./...` + - 覆盖率解析步骤（如使用 `gocov` 或自定义脚本检查 domain ≥70%、service ≥80%） + - 未达标时 PR 阻断（exit 1） + - 测试失败通知 TechLead / QA 的机制（如 Slack / 邮件 Webhook） + - 每日定时 E2E / 每周 Chaos 的 workflow 文件 + +4. **输出可执行的性能压测资产** + - 提供 `test/perf/` 目录，包含： + - `dashboard_k6.js`：50 并发首页加载压测脚本 + - `drilldown_k6.js`：20 并发下钻压测脚本 + - `alert_latency_test.go`：告警触发到通知的计时单测（含重试统计） + - `PERF_ENV.md`：压测环境规格、数据量基准、判定标准（P99 计算方式、持续 5min） + +### 短期优化（提测前完成） + +5. **建立覆盖率验证机制** + - 在 CI 中引入 `go tool cover -func=coverage.out` 解析，按模块（domain / service / handler）分别校验阈值。 + - 引入增量覆盖率检查（如 codecov / coveralls），要求新增代码覆盖率 ≥80%。 + +6. **补充混沌测试用例** + - 在 TEST_DESIGN.md 中新增"混沌测试"章节，至少设计 3 条可执行用例： + - Chaos-01：随机杀死一个服务 Pod，验证告警引擎本地缓存持续运行且控制台进入只读。 + - Chaos-02：模拟 Redis 网络分区 30s，验证告警抑制状态不丢失、恢复后不重复通知。 + - Chaos-03：模拟 PostgreSQL 主从切换，验证审计写入短暂失败后异步补写。 + +7. **完善测试数据管理规范** + - 创建 `test/fixtures/` 目录结构文档，规定 SQL / JSON / Go seed 三种数据注入方式。 + - 为大数据量性能测试提供数据生成脚本（如 `generate_audit_logs.go` 生成 10000 条审计记录）。 + - 明确并行测试隔离方案（testcontainers 独立数据库 / 事务回滚 / 唯一 schema）。 + +8. **统一用例编号规范** + - 建议统一为 `TC-{AC}-{序号}`（如 `TC-01-01`），并同步修改 CASES.md。 + +--- + +## 审核结论 + +**当前状态：REQUEST_CHANGES** + +本文档在测试策略框架层面具备较好的完整性，分层模型、Mock 策略、环境矩阵和发布门禁检查表已达到可评审水平。但由于 P0-01 ~ P0-04 四项阻塞级缺陷（负向用例大面积缺失、异常流程遗漏、CI 零配置、性能压测无载体），**当前测试设计不足以支撑进入开发或提测阶段**。 + +建议研发团队优先补齐上述"立即行动"项，完成后提交 QA 复评。 + +--- + +> 报告生成路径：`/home/long/project/ai-ops/tech/QA_REVIEW_REPORT.md` diff --git a/tech/SECURITY_AUDIT_REPORT.md b/tech/SECURITY_AUDIT_REPORT.md new file mode 100644 index 0000000..91c4fe5 --- /dev/null +++ b/tech/SECURITY_AUDIT_REPORT.md @@ -0,0 +1,111 @@ +## Security 审核报告 + +项目：AI-Ops 智能运维系统 +审核日期：2026-05-11 +审核范围：HLD.md（第 8 节、第 10.1 节）、INTERFACE.md、PRD.md、000001_init_schema.up.sql、TEST_DESIGN.md +审核人：Security Role + +--- + +### 总体评级：B + +安全设计具备基本框架，RBAC、审计日志、威胁建模等核心模块已有原则性设计，但存在多项与 fail-closed 策略冲突的设计缺陷、权限边界模糊、以及缺乏可落地的实现细则。在进入开发前必须修复 P0/P1 项，否则生产环境存在越权操作、审计失效和自愈引擎被滥用的风险。 + +--- + +### 优点 + +1. 威胁建模已覆盖运维核心场景（自愈误触发、告警洪泛、越权操作、审计失真、外部适配层滥用），并映射到具体控制措施与验证要求（HLD 10.1）。 +2. 数据库层通过 BEFORE UPDATE/DELETE 触发器实现审计表 append-only，配合 CHECK 约束限制 action/risk_level 枚举值，基础防篡改机制存在（migration SQL）。 +3. RBAC 三角色权限矩阵在控制台层面有明确区分（viewer/operator/admin），并在响应头设计 `X-Permitted-Actions` 返回（HLD 8.1）。 +4. 自愈引擎引入沙盘模式（dry-run）和级联故障自动回退设计，降低了自动化动作对生产环境的直接冲击（HLD 3.3）。 +5. 数据隔离在架构层面要求 tenant_id 过滤，并保留 Row Level Security（RLS）作为最后一道防线（HLD 8.3）。 +6. 测试设计将安全项纳入 CI 门禁：权限越界、审计篡改、SQL 注入均有测试用例（TEST_DESIGN.md 第 10 节）。 + +--- + +### 发现问题（按严重度 P0/P1/P2 分类） + +#### P0 — 阻塞性风险（进入开发前必须修复） + +| 编号 | 问题 | 证据 | 风险 | +|------|------|------|------| +| P0-001 | **审计写入失败时未阻断业务，与 fail-closed 策略直接冲突** | PRD F-5 明确写道："审计日志存储满盘/写入失败 - 丢弃非关键字段或改为异步上报，不阻断业务操作"。HLD 10.2 和 TEST_DESIGN 8.1 虽声明"高风险操作审计失败即拒绝"，但 PRD 的故障路径与之矛盾。 | 审计链路在降级场景下失效，无法保证"先写审计再执行业务"，导致合规断裂和事后无法追责。 | +| P0-002 | **自愈引擎 invoke_script 缺少执行环境沙盘隔离** | HLD 3.3 定义 invoke_script 动作可"执行用户配置的程序化脚本"，但文档中仅存在 dry-run 沙盘（验证动作逻辑），无任何关于脚本运行时隔离（容器化、seccomp、资源限制、网络策略、文件系统隔离）的设计。 | 攻击者或误配置可导致任意代码执行（RCE），直接控制生产环境或窃取密钥。 | +| P0-003 | **高风险变更二次确认可被 API 直接绕过** | HLD 3.5 和 10.1 将二次确认描述为"弹出二次确认窗口"（UI 层面），但未在 INTERFACE.md 或 HLD 中设计 API 层的防绕过机制（如二次确认令牌、幂等键、管理员数字签名）。 | 攻击者通过直接调用 REST API 即可绕过前端确认，执行影响面 >50% 的高风险变更。 | + +#### P1 — 高风险（必须在上线前修复） + +| 编号 | 问题 | 证据 | 风险 | +|------|------|------|------| +| P1-001 | **审计防篡改机制无法抵御特权用户攻击** | migration SQL 中的触发器可被超级用户（postgres）或具有 TRIGGER/ALTER 权限的用户禁用或删除；缺少对 DDL 操作（DROP TRIGGER、ALTER TABLE）的审计；无哈希链或数字签名。 | 拥有数据库高权限的内部人员或入侵者可静默抹除审计痕迹。 | +| P1-002 | **外部管理接口攻击面过大，API Key 管理缺失** | HLD 7.1 指出 AI-Ops 通过 POST /internal/gateway/switch-route、throttle、restart 等接口直接控制 gateway。独立运行时使用"API Key 鉴权"，但未描述 Key 的存储、轮换、最小权限、生命周期管理。 | 若 AI-Ops 服务被攻破，攻击者可一键切流、限流或重启实例，造成生产事故。 | +| P1-003 | **RBAC 缺少自愈动作与资源级权限控制** | HLD 8.1 RBAC 矩阵未包含"执行自愈动作"权限；INTERFACE.md 3.2 的 POST /api/v1/ai-ops/healing/execute 未标注角色要求；PRD AC-12 未限制 operator 是否可触发人工自愈。同时缺少同角色用户只能操作自己创建的资源的水平越权防护。 | operator 可能执行超出职责范围的自愈动作；用户 A 可修改/回滚用户 B 创建的规则。 | +| P1-004 | **敏感数据在配置表和审计日志中无加密存储设计** | ai_ops_channels.config 存储 Webhook URL 和密钥；ai_ops_rules.healing_config 可能包含管理 API 凭据；HLD 8.2 仅声明 Sanitizer 脱敏，无字段级加密或密钥托管设计。 | 数据库备份泄露、SQL 注入成功或内部人员越权查询即可直接获取明文密钥。 | +| P1-005 | **SQL 注入与 PromQL 注入缺少架构级强制约束** | TEST_DESIGN.md 有 SQL 注入测试项，但 HLD/INTERFACE 中未强制要求所有 SQL 必须使用参数化查询。threshold_value 支持 regex 类型，若该值被拼接进 PromQL 或 SQL，存在注入风险。 | 攻击者通过构造恶意阈值规则，可能读取未授权数据或篡改时序查询。 | + +#### P2 — 中等风险（建议修复） + +| 编号 | 问题 | 证据 | 风险 | +|------|------|------|------| +| P2-001 | **威胁建模缺少 OWASP Top 10 系统性映射和 LLM/Gateway 特有风险** | HLD 10.1 的 5 个威胁场景覆盖运维故障较多，但未明确映射 A02、A05、A06、A07、A10。对于 LLM 网关场景，缺少对 Prompt Injection、Model DoS、SSRF 的专项分析。 | 安全测试覆盖不全，可能导致未知攻击面遗漏。 | +| P2-002 | **影响面计算单一，无法覆盖多种高风险场景** | HLD 3.5 仅通过"拒绝率 > 50%"判定高风险。未考虑告警阈值过度敏感导致的告警风暴、通知渠道误删导致的告警黑洞等。 | 真正的风险变更可能被误判为低风险，绕过二次确认。 | +| P2-003 | **/metrics 和 WebSocket 接口的鉴权与限流设计缺失** | /metrics 对外暴露；WebSocket /ws/v1/ai-ops/alerts 未描述 JWT 验证、连接数限制、订阅权限校验。 | 潜在信息泄露和 DoS 攻击面。 | +| P2-004 | **错误码重复定义且 HTTP 状态码语义不一致** | INTERFACE.md 3.3 中 OPS_AUD_4101 出现两次，OPS_AUD_4001 定义为"无权进行审计操作"却对应 403。 | 客户端错误处理混乱，增加集成方开发成本。 | +| P2-005 | **审计日志外键允许隐式修改审计记录** | migration SQL 中 parent_audit_id 使用 ON DELETE SET NULL。虽触发器阻止 DELETE on ai_ops_audits，但如果 TRUNCATE 或超级用户绕过，子记录的 parent_audit_id 会被设为 NULL。 | 审计关联完整性受损，影响回滚追溯。 | +| P2-006 | **回滚操作自身缺少版本并发控制** | HLD 3.5 提到回滚前检查目标资源是否被后续修改覆盖（返回 4102），但未明确回滚执行期间是否加乐观锁或分布式锁。 | 并发回滚与修改可能导致配置状态进一步损坏。 | + +--- + +### 改进建议 + +1. **统一 fail-closed 策略，消除文档冲突** + - 修改 PRD F-5：将"审计写入失败不阻断业务"改为"审计写入失败时，高风险操作必须拒绝执行；低风险操作可降级至本地队列缓存，但必须在 5 分钟内补写成功，否则触发 P1 内部告警"。 + - 在 HLD 5.3 配置回滚流程和告警规则变更流程中，明确画出"先写审计表（INSERT ai_ops_audits） -> 获得 audit_id -> 执行业务 SQL -> 更新审计结果"的时序图。 + +2. **为 invoke_script 增加运行时沙盘** + - 采用 gVisor、Firecracker 或至少 Docker + seccomp + 只读根文件系统运行用户脚本。 + - 限制：禁止访问环境变量 secrets、禁止出站网络（除白名单外）、CPU/内存硬限制、执行超时强制 kill。 + - 脚本内容在保存前需经过静态安全扫描（如禁止 os.exec、net.Dial 等敏感调用）。 + +3. **在 API 层实现不可绕过的二次确认机制** + - 对影响面 > 50% 的变更，后端生成一次性 confirmation_token（绑定用户、资源、变更哈希、TTL=5min），前端弹窗只是展示层。 + - 执行高风险 API 时必须在请求头中携带有效的 confirmation_token，否则返回需要二次确认。 + - 该机制同时适用于回滚操作。 + +4. **加固审计防篡改能力** + - 引入审计日志哈希链：每条新记录包含 previous_hash = SHA256(上一条记录的 id + created_at + 内容)，应用层计算并存储。 + - 对 ai_ops_audits 启用 pgaudit 扩展，记录所有 DDL 和超级用户操作。 + - 定期（每日）将审计日志摘要写入不可篡改的外部存储（如 WORM 存储或独立只读实例）。 + - 将 parent_audit_id 的 ON DELETE SET NULL 改为 ON DELETE RESTRICT。 + +5. **细化 RBAC 并增加资源级鉴权** + - 扩展 RBAC 矩阵：增加 healing:execute、channel:manage、capacity:threshold_write 等细粒度权限。 + - 增加资源所有者概念：用户只能修改/删除自己创建的 rules/channels，admin 可绕过此限制。 + - 在独立运行模式下为 ai_ops_roles 增加权限位字段（bitmask），避免硬编码角色逻辑。 + +6. **强制参数化查询与注入防护** + - 在 HLD 技术约束中增加条款：所有 SQL 操作必须使用 pgx 参数化查询，禁止字符串拼接 SQL。所有 PromQL 操作必须经过白名单校验。 + - threshold_type = 'regex' 的场景，regex 值只应用于应用层阈值评估，不应作为 SQL LIKE 或 PromQL 的查询条件。 + +7. **加密存储敏感配置** + - ai_ops_channels.config 和 ai_ops_rules.healing_config 中的敏感字段在写入前使用 AES-256-GCM 加密，密钥由外部 KMS/Vault 托管。 + - 审计日志的 before_state / after_state 在序列化前由 Sanitizer 递归扫描并替换敏感值为 "***REDACTED***"，确保 JSONB 中洼套的密钥均被脱敏。 + +8. **缩小外部接口攻击面** + - 为 /internal/gateway/* 接口引入短期有效的 mTLS 或至少 HMAC-SHA256 签名验证，API Key 存储在内存加密空间（非环境变量明文）。 + - WebSocket 接口实现 JWT Token 验证和每 IP 最大连接数限制。 + - /metrics 端点增加 IP 白名单或基础鉴权，避免暴露过多内部元数据。 + +--- + +### 门禁检查清单（进入开发前必须通过） + +- [ ] PRD F-5 修正：审计写入失败时高风险操作必须拒绝（fail-closed）。 +- [ ] HLD 补充二次确认 API 防绕过机制设计。 +- [ ] HLD 补充 invoke_script 运行时沙盘隔离方案。 +- [ ] RBAC 矩阵增加自愈执行、通知渠道、容量阈值等细粒度权限。 +- [ ] 审计触发器增加 pgaudit DDL 监控 + 哈希链设计。 +- [ ] 数据库层强制参数化查询约束落地到 HLD 技术约束中。 +- [ ] 敏感配置字段加密存储方案确认。 +- [ ] 错误码重复定义问题修复并与前端对齐。 diff --git a/tech/TEST_DESIGN.md b/tech/TEST_DESIGN.md new file mode 100644 index 0000000..0b823a9 --- /dev/null +++ b/tech/TEST_DESIGN.md @@ -0,0 +1,378 @@ +# AI-Ops 测试设计方案 + +> 版本：v1.0 +> 日期：2026-04-27 +> 状态：初稿 +> 覆盖：AC-01 ~ AC-12、异常流程 F-01 ~ F-08、边缘流程 G ~ I + +--- + +## 1. 测试策略 + +### 1.1 测试分层模型 + +``` +┌─────────────────────────────────────────────────┐ +│ E2E Tests (黑盒) │ +│ 场景：用户操作链路 + 系统集成验证 │ +│ 工具：Go test + k6 / 自制 E2E runner │ +│ 覆盖率目标：每个主流程 ≥ 1 条 │ +└─────────────────────────────────────────────────┘ + ▲ +┌─────────────────────────────────────────────────┐ +│ Integration Tests (灰盒) │ +│ 场景：Service 间协作、数据库读写、外部 API Mock │ +│ 工具：Go test + testify + sqlmock + httptest │ +│ 覆盖率门槛：service ≥ 80%, handler ≥ 80% │ +└─────────────────────────────────────────────────┘ + ▲ +┌─────────────────────────────────────────────────┐ +│ Unit Tests (白盒) │ +│ 场景：单个函数/方法逻辑、边界条件、错误分支 │ +│ 工具：Go test + testify + gomock │ +│ 覆盖率门槛：domain ≥ 70% │ +└─────────────────────────────────────────────────┘ +``` + +### 1.2 测试通过标准 + +| 维度 | 标准 | +|------|------| +| 覆盖率 | domain ≥ 70%, service/handler ≥ 80% | +| 主流程 | AC-01 ~ AC-12 全部有至少 1 条通过测试 | +| 异常流程 | F-01 ~ F-08 全部有至少 1 条验证测试 | +| 边缘流程 | G、H、I 全部有至少 1 条验证测试 | +| 告警噪声率 | 沙盒测试中误报率 ≤ 1%，超过则 CI 失败 | +| 自愈误触发 | 沙盒测试中 0 次误触发，否则 CI 失败 | + +### 1.3 测试环境矩阵 + +| 环境 | 用途 | 数据特征 | 外部依赖 | +|------|------|---------|---------| +| **Local Dev** | 开发者快速验证 | Mock 数据 | Mock 所有外部服务 | +| **CI** | PR Merge 门禁 | Mock 数据 | Mock 所有外部服务 | +| **Sandbox** | 沙盒验证（自愈规则） | 生产数据脱敏副本 | Mock + 部分真实依赖 | +| **Staging** | 上线前全流程验证 | 生产数据脱敏副本 | 全真实依赖 | +| **Production** | 灰度上线 | 真实数据 | 全真实依赖 | + +--- + +## 2. Mock 策略 + +### 2.1 外部依赖 Mock + +| 依赖 | Mock 方案 | 工具 | +|------|---------|------| +| **Prometheus / 时序数据库** | 嵌入式 mock server，返回预置指标数据 | httptest + 自定义 mock | +| **gateway/internal/metrics** | Mock HTTP handler，返回 JSON 指标 | gock / httptest | +| **supply-api/ 供应商健康接口** | Mock 返回 200/401/429/500 | gock | +| **platform-token-runtime/ 运行时状态接口** | Mock 返回正常/异常状态 | gock | +| **通知渠道（Webhook/邮件/飞书）** | Mock server 接收并验证请求格式 | httptest | +| **PostgreSQL** | sqlmock 拦截 SQL，验证查询正确性 | github.com/DATA-DOG/go-sqlmock | +| **Redis** | miniredis 内存模拟 | github.com/alicebob/miniredis | + +### 2.2 Mock 分层 + +``` +Production 依赖: + gateway metrics API ──→ supply-api 供应商接口 ──→ token-runtime 状态接口 + │ │ │ + ▼ ▼ ▼ +Mock (CI/Local): Mock (CI/Local): Mock (CI/Local): +MetricsMockServer → SupplierMockServer → RuntimeMockServer +``` + +--- + +## 3. 测试用例矩阵（按 AC 编号） + +### AC-01 实时监控看板 + +| 用例 ID | 描述 | 类型 | 覆盖条件 | +|---------|------|------|---------| +| TC-01-01 | 首页加载时间 <2s | Performance | Given 用户登录 When 访问首页 Then 响应时间 ≤2s | +| TC-01-02 | 首页显示 6 个指标 | Happy Path | Given 系统运行 When 首页加载 Then 显示 QPS/延迟/P99/错误率/供应商数/告警数 | +| TC-01-03 | 指标卡片 15s 内刷新 | Functional | Given 指标更新 When 数据推送 Then 15s 内页面刷新 | +| TC-01-04 | 无数据时看板展示"无数据" | Edge | Given 指标源断开 When 首页加载 Then 不显示过期数据 | +| TC-01-NEG-01 | 未登录访问首页返回 401 | Negative | Given 未登录 When 访问首页 Then 返回 401 | +| TC-01-NEG-02 | 非法时间范围参数返回 400 | Negative | Given 非法时间范围参数 When 请求指标 Then 返回 400 | + +### AC-02 指标下钻 + +| 用例 ID | 描述 | 类型 | 覆盖条件 | +|---------|------|------|---------| +| TC-02-01 | 下钻显示 1 小时趋势图 | Happy Path | Given 点击指标卡片 When 下钻 Then 显示 60min 趋势 | +| TC-02-02 | 按 service/path/supplier 维度分割 | Functional | Given 趋势图 When 按 supplier 下钻 Then 正确分割 | +| TC-02-03 | 下钻查询 <3s | Performance | Given 大数据量 When 执行下钻 Then 响应 <3s | +| TC-02-04 | 无数据范围返回空图表 | Edge | Given 无数据 When 下钻 Then 显示空图表而非报错 | +| TC-02-NEG-01 | 下钻不存在的 service 返回空结果/404 | Negative | Given 不存在的 service When 下钻 Then 返回空结果或 404 | +| TC-02-NEG-02 | 超大时间范围返回 413/截断 | Negative | Given 超大时间范围 When 下钻 Then 返回 413 或自动截断 | + +### AC-03 告警规则配置 + +| 用例 ID | 描述 | 类型 | 覆盖条件 | +|---------|------|------|---------| +| TC-03-01 | 创建告警规则 | Happy Path | Given 登录管理员 When 创建规则 Then 规则保存成功 | +| TC-03-02 | 规则字段完整性校验 | Negative | Given 缺少必填字段 When 创建规则 Then 返回 400 | +| TC-03-03 | 规则变更 30s 内生效 | Functional | Given 规则已创建 When 修改阈值 Then 30s 后新规则生效 | +| TC-03-04 | 支持 50 条规则并发运行 | Load | Given 50 条规则 When 同时触发 Then 全部正确评估 | +| TC-03-05 | 规则编辑/禁用/删除 | Functional | Given 规则存在 When 编辑/禁用/删除 Then 状态正确变更 | + +### AC-04 告警通知触达 + +| 用例 ID | 描述 | 类型 | 覆盖条件 | +|---------|------|------|---------| +| TC-04-01 | P0/P1 告警 30s 内通知 | Performance | Given P1 告警触发 When 通知发送 Then ≤30s 到达 | +| TC-04-02 | P2 告警 120s 内通知 | Performance | Given P2 告警触发 When 通知发送 Then ≤120s 到达 | +| TC-04-03 | 至少 2 种通知渠道 | Functional | Given 告警触发 When 发送 Then 飞书和邮件均收到 | +| TC-04-04 | 通知内容完整性 | Functional | Given 告警发送 Then 包含级别/规则名/时间/当前值/阈值/事件ID/链接 | +| TC-04-05 | Webhook 通知失败后自动切换 | Resilience | Given Webhook 发送失败 When 告警触发 Then 自动切换至邮件 | +| TC-04-NEG-01 | 通知渠道全部失效时记录失败并触发内部告警 | Negative | Given 所有通知渠道失效 When 告警触发 Then 记录失败并触发内部 P2 告警 | +| TC-04-NEG-02 | 非法事件 ID 查询返回 404 | Negative | Given 非法事件 ID When 查询事件 Then 返回 404 | + +### AC-05 告警聚合与抑制 + +| 用例 ID | 描述 | 类型 | 覆盖条件 | +|---------|------|------|---------| +| TC-05-01 | 1 分钟内 >20 条告警触发聚合 | Functional | Given 同一资源 1min 内触发 25 条 When 聚合 Then 生成 1 条集群告警 | +| TC-05-02 | 集群告警包含累计数量和规则列表 | Functional | Given 集群告警生成 Then 内容包含数量≥20 和规则列表 | +| TC-05-03 | 5 分钟抑制期内同一规则不重复通知 | Functional | Given 告警已发送 When 5min 内再次触发 Then 不重复通知 | +| TC-05-04 | 级别升级时抑制解除 | Functional | Given P2 告警抑制中 When 升级为 P1 Then 立即通知 | +| TC-05-NEG-01 | 聚合阈值设置为 0 或负数时的校验拒绝 | Negative | Given 阈值为 0 或负数 When 创建/编辑规则 Then 返回 400 并拒绝 | + +### AC-06 自动自愈 + +| 用例 ID | 描述 | 类型 | 覆盖条件 | +|---------|------|------|---------| +| TC-06-01 | 自愈动作 60s 内完成 | Performance | Given 自愈规则触发 When 执行切换路由 Then ≤60s 完成含重试 | +| TC-06-02 | 自愈成功记录事件 | Happy Path | Given 自愈执行成功 When 完成 Then 事件记录 success | +| TC-06-03 | 自愈失败升级 P0 人工告警 | Functional | Given 自愈重试均失败 When 停止 Then 升级 P0 通知 | +| TC-06-04 | 无自愈规则时仅通知 | Functional | Given 告警无自愈配置 When 触发 Then 仅发送通知 | +| TC-06-05 | 沙盒模式：自愈不生效 | Resilience | Given 沙盒模式 When 自愈触发 Then 仅记录，不实际执行 | +| TC-06-06 | 自愈后 2min 评估是否解除 | Functional | Given 自愈执行 When 2min 后 Then 评估条件是否满足 | +| TC-06-07 | 自愈级联失败回退 | Functional | Given 自愈切换导致新故障 When 检测到 Then 回退并升级 | +| TC-06-NEG-01 | 沙盒未通过时禁止关联生产规则 | Negative | Given 沙盒测试未通过 When 关联生产告警规则 Then 返回 400 并拒绝 | +| TC-06-NEG-02 | 自愈动作类型非法返回 400 | Negative | Given 非法自愈动作类型 When 配置规则 Then 返回 400 | + +### AC-07 配置审计日志 + +| 用例 ID | 描述 | 类型 | 覆盖条件 | +|---------|------|------|---------| +| TC-07-01 | 配置变更 1s 内生成审计记录 | Performance | Given 执行配置变更 When 完成 Then ≤1s 审计记录存在 | +| TC-07-02 | 审计字段完整性 | Functional | Given 审计记录 When 查询 Then 包含全部 10 个字段 | +| TC-07-03 | 审计日志不可篡改 | Security | Given 审计记录 When 尝试修改 Then 数据库层拒绝或被检测 | +| TC-07-04 | 审计日志 90 天保留 | Functional | Given 审计数据 91 天 When 查询 Then 91 天前记录不存在（新数据已清理） | +| TC-07-05 | 审计查询 <3s | Performance | Given 10000 条审计记录 When 按条件查询 Then <3s | + +### AC-08 配置回滚 + +| 用例 ID | 描述 | 类型 | 覆盖条件 | +|---------|------|------|---------| +| TC-08-01 | 正常回滚 <60s | Performance | Given 审计记录存在 When 执行回滚 Then ≤60s 完成 | +| TC-08-02 | 回滚前显示子资源影响列表 | Functional | Given 回滚操作 When 执行前 Then 显示将被覆盖的子资源 | +| TC-08-03 | 回滚生成新审计记录 | Functional | Given 回滚执行 When 完成 Then 新审计记录关联原始 ID | +| TC-08-04 | 目标不存在时返回 OPS_AUD_4101 | Negative | Given 目标已被删除 When 执行回滚 Then 返回错误码且不执行 | +| TC-08-05 | 回滚失败不静默 | Resilience | Given 回滚执行失败 When 完成 Then 返回错误码并通知 | + +### AC-09 容量主板 + +| 用例 ID | 描述 | 类型 | 覆盖条件 | +|---------|------|------|---------| +| TC-09-01 | 显示 7 天趋势数据 | Functional | Given 容量主板 When 加载 Then 显示 7 天 Token/QPS/延迟趋势 | +| TC-09-02 | 负载等级标注（正常/警告/过载） | Functional | Given 负载数据 When 展示 Then 正确标注等级 | +| TC-09-03 | 预测触达上限时间 | Functional | Given 增长率数据 When 计算 Then 显示预测时间（仅供参考） | +| TC-09-NEG-01 | 容量主板数据源丢失时展示降级提示 | Negative | Given 时序库断开 When 访问容量主板 Then 显示降级提示而非错误 | + +### AC-10 日志/指标查询 + +| 用例 ID | 描述 | 类型 | 覆盖条件 | +|---------|------|------|---------| +| TC-10-01 | 按多维度筛选日志 | Functional | Given 查询条件 When 执行 Then 正确过滤 | +| TC-10-02 | 日志查询 <3s | Performance | Given 10000 条日志 When 查询 Then <3s | +| TC-10-03 | CSV 导出 10000 条 | Load | Given 查询结果 When 导出 Then 正确生成 CSV | +| TC-10-04 | 分页查询第 2 页 | Functional | Given 分页请求 When 获取第 2 页 Then 返回正确偏移 | +| TC-10-NEG-01 | 导出超过 10000 条时返回 413 或分批 | Negative | Given 查询结果 >10000 条 When 导出 CSV Then 返回 413 或自动分批导出 | + +### AC-11 监控数据保存 + +| 用例 ID | 描述 | 类型 | 覆盖条件 | +|---------|------|------|---------| +| TC-11-01 | 原始数据保留 ≥7 天 | Functional | Given 8 天前数据 When 查询 Then 7 天内数据存在 | +| TC-11-02 | 分钟级聚合保留 ≥30 天 | Functional | Given 31 天前数据 When 查询 Then 31 天前不存在 | +| TC-11-03 | 小时级聚合保留 ≥90 天 | Functional | Given 91 天前数据 When 查询 Then 不存在 | +| TC-11-NEG-01 | 查询已清理数据返回空并提示保留策略 | Negative | Given 查询已清理时段 When 查询原始数据 Then 返回空并提示保留策略 | + +### AC-12 角色与权限 + +| 用例 ID | 描述 | 类型 | 覆盖条件 | +|---------|------|------|---------| +| TC-12-01 | 查看者只能读不可写 | Security | Given 查看者 When 尝试写操作 Then 返回 403 | +| TC-12-02 | 运维人员不可执行回滚 | Security | Given 运维人员 When 执行回滚 Then 返回 403 | +| TC-12-03 | 管理员可执行所有操作 | Functional | Given 管理员 When 执行任意操作 Then 成功 | + +--- + +## 4. 异常流程测试（F-01 ~ F-08） + +| 用例 ID | 异常场景 | 验证点 | 预期行为 | +|---------|---------|-------|---------| +| TF-01 | 自愈动作重试均失败 | P0 人工告警触发 | 10s 内重试 1 次，失败后立即升级 P0 电话/短信 | +| TF-02 | 通知渠道失效（Webhook 5xx） | 备用渠道切换 | 记录失败，使用邮件→飞书→短信三次切换 | +| TF-03 | 回滚目标已不存在 | OPS_AUD_4101 | 返回错误码，运营手动修复 | +| TF-04 | 指标采集器 5min 无数据 | 数据源丢失标识 | 控制台显示丢失标识，触发 P2 内部告警 | +| TF-05 | 审计日志存储满盘 | 降级不阻断业务 | 丢弃非关键字段或异步上报，业务操作继续 | +| TF-06 | 自愈形成级联故障 | 回退并升级 | 自动恢复上一步，升级人工告警，立即电话通知 | +| TF-07 | 监控数据库全面中断 | 只读/降级模式 | 控制台只读，告警引擎本地缓存继续运行 | +| TF-08 | 实时看板指标计算超时 | 显示上次结果 | 显示上次成功结果并标注时间戳 | + +--- + +## 5. 灰度发布验证计划 + +### 5.1 各 Phase 验证内容 + +| Phase | 验证内容 | 通过标准 | 回归集 | +|-------|---------|---------|--------| +| **Phase 1** | 监控看板 + 日志查询 | AC-01, AC-02, AC-10, AC-11 全部通过 | 无历史功能 | +| **Phase 2** | 告警规则 + 通知渠道 | AC-03, AC-04, AC-05 全部通过 | Phase 1 全量 | +| **Phase 3** | 自愈引擎 + 审计回滚 | AC-06, AC-07, AC-08 全部通过 + 沙盒 10 次无误触发 | Phase 1+2 全量 | +| **Phase 4** | 容量主板 | AC-09 全部通过 | Phase 1+2+3 全量 | + +### 5.2 灰度门禁检查项 + +每次 Phase 升级前必须全部通过： +- [ ] 所有 AC 测试用例 100% 通过 +- [ ] 单元测试覆盖率达标（domain ≥70%, service ≥80%） +- [ ] 自愈沙盒验证 ≥10 次无误触发 +- [ ] 回滚演练（至少 3 个资源类型）成功 +- [ ] 性能基准测试通过（响应时间符合 AC 要求） +- [ ] 安全扫描通过（无高危漏洞） + +--- + +## 6. 回归测试集 + +### 6.1 快速回归集（每次 PR） + +``` +TC-01-01, TC-01-02, TC-03-01, TC-03-03, TC-04-01, TC-07-01, TC-07-02, TC-12-01, TC-12-03 +共 9 条，约 5-10 分钟 +``` + +### 6.2 完整回归集（每次 Phase 升级） + +``` +TC-01-01 ~ TC-01-04, TC-01-NEG-01, TC-01-NEG-02 +TC-02-01 ~ TC-02-04, TC-02-NEG-01, TC-02-NEG-02 +TC-03-01 ~ TC-03-05 +TC-04-01 ~ TC-04-05, TC-04-NEG-01, TC-04-NEG-02 +TC-05-01 ~ TC-05-04, TC-05-NEG-01 +TC-06-01 ~ TC-06-07, TC-06-NEG-01, TC-06-NEG-02 +TC-07-01 ~ TC-07-05 +TC-08-01 ~ TC-08-05 +TC-09-01 ~ TC-09-03, TC-09-NEG-01 +TC-10-01 ~ TC-10-04, TC-10-NEG-01 +TC-11-01 ~ TC-11-03, TC-11-NEG-01 +TC-12-01 ~ TC-12-03 +TF-01 ~ TF-08 +共 53 条，约 30-60 分钟 +``` + +--- + +## 7. 技术栈与集成约束验证 + +### 7.1 统一技术栈与双运行模式验证 + +| 用例 ID | 描述 | 类型 | 验证条件 | +|---------|------|------|---------| +| TOPS-RUN-01 | 独立运行模式启动 | Happy Path | Given 独立 `config.yaml` 与独立数据库/Redis/时序库 When 启动 `cmd/ai-ops/main.go` Then `/actuator/health/ready` 返回 200，`/api/v1/ai-ops/*` 可访问 | +| TOPS-RUN-02 | 集成运行模式挂载 | Integration | Given gateway 或 supply-api 主进程加载 `IntegrationPlugin` When 启动 Then `/internal/ai-ops/*` 路由、后台 worker、健康检查挂载成功 | +| TOPS-RUN-03 | 配置分离加载 | Functional | Given 独立模式与集成模式分别启动 When 读取配置 Then 独立模式仅使用自身配置，集成模式正确合并主项目配置 | +| TOPS-RUN-04 | 数据库前缀隔离 | Structural | Given 执行迁移 When 检查 schema Then 仅创建 `ai_ops_` 前缀表 | + +### 7.2 独立运行与集成运行验证 + +### 7.3 IntegrationPlugin 与模块挂载验证 + +| 用例 ID | 描述 | 类型 | 验证条件 | +|---------|------|------|---------| +| TOPS-PLG-01 | IntegrationPlugin 注册路由与健康检查 | Integration | Given 集成模式 When 插件注册 Then 监控、告警、日志、审计、健康检查路由挂载成功 | +| TOPS-PLG-02 | 模块开关生效 | Functional | Given `enabled_modules` 关闭某模块 When 启动 Then 对应路由/后台任务不注册，其他模块不受影响 | +| TOPS-PLG-03 | 集成模式共享资源 | Integration | Given 主进程注入共享 DB/Redis/logger/metrics client When 插件启动 Then 使用共享资源且不重复初始化冲突依赖 | + +### 7.3 OpenAPI 契约验证 + +| 用例 ID | 描述 | 类型 | 验证条件 | +|---------|------|------|---------| +| TOPS-OAS-01 | OpenAPI 文档可访问 | Functional | Given 服务启动 When 请求 `/openapi.json` 或 `/docs` Then 返回 200 且包含监控、告警、自愈、审计、日志查询接口 | +| TOPS-OAS-02 | 路由与 OpenAPI 一致 | Contract | Given 导出的 OpenAPI 文档 When 对照 HTTP 路由 Then 请求/响应/错误码与实现一致，无缺失公开接口 | +| TOPS-OAS-03 | 集成前缀可配置 | Contract | Given 集成模式配置内部前缀 When 导出文档 Then 文档反映 `/internal/ai-ops/` 前缀或明确区分外部/内部暴露面 | + +### 7.4 NewAPI / Sub2API 适配层验证 + +| 用例 ID | 描述 | 类型 | 验证条件 | +|---------|------|------|---------| +| TOPS-ADP-01 | `/metrics` 采集适配 | Contract | Given NewAPI/Sub2API 通过 Prometheus scrape 拉取指标 When 调用 `/metrics` Then 指标命名、label、采样频率满足契约 | +| TOPS-ADP-02 | 告警回调适配 | Integration | Given 外部系统配置 Webhook 回调 When 告警触发 Then 回调内容完整、签名正确、失败可重试 | +| TOPS-ADP-03 | 自愈脚本调用外部管理 API | Integration | Given 自愈动作触发程序化脚本 When 通过适配层调用 NewAPI/Sub2API Then 鉴权、错误码映射、回退逻辑符合设计 | + +--- + +## 8. 发布门禁与阶段结论 + +### 8.1 发布门禁检查表 + +以下门禁项全部通过前，不得进入生产交付： + +- [ ] 独立运行与集成运行模式均完成启动验证，路由、worker、健康检查真实挂载 +- [ ] `BuildServer` / `BuildRuntime` 中条件能力已显式接入，而非仅存在定义 +- [ ] OpenAPI、`/metrics`、Webhook、管理 API 的鉴权与字段边界合同测试通过 +- [ ] 自愈动作均完成沙盒验证、快照记录与回滚演练 +- [ ] 审计日志保证先写审计再执行业务，高风险操作审计失败即拒绝 +- [ ] viewer / operator / admin 三类角色权限矩阵验证通过 +- [ ] 告警洪泛、自愈误触发、时序库中断、通知渠道失效四类高风险回归全部通过 +- [ ] 至少一条真实故障检测 → 告警 → 通知/回滚链路完成端到端验证 + +### 8.2 阶段门控结论 + +**当前结论：REQUEST_CHANGES（已转化为具体行动项，见 HLD 10.2 节）** + +**进入开发/实现前必须补齐：** +- [ ] 将 HLD 中的威胁建模点全部下沉为可执行测试与阻断项（每个威胁场景必须有对应 CI 阻断测试用例）。 +- [ ] 为"定义 → 装配 → 调用 → 入口"四层链路补充 QA 检查要求，重点覆盖自愈、告警、审计、权限。 +- [ ] 分别给出独立模式与集成模式的最小验证命令、预期输出与失败判定。 +- [ ] 高风险变更必须 fail-closed：影响面 > 50% 的变更在审计写入失败时必须拒绝执行。 + +**阻断条件（任一触发则不得进入开发）：** +- 高风险动作没有沙盒/回滚闭环。 +- 审计不能证明先写后执行业务。 +- 关键能力只存在接口声明，未真实接入运行主链路。 +- HLD 门控 8.1 中任意一项未通过。 + +--- + +## 9. 性能测试 + +### 9.1 性能基准 + +| 指标 | 目标值 | 压测方法 | +|------|-------|---------| +| 首页加载 | <2s (P99) | k6 并发 50 用户 | +| 告警触发到通知 | P0/P1 <30s, P2 <120s | 单次告警触发计时 | +| 下钻查询 | <3s (P99) | k6 并发 20 用户 | +| 审计查询 | <3s (P99) | 10000 条数据下查询 | +| 配置回滚 | <60s (P99) | 单次回滚计时 | +| 支持并发告警规则 | ≥50 条同时评估 | 并发注入 50 条告警数据 | + +--- + +## 10. 安全测试 + +| 测试项 | 方法 | 验证点 | +|-------|------|-------| +| 权限越界 | 使用低权限 Token 尝试高权限操作 | 返回 403 | +| 审计日志篡改 | 尝试 UPDATE/DELETE 审计表 | 操作被拒绝或被检测 | +| SQL 注入 | 输入 `' OR 1=1 --` 等 | 参数化查询无注入 | +| 告警信息泄露 | 跨用户查询告警 | 无数据泄露 | +| 高风险变更未二次确认 | 提交影响 90% 流量的变更 | 变更被标记待确认 | diff --git a/tech/TechLead_Review_Report.md b/tech/TechLead_Review_Report.md new file mode 100644 index 0000000..c11e931 --- /dev/null +++ b/tech/TechLead_Review_Report.md @@ -0,0 +1,111 @@ +## TechLead 审核报告 — AI-Ops 智能运维系统 + +审核日期：2026-05-11 +审核范围：HLD.md、INTERFACE.md、DEPLOYMENT.md、000001_init_schema.up.sql +审核人：TechLead + +--- + +### 总体评级：B + +架构方向正确，核心设计（审计防篡改、沙盒自愈、DualCache、独立/集成双模式）有成熟模式支撑。但三份文档之间存在多处接口命名和路径不一致，ER 图与 migration 存在表缺失，IntegrationPlugin 未定义接口，这些问题必须在进入开发前修复。 + +--- + +### 优点 + +1. 技术选型有明确决策理由和备选方案（Prometheus vs VictoriaMetrics、DualCache、CustomBatchLogger），降低未来换型风险。 +2. 核心业务指标均有量化目标和验证方式（MTTR<10min、噪声率<5%、覆盖率>=60%），便于 QA 建立阻断测试。 +3. 审计设计采用 append-only + 数据库触发器防篡改，符合合规和故障定责要求。 +4. 安全设计覆盖 RBAC、敏感字段脱敏、Row Level Security 可选方案，数据隔离意识到位。 +5. 风险分析包含威胁建模和六条具体风险项，并给出了缓解策略。 +6. 支持独立运行与集成运行两种模式，且 schema 强制使用 `ai_ops_` 前缀，避免与主项目冲突。 +7. 借鉴 LiteLLM 的 CustomBatchLogger、DualCache、DigestEntry 等模式，降低实现不确定性。 + +--- + +### 发现问题（按严重度分类） + +#### P0 — 阻断开发 + +| 编号 | 问题描述 | 影响 | 位置 | +|------|---------|------|------| +| P0-1 | HLD 与 INTERFACE 外部集成接口定义严重不一致。gateway/：HLD 写 `POST /internal/gateway/throttle`、`/switch-route`、`/restart`；INTERFACE 写 `POST /internal/gateway/routes`。supply-api/：HLD 写 `/internal/suppliers/health`、`/audit/events`、`/usage/token-stats`；INTERFACE 写 `/internal/supply/accounts/health`、`/internal/supply/audit/schema`。token-runtime/：HLD 写 `/internal/tokens/status`；INTERFACE 写 `/internal/runtime/token-usage`。 | 开发团队无法确定真实调用契约，集成测试无法编写，联调必失败。 | HLD §7 / INTERFACE §2 | +| P0-2 | HLD ER 图（§4.1）中出现 `ai_ops_events`、`ai_ops_notifys`、`ai_ops_configs`、`ai_ops_snapshots` 四张表，但 HLD §4.2 表结构、INTERFACE、migration SQL 中均完全缺失。 | 数据模型不完整，核心流程（通知、快照、配置版本）无法落地。 | HLD §4.1 vs §4.2 / migration | +| P0-3 | 自愈动作类型命名不一致。HLD §3.3 定义 `switch_route`、`restart_instance`、`isolate_node`；INTERFACE §1.3 HealingAction.Type 注释写 `restart_service switch_provider throttle isolate_node`。 | 同一概念多个命名，导致存储序列化、API 校验、前端枚举全部混乱。 | HLD §3.3 / INTERFACE §1.3 | +| P0-4 | 集成运行模式的核心契约 `IntegrationPlugin` 未在任何文档中定义 Go interface。HLD 仅文字描述“通过 IntegrationPlugin 将检查逻辑注入到主程序的健康检查中”，但没有接口方法、生命周期、注册方式。 | 集成模式无法编码实现，也无法做 CI 阻断检查（HLD §10.2 要求 BuildServer/BuildRuntime 显式挂载约束落实为 CI 检查，但无接口无法执行）。 | HLD §1.3 / §3.2 / §7 / §10.2 | + +#### P1 — 必须修复 + +| 编号 | 问题描述 | 影响 | 位置 | +|------|---------|------|------| +| P1-5 | DEPLOYMENT §1.1 写“AI-Ops API Server x 2 (主备)”，但 §4.2 写“负载均衡自动移除，剩余节点继续服务”。主备模式下备机不处理请求，与负载均衡多活逻辑矛盾。 | 部署架构描述混乱，SRE 无法按文档实施。 | DEPLOYMENT §1.1 / §4.2 | +| P1-6 | DEPLOYMENT §3.2 启动顺序让 Worker 执行 migration。若 Worker 为多副本，同时启动会导致并发 migration 冲突（锁竞争或重复执行）。 | 可能导致数据库状态损坏或启动失败。 | DEPLOYMENT §3.2 | +| P1-7 | HLD §8.1 提到“系统自带角色表 `ai_ops_roles`”，但 HLD §4.2 和 migration 中均无该表定义。 | RBAC 无法落地。 | HLD §8.1 vs migration | +| P1-8 | HLD §3.3 级联故障防护要求“记录当前状态快照（包含相关配置版本号）”，但无 `ai_ops_snapshots` 表结构。P0-2 已提，此处强调其业务必要性。 | 自愈回退和级联故障检测缺少数据支撑。 | HLD §3.3 | +| P1-9 | 告警聚合流程（HLD §5.2）定义了聚合触发条件（>20条/60s），但未定义聚合告警如何解除、子告警状态如何同步到父告警、聚合告警 resolved 后子告警是否自动 resolved。 | 告警状态机不完整，可能导致聚合告警永久挂起。 | HLD §5.2 | +| P1-10 | 性能目标“>= 50 条规则 / 15s”对于实时告警场景仅约 3.3 条规则/秒，未说明规则评估是否支持水平分片并行。若规则数增长到 200 条，单实例评估可能超时。 | 扩展性设计停留在文字，未给出分片策略和负载均衡方案。 | HLD §9.1 / §9.2 | + +#### P2 — 建议优化 + +| 编号 | 问题描述 | 影响 | 位置 | +|------|---------|------|------| +| P2-11 | INTERFACE §3.3 错误码列表中 `OPS_AUD_4001`（403 无权）与 `OPS_AUD_4101`（400 回滚目标不存在）排版紧邻，且 `OPS_AUD_4101` 重复出现两次，易混淆。 | 错误码使用方容易误用。 | INTERFACE §3.3 | +| P2-12 | migration 中 `ai_ops_metrics_p` 仅创建 DEFAULT 分区，未建立按天 RANGE 分区，也未实现“自动删除 > 7 天的分区”。pg_partman 仅为注释。 | 时序缓存表会随着数据增长性能急剧下降，且无法自动清理。 | migration §6 | +| P2-13 | WebSocket 接口（INTERFACE §3.4）缺少鉴权机制说明。告警数据为敏感生产信息，公开 WebSocket 无鉴权存在数据泄露风险。 | 安全合规风险。 | INTERFACE §3.4 | +| P2-14 | Graceful Shutdown（DEPLOYMENT §3.2）未说明 WebSocket 长连接的关闭策略。若仅关闭 HTTP server，WebSocket 客户端可能异常断连。 | 用户体验和监控准确性下降。 | DEPLOYMENT §3.2 | +| P2-15 | HLD §9.3 存储估算假设 Prometheus 每个样本 8 bytes，实际 Prometheus TSDB 包含时间戳、变长标签、chunk 元数据等，真实存储远高于此。按此估算生产磁盘规划可能不足。 | 生产环境磁盘空间不足风险。 | HLD §9.3 | + +--- + +### 改进建议 + +1. **统一三文档外部集成契约**：以 INTERFACE.md 为基准，组织一次 HLD + INTERFACE + DEPLOYMENT 的接口对齐评审，确定 gateway/supply-api/token-runtime 的正式路径、请求/响应字段、鉴权方式，形成一份 `INTEGRATION_CONTRACT.md` 作为唯一可信源。 + +2. **补齐或裁剪 ER 图**： + - 若 `ai_ops_events`、`ai_ops_notifys`、`ai_ops_configs`、`ai_ops_snapshots` 确实需要，在 HLD §4.2 和 migration 中补充表结构、字段、索引、外键。 + - 若不需要，从 ER 图中删除，并说明替代方案（例如 `ai_ops_events` 是否被 `ai_ops_alerts` + `ai_ops_healings` 覆盖，`ai_ops_notifys` 是否被通知渠道 + alert status 覆盖）。 + +3. **定义 IntegrationPlugin 接口**：在 INTERFACE.md 中增加 `IntegrationPlugin` Go interface，至少包含： + ```go + type IntegrationPlugin interface { + Name() string + Init(ctx context.Context, cfg Config) error + RegisterRoutes(mux *http.ServeMux) error + HealthChecks() []HealthCheckFunc + Shutdown(ctx context.Context) error + } + ``` + 并说明注册方式（import + 显式 Enable）。 + +4. **修正 API Server 部署描述**：将 DEPLOYMENT §1.1 的“主备”改为“多实例 Active-Active + 负载均衡”，与 §4.2 的故障处理逻辑一致。 + +5. **分离 migration 执行**：将数据库 migration 从 Worker 启动逻辑中移出，改为： + - 独立运行：使用 `cmd/migrate` 或 Docker init container 执行。 + - 集成运行：由主程序在启动前统一执行，或通过 Kubernetes Job 执行。 + +6. **补充缺失表结构**：至少补充 `ai_ops_roles`（RBAC 需要）和 `ai_ops_snapshots`（自愈回退需要）。 + +7. **建立 metrics 分区管理策略**：选择以下之一并在 migration 中体现： + - 引入 `pg_partman` 扩展并编写初始化脚本； + - 或在应用层编写定时任务每日创建新分区、删除旧分区。 + +8. **WebSocket 增加鉴权与优雅关闭**： + - 连接建立时校验 JWT Token； + - Shutdown 阶段先发送 close frame，等待客户端 ack 或超时（5s）后再关闭 TCP 连接。 + +9. **完善告警聚合状态机**：补充聚合告警的 resolved/escalated 规则，以及子告警状态与父告警的同步策略（例如父告警 resolved 时是否批量 resolved 子告警）。 + +10. **重新校准时序存储容量估算**：参考 Prometheus 官方容量规划公式（`bytes_per_sample ≈ 1-2 bytes 压缩后，但写放大和索引占主要空间`），给出更保守的磁盘规划建议。 + +--- + +### 审核结论 + +**当前状态：REQUEST_CHANGES** + +本设计在架构层面具备可行性，核心决策有理据。但文档间的接口不一致和模型缺失是足以导致开发返工的系统性问题。建议在进入编码前： + +1. 召开接口对齐会（1-2 小时），统一 HLD / INTERFACE / DEPLOYMENT 中的所有外部路径和命名； +2. 由架构负责人补充 IntegrationPlugin 接口定义和缺失的表结构； +3. 将上述修复重新提交 TechLead 审核，通过后方可进入开发。 diff --git a/tech/migrations/000001_init_schema.down.sql b/tech/migrations/000001_init_schema.down.sql new file mode 100644 index 0000000..20c81b2 --- /dev/null +++ b/tech/migrations/000001_init_schema.down.sql @@ -0,0 +1,6 @@ +DROP TABLE IF EXISTS ai_ops_metrics CASCADE; +DROP TABLE IF EXISTS ai_ops_healings CASCADE; +DROP TABLE IF EXISTS ai_ops_alerts CASCADE; +DROP TABLE IF EXISTS ai_ops_channels CASCADE; +DROP TABLE IF EXISTS ai_ops_rules CASCADE; +DROP TABLE IF EXISTS ai_ops_audits CASCADE; diff --git a/tech/migrations/000001_init_schema.up.sql b/tech/migrations/000001_init_schema.up.sql new file mode 100644 index 0000000..2690232 --- /dev/null +++ b/tech/migrations/000001_init_schema.up.sql @@ -0,0 +1,180 @@ +-- AI-Ops 初始化 schema +-- 表前缀 ai_ops_，避免与桥项目表名冲突 + +-- 1. 告警规则 +CREATE TABLE IF NOT EXISTS ai_ops_rules ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + name VARCHAR(128) NOT NULL, + metric_source VARCHAR(64) NOT NULL, + metric_name VARCHAR(128) NOT NULL, + threshold_type VARCHAR(16) NOT NULL CHECK (threshold_type IN ('>', '<', '=', 'regex')), + threshold_value TEXT NOT NULL, + duration_min INT NOT NULL DEFAULT 1 CHECK (duration_min >= 1), + level VARCHAR(8) NOT NULL CHECK (level IN ('P0', 'P1', 'P2', 'P3')), + channel_ids UUID[] NOT NULL DEFAULT '{}', + healing_action VARCHAR(32) DEFAULT NULL, + healing_config JSONB DEFAULT NULL, + is_sandboxed BOOLEAN NOT NULL DEFAULT FALSE, + enabled BOOLEAN NOT NULL DEFAULT TRUE, + created_by VARCHAR(64) NOT NULL, + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + version INT NOT NULL DEFAULT 1, + CONSTRAINT uq_rules_name UNIQUE (name) +); + +CREATE INDEX IF NOT EXISTS idx_rules_enabled ON ai_ops_rules(enabled); + +-- 2. 告警事件 +CREATE TABLE IF NOT EXISTS ai_ops_alerts ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + rule_id UUID NOT NULL REFERENCES ai_ops_rules(id) ON DELETE CASCADE, + level VARCHAR(8) NOT NULL, + resource_type VARCHAR(64) NOT NULL, + resource_id VARCHAR(128) NOT NULL, + current_value TEXT NOT NULL, + threshold_value TEXT NOT NULL, + status VARCHAR(16) NOT NULL DEFAULT 'triggered' + CHECK (status IN ('triggered', 'notified', 'healing', 'resolved', 'escalated', 'acknowledged')), + is_aggregated BOOLEAN NOT NULL DEFAULT FALSE, + aggregated_count INT DEFAULT 0, + parent_alert_id UUID NULL REFERENCES ai_ops_alerts(id) ON DELETE SET NULL, + started_at TIMESTAMPTZ NOT NULL, + resolved_at TIMESTAMPTZ NULL, + acknowledged_by VARCHAR(64) NULL, + acknowledged_at TIMESTAMPTZ NULL +); + +CREATE INDEX IF NOT EXISTS idx_alerts_status ON ai_ops_alerts(status); +CREATE INDEX IF NOT EXISTS idx_alerts_started_at ON ai_ops_alerts(started_at DESC); +CREATE INDEX IF NOT EXISTS idx_alerts_resource ON ai_ops_alerts(resource_type, resource_id); + +-- 3. 自愈执行记录 +CREATE TABLE IF NOT EXISTS ai_ops_healings ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + alert_id UUID NOT NULL REFERENCES ai_ops_alerts(id) ON DELETE CASCADE, + action_type VARCHAR(32) NOT NULL + CHECK (action_type IN ('switch_route', 'throttle', 'restart_instance', 'invoke_script', 'isolate_node')), + config JSONB NOT NULL, + status VARCHAR(16) NOT NULL DEFAULT 'pending' + CHECK (status IN ('pending', 'succeeded', 'failed', 'rolled_back')), + dry_run BOOLEAN NOT NULL DEFAULT FALSE, + result_detail JSONB NULL, + error_code VARCHAR(16) NULL, + started_at TIMESTAMPTZ NOT NULL, + completed_at TIMESTAMPTZ NULL +); + +CREATE INDEX IF NOT EXISTS idx_healings_alert ON ai_ops_healings(alert_id); + +-- 4. 通知渠道 +CREATE TABLE IF NOT EXISTS ai_ops_channels ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + name VARCHAR(128) NOT NULL, + channel_type VARCHAR(32) NOT NULL + CHECK (channel_type IN ('webhook', 'email', 'feishu', 'wechat', 'sms')), + config JSONB NOT NULL, + priority INT NOT NULL DEFAULT 1, + enabled BOOLEAN NOT NULL DEFAULT TRUE, + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() +); + +-- 5. 审计日志（append-only，禁止更新和删除） +CREATE TABLE IF NOT EXISTS ai_ops_audits ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + tenant_id VARCHAR(64) NOT NULL, + object_type VARCHAR(64) NOT NULL, + object_id VARCHAR(128) NOT NULL, + action VARCHAR(32) NOT NULL + CHECK (action IN ('create', 'update', 'delete', 'rollback')), + before_state JSONB NULL, + after_state JSONB NULL, + request_id VARCHAR(64) NOT NULL, + result_code VARCHAR(16) NOT NULL, + source_ip VARCHAR(45) NOT NULL, + actor_id VARCHAR(64) NOT NULL, + risk_level VARCHAR(8) NOT NULL DEFAULT 'normal' + CHECK (risk_level IN ('normal', 'high', 'critical')), + parent_audit_id UUID NULL REFERENCES ai_ops_audits(id) ON DELETE SET NULL, + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() +); + +CREATE INDEX IF NOT EXISTS idx_audits_tenant_created ON ai_ops_audits(tenant_id, created_at DESC); +CREATE INDEX IF NOT EXISTS idx_audits_object ON ai_ops_audits(object_type, object_id); +CREATE INDEX IF NOT EXISTS idx_audits_actor ON ai_ops_audits(actor_id, created_at DESC); +CREATE INDEX IF NOT EXISTS idx_audits_request ON ai_ops_audits(request_id); + +-- 审计日志防篡改触发器（仅允许插入，禁止更新和删除） +CREATE OR REPLACE FUNCTION ai_ops_audit_readonly() +RETURNS TRIGGER AS $$ +BEGIN + RAISE EXCEPTION 'ai_ops_audits is append-only. Updates and deletes are not allowed.'; +END; +$$ LANGUAGE plpgsql; + +DROP TRIGGER IF EXISTS trg_ai_ops_audits_no_update ON ai_ops_audits; +CREATE TRIGGER trg_ai_ops_audits_no_update + BEFORE UPDATE ON ai_ops_audits + FOR EACH ROW + EXECUTE FUNCTION ai_ops_audit_readonly(); + +DROP TRIGGER IF EXISTS trg_ai_ops_audits_no_delete ON ai_ops_audits; +CREATE TRIGGER trg_ai_ops_audits_no_delete + BEFORE DELETE ON ai_ops_audits + FOR EACH ROW + EXECUTE FUNCTION ai_ops_audit_readonly(); + +-- 6. 时序指标缓存（降级方案，主存储推荐 Prometheus / VictoriaMetrics） +CREATE TABLE IF NOT EXISTS ai_ops_metrics ( + id BIGSERIAL PRIMARY KEY, + metric_name VARCHAR(128) NOT NULL, + labels JSONB NOT NULL DEFAULT '{}', + value DOUBLE PRECISION NOT NULL, + recorded_at TIMESTAMPTZ NOT NULL +); + +CREATE INDEX IF NOT EXISTS idx_metrics_name_time ON ai_ops_metrics(metric_name, recorded_at DESC); + +-- 分区表（按天分区，自动清理 > 7 天的分区） +CREATE TABLE IF NOT EXISTS ai_ops_metrics_p ( + id BIGSERIAL NOT NULL, + metric_name VARCHAR(128) NOT NULL, + labels JSONB NOT NULL DEFAULT '{}', + value DOUBLE PRECISION NOT NULL, + recorded_at TIMESTAMPTZ NOT NULL, + PRIMARY KEY (id, recorded_at) +) PARTITION BY RANGE (recorded_at); + +-- 创建当前日分区 +CREATE TABLE IF NOT EXISTS ai_ops_metrics_p_default PARTITION OF ai_ops_metrics_p + DEFAULT; + +-- 启用 pg_partman 扩展后可自动管理分区（建议生产环境使用） +-- SELECT partman.create_parent('public.ai_ops_metrics_p', 'recorded_at', 'native', 'daily'); + +-- 7. 配置版本管理（支持审计回滚） +CREATE TABLE IF NOT EXISTS ai_ops_configs ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + object_type VARCHAR(64) NOT NULL, + object_id VARCHAR(128) NOT NULL, + config_data JSONB NOT NULL, + version INT NOT NULL DEFAULT 1, + created_by VARCHAR(64) NOT NULL, + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + CONSTRAINT uq_configs_object_version UNIQUE (object_type, object_id, version) +); + +CREATE INDEX IF NOT EXISTS idx_configs_object ON ai_ops_configs(object_type, object_id, created_at DESC); + +-- 8. 状态快照（自愈回滚用） +CREATE TABLE IF NOT EXISTS ai_ops_snapshots ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + healing_id UUID NOT NULL REFERENCES ai_ops_healings(id) ON DELETE CASCADE, + snapshot_type VARCHAR(32) NOT NULL + CHECK (snapshot_type IN ('route', 'rate_limit', 'instance', 'script', 'node')), + before_state JSONB NOT NULL, + after_state JSONB NULL, + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() +); + +CREATE INDEX IF NOT EXISTS idx_snapshots_healing ON ai_ops_snapshots(healing_id); diff --git a/tech/migrations/000002_add_notification_logs.up.sql b/tech/migrations/000002_add_notification_logs.up.sql new file mode 100644 index 0000000..3c1edd9 --- /dev/null +++ b/tech/migrations/000002_add_notification_logs.up.sql @@ -0,0 +1,17 @@ +-- 通知日志表 +CREATE TABLE IF NOT EXISTS ai_ops_notification_logs ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + event_id UUID NOT NULL REFERENCES ai_ops_alerts(id) ON DELETE CASCADE, + channel_id UUID NOT NULL REFERENCES ai_ops_channels(id) ON DELETE CASCADE, + channel_type VARCHAR(32) NOT NULL, + status VARCHAR(16) NOT NULL DEFAULT 'pending' + CHECK (status IN ('pending', 'sent', 'failed', 'retrying')), + retry_count INT NOT NULL DEFAULT 0, + error_message TEXT NULL, + sent_at TIMESTAMPTZ NULL, + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() +); + +CREATE INDEX IF NOT EXISTS idx_notification_logs_event ON ai_ops_notification_logs(event_id); +CREATE INDEX IF NOT EXISTS idx_notification_logs_status ON ai_ops_notification_logs(status); +CREATE INDEX IF NOT EXISTS idx_notification_logs_created ON ai_ops_notification_logs(created_at DESC); diff --git a/tech/migrations/000002_create_request_logs.up.sql b/tech/migrations/000002_create_request_logs.up.sql new file mode 100644 index 0000000..86f46fb --- /dev/null +++ b/tech/migrations/000002_create_request_logs.up.sql @@ -0,0 +1,22 @@ +-- Phase 1: 补充请求日志表，支持日志查询功能 +CREATE TABLE IF NOT EXISTS ai_ops_request_logs ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(), + service VARCHAR(64) NOT NULL, + path VARCHAR(256) NOT NULL, + method VARCHAR(8) NOT NULL, + status_code INT NOT NULL, + latency_ms DECIMAL(10,3) NOT NULL, + user_id VARCHAR(64), + supplier_id VARCHAR(64), + error_code VARCHAR(64), + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() +); + +CREATE INDEX IF NOT EXISTS idx_request_logs_timestamp ON ai_ops_request_logs (timestamp DESC); +CREATE INDEX IF NOT EXISTS idx_request_logs_service ON ai_ops_request_logs (service); +CREATE INDEX IF NOT EXISTS idx_request_logs_path ON ai_ops_request_logs (path); +CREATE INDEX IF NOT EXISTS idx_request_logs_status_code ON ai_ops_request_logs (status_code); +CREATE INDEX IF NOT EXISTS idx_request_logs_user_id ON ai_ops_request_logs (user_id); +CREATE INDEX IF NOT EXISTS idx_request_logs_supplier_id ON ai_ops_request_logs (supplier_id); +CREATE INDEX IF NOT EXISTS idx_request_logs_time_service ON ai_ops_request_logs (timestamp DESC, service); diff --git a/test/CASES.md b/test/CASES.md new file mode 100644 index 0000000..62f30e9 --- /dev/null +++ b/test/CASES.md @@ -0,0 +1,113 @@ +# AI-Ops 测试用例 + +> 版本：v1.0 | 状态：初稿 + +--- + +## AC-1 实时监控看板 + +| 用例编号 | 名称 | 前置条件 | 测试步骤 | 预期结果 | 优先级 | +|---------|------|---------|---------|---------|--------| +| TC-1.1 | 首页加载时间 | 服务运行中，指标数据已采集 | 1. 访问运维主控台首页 2. 记录首屏加载时间 | 加载时间 < 2s | P0 | +| TC-1.2 | 六大指标显示 | 指标数据已采集 | 1. 访问首页 2. 检查指标卡片 | 必须显示 QPS、平均延迟、P99 延迟、5xx 错误率、活跃供应商数量、未关闭告警数量 | P0 | +| TC-1.3 | 指标刷新延迟 | 指标数据已更新 | 1. 触发新指标数据写入 2. 记录前端刷新时间 | 15s 内刷新显示 | P0 | + +## AC-2 指标下钻 + +| 用例编号 | 名称 | 前置条件 | 测试步骤 | 预期结果 | 优先级 | +|---------|------|---------|---------|---------|--------| +| TC-2.1 | 趋势图展示 | 存在 1 小时指标数据 | 1. 点击某指标卡片 2. 观察趋势图 | 展示过去 1 小时分钟级数据 | P0 | +| TC-2.2 | 下钻分割 | 存在多服务/路径/供应商数据 | 1. 选择下钻维度 2. 查看分割结果 | 支持 service、path、supplier 维度 | P1 | +| TC-2.3 | 下钻查询时间 | 大量数据存在 | 1. 执行下钻查询 2. 记录响应时间 | 查询时间 < 3s | P0 | + +## AC-3 告警规则配置 + +| 用例编号 | 名称 | 前置条件 | 测试步骤 | 预期结果 | 优先级 | +|---------|------|---------|---------|---------|--------| +| TC-3.1 | 创建规则 | 登录运维人员 | 1. 填写规则名称、指标、阈值、持续时间、级别、通知渠道 2. 提交 | 规则创建成功，返回规则 ID | P0 | +| TC-3.2 | 缺少字段报错 | 登录运维人员 | 1. 提交空规则名称 2. 提交 | 返回 400 错误，提示缺少字段 | P1 | +| TC-3.3 | 规则生效时间 | 规则已创建 | 1. 创建规则 2. 30s 后触发相关指标超阈值 | 规则生效，触发告警 | P0 | +| TC-3.4 | 同时运行 50 条规则 | 已创建 50 条规则 | 1. 创建 50 条规则 2. 观察系统运行 | 50 条规则同时运行不崩溃 | P1 | + +## AC-4 告警通知触达 + +| 用例编号 | 名称 | 前置条件 | 测试步骤 | 预期结果 | 优先级 | +|---------|------|---------|---------|---------|--------| +| TC-4.1 | P0 告警触发时间 | P0 规则已配置 | 1. 模拟指标超阈值 2. 记录通知发送时间 | 通知发送时间 < 30s | P0 | +| TC-4.2 | P2 告警触发时间 | P2 规则已配置 | 1. 模拟指标超阈值 2. 记录通知发送时间 | 通知发送时间 < 120s | P0 | +| TC-4.3 | 通知渠道覆盖 | 规则已配置 | 1. 配置 Webhook、邮件、飞书通知 2. 触发告警 | 所有配置渠道均收到通知 | P0 | +| TC-4.4 | 通知模板完整性 | 规则已配置 | 1. 触发告警 2. 检查通知内容 | 包含级别、规则名称、触发时间、当前值、阈值、事件 ID、查看链接 | P1 | + +## AC-5 告警聚合与抑制 + +| 用例编号 | 名称 | 前置条件 | 测试步骤 | 预期结果 | 优先级 | +|---------|------|---------|---------|---------|--------| +| TC-5.1 | 集群告警触发 | 规则已配置 | 1. 1 分钟内模拟触发 >20 条同资源告警 | 生成 1 条集群告警，停止单条通知 | P0 | +| TC-5.2 | 抑制周期 | 规则已配置 | 1. 触发告警 2. 5 分钟内再次触发同规则同目标 | 仅发送 1 次通知（除非级别升级） | P0 | + +## AC-6 自动自愈 + +| 用例编号 | 名称 | 前置条件 | 测试步骤 | 预期结果 | 优先级 | +|---------|------|---------|---------|---------|--------| +| TC-6.1 | 自愈动作配置 | 规则已配置 | 1. 为规则配置自愈动作 2. 模拟触发 | 自愈动作在 60s 内执行完成 | P0 | +| TC-6.2 | 自愈执行结果记录 | 自愈已执行 | 1. 执行自愈动作 2. 检查告警事件 | 记录执行结果（成功/失败/拒绝） | P1 | +| TC-6.3 | 自愈失败升级 | 自愈动作配置 | 1. 模拟自愈失败 2. 观察 2 分钟 | 升级为人工告警 | P0 | + +## AC-7 配置审计日志 + +| 用例编号 | 名称 | 前置条件 | 测试步骤 | 预期结果 | 优先级 | +|---------|------|---------|---------|---------|--------| +| TC-7.1 | 审计日志生成 | 登录管理员 | 1. 修改配置 2. 1s 内查询审计日志 | 生成审计记录，包含所有必要字段 | P0 | +| TC-7.2 | 审计日志不可篡改 | 审计日志已生成 | 1. 尝试直接修改数据库审计记录 | 修改被拒绝或不影响查询结果 | P1 | +| TC-7.3 | 审计查询效率 | 存在大量审计记录 | 1. 查询审计日志 2. 记录响应时间 | 响应时间 < 3s | P1 | + +## AC-8 配置回滚 + +| 用例编号 | 名称 | 前置条件 | 测试步骤 | 预期结果 | 优先级 | +|---------|------|---------|---------|---------|--------| +| TC-8.1 | 回滚成功 | 存在可回滚的审计记录 | 1. 选择审计记录 2. 执行回滚 3. 确认覆盖内容 | 回滚成功，生成新审计记录 | P0 | +| TC-8.2 | 回滚目标不存在 | 目标资源已删除 | 1. 尝试回滚已删除的资源 | 返回错误码 `OPS_AUD_4101` | P0 | +| TC-8.3 | 回滚二次确认 | 回滚将影响多个子资源 | 1. 执行回滚 2. 观察提示 | 显示将要覆盖的子资源列表 | P1 | + +## AC-9 容量主板 + +| 用例编号 | 名称 | 前置条件 | 测试步骤 | 预期结果 | 优先级 | +|---------|------|---------|---------|---------|--------| +| TC-9.1 | 趋势展示 | 存在 7 天数据 | 1. 访问容量主板 | 显示 7 天趋势 | P1 | +| TC-9.2 | 负载等级 | 指标数据已采集 | 1. 调整阈值 2. 观察等级变化 | 正确标注正常/警告/过载 | P1 | + +## AC-10 日志/指标查询 + +| 用例编号 | 名称 | 前置条件 | 测试步骤 | 预期结果 | 优先级 | +|---------|------|---------|---------|---------|--------| +| TC-10.1 | 日志筛选 | 存在日志数据 | 1. 按时间范围、服务、状态码筛选 | 返回符合条件的日志 | P0 | +| TC-10.2 | 日志分页 | 存在大量日志 | 1. 查询日志 2. 分页浏览 | 首页返回时间 < 3s，单页 100 条 | P1 | +| TC-10.3 | 日志导出 | 存在日志数据 | 1. 导出日志为 CSV | 成功导出，单次上限 10000 条 | P1 | + +## AC-11 监控数据保存 + +| 用例编号 | 名称 | 前置条件 | 测试步骤 | 预期结果 | 优先级 | +|---------|------|---------|---------|---------|--------| +| TC-11.1 | 原始数据保留 | 已采集指标 | 1. 等待 7 天 2. 查询 7 天前的原始数据 | 数据仍可查询 | P1 | +| TC-11.2 | 聚合数据保留 | 已采集指标 | 1. 等待 30 天 2. 查询分钟级数据 | 分钟级聚合数据可查，原始数据已清理 | P1 | + +## AC-12 角色与权限 + +| 用例编号 | 名称 | 前置条件 | 测试步骤 | 预期结果 | 优先级 | +|---------|------|---------|---------|---------|--------| +| TC-12.1 | 查看者权限 | 登录查看者 | 1. 尝试修改配置 | 操作被拒绝（返回 403） | P1 | +| TC-12.2 | 运维人员权限 | 登录运维人员 | 1. 确认告警 2. 尝试回滚 | 确认成功，回滚被拒绝 | P1 | +| TC-12.3 | 管理员权限 | 登录管理员 | 1. 执行回滚 | 回滚成功 | P0 | + +## 边缘场景 / 失败路径 + +| 用例编号 | 名称 | 前置条件 | 测试步骤 | 预期结果 | 优先级 | +|---------|------|---------|---------|---------|--------| +| TC-E1 | 自愈动作重试均失败 | 自愈动作已配置 | 1. 模拟自愈失败 2 次 | 升级为 P0 人工告警 | P0 | +| TC-E2 | 通知渠道失效 | 通知渠道已配置 | 1. 模拟 Webhook 8xx 2. 观察切换 | 切换至备用渠道 | P1 | +| TC-E3 | 回滚目标不存在 | 目标已删除 | 1. 尝试回滚 | 返回错误码 | P1 | +| TC-E4 | 数据源丢失 | 采集器运行中 | 1. 停止采集器 5 分钟 | 显示数据源丢失标识，触发 P2 告警 | P1 | +| TC-E5 | 审计日志存储满盘/写入失败 | 审计日志存储满盘或写入失败 | 1. 模拟存储满盘或写入失败 2. 执行配置变更操作 | 丢弃非关键字段或改为异步上报，不阻断业务操作；记录降级事件 | P1 | +| TC-E6 | 自愈动作触发后形成级联故障 | 自愈动作已配置 | 1. 触发自愈动作（如切换路由） 2. 模拟新节点故障 | 自动恢复上一步操作前的状态，然后升级为人工告警 | P0 | +| TC-E7 | 时序库全面中断 | 监控系统运行中 | 1. 断开时序数据库连接 | 控制台进入只读/降级模式，告警引擎依赖本地缓存持续运行 | P0 | +| TC-E8 | 看板计算超时 | 看板有历史数据 | 1. 模拟查询引擎超时 2. 请求看板指标 | 显示上次成功结果并标注时间戳，不等待当前请求 | P1 | diff --git a/test/STRATEGY.md b/test/STRATEGY.md new file mode 100644 index 0000000..2482483 --- /dev/null +++ b/test/STRATEGY.md @@ -0,0 +1,73 @@ +# AI-Ops 测试策略 + +> 版本：v1.0 | 状态：初稿 + +--- + +## 1. 测试目标 + +| 目标 | 指标 | 验证方式 | +|------|------|---------| +| 功能正确性 | 所有 AC 通过率 100% | 每个 AC 至少 1 个正向 + 1 个负向测试用例 | +| 性能达标 | 首页加载 <2s，查询 <3s，告警触发 <30s | 负载测试 + 峰值测试 | +| 安全性 | 无越权、无审计日志缺失 | 渗透测试 + 审计追溯测试 | +| 容灾能力 | 单机故障不影响服务 | 混淆工程测试 | + +## 2. 测试层级 + +``` +├── 单元测试 (Unit Test) +│ ├── domain 层逻辑测试 +│ ├── service 层业务流程测试 +│ └── handler 层输入验证测试 +│ +├── 集成测试 (Integration Test) +│ ├── 数据库交互测试 +│ ├── Redis 缓存交互测试 +│ ├── Prometheus 采集测试 +│ └── 外部服务 Mock 测试 +│ +├── E2E 测试 (End-to-End Test) +│ ├── API 端到端测试 +│ ├── WebSocket 实时推送测试 +│ └── 前端流程测试 +│ +└── 混淆工程测试 (Chaos Test) + ├── 单机故障 + ├── 网络分区 + └── 数据库主从切换 +``` + +## 3. 测试工具 + +| 层级 | 工具 | 说明 | +|------|------|------| +| 单元测试 | Go testing + testify + mockery | 覆盖率门槛 domain ≥ 70%、service/handler ≥ 80% | +| 数据库测试 | testcontainers-go (PostgreSQL) | 每次测试启动独立容器 | +| 缓存测试 | miniredis | 轻量级 Redis Mock | +| HTTP 测试 | httptest + net/http | 标准库内置测试 | +| E2E 测试 | 自定义 Go E2E 框架 | 启动完整服务 + 数据库 + 缓存 | +| 混淆测试 | chaos-mesh / 自定义脚本 | K8s 环境下使用 chaos-mesh，非 K8s 使用自定义脚本 | + +## 4. 测试环境 + +| 环境 | 用途 | 数据 | +|------|------|------| +| 本地开发 | 单元 + 快速集成测试 | 测试数据生成 | +| CI | 自动化单元 + 集成测试 | 测试数据生成 | +| 测试环境 | E2E 测试 + 性能基准 | 模拟生产数据（脱敏） | +| 生产前 | 灾备测试 + 回滚演练 | 生产数据副本（脱敏） | +| 生产环境 | 灰度监控 + 告警验证 | 真实生产数据 | + +## 5. 测试数据管理 + +- 测试数据通过 `test/fixtures/` 下的 SQL 脚本和 JSON 文件管理。 +- 每个测试用例自洁，启动前加载固定数据集，结束后清理。 +- 数据库测试使用编程式事务，测试结束后自动回滚。 + +## 6. 自动化与 CI 集成 + +- PR 提交时自动触发单元测试和集成测试。 +- 每日定时触发全量 E2E 测试。 +- 每周定时触发混淆测试（若有 K8s 环境）。 +- 测试失败时自动通知 TechLead 和 QA。 diff --git a/test/perf/PERF_ENV.md b/test/perf/PERF_ENV.md new file mode 100644 index 0000000..8576949 --- /dev/null +++ b/test/perf/PERF_ENV.md @@ -0,0 +1,34 @@ +# AI-Ops 性能压测环境规格 + +## 压测目标 + +| 场景 | 并发用户 | 目标 P95 | 目标 P99 | 失败率门槛 | +|------|----------|----------|----------|------------| +| 首页加载 | 50 | < 2s | < 3s | < 1% | +| 指标下钻 | 20 | < 3s | < 5s | < 1% | +| 告警触发到通知 | - | < 30s | < 60s | < 0.1% | + +## 环境规格 + +| 组件 | 规格 | 说明 | +|------|------|------| +| AI-Ops API Server | 2 vCPU / 4GB 内存 | 与生产目标一致 | +| PostgreSQL | 2 vCPU / 4GB 内存 / SSD | 含 10 万条审计日志基准数据 | +| Redis | 1 vCPU / 2GB 内存 | 用于告警抑制缓存 | +| Prometheus | 独立实例 | 采集 10 个指标、15s 间隔 | + +## 压测方法 + +1. **首页加载**：`k6 run test/perf/dashboard_k6.js` +2. **下钻查询**：`k6 run test/perf/drilldown_k6.js` +3. **告警延迟**：通过 Go 单测 `alert_latency_test.go` 测量触发到通知的延迟 + +## P99 计算方法 + +使用 k6 默认统计方法：在压测期间（10s 滑动窗口）内的所有请求响应时间排序后取 99% 分位数。 + +## 通过标准 + +- P95 达标且失败率 < 1% → 通过 +- P95 超标但 P99 达标 → CONDITIONAL_APPROVED（需性能优化） +- P99 超标 → 阻止进入下一阶段 diff --git a/test/perf/alert_latency_test.go b/test/perf/alert_latency_test.go new file mode 100644 index 0000000..65e2ac1 --- /dev/null +++ b/test/perf/alert_latency_test.go @@ -0,0 +1,29 @@ +package perf + +import ( + "testing" + "time" + + "github.com/stretchr/testify/assert" +) + +// TestAlertLatency 测量告警触发到通知的延迟 +// 目标：< 30s (P95) +func TestAlertLatency(t *testing.T) { + // 模拟规则触发 + triggeredAt := time.Now() + + // TODO: 替换为实际的告警服务调用 + // alert, err := alertService.Evaluate(ctx, ruleID) + // require.NoError(t, err) + + // 模拟通知发送 + // err = notifyService.Send(ctx, alert) + // require.NoError(t, err) + + // 计算延迟 + latency := time.Since(triggeredAt) + + t.Logf("Alert latency: %v", latency) + assert.Less(t, latency, 30*time.Second, "alert latency should be < 30s") +} diff --git a/test/perf/dashboard_k6.js b/test/perf/dashboard_k6.js new file mode 100644 index 0000000..5110563 --- /dev/null +++ b/test/perf/dashboard_k6.js @@ -0,0 +1,32 @@ +import http from 'k6/http'; +import { check, sleep } from 'k6'; + +// AI-Ops 看板首页性能压测脚本 +// 目标：验证首页加载 < 2s，并发 50 用户 + +export const options = { + stages: [ + { duration: '1m', target: 50 }, // 逐步加载到 50 并发 + { duration: '3m', target: 50 }, // 稳定压测 3 分钟 + { duration: '1m', target: 0 }, // 逐步卸载 + ], + thresholds: { + http_req_duration: ['p(95)<2000'], // P95 < 2s + http_req_duration: ['p(99)<3000'], // P99 < 3s + http_req_failed: ['rate<0.01'], // 失败率 < 1% + }, +}; + +export default function () { + const url = 'http://localhost:8080/api/v1/ai-ops/dashboard'; + const res = http.get(url, { + headers: { 'Authorization': 'Bearer ${__ENV.AI_OPS_TOKEN}' }, + }); + + check(res, { + 'status is 200': (r) => r.status === 200, + 'response time < 2s': (r) => r.timings.duration < 2000, + }); + + sleep(1); +} diff --git a/test/perf/drilldown_k6.js b/test/perf/drilldown_k6.js new file mode 100644 index 0000000..84deadc --- /dev/null +++ b/test/perf/drilldown_k6.js @@ -0,0 +1,32 @@ +import http from 'k6/http'; +import { check, sleep } from 'k6'; + +// AI-Ops 下钻性能压测脚本 +// 目标：验证下钻加载 < 3s，并发 20 用户 + +export const options = { + stages: [ + { duration: '30s', target: 20 }, + { duration: '2m', target: 20 }, + { duration: '30s', target: 0 }, + ], + thresholds: { + http_req_duration: ['p(95)<3000'], // P95 < 3s + http_req_failed: ['rate<0.01'], // 失败率 < 1% + }, +}; + +export default function () { + const service = `service_${Math.floor(Math.random() * 10)}`; + const url = `http://localhost:8080/api/v1/ai-ops/metrics/drilldown?service=${service}&window=5m`; + const res = http.get(url, { + headers: { 'Authorization': 'Bearer ${__ENV.AI_OPS_TOKEN}' }, + }); + + check(res, { + 'status is 200': (r) => r.status === 200, + 'response time < 3s': (r) => r.timings.duration < 3000, + }); + + sleep(2); +}