feat: close v3 governance evidence and slo metrics wiring
This commit is contained in:
@@ -98,16 +98,49 @@
|
||||
- `go test ./tests/integration/... -count=1` → PASS
|
||||
- `bash ./scripts/test/test_tksea_portal_assets.sh` → PASS
|
||||
|
||||
### 线上真验缺口
|
||||
### 线上真验收口(2026-06-08)
|
||||
|
||||
remote43 当前不可达(SSH timeout / nginx 超时),导致无法完成以下闭环:
|
||||
- **根因 1:公网 `/v1/chat/completions` 未接到 CRM**
|
||||
- 真实证据:2026-06-08 线上直探无鉴权与坏 key 时返回的是宿主错误形状 `API_KEY_REQUIRED` / `INVALID_API_KEY`,而不是 CRM handler 约定的 `unauthorized` / `key_paused`。
|
||||
- 修复:remote43 nginx 已补 `location = /v1/chat/completions { proxy_pass http://127.0.0.1:18190/v1/chat/completions; }`;仓库同步更新:
|
||||
- `deploy/tksea-portal/nginx.sub.tksea.top.conf.example`
|
||||
- `scripts/deploy/deploy_tksea_portal.sh`
|
||||
- **根因 2:CRM SQLite `hosts.auth_token` 再次过期**
|
||||
- 真实证据:2026-06-08 `POST /portal-admin-api/api/keys` / `reset` 返回 `TOKEN_EXPIRED`,错误来自 `ensureSubjectHasAccess -> GET /api/v1/admin/users`。
|
||||
- 修复:remote43 已用当前宿主管理员登录重新获取 bearer,并回写 CRM SQLite `hosts.auth_token`。
|
||||
- **V3-1 三段式治理真验现已真实闭环**
|
||||
- artifact:`artifacts/v3-governance-live/20260608_102323/99-summary.json`
|
||||
- 历史 paused key 复验:
|
||||
- `GET /portal-admin-api/api/keys/key_jxdopi6wykly` -> `admin_status=paused`
|
||||
- `POST https://sub.tksea.top/v1/chat/completions` with same key -> `403 key_paused`
|
||||
- 新 key 全链路:
|
||||
- create -> `201`
|
||||
- chat-before -> `200`
|
||||
- pause -> `200`
|
||||
- get-paused -> `200` (`admin_status=paused`)
|
||||
- chat-paused -> `403 key_paused`
|
||||
- resume -> `200`
|
||||
- get-resumed -> `200` (`admin_status=active`)
|
||||
- chat-resumed -> `200`
|
||||
- delete -> `200`
|
||||
- 宿主侧 key status `PUT /api/v1/admin/api-keys/{id}` 依然不可用(字段写入不生效);当前治理仍依赖 user-level `allowed_groups` 清空/恢复,但已不再阻塞 CRM 网关路线验收。
|
||||
|
||||
1. ~~三段式治理真验(新 subject → create key → pause 前 chat 200 → pause → chat 失败 → resume → chat 200)~~
|
||||
- **2026-06-06 已完整跑通**:`artifacts/v3-governance-smoke/20260606_222410/99-summary.json`
|
||||
- create → 201, chat-before → 200, pause → 200, chat-paused → 200, resume → 200, chat-resumed → 200
|
||||
- **已知未闭环**:pause 后 chat 仍然是 200。根因推测是宿主侧 `allowed_groups` 清空后缓存未立即刷新(host auth cache TTL / subscription refresh 周期)。CRM 侧 `admin_status` 已正确切为 `paused`。
|
||||
- → 这是宿主中间件时效性问题,非 CRM 代码错误。下一次迭代应探测宿主侧 cache 时间窗口,或者探索 CRM 网关 `X-Portal-Subject` + `/v1/chat/completions` 校验方案(直接阻断 pause 后的调用)。
|
||||
2. 宿主侧 key status `PUT /api/v1/admin/api-keys/{id}` 依然不可用(字段写入不生效)。pause/resume 当前依赖 user-level `allowed_groups` 清空/恢复。
|
||||
### V3-2 SLO / 观测最小闭环(2026-06-08 首批)
|
||||
|
||||
- 目标:先把现有 CRM 网关与 user-key 自助链路接成可观测真相源,而不是停留在“有 /metrics 端点但关键路径不产生日志/指标”。
|
||||
- 本轮代码接线:
|
||||
- `internal/metrics/metrics.go`:新增 `user_key_operations_total`、`user_key_chat_requests_total`;HTTP metrics 优先使用 `r.Pattern`,避免动态 path 高基数
|
||||
- `internal/app/route_resolve_api.go`:resolve / failover 接入 route metrics
|
||||
- `internal/app/key_self_service_svc.go`:create/reset/pause/resume/delete success metrics 接线
|
||||
- `internal/app/http_api.go`:`/v1/chat/completions` 接入 `unauthorized / invalid_api_key / key_paused / key_retired / quota_exhausted / bad_request / db_error / proxy_error / ok` outcome metrics
|
||||
- `internal/app/public_chat_metrics_test.go`:新增 quota_exhausted 与 route pattern 回归测试
|
||||
- 本轮门禁:
|
||||
- `go test ./internal/app ./internal/metrics -count=1` → PASS
|
||||
- `go test ./tests/integration/... -count=1` → PASS
|
||||
- `go vet ./...` → PASS
|
||||
- `go test -cover ./internal/...` → PASS(核心包 `access/provision/pack` 均 ≥ 70%)
|
||||
- 当前结论:
|
||||
- `部分闭环` —— 首批 SLO/观测接线已完成并过门禁;更宽泛的治理/SLO 扩展(失败路径细化、告警/发布门禁)继续推进
|
||||
|
||||
- portal key 管理 UI 已完成实现、部署和真实公网验收:
|
||||
- 关键代码:
|
||||
|
||||
Reference in New Issue
Block a user