204 lines
6.9 KiB
Markdown
204 lines
6.9 KiB
Markdown
|
|
# Supply-Intelligence 生产观测与巡检清单(2026-05-10)
|
|||
|
|
|
|||
|
|
状态:当前有效
|
|||
|
|
仓库:`/home/long/project/supply-intelligence`
|
|||
|
|
目标:确保关键链路有最小可用的观测面,并明确异常时的止损与升级路径
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. 已接入指标清单
|
|||
|
|
|
|||
|
|
以下 metrics 已通过 Prometheus client 注册在 `internal/metrics/metrics.go`,可通过 `/metrics` 端点抓取。
|
|||
|
|
|
|||
|
|
### 1.1 Probe 层
|
|||
|
|
|
|||
|
|
| Metric | Type | Labels | 说明 |
|
|||
|
|
|--------|------|--------|------|
|
|||
|
|
| `supply_intelligence_probe_evaluations_total` | Counter | platform, classification | 探针评估次数 |
|
|||
|
|
| `supply_intelligence_probe_latency_seconds` | Histogram | platform | 探针评估延迟 |
|
|||
|
|
|
|||
|
|
### 1.2 Discovery 层
|
|||
|
|
|
|||
|
|
| Metric | Type | Labels | 说明 |
|
|||
|
|
|--------|------|--------|------|
|
|||
|
|
| `supply_intelligence_discovery_scans_total` | Counter | platform, status | 扫描次数 |
|
|||
|
|
| `supply_intelligence_discovery_new_models_total` | Counter | platform | 新发现模型数 |
|
|||
|
|
|
|||
|
|
### 1.3 Admission 层
|
|||
|
|
|
|||
|
|
| Metric | Type | Labels | 说明 |
|
|||
|
|
|--------|------|--------|------|
|
|||
|
|
| `supply_intelligence_admission_tests_total` | Counter | platform, result | 准入测试次数 |
|
|||
|
|
| `supply_intelligence_admission_latency_seconds` | Histogram | platform | 准入测试延迟 |
|
|||
|
|
|
|||
|
|
### 1.4 Gateway / Consumer 层
|
|||
|
|
|
|||
|
|
| Metric | Type | Labels | 说明 |
|
|||
|
|
|--------|------|--------|------|
|
|||
|
|
| `supply_intelligence_gateway_events_processed_total` | Counter | platform, event_type, result | gateway 事件处理次数 |
|
|||
|
|
| `supply_intelligence_gateway_event_latency_seconds` | Histogram | platform | gateway 事件处理延迟 |
|
|||
|
|
| `supply_intelligence_gateway_event_retries_total` | Counter | platform, category | 重试次数 |
|
|||
|
|
| `supply_intelligence_gateway_pending_retry_events` | Gauge | consumer | 待重试事件数 |
|
|||
|
|
| `supply_intelligence_gateway_failed_events` | Gauge | consumer | 终态失败事件数 |
|
|||
|
|
|
|||
|
|
### 1.5 Routing State 层
|
|||
|
|
|
|||
|
|
| Metric | Type | Labels | 说明 |
|
|||
|
|
|--------|------|--------|------|
|
|||
|
|
| `supply_intelligence_accounts_by_status` | Gauge | platform, status | 按状态分类的账户数 |
|
|||
|
|
| `supply_intelligence_routing_enabled_accounts` | Gauge | platform | 路由已启用的账户数 |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. 推荐告警规则(待结合具体监控平台配置)
|
|||
|
|
|
|||
|
|
以下为推荐的 Prometheus 告警规则模板,需要结合具体的 Alertmanager / 云监控平台部署。
|
|||
|
|
|
|||
|
|
### 2.1 Critical(立即止损)
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# gateway 事件失败率突增
|
|||
|
|
- alert: SupplyIntelligenceGatewayFailureRateHigh
|
|||
|
|
expr: |
|
|||
|
|
(
|
|||
|
|
sum(rate(supply_intelligence_gateway_events_processed_total{result="failed"}[5m]))
|
|||
|
|
/
|
|||
|
|
sum(rate(supply_intelligence_gateway_events_processed_total[5m]))
|
|||
|
|
) > 0.1
|
|||
|
|
for: 2m
|
|||
|
|
labels:
|
|||
|
|
severity: critical
|
|||
|
|
annotations:
|
|||
|
|
summary: "Gateway 事件失败率超过 10%"
|
|||
|
|
action: "执行 scripts/gateway_closure_rollback.sh 并通知值班工程师"
|
|||
|
|
|
|||
|
|
# 健康检查连续失败
|
|||
|
|
- alert: SupplyIntelligenceHealthCheckFailing
|
|||
|
|
expr: up{job="supply-intelligence"} == 0
|
|||
|
|
for: 1m
|
|||
|
|
labels:
|
|||
|
|
severity: critical
|
|||
|
|
annotations:
|
|||
|
|
summary: "Supply-Intelligence 健康检查失败"
|
|||
|
|
action: "检查容器/进程状态,必要时重启"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2.2 Warning(需要关注)
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# pending retry 事件积压
|
|||
|
|
- alert: SupplyIntelligencePendingRetryEventsHigh
|
|||
|
|
expr: supply_intelligence_gateway_pending_retry_events > 20
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "Gateway 待重试事件积压"
|
|||
|
|
action: "检查 consumer applier 是否异常,或下游 gateway 是否可达"
|
|||
|
|
|
|||
|
|
# 发布事务冲突频发
|
|||
|
|
- alert: SupplyIntelligencePublishConflictHigh
|
|||
|
|
expr: |
|
|||
|
|
increase(supply_intelligence_gateway_events_processed_total{result="duplicate"}[5m]) > 5
|
|||
|
|
for: 2m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "发布事务冲突频发"
|
|||
|
|
action: "检查是否有重复发布请求或客户端重试逻辑异常"
|
|||
|
|
|
|||
|
|
# 准入测试延迟高
|
|||
|
|
- alert: SupplyIntelligenceAdmissionLatencyHigh
|
|||
|
|
expr: histogram_quantile(0.99, sum(rate(supply_intelligence_admission_latency_seconds_bucket[5m])) by (le, platform)) > 10
|
|||
|
|
for: 5m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "Admission 测试 P99 延迟超过 10s"
|
|||
|
|
action: "检查 LLM API 调用是否异常"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. 巡检清单
|
|||
|
|
|
|||
|
|
### 3.1 自动化巡检脚本(推荐定时执行)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/usr/bin/env bash
|
|||
|
|
# 建议放在 cronjob 或 CI 巡检中,每 5 分钟执行一次
|
|||
|
|
set -euo pipefail
|
|||
|
|
|
|||
|
|
BASE_URL="${BASE_URL:-http://127.0.0.1:8080}"
|
|||
|
|
METRICS_URL="${METRICS_URL:-http://127.0.0.1:9090/metrics}"
|
|||
|
|
|
|||
|
|
echo "=== Supply-Intelligence 巡检 $(date -Iseconds) ==="
|
|||
|
|
|
|||
|
|
# 1. 健康检查
|
|||
|
|
health=$(curl -fsS -o /dev/null -w "%{http_code}" "$BASE_URL/internal/supply-intelligence/healthz" || true)
|
|||
|
|
if [ "$health" != "200" ]; then
|
|||
|
|
echo "[FAIL] healthz: $health"
|
|||
|
|
exit 1
|
|||
|
|
fi
|
|||
|
|
echo "[PASS] healthz: 200"
|
|||
|
|
|
|||
|
|
# 2. runtime 状态
|
|||
|
|
status=$(curl -fsS "$BASE_URL/internal/supply-intelligence/gateway/runtime-status" || echo '{}')
|
|||
|
|
pending=$(echo "$status" | python3 -c "import sys,json; print(json.load(sys.stdin).get('pending_retry_events',0))")
|
|||
|
|
failed=$(echo "$status" | python3 -c "import sys,json; print(json.load(sys.stdin).get('failed_events',0))")
|
|||
|
|
echo "[INFO] pending_retry=$pending failed=$failed"
|
|||
|
|
|
|||
|
|
# 3. metrics 可抓取
|
|||
|
|
if curl -fsS "$METRICS_URL" | grep -q "supply_intelligence_gateway_events_processed_total"; then
|
|||
|
|
echo "[PASS] gateway metrics available"
|
|||
|
|
else
|
|||
|
|
echo "[FAIL] gateway metrics missing"
|
|||
|
|
exit 1
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# 4. 关键阈值检查
|
|||
|
|
if [ "$pending" -gt 50 ]; then
|
|||
|
|
echo "[WARN] pending_retry_events=$pending > 50"
|
|||
|
|
fi
|
|||
|
|
if [ "$failed" -gt 10 ]; then
|
|||
|
|
echo "[WARN] failed_events=$failed > 10"
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
echo "=== 巡检完成 ==="
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.2 手动巡检项(上线后必查)
|
|||
|
|
|
|||
|
|
| 项目 | 验证方法 | 正常标准 | 巡检频率 |
|
|||
|
|
|------|----------|----------|----------|
|
|||
|
|
| candidate 与 package 状态一致性 | 抽样 `admission-state` API | candidate.published + package.active 成对 | 每日 |
|
|||
|
|
| event 与 snapshot 一致性 | 比对 `last_event_id` 与最新 applied event | 一致 | 每日 |
|
|||
|
|
| 未授权 consumer 过滤 | 检查无账户关联的 consumer 是否有 ack 记录 | 无记录 | 每周 |
|
|||
|
|
| DB 事务日志 | 检查 PostgreSQL 慢查询/死锁 | 无异常 | 每周 |
|
|||
|
|
| 重试队列演进 | 观察 pending retry 事件是否逐渐减少 | 趋势下降 | 每日 |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. 升级路径
|
|||
|
|
|
|||
|
|
| 场景 | 升级方式 | 预期时间 |
|
|||
|
|
|------|----------|----------|
|
|||
|
|
| 告警触发 | 值班工程师接收通知 | < 2 分钟 |
|
|||
|
|
| Warning 级别 | 评估影响,决定是否需要暂停 runtime | < 10 分钟 |
|
|||
|
|
| Critical 级别 | 立即执行 rollback runbook | < 5 分钟 |
|
|||
|
|
| 无法定位 | 通知 TechLead + PM,启动事故响应 | < 30 分钟 |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 5. 已知缺口
|
|||
|
|
|
|||
|
|
| 缺口 | 影响 | 计划 |
|
|||
|
|
|------|------|------|
|
|||
|
|
| 告警规则未部署到具体平台 | 当前仅为模板 | 结合云监控/Alertmanager 落地 |
|
|||
|
|
| 日志集中收集未配置 | 异常排查依赖本地日志 | 接入 ELK/Loki |
|
|||
|
|
| 自动化巡检脚本未调度 | 当前为手动执行 | 纳入 CI/定时任务 |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
版本:v1.0 | 创建:2026-05-10
|