tech/PRODUCTION_OBSERVABILITY_CHECKLIST_2026-05-10.md

# Supply-Intelligence 生产观测与巡检清单（2026-05-10）

状态：当前有效  
仓库：`/home/long/project/supply-intelligence`  
目标：确保关键链路有最小可用的观测面，并明确异常时的止损与升级路径

---

## 1. 已接入指标清单

以下 metrics 已通过 Prometheus client 注册在 `internal/metrics/metrics.go`，可通过 `/metrics` 端点抓取。

### 1.1 Probe 层

| Metric | Type | Labels | 说明 |
|--------|------|--------|------|
| `supply_intelligence_probe_evaluations_total` | Counter | platform, classification | 探针评估次数 |
| `supply_intelligence_probe_latency_seconds` | Histogram | platform | 探针评估延迟 |

### 1.2 Discovery 层

| Metric | Type | Labels | 说明 |
|--------|------|--------|------|
| `supply_intelligence_discovery_scans_total` | Counter | platform, status | 扫描次数 |
| `supply_intelligence_discovery_new_models_total` | Counter | platform | 新发现模型数 |

### 1.3 Admission 层

| Metric | Type | Labels | 说明 |
|--------|------|--------|------|
| `supply_intelligence_admission_tests_total` | Counter | platform, result | 准入测试次数 |
| `supply_intelligence_admission_latency_seconds` | Histogram | platform | 准入测试延迟 |

### 1.4 Gateway / Consumer 层

| Metric | Type | Labels | 说明 |
|--------|------|--------|------|
| `supply_intelligence_gateway_events_processed_total` | Counter | platform, event_type, result | gateway 事件处理次数 |
| `supply_intelligence_gateway_event_latency_seconds` | Histogram | platform | gateway 事件处理延迟 |
| `supply_intelligence_gateway_event_retries_total` | Counter | platform, category | 重试次数 |
| `supply_intelligence_gateway_pending_retry_events` | Gauge | consumer | 待重试事件数 |
| `supply_intelligence_gateway_failed_events` | Gauge | consumer | 终态失败事件数 |

### 1.5 Routing State 层

| Metric | Type | Labels | 说明 |
|--------|------|--------|------|
| `supply_intelligence_accounts_by_status` | Gauge | platform, status | 按状态分类的账户数 |
| `supply_intelligence_routing_enabled_accounts` | Gauge | platform | 路由已启用的账户数 |

---

## 2. 推荐告警规则（待结合具体监控平台配置）

以下为推荐的 Prometheus 告警规则模板，需要结合具体的 Alertmanager / 云监控平台部署。

### 2.1 Critical（立即止损）

```yaml
# gateway 事件失败率突增
- alert: SupplyIntelligenceGatewayFailureRateHigh
  expr: |
    (
      sum(rate(supply_intelligence_gateway_events_processed_total{result="failed"}[5m]))
      /
      sum(rate(supply_intelligence_gateway_events_processed_total[5m]))
    ) > 0.1
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Gateway 事件失败率超过 10%"
    action: "执行 scripts/gateway_closure_rollback.sh 并通知值班工程师"

# 健康检查连续失败
- alert: SupplyIntelligenceHealthCheckFailing
  expr: up{job="supply-intelligence"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Supply-Intelligence 健康检查失败"
    action: "检查容器/进程状态，必要时重启"
```

### 2.2 Warning（需要关注）

```yaml
# pending retry 事件积压
- alert: SupplyIntelligencePendingRetryEventsHigh
  expr: supply_intelligence_gateway_pending_retry_events > 20
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Gateway 待重试事件积压"
    action: "检查 consumer applier 是否异常，或下游 gateway 是否可达"

# 发布事务冲突频发
- alert: SupplyIntelligencePublishConflictHigh
  expr: |
    increase(supply_intelligence_gateway_events_processed_total{result="duplicate"}[5m]) > 5
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "发布事务冲突频发"
    action: "检查是否有重复发布请求或客户端重试逻辑异常"

# 准入测试延迟高
- alert: SupplyIntelligenceAdmissionLatencyHigh
  expr: histogram_quantile(0.99, sum(rate(supply_intelligence_admission_latency_seconds_bucket[5m])) by (le, platform)) > 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Admission 测试 P99 延迟超过 10s"
    action: "检查 LLM API 调用是否异常"
```

---

## 3. 巡检清单

### 3.1 自动化巡检脚本（推荐定时执行）

```bash
#!/usr/bin/env bash
# 建议放在 cronjob 或 CI 巡检中，每 5 分钟执行一次
set -euo pipefail

BASE_URL="${BASE_URL:-http://127.0.0.1:8080}"
METRICS_URL="${METRICS_URL:-http://127.0.0.1:9090/metrics}"

echo "=== Supply-Intelligence 巡检 $(date -Iseconds) ==="

# 1. 健康检查
health=$(curl -fsS -o /dev/null -w "%{http_code}" "$BASE_URL/internal/supply-intelligence/healthz" || true)
if [ "$health" != "200" ]; then
  echo "[FAIL] healthz: $health"
  exit 1
fi
echo "[PASS] healthz: 200"

# 2. runtime 状态
status=$(curl -fsS "$BASE_URL/internal/supply-intelligence/gateway/runtime-status" || echo '{}')
pending=$(echo "$status" | python3 -c "import sys,json; print(json.load(sys.stdin).get('pending_retry_events',0))")
failed=$(echo "$status" | python3 -c "import sys,json; print(json.load(sys.stdin).get('failed_events',0))")
echo "[INFO] pending_retry=$pending failed=$failed"

# 3. metrics 可抓取
if curl -fsS "$METRICS_URL" | grep -q "supply_intelligence_gateway_events_processed_total"; then
  echo "[PASS] gateway metrics available"
else
  echo "[FAIL] gateway metrics missing"
  exit 1
fi

# 4. 关键阈值检查
if [ "$pending" -gt 50 ]; then
  echo "[WARN] pending_retry_events=$pending > 50"
fi
if [ "$failed" -gt 10 ]; then
  echo "[WARN] failed_events=$failed > 10"
fi

echo "=== 巡检完成 ==="
```

### 3.2 手动巡检项（上线后必查）

| 项目 | 验证方法 | 正常标准 | 巡检频率 |
|------|----------|----------|----------|
| candidate 与 package 状态一致性 | 抽样 `admission-state` API | candidate.published + package.active 成对 | 每日 |
| event 与 snapshot 一致性 | 比对 `last_event_id` 与最新 applied event | 一致 | 每日 |
| 未授权 consumer 过滤 | 检查无账户关联的 consumer 是否有 ack 记录 | 无记录 | 每周 |
| DB 事务日志 | 检查 PostgreSQL 慢查询/死锁 | 无异常 | 每周 |
| 重试队列演进 | 观察 pending retry 事件是否逐渐减少 | 趋势下降 | 每日 |

---

## 4. 升级路径

| 场景 | 升级方式 | 预期时间 |
|------|----------|----------|
| 告警触发 | 值班工程师接收通知 | < 2 分钟 |
| Warning 级别 | 评估影响，决定是否需要暂停 runtime | < 10 分钟 |
| Critical 级别 | 立即执行 rollback runbook | < 5 分钟 |
| 无法定位 | 通知 TechLead + PM，启动事故响应 | < 30 分钟 |

---

## 5. 已知缺口

| 缺口 | 影响 | 计划 |
|------|------|------|
| 告警规则未部署到具体平台 | 当前仅为模板 | 结合云监控/Alertmanager 落地 |
| 日志集中收集未配置 | 异常排查依赖本地日志 | 接入 ELK/Loki |
| 自动化巡检脚本未调度 | 当前为手动执行 | 纳入 CI/定时任务 |

---

版本：v1.0 | 创建：2026-05-10