Supply-Intelligence 生产观测与巡检清单（2026-05-10）

状态：当前有效
仓库：/home/long/project/supply-intelligence
目标：确保关键链路有最小可用的观测面，并明确异常时的止损与升级路径

1. 已接入指标清单

以下 metrics 已通过 Prometheus client 注册在 internal/metrics/metrics.go，可通过 /metrics 端点抓取。

1.1 Probe 层

Metric	Type	Labels	说明
`supply_intelligence_probe_evaluations_total`	Counter	platform, classification	探针评估次数
`supply_intelligence_probe_latency_seconds`	Histogram	platform	探针评估延迟

1.2 Discovery 层

Metric	Type	Labels	说明
`supply_intelligence_discovery_scans_total`	Counter	platform, status	扫描次数
`supply_intelligence_discovery_new_models_total`	Counter	platform	新发现模型数

1.3 Admission 层

Metric	Type	Labels	说明
`supply_intelligence_admission_tests_total`	Counter	platform, result	准入测试次数
`supply_intelligence_admission_latency_seconds`	Histogram	platform	准入测试延迟

1.4 Gateway / Consumer 层

Metric	Type	Labels	说明
`supply_intelligence_gateway_events_processed_total`	Counter	platform, event_type, result	gateway 事件处理次数
`supply_intelligence_gateway_event_latency_seconds`	Histogram	platform	gateway 事件处理延迟
`supply_intelligence_gateway_event_retries_total`	Counter	platform, category	重试次数
`supply_intelligence_gateway_pending_retry_events`	Gauge	consumer	待重试事件数
`supply_intelligence_gateway_failed_events`	Gauge	consumer	终态失败事件数

1.5 Routing State 层

Metric	Type	Labels	说明
`supply_intelligence_accounts_by_status`	Gauge	platform, status	按状态分类的账户数
`supply_intelligence_routing_enabled_accounts`	Gauge	platform	路由已启用的账户数

2. 推荐告警规则（待结合具体监控平台配置）

以下为推荐的 Prometheus 告警规则模板，需要结合具体的 Alertmanager / 云监控平台部署。

2.1 Critical（立即止损）

# gateway 事件失败率突增
- alert: SupplyIntelligenceGatewayFailureRateHigh
  expr: |
    (
      sum(rate(supply_intelligence_gateway_events_processed_total{result="failed"}[5m]))
      /
      sum(rate(supply_intelligence_gateway_events_processed_total[5m]))
    ) > 0.1
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Gateway 事件失败率超过 10%"
    action: "执行 scripts/gateway_closure_rollback.sh 并通知值班工程师"

# 健康检查连续失败
- alert: SupplyIntelligenceHealthCheckFailing
  expr: up{job="supply-intelligence"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Supply-Intelligence 健康检查失败"
    action: "检查容器/进程状态，必要时重启"

2.2 Warning（需要关注）

# pending retry 事件积压
- alert: SupplyIntelligencePendingRetryEventsHigh
  expr: supply_intelligence_gateway_pending_retry_events > 20
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Gateway 待重试事件积压"
    action: "检查 consumer applier 是否异常，或下游 gateway 是否可达"

# 发布事务冲突频发
- alert: SupplyIntelligencePublishConflictHigh
  expr: |
    increase(supply_intelligence_gateway_events_processed_total{result="duplicate"}[5m]) > 5
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "发布事务冲突频发"
    action: "检查是否有重复发布请求或客户端重试逻辑异常"

# 准入测试延迟高
- alert: SupplyIntelligenceAdmissionLatencyHigh
  expr: histogram_quantile(0.99, sum(rate(supply_intelligence_admission_latency_seconds_bucket[5m])) by (le, platform)) > 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Admission 测试 P99 延迟超过 10s"
    action: "检查 LLM API 调用是否异常"

3. 巡检清单

3.1 自动化巡检脚本（推荐定时执行）

#!/usr/bin/env bash
# 建议放在 cronjob 或 CI 巡检中，每 5 分钟执行一次
set -euo pipefail

BASE_URL="${BASE_URL:-http://127.0.0.1:8080}"
METRICS_URL="${METRICS_URL:-http://127.0.0.1:9090/metrics}"

echo "=== Supply-Intelligence 巡检 $(date -Iseconds) ==="

# 1. 健康检查
health=$(curl -fsS -o /dev/null -w "%{http_code}" "$BASE_URL/internal/supply-intelligence/healthz" || true)
if [ "$health" != "200" ]; then
  echo "[FAIL] healthz: $health"
  exit 1
fi
echo "[PASS] healthz: 200"

# 2. runtime 状态
status=$(curl -fsS "$BASE_URL/internal/supply-intelligence/gateway/runtime-status" || echo '{}')
pending=$(echo "$status" | python3 -c "import sys,json; print(json.load(sys.stdin).get('pending_retry_events',0))")
failed=$(echo "$status" | python3 -c "import sys,json; print(json.load(sys.stdin).get('failed_events',0))")
echo "[INFO] pending_retry=$pending failed=$failed"

# 3. metrics 可抓取
if curl -fsS "$METRICS_URL" | grep -q "supply_intelligence_gateway_events_processed_total"; then
  echo "[PASS] gateway metrics available"
else
  echo "[FAIL] gateway metrics missing"
  exit 1
fi

# 4. 关键阈值检查
if [ "$pending" -gt 50 ]; then
  echo "[WARN] pending_retry_events=$pending > 50"
fi
if [ "$failed" -gt 10 ]; then
  echo "[WARN] failed_events=$failed > 10"
fi

echo "=== 巡检完成 ==="

3.2 手动巡检项（上线后必查）

项目	验证方法	正常标准	巡检频率
candidate 与 package 状态一致性	抽样 `admission-state` API	candidate.published + package.active 成对	每日
event 与 snapshot 一致性	比对 `last_event_id` 与最新 applied event	一致	每日
未授权 consumer 过滤	检查无账户关联的 consumer 是否有 ack 记录	无记录	每周
DB 事务日志	检查 PostgreSQL 慢查询/死锁	无异常	每周
重试队列演进	观察 pending retry 事件是否逐渐减少	趋势下降	每日

4. 升级路径

场景	升级方式	预期时间
告警触发	值班工程师接收通知	< 2 分钟
Warning 级别	评估影响，决定是否需要暂停 runtime	< 10 分钟
Critical 级别	立即执行 rollback runbook	< 5 分钟
无法定位	通知 TechLead + PM，启动事故响应	< 30 分钟

5. 已知缺口

缺口	影响	计划
告警规则未部署到具体平台	当前仅为模板	结合云监控/Alertmanager 落地
日志集中收集未配置	异常排查依赖本地日志	接入 ELK/Loki
自动化巡检脚本未调度	当前为手动执行	纳入 CI/定时任务

版本：v1.0 | 创建：2026-05-10

6.9 KiB Raw Permalink Blame History Unescape Escape