632 lines
28 KiB
Markdown
632 lines
28 KiB
Markdown
# TechLead 设计:Gateway 收口 / 重试 / 灰度回滚 / 巡检门禁(2026-05-08)
|
||
|
||
状态:当前有效
|
||
阶段结论:可进入 QA 设计审查
|
||
仓库:`/home/long/project/supply-intelligence`
|
||
上游真源:
|
||
- `/home/long/project/supply-intelligence/tech/CURRENT_SOURCE_OF_TRUTH_2026-05.md`
|
||
- `/home/long/project/supply-intelligence/tech/BASELINE_TECHLEAD_V2.md`
|
||
- `/home/long/project/supply-intelligence/tech/GATEWAY_CONSUMER_DECISION_2026-05.md`
|
||
- `/home/long/project/supply-intelligence/tech/PRODUCTION_LAUNCH_CLOSURE_BOARD_2026-05-08.md`
|
||
- `/home/long/project/supply-intelligence/prd/PM_GATEWAY_CLOSURE_PRD_2026-05-08.md`
|
||
|
||
## 0. 当前结论
|
||
|
||
当前仓库已经具备以下真实落点,可作为本轮收口设计基础:
|
||
- package 发布 -> event 写入:`/home/long/project/supply-intelligence/internal/publish/service.go`
|
||
- gateway 拉取 / 自动消费 / ack:
|
||
- `/home/long/project/supply-intelligence/internal/httpapi/server.go`
|
||
- `/home/long/project/supply-intelligence/internal/gatewayconsumer/service.go`
|
||
- `/home/long/project/supply-intelligence/internal/poller/gateway_package_poller.go`
|
||
- admission-state / routing-state / healthz / metrics 暴露:`/home/long/project/supply-intelligence/internal/httpapi/server.go`
|
||
- Postgres 持久化 package event 与 gateway snapshot:
|
||
- `/home/long/project/supply-intelligence/internal/repository/postgres.go`
|
||
- `/home/long/project/supply-intelligence/migrations/0003_gateway_snapshots.sql`
|
||
- E2E 证明 publish -> consume -> ack -> admission-state:`/home/long/project/supply-intelligence/internal/httpapi/postgres_e2e_test.go`
|
||
|
||
但按 PM 收口口径,当前仍缺三类工程化收口:
|
||
1. gateway 失败分类与自动重试边界尚未映射到现有 consumer/poller/repository 结构
|
||
2. rollout / rollback 仍缺脚本、命令入口、巡检文档的明确落点
|
||
3. 观测指标虽暴露 `/metrics`,但关键 gateway 语义尚未真正打点到调用链
|
||
|
||
因此本文件目标不是发散新架构,而是基于现有代码结构,把上线收口项转成文件级实现设计与任务拆解。
|
||
|
||
结论:
|
||
- 当前设计包已经足够进入 QA 设计审查
|
||
- 但 QA 审查应明确标记:进入的是“按本文件执行实现”的审查,不是“当前代码已可上线”
|
||
|
||
## 1. 设计边界
|
||
|
||
### 1.1 In Scope
|
||
- gateway package event 拉取与 ack 契约实现边界
|
||
- gateway 消费失败分类、自动重试、终态 failed、人工处置入口
|
||
- rollout / rollback runbook 的技术支撑:接口、脚本、命令、检查文档
|
||
- 观测指标、告警、巡检门禁落到具体文件
|
||
- QA 设计审查必须核查的真实调用链
|
||
- Engineer 文件级任务拆解
|
||
|
||
### 1.2 Out of Scope
|
||
- 不引入 MQ/Kafka/Redis/Temporal
|
||
- 不扩展到 NewAPI / Sub2API 的事件 ack 闭环
|
||
- 不重做独立控制台或外部告警平台
|
||
- 不改 package 发布主模型,不改 event + ack 基本模式
|
||
|
||
### 1.3 约束
|
||
- 必须贴合当前仓库已有代码与目录
|
||
- 优先复用已有:`internal/gatewayconsumer`、`internal/poller`、`internal/httpapi`、`internal/repository`、`internal/metrics`
|
||
- 不新增新基础设施,只允许新增当前仓库内脚本、文档、少量 repository 字段/方法、测试与打点
|
||
|
||
## 2. Gateway 契约实现边界
|
||
|
||
## 2.1 当前真实代码边界
|
||
|
||
当前已有契约实现如下:
|
||
|
||
1. 发布侧
|
||
- `POST /internal/supply-intelligence/publish/package-event`
|
||
- 实现:`/home/long/project/supply-intelligence/internal/httpapi/server.go :: handlePublishPackageEvent`
|
||
- 服务:`/home/long/project/supply-intelligence/internal/publish/service.go :: PublishDraft`
|
||
- 语义:candidate `test_passed -> published`,package `draft -> active`,生成 `PackageChangeEvent{gateway_sync_status=pending}`
|
||
|
||
2. 查询事件侧
|
||
- `GET /internal/supply-intelligence/gateway/package-changes?cursor=...`
|
||
- 实现:`/home/long/project/supply-intelligence/internal/httpapi/server.go :: handleListPackageChanges`
|
||
- repo:`/home/long/project/supply-intelligence/internal/repository/interfaces.go :: ListPackageEventsAfter`
|
||
- Postgres:`/home/long/project/supply-intelligence/internal/repository/postgres.go :: ListPackageEventsAfter`
|
||
|
||
3. ack 回写侧
|
||
- `POST /internal/supply-intelligence/gateway/package-changes/{event_id}/ack`
|
||
- 实现:`/home/long/project/supply-intelligence/internal/httpapi/server.go :: handleAckPackageChange`
|
||
- repo:`/home/long/project/supply-intelligence/internal/repository/interfaces.go :: AckPackageEvent`
|
||
- Postgres:`/home/long/project/supply-intelligence/internal/repository/postgres.go :: AckPackageEvent`
|
||
|
||
4. 本地默认消费方
|
||
- 消费服务:`/home/long/project/supply-intelligence/internal/gatewayconsumer/service.go :: ConsumeOnce`
|
||
- poller:`/home/long/project/supply-intelligence/internal/poller/gateway_package_poller.go :: PollOnce`
|
||
- runtime:`/home/long/project/supply-intelligence/internal/poller/runtime.go :: Start`
|
||
- 装配:`/home/long/project/supply-intelligence/internal/app/app.go`
|
||
|
||
### 2.2 契约边界结论
|
||
|
||
必须按以下边界实现,不得越界:
|
||
|
||
supply-intelligence 负责:
|
||
1. 产出 pending event
|
||
2. 提供 cursor 拉取接口
|
||
3. 接收 applied/failed ack
|
||
4. 对 event 的同步状态做持久化与查询暴露
|
||
5. 提供 admission-state 读口径,明确 `published != applied`
|
||
|
||
当前仓库内 gateway consumer 负责:
|
||
1. 拉取 pending event
|
||
2. 执行本地 apply
|
||
3. 对每次尝试产出显式结果
|
||
4. 在安全可重试范围内受控重试
|
||
5. 达到终态后回写 applied 或 failed
|
||
|
||
不负责:
|
||
- supply-intelligence 不同步调用下游管理 RPC 决定发布是否成功
|
||
- gateway consumer 不修改上游 candidate/package 状态
|
||
- ack 不承担重跑发布逻辑,只回写消费结果
|
||
|
||
### 2.3 状态语义约束
|
||
|
||
必须统一以下状态语义,并落到 API 返回、测试断言、runbook 文案:
|
||
- `candidate.status=published`:上游已发布,可被消费
|
||
- `package.status=active`:上游已允许下游消费
|
||
- `event.gateway_sync_status=pending`:尚未拿到最终消费确认
|
||
- `event.gateway_sync_status=applied`:消费方已成功应用
|
||
- `event.gateway_sync_status=failed`:消费方已确认失败,停止自动重试
|
||
|
||
当前 admission-state 已通过 `last_event.gateway_sync_status` 暴露该语义,代码位于:
|
||
- `/home/long/project/supply-intelligence/internal/httpapi/server.go :: handleModelAdmissionState`
|
||
|
||
### 2.4 当前设计缺口
|
||
|
||
当前代码与 PM 口径相比的缺口:
|
||
1. `AckPackageEvent` 只有 applied/failed 最终写回,没有“重试中”结构
|
||
2. `gatewayconsumer.Service` 当前一轮消费内直接 ack applied/failed,没有失败分类
|
||
3. `poller.Runtime` 只做固定间隔拉取,没有按 event 维度退避与重试上限
|
||
4. event 表当前只有 ack 结果,没有重试次数、最后失败时间、失败分类
|
||
5. `ListPackageEventsAfter` 当前会返回已 failed 事件,但 consumer 因仅消费 pending 会跳过,导致缺少“失败后如何再次进入自动重试”的结构
|
||
|
||
## 3. 失败重试策略映射到现有代码结构
|
||
|
||
## 3.1 PM 口径到代码模型映射
|
||
|
||
PM 定义:
|
||
- 可自动重试:瞬时网络错误、临时 5xx、超时、gateway 短暂不可用且幂等安全
|
||
- 不可自动重试:参数/契约错误、幂等冲突、鉴权错误、明确业务拒绝
|
||
- 上限:每个 event 最多 3 次自动重试,退避 1m / 5m / 15m
|
||
- 第 3 次失败后转最终 `failed`
|
||
|
||
映射到当前代码结构后的实现原则:
|
||
1. `gateway_sync_status` 仍只保留 `pending|applied|failed`,不新增更复杂外部语义
|
||
2. 自动重试中的 event 仍保持 `pending`
|
||
3. 重试元数据落到 repository 持久化字段,而不是把 `failed` 当成“还要自动重试”
|
||
4. 只有最终不可自动重试,或达到 3 次上限,才 ack 为 `failed`
|
||
5. 任一成功尝试直接 ack 为 `applied`
|
||
|
||
### 3.2 建议新增/补齐的数据字段
|
||
|
||
基于当前表结构,建议在 package events 所在 schema 增补以下字段,保持不引入新表:
|
||
- `retry_count int not null default 0`
|
||
- `last_retry_at timestamptz null`
|
||
- `next_retry_at timestamptz null`
|
||
- `last_failure_category varchar(32) null`
|
||
- `last_failure_detail text null`
|
||
|
||
文件落点:
|
||
- 新 migration:`/home/long/project/supply-intelligence/migrations/0004_gateway_event_retry_state.sql`
|
||
- Postgres 读写:`/home/long/project/supply-intelligence/internal/repository/postgres.go`
|
||
- 内存实现同步:`/home/long/project/supply-intelligence/internal/repository/memory.go`
|
||
- 领域模型:`/home/long/project/supply-intelligence/internal/domain/types.go`
|
||
|
||
### 3.3 失败分类模型
|
||
|
||
建议在 domain 内新增消费失败分类枚举,只用于内部消费与观测,不暴露为新的上线状态:
|
||
- `temporary_network`
|
||
- `temporary_timeout`
|
||
- `temporary_5xx`
|
||
- `temporary_unavailable`
|
||
- `contract_invalid`
|
||
- `auth_forbidden`
|
||
- `idempotency_conflict`
|
||
- `business_rejected`
|
||
- `unknown`
|
||
|
||
文件落点:
|
||
- `/home/long/project/supply-intelligence/internal/domain/types.go`
|
||
|
||
### 3.4 consumer 层实现边界
|
||
|
||
现有文件:`/home/long/project/supply-intelligence/internal/gatewayconsumer/service.go`
|
||
|
||
当前 `applier` 返回:
|
||
- `GatewayAckResult`
|
||
- `detail string`
|
||
|
||
为支持失败分类与重试,建议改为内部结果结构,不改 HTTP 契约:
|
||
- `ackResult`:仅在最终写回时使用
|
||
- `retryable bool`
|
||
- `failureCategory string`
|
||
- `detail string`
|
||
|
||
consumer 的处理规则:
|
||
1. 拉取 event
|
||
2. 若 event `pending` 且 `next_retry_at` 为空或已到期,则尝试 apply
|
||
3. apply 成功:
|
||
- 更新 snapshot
|
||
- ack `applied`
|
||
4. apply 失败且可自动重试:
|
||
- `retry_count + 1`
|
||
- 写 `last_failure_category/detail`
|
||
- 计算 `next_retry_at`
|
||
- 若次数 < 3:保持 `pending`,不写最终 ack
|
||
- 若次数 == 3:ack `failed`
|
||
5. apply 失败且不可自动重试:
|
||
- 直接 ack `failed`
|
||
- 持久化失败分类与 detail
|
||
|
||
### 3.5 repository 层需要补齐的方法
|
||
|
||
在 `internal/repository/interfaces.go` 增加以下接口,避免把 retry 逻辑塞进 HTTP handler:
|
||
- `ListRetryablePendingPackageEvents(ctx context.Context, consumer string, now time.Time, limit int) []domain.PackageChangeEvent`
|
||
- `MarkPackageEventRetry(ctx context.Context, eventID string, retryCount int, nextRetryAt time.Time, category, detail string) (domain.PackageChangeEvent, error)`
|
||
- `GetPackageEventByID(ctx context.Context, eventID string) (domain.PackageChangeEvent, bool)`
|
||
|
||
对应 Postgres 实现文件:
|
||
- `/home/long/project/supply-intelligence/internal/repository/postgres.go`
|
||
|
||
对应内存实现文件:
|
||
- `/home/long/project/supply-intelligence/internal/repository/memory.go`
|
||
|
||
原因:
|
||
- 当前 `ListPackageEventsAfter` 是“事件流读取”语义,不适合直接承担“到期重试任务队列”语义
|
||
- 自动重试应按 pending + next_retry_at 过滤,而不是依赖 cursor 重新扫全量历史
|
||
|
||
### 3.6 poller/runtime 层映射
|
||
|
||
现有文件:
|
||
- `/home/long/project/supply-intelligence/internal/poller/gateway_package_poller.go`
|
||
- `/home/long/project/supply-intelligence/internal/poller/runtime.go`
|
||
|
||
建议映射:
|
||
1. `GatewayPackagePoller.PollOnce` 保留“单轮执行”语义
|
||
2. `gatewayconsumer.Service.ConsumeOnce` 内部改为:
|
||
- 先处理 cursor 拉取的新 pending event
|
||
- 再处理到期的 retryable pending event
|
||
3. `Runtime` 保持简单定时器,不在 runtime 层做复杂调度
|
||
4. 退避时间计算放在 `gatewayconsumer/service.go` 或新建 `gatewayconsumer/retry_policy.go`
|
||
|
||
这样可贴合当前结构,不引入新 scheduler/queue。
|
||
|
||
### 3.7 retry 状态机
|
||
|
||
事件消费内部状态机如下:
|
||
1. `pending` + 首次消费成功 -> `applied`
|
||
2. `pending` + retryable 失败 + 次数 1/2 -> 保持 `pending`,写 `next_retry_at`
|
||
3. `pending` + retryable 失败 + 次数 3 -> `failed`
|
||
4. `pending` + non-retryable 失败 -> `failed`
|
||
5. `failed` 不再被自动消费
|
||
6. `applied` 不再重复消费
|
||
|
||
这与 PM 口径一致,并且不破坏外部 API 现有三态语义。
|
||
|
||
## 4. Rollout / Rollback runbook 需要的脚本、接口、文档支撑
|
||
|
||
## 4.1 现有可复用接口
|
||
|
||
当前 runbook 已可复用的真实接口:
|
||
- `/healthz`:`/home/long/project/supply-intelligence/internal/httpapi/server.go :: handleHealth`
|
||
- `/metrics`:`/home/long/project/supply-intelligence/internal/httpapi/server.go :: Routes`
|
||
- `POST /internal/supply-intelligence/publish/package-event`
|
||
- `GET /internal/supply-intelligence/gateway/package-changes`
|
||
- `POST /internal/supply-intelligence/gateway/package-changes/{event_id}/ack`
|
||
- `POST /internal/supply-intelligence/gateway/consume-once`
|
||
- `GET /internal/supply-intelligence/models/{platform}/{model}/admission-state`
|
||
- `GET /internal/supply-intelligence/accounts/{account_id}/routing-state`
|
||
|
||
### 4.2 缺少的 runbook 支撑物
|
||
|
||
按 PM 要求,runbook 不能只写文字,必须配套脚本与检查入口。建议新增:
|
||
|
||
1. 桌面演练脚本
|
||
- 路径:`/home/long/project/supply-intelligence/scripts/gateway_closure_smoke.sh`
|
||
- 作用:执行 publish -> package-changes -> consume-once/ack -> admission-state 检查
|
||
- 用于上线前前提第 3 条“至少完成一轮桌面演练”
|
||
|
||
2. 巡检脚本
|
||
- 路径:`/home/long/project/supply-intelligence/scripts/gateway_closure_inspect.sh`
|
||
- 作用:读取 metrics、healthz、admission-state 样本、失败 event 数量,输出是否满足继续/暂停/回滚条件
|
||
|
||
3. 回滚脚本或操作模板
|
||
- 路径:`/home/long/project/supply-intelligence/scripts/gateway_closure_rollback.sh`
|
||
- 作用:不是直接删数据,而是调用受控入口做“停止 poller / 定位失败 event / 人工 ack 或重新发布替换 package”的半自动操作模板
|
||
|
||
4. runbook 文档
|
||
- 路径:`/home/long/project/supply-intelligence/tech/RUNBOOK_GATEWAY_ROLLOUT_ROLLBACK_2026-05-08.md`
|
||
- 四段必须存在:
|
||
- 上线前检查
|
||
- 灰度观察
|
||
- 失败回滚
|
||
- 回滚后确认
|
||
|
||
### 4.3 需要补充的运维/控制接口
|
||
|
||
当前仓库缺少显式的 gateway runtime 开关与状态查看接口,runbook 无法落地“暂停放量/停止自动消费”。建议新增最小控制入口:
|
||
|
||
1. runtime 状态查询
|
||
- 建议路径:`GET /internal/supply-intelligence/gateway/runtime-status`
|
||
- 落点:`/home/long/project/supply-intelligence/internal/httpapi/server.go`
|
||
- 返回:poller 是否启动、cursor、最近轮询时间、最近错误、待重试数量、最终 failed 数量
|
||
|
||
2. runtime 暂停/恢复
|
||
- 建议路径:
|
||
- `POST /internal/supply-intelligence/gateway/runtime/pause`
|
||
- `POST /internal/supply-intelligence/gateway/runtime/resume`
|
||
- 落点:
|
||
- `internal/httpapi/server.go`
|
||
- `internal/app/app.go`
|
||
- `internal/poller/runtime.go`
|
||
- 作用:支持 runbook 中“暂停继续放量但不立即回滚”
|
||
|
||
注意:
|
||
- 这里不是引入新平台,只是给现有 poller/runtime 补一个可控开关
|
||
- 若不补该开关,runbook 只能通过进程级停服务实现,粒度过粗
|
||
|
||
### 4.4 rollback 技术定义
|
||
|
||
本仓库现状下,回滚不应定义为“删除 event”或“改回 published 之前状态”,而应定义为以下受控动作之一:
|
||
1. 暂停 gateway consumer runtime,阻止继续消费新 event
|
||
2. 对错误 package 生成替代发布 event,让新正确版本覆盖旧错误版本
|
||
3. 对最终 failed event 人工判定后重新投递或关闭
|
||
4. 通过 admission-state 与 gateway snapshot 确认错误影响范围已止血
|
||
|
||
因此 runbook 需要的技术支撑文件为:
|
||
- 脚本:`scripts/gateway_closure_rollback.sh`
|
||
- 文档:`tech/RUNBOOK_GATEWAY_ROLLOUT_ROLLBACK_2026-05-08.md`
|
||
- 查询接口:runtime-status、admission-state、package-changes
|
||
|
||
## 5. 观测指标、告警、巡检门禁落点
|
||
|
||
## 5.1 当前现状
|
||
|
||
当前 `internal/metrics/metrics.go` 已声明:
|
||
- `GatewayEventsProcessedTotal`
|
||
- `GatewayEventLatencySeconds`
|
||
- `AccountsByStatus`
|
||
- `RoutingEnabledAccounts`
|
||
|
||
但搜索当前仓库可见:这些指标尚未真正接到 gateway/probe/admission 关键调用链上,至少当前代码中没有使用引用。因此现状是:
|
||
- `/metrics` 端点存在
|
||
- 指标声明存在
|
||
- 关键 gateway 收口指标未真实打点
|
||
|
||
这正是 PM 文档中“已有 metrics 暴露,不等于生产口径清晰”的对应缺口。
|
||
|
||
### 5.2 指标落点设计
|
||
|
||
1. gateway 事件处理量
|
||
- 指标:`supply_intelligence_gateway_events_processed_total`
|
||
- 文件:`/home/long/project/supply-intelligence/internal/gatewayconsumer/service.go`
|
||
- 打点点位:
|
||
- 每次最终 `applied`
|
||
- 每次最终 `failed`
|
||
- 标签建议从现有 `{platform,event_type}` 扩展为 `{platform,event_type,result}`
|
||
|
||
2. gateway 事件处理时延
|
||
- 指标:`supply_intelligence_gateway_event_latency_seconds`
|
||
- 文件:`internal/gatewayconsumer/service.go`
|
||
- 打点点位:从开始 apply 到本次尝试结束
|
||
- 说明:用于看 PM 要求的“新 event 到 applied 时延是否稳定”,虽然严格的“event 产生到 applied”还需要额外观察值
|
||
|
||
3. gateway 重试次数/积压
|
||
建议新增:
|
||
- `supply_intelligence_gateway_event_retries_total{platform,category}`
|
||
- `supply_intelligence_gateway_pending_retry_events{consumer}`
|
||
- `supply_intelligence_gateway_failed_events{consumer}`
|
||
|
||
文件:
|
||
- 声明:`/home/long/project/supply-intelligence/internal/metrics/metrics.go`
|
||
- 更新:`/home/long/project/supply-intelligence/internal/gatewayconsumer/service.go`
|
||
- 查询支撑:`/home/long/project/supply-intelligence/internal/repository/postgres.go`、`memory.go`
|
||
|
||
4. routing 状态盘点
|
||
- `AccountsByStatus`
|
||
- `RoutingEnabledAccounts`
|
||
- 更新点位:`/home/long/project/supply-intelligence/internal/probe/service.go`
|
||
- 作用:支撑 24h 巡检中的“按 platform 查看 account status / routing enabled 数量”
|
||
|
||
5. admission-state 观测支撑
|
||
不一定需要新增指标,但必须保留 API 抽样检查入口:
|
||
- `/internal/supply-intelligence/models/{platform}/{model}/admission-state`
|
||
- 文件:`internal/httpapi/server.go`
|
||
|
||
### 5.3 告警门禁映射
|
||
|
||
在不引入新基础设施前提下,本轮先交付“告警规则定义文档 + 脚本化巡检 + metrics 落点”。
|
||
|
||
建议新增文档:
|
||
- `/home/long/project/supply-intelligence/tech/OBSERVABILITY_GATEWAY_CLOSURE_2026-05-08.md`
|
||
|
||
其中至少定义以下门禁:
|
||
1. 15 分钟 applied 比例 < 95% -> 暂停放量
|
||
2. pending retry event > 10 -> 暂停放量
|
||
3. 连续 3 个最终 failed -> 触发回滚
|
||
4. metrics/healthz 不可达 -> 停止继续上线
|
||
5. auth_forbidden / contract_invalid / idempotency_conflict 任一出现 -> 升级 TechLead + XL
|
||
|
||
### 5.4 巡检脚本最小输出项
|
||
|
||
`gateway_closure_inspect.sh` 建议输出:
|
||
- healthz 是否 200
|
||
- metrics 是否可抓取
|
||
- pending event 数量
|
||
- due retry event 数量
|
||
- failed event 数量
|
||
- 最近 15 分钟 applied 数量 / failed 数量
|
||
- 最近 15 分钟 applied 比例
|
||
- 是否命中 continue / pause / rollback 阈值
|
||
|
||
要实现这些输出,repository 层需要补充 count 查询;文件落点:
|
||
- `/home/long/project/supply-intelligence/internal/repository/interfaces.go`
|
||
- `/home/long/project/supply-intelligence/internal/repository/postgres.go`
|
||
- `/home/long/project/supply-intelligence/internal/repository/memory.go`
|
||
|
||
## 6. QA 设计审查时必须检查的调用链路
|
||
|
||
QA 不能只看定义,必须按“定义 -> 装配 -> 调用 -> 入口”四层核查。
|
||
|
||
### 链路 A:发布后 event 进入待消费态
|
||
- 定义:`internal/publish/service.go :: PublishDraft`
|
||
- 装配:`internal/app/app.go :: buildApp`
|
||
- 调用:`internal/httpapi/server.go :: handlePublishPackageEvent`
|
||
- 入口:`POST /internal/supply-intelligence/publish/package-event`
|
||
- 必查点:返回体或后续 admission-state 中必须能看到 `gateway_sync_status=pending`
|
||
|
||
### 链路 B:gateway 自动消费成功
|
||
- 定义:`internal/gatewayconsumer/service.go :: ConsumeOnce`
|
||
- 装配:`internal/app/app.go` 中 `GatewayConsumerService`、`GatewayPoller`、`GatewayRuntime`
|
||
- 调用:`internal/poller/gateway_package_poller.go :: PollOnce`
|
||
- 入口:
|
||
- `POST /internal/supply-intelligence/gateway/consume-once`
|
||
- 或 runtime 定时启动 `internal/poller/runtime.go :: Start`
|
||
- 必查点:成功后 event `pending -> applied`,snapshot 已写入
|
||
|
||
### 链路 C:gateway 自动重试
|
||
- 定义:新增 `gatewayconsumer/retry_policy.go` 或 `service.go` 内 retry 逻辑
|
||
- 装配:`app.go` 注入 consumer 与 runtime
|
||
- 调用:`ConsumeOnce` 内对 retryable event 的二次处理
|
||
- 入口:定时 runtime 或显式 `consume-once`
|
||
- 必查点:
|
||
- retryable 失败不会立刻写最终 `failed`
|
||
- `retry_count`、`next_retry_at` 持续变化
|
||
- 第 3 次失败后才转 `failed`
|
||
|
||
### 链路 D:不可自动重试失败终态
|
||
- 定义:`gatewayconsumer/service.go` 失败分类
|
||
- 装配:同上
|
||
- 调用:apply 返回 contract/auth/conflict/business reject
|
||
- 入口:`consume-once` 或 poller runtime
|
||
- 必查点:首轮即 `failed`,且 failure category/detail 可查询
|
||
|
||
### 链路 E:admission-state 对 published/applied 差异暴露
|
||
- 定义:`internal/httpapi/server.go :: handleModelAdmissionState`
|
||
- 装配:server routes mounted
|
||
- 调用:repo `GetLatestPackageEvent`
|
||
- 入口:`GET /internal/supply-intelligence/models/{platform}/{model}/admission-state`
|
||
- 必查点:不能把 `package active` 误报成“已生效”
|
||
|
||
### 链路 F:runbook 执行前置检查
|
||
- 定义:`/healthz`、`/metrics`、`gateway_closure_smoke.sh`
|
||
- 装配:`server.go :: Routes`
|
||
- 调用:脚本对 HTTP 入口发起真实调用
|
||
- 入口:`scripts/gateway_closure_smoke.sh`
|
||
- 必查点:脚本不是伪脚本,命令与接口路径必须真实存在
|
||
|
||
### 链路 G:暂停 / 恢复自动消费
|
||
- 定义:新增 runtime pause/resume 接口
|
||
- 装配:`server.go` + `app.go` + `poller/runtime.go`
|
||
- 调用:runbook 中暂停放量时调用
|
||
- 入口:pause/resume HTTP endpoint 或等价 CLI
|
||
- 必查点:暂停后不再消费新 event,但已有状态查询仍可用
|
||
|
||
## 7. Engineer 任务拆解(必须包含具体文件路径)
|
||
|
||
以下任务按“贴合当前代码、最小必要改动”拆解。
|
||
|
||
### 7.1 Domain / Schema
|
||
1. `/home/long/project/supply-intelligence/internal/domain/types.go`
|
||
- 新增 gateway failure category 枚举
|
||
- 为 `PackageChangeEvent` 增加 retry 元数据字段
|
||
|
||
2. `/home/long/project/supply-intelligence/migrations/0004_gateway_event_retry_state.sql`
|
||
- 为 package events 表新增 `retry_count` / `last_retry_at` / `next_retry_at` / `last_failure_category` / `last_failure_detail`
|
||
- 补索引:`ack_status + next_retry_at` 或等价查询索引
|
||
|
||
### 7.2 Repository
|
||
3. `/home/long/project/supply-intelligence/internal/repository/interfaces.go`
|
||
- 增加 retryable event 查询、event by id 查询、retry 标记、统计查询接口
|
||
|
||
4. `/home/long/project/supply-intelligence/internal/repository/postgres.go`
|
||
- 实现新增接口
|
||
- 更新 `ListPackageEventsAfter` / `GetLatestPackageEvent` / `AckPackageEvent` 的 scan 结构
|
||
- 增加 pending/retry/failed 统计查询
|
||
|
||
5. `/home/long/project/supply-intelligence/internal/repository/memory.go`
|
||
- 同步实现 retry 元数据与统计接口
|
||
|
||
6. `/home/long/project/supply-intelligence/internal/repository/memory_test.go`
|
||
- 补 memory 仓储行为测试
|
||
|
||
7. `/home/long/project/supply-intelligence/internal/repository/postgres_publish_tx_test.go`
|
||
- 补 Postgres 事务路径下 event retry 字段一致性测试
|
||
|
||
### 7.3 Gateway Consumer / Poller
|
||
8. `/home/long/project/supply-intelligence/internal/gatewayconsumer/service.go`
|
||
- 引入失败分类
|
||
- 增加自动重试判断
|
||
- 增加 1m/5m/15m 退避计算
|
||
- 成功时写 applied
|
||
- retryable 失败时保持 pending 并更新 next_retry_at
|
||
- non-retryable 或超过 3 次时写 failed
|
||
- 补 metrics 打点
|
||
|
||
9. `/home/long/project/supply-intelligence/internal/gatewayconsumer/retry_policy.go`
|
||
- 抽出 retry 判定与退避函数,避免 `service.go` 过重
|
||
|
||
10. `/home/long/project/supply-intelligence/internal/gatewayconsumer/service_test.go`
|
||
- 增加以下测试:
|
||
- retryable failure stays pending on attempt 1/2
|
||
- retryable failure becomes failed on attempt 3
|
||
- non-retryable failure becomes failed immediately
|
||
- applied path updates snapshot and metrics
|
||
|
||
11. `/home/long/project/supply-intelligence/internal/poller/gateway_package_poller.go`
|
||
- 保持最小变更;若需要暴露最近轮询结果,可加 last run state
|
||
|
||
12. `/home/long/project/supply-intelligence/internal/poller/runtime.go`
|
||
- 增加 pause/resume/status 能力
|
||
- 保留 Start/Stop 现有行为兼容
|
||
|
||
13. `/home/long/project/supply-intelligence/internal/poller/runtime_test.go`
|
||
- 增加 pause/resume/status 测试
|
||
|
||
### 7.4 HTTP API / App Wiring
|
||
14. `/home/long/project/supply-intelligence/internal/httpapi/server.go`
|
||
- 新增 runtime-status / pause / resume 路由
|
||
- 若需要,新增 inspect 用统计接口
|
||
- 保持现有 `package-changes`、`ack`、`consume-once` 不破坏兼容
|
||
|
||
15. `/home/long/project/supply-intelligence/internal/httpapi/server_test.go`
|
||
- 补充 runtime 控制接口测试
|
||
|
||
16. `/home/long/project/supply-intelligence/internal/httpapi/server_integration_test.go`
|
||
- 增加 pause 后不再自动消费、resume 后恢复消费测试
|
||
|
||
17. `/home/long/project/supply-intelligence/internal/app/app.go`
|
||
- 把 runtime 状态控制能力暴露给 HTTP 层
|
||
- 如需要,给 Application 增加获取 gateway runtime status 的方法
|
||
|
||
### 7.5 Metrics / Observability
|
||
18. `/home/long/project/supply-intelligence/internal/metrics/metrics.go`
|
||
- 为 gateway 增加 retry total / pending retry gauge / failed gauge
|
||
- 如必要,扩充 processed_total label 维度
|
||
|
||
19. `/home/long/project/supply-intelligence/internal/gatewayconsumer/service.go`
|
||
- 实际写 metrics,不允许只声明不调用
|
||
|
||
20. `/home/long/project/supply-intelligence/internal/probe/service.go`
|
||
- 把 `AccountsByStatus`、`RoutingEnabledAccounts` 真正接到状态写回路径
|
||
|
||
21. `/home/long/project/supply-intelligence/internal/probe/service_test.go`
|
||
- 补 probe 指标更新测试
|
||
|
||
### 7.6 Runbook / Scripts / Docs
|
||
22. `/home/long/project/supply-intelligence/scripts/gateway_closure_smoke.sh`
|
||
- 上线前演练脚本
|
||
- 验证 publish -> package-changes -> consume-once/ack -> admission-state
|
||
|
||
23. `/home/long/project/supply-intelligence/scripts/gateway_closure_inspect.sh`
|
||
- 24h / 72h 巡检脚本
|
||
- 输出 continue / pause / rollback 判定
|
||
|
||
24. `/home/long/project/supply-intelligence/scripts/gateway_closure_rollback.sh`
|
||
- 回滚操作模板脚本
|
||
- 支持 pause runtime、查询 failed、给出人工恢复提示
|
||
|
||
25. `/home/long/project/supply-intelligence/tech/RUNBOOK_GATEWAY_ROLLOUT_ROLLBACK_2026-05-08.md`
|
||
- 记录 rollout / rollback 执行步骤与负责人
|
||
|
||
26. `/home/long/project/supply-intelligence/tech/OBSERVABILITY_GATEWAY_CLOSURE_2026-05-08.md`
|
||
- 指标、告警、巡检、升级路径文档
|
||
|
||
### 7.7 E2E / QA 证据
|
||
27. `/home/long/project/supply-intelligence/internal/httpapi/postgres_e2e_test.go`
|
||
- 扩展为覆盖:
|
||
- pending -> applied
|
||
- retryable failure -> pending -> applied
|
||
- retryable failure x3 -> failed
|
||
- non-retryable failure -> failed
|
||
- runtime pause/resume
|
||
|
||
28. `/home/long/project/supply-intelligence/internal/poller/gateway_package_poller_test.go`
|
||
- 补 cursor + retry 混合路径测试
|
||
|
||
29. `/home/long/project/supply-intelligence/internal/httpapi/admission_state_api_test.go`
|
||
- 补 published/pending/applied/failed 语义测试
|
||
|
||
## 8. QA 审查结论口径
|
||
|
||
### 8.1 当前可给出的设计阶段结论
|
||
- 结论:可进入 QA 设计审查
|
||
|
||
原因:
|
||
1. 真源与 PM 收口要求已经被映射到当前仓库真实文件
|
||
2. gateway 主链现有代码落点真实存在,不是空设计
|
||
3. 本文件已把失败重试、runbook、观测、调用链路、Engineer 任务细化到文件级
|
||
4. 没有发散到新基础设施,符合当前仓库约束
|
||
|
||
### 8.2 QA 需要重点卡住的补设计红线
|
||
若后续实现/补文档出现以下任一情况,QA 应打回:
|
||
1. 仍把 `published`、`active`、`applied` 混为一谈
|
||
2. 仍用 `failed` 表示“以后自动再试”,没有 pending + retry 元数据
|
||
3. 仅新增 metrics 定义,不在真实调用链打点
|
||
4. runbook 只有文档,没有脚本/接口支撑
|
||
5. pause/resume 缺失,导致“暂停放量”只能靠停整个服务
|
||
6. E2E 仍只测 happy path,不测 retryable / final failed 路径
|
||
|
||
## 9. 最终结论
|
||
|
||
当前结论:可进入 QA 设计审查。
|
||
|
||
说明:
|
||
- 这是“设计可审查”的结论,不是“当前代码已可上线”的结论
|
||
- 进入实现前,不再需要补 PM 口径
|
||
- 进入实现后,必须严格按本文件的文件路径和调用链补齐 retry、runbook、observability、QA 证据
|
||
|
||
## 10. 本文档对应的绝对路径
|
||
|
||
`/home/long/project/supply-intelligence/tech/TECHLEAD_GATEWAY_CLOSURE_DESIGN_2026-05-08.md`
|