662 lines
16 KiB
Markdown
662 lines
16 KiB
Markdown
# 插件闭环实施规划(2026-05-28)
|
||
|
||
日期:2026-05-28
|
||
|
||
## 目标
|
||
|
||
把插件后续演进从“概念设计”推进到“可执行实施”,明确:
|
||
|
||
- 当前数据库是什么
|
||
- 后续继续沿用什么存储结构
|
||
- 每个功能域怎样形成闭环
|
||
- 智能路由的日志必须落在哪里
|
||
- 哪些运行态适合放 Redis,哪些必须写插件数据库
|
||
|
||
这份文档是对以下文档的进一步细化:
|
||
|
||
- [PLUGIN_REQUIREMENTS_OVERVIEW_2026-05-28.md](/home/long/project/sub2api-cn-relay-manager/docs/PLUGIN_REQUIREMENTS_OVERVIEW_2026-05-28.md:1)
|
||
- [PLUGIN_ROUTE_STICKY_DESIGN.md](/home/long/project/sub2api-cn-relay-manager/docs/PLUGIN_ROUTE_STICKY_DESIGN.md:1)
|
||
|
||
## 一、当前数据库结论
|
||
|
||
### 当前插件用的是什么数据库
|
||
|
||
当前插件主数据库是 **SQLite**。
|
||
|
||
直接证据:
|
||
|
||
- 配置项:
|
||
- [internal/config/config.go](/home/long/project/sub2api-cn-relay-manager/internal/config/config.go:12)
|
||
- 环境变量 `SUB2API_CRM_SQLITE_DSN`
|
||
- 默认配置:
|
||
- [.env.example](/home/long/project/sub2api-cn-relay-manager/.env.example:2)
|
||
- `SUB2API_CRM_SQLITE_DSN=file:/data/sub2api-cn-relay-manager.db?_foreign_keys=on&_busy_timeout=5000`
|
||
- 打开数据库与迁移入口:
|
||
- [internal/store/sqlite/db.go](/home/long/project/sub2api-cn-relay-manager/internal/store/sqlite/db.go:35)
|
||
|
||
### 当前 SQLite 里已经在存什么
|
||
|
||
当前 SQLite 已经承载:
|
||
|
||
- host 注册信息
|
||
- pack/provider 元数据
|
||
- provider drafts
|
||
- import batches / import items
|
||
- managed resources
|
||
- probe results
|
||
- access closures
|
||
- reconcile runs
|
||
|
||
相关表来自:
|
||
|
||
- [0001_init.sql](/home/long/project/sub2api-cn-relay-manager/internal/store/migrations/0001_init.sql:1)
|
||
- [0002_operational_runtime.sql](/home/long/project/sub2api-cn-relay-manager/internal/store/migrations/0002_operational_runtime.sql:1)
|
||
- 后续 migrations `0003` 到 `0009*`
|
||
|
||
### 当前数据库特性边界
|
||
|
||
SQLite 目前是合理选择,原因是:
|
||
|
||
- 项目当前是单实例控制面为主
|
||
- 状态库规模仍然偏中小
|
||
- 现有 repo / migration / integration test 全围绕 SQLite 建设
|
||
|
||
但要明确一个实现事实:
|
||
|
||
- [internal/store/sqlite/db.go](/home/long/project/sub2api-cn-relay-manager/internal/store/sqlite/db.go:42)
|
||
已经把连接池限制为单 writer 友好模式:
|
||
- `SetMaxOpenConns(1)`
|
||
- `SetMaxIdleConns(1)`
|
||
|
||
这意味着:
|
||
|
||
- SQLite 很适合当前控制面状态持久化
|
||
- 但智能路由日志不能无限制高频同步写库,否则会放大写锁争用
|
||
|
||
所以后续方案应当是:
|
||
|
||
- **SQLite 继续作为插件主状态库**
|
||
- **Redis 作为智能路由运行态缓存**
|
||
- **智能路由日志按结构化事件写回 SQLite**
|
||
|
||
---
|
||
|
||
## 二、后续存储策略
|
||
|
||
## 总体原则
|
||
|
||
### 1. SQLite:保存“真相状态”
|
||
|
||
放 SQLite 的必须是:
|
||
|
||
- 配置真相
|
||
- 可审计真相
|
||
- 需要查询/回放/后台管理的结构化记录
|
||
|
||
### 2. Redis:保存“短期运行态”
|
||
|
||
放 Redis 的应该是:
|
||
|
||
- sticky route
|
||
- route cooldown
|
||
- 短期失败计数
|
||
- 热路由选择缓存
|
||
|
||
### 3. 智能路由日志:必须最终落 SQLite
|
||
|
||
用户已经明确要求:
|
||
|
||
- **智能路由日志要存在插件中**
|
||
|
||
所以要求不能只停留在 Redis,也不能只打 stdout。
|
||
|
||
正确做法是:
|
||
|
||
- 请求热路径上先写结构化内存事件/异步队列
|
||
- 由后台 writer 批量落 SQLite
|
||
- 关键失败场景允许同步兜底落一条简版事件
|
||
|
||
---
|
||
|
||
## 三、目标架构
|
||
|
||
建议后续插件内部形成三层存储:
|
||
|
||
```text
|
||
SQLite
|
||
- 配置真相
|
||
- 业务真相
|
||
- 路由日志
|
||
|
||
Redis
|
||
- sticky route
|
||
- cooldown
|
||
- short-lived routing cache
|
||
|
||
In-memory
|
||
- 单请求上下文
|
||
- 异步日志缓冲队列
|
||
```
|
||
|
||
## SQLite 负责
|
||
|
||
- `logical_groups`
|
||
- `logical_group_routes`
|
||
- `logical_group_models`
|
||
- `route_shadow_groups`
|
||
- `route_decision_logs`
|
||
- `route_failover_events`
|
||
- `route_sticky_audit`
|
||
- `provider account inventory` 扩展表
|
||
|
||
## Redis 负责
|
||
|
||
- 当前 route sticky
|
||
- route 失败计数
|
||
- route cooldown
|
||
- 可选的 user-model route cache
|
||
|
||
---
|
||
|
||
## 四、功能闭环规划
|
||
|
||
下面按功能域写成闭环,不再只列功能点。
|
||
|
||
## 4.1 增加模型闭环
|
||
|
||
### 目标闭环
|
||
|
||
运营在管理页完成一次“新增模型”后,应形成:
|
||
|
||
1. provider manifest 被创建或更新
|
||
2. 模型被纳入某个 `logical_group`
|
||
3. 至少一条 route 被配置好
|
||
4. route 对应的 shadow group 被定义好
|
||
5. 后续可直接导入供应商帐号
|
||
|
||
### 当前已有能力
|
||
|
||
- provider drafts
|
||
- publish to pack repo
|
||
- 同模型冲突校验
|
||
|
||
### 待补技术细节
|
||
|
||
新增模型不应只产生 provider 文件,还应额外落地以下数据:
|
||
|
||
#### 新表:`logical_groups`
|
||
|
||
```sql
|
||
CREATE TABLE logical_groups (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
logical_group_id TEXT NOT NULL UNIQUE,
|
||
display_name TEXT NOT NULL,
|
||
status TEXT NOT NULL,
|
||
description TEXT NOT NULL DEFAULT '',
|
||
route_policy TEXT NOT NULL DEFAULT 'priority',
|
||
sticky_mode TEXT NOT NULL DEFAULT 'conversation_preferred',
|
||
conversation_ttl_seconds INTEGER NOT NULL DEFAULT 7200,
|
||
user_model_ttl_seconds INTEGER NOT NULL DEFAULT 1800,
|
||
failover_threshold INTEGER NOT NULL DEFAULT 2,
|
||
cooldown_seconds INTEGER NOT NULL DEFAULT 600,
|
||
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
||
updated_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
|
||
);
|
||
```
|
||
|
||
#### 新表:`logical_group_models`
|
||
|
||
```sql
|
||
CREATE TABLE logical_group_models (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
logical_group_id TEXT NOT NULL,
|
||
public_model TEXT NOT NULL,
|
||
status TEXT NOT NULL DEFAULT 'active',
|
||
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
||
UNIQUE (logical_group_id, public_model)
|
||
);
|
||
```
|
||
|
||
### 闭环接口
|
||
|
||
建议新增:
|
||
|
||
- `POST /api/logical-groups`
|
||
- `POST /api/logical-groups/{group_id}/models`
|
||
- `GET /api/logical-groups`
|
||
- `GET /api/logical-groups/{group_id}`
|
||
|
||
### 验收标准
|
||
|
||
- 新增模型后,不只是 pack 文件变了
|
||
- 插件库里也能看到:
|
||
- 该模型属于哪个 logical group
|
||
- 哪些 route 支持它
|
||
|
||
---
|
||
|
||
## 4.2 维护逻辑分组闭环
|
||
|
||
### 目标闭环
|
||
|
||
运营维护一个逻辑分组时,应能完成:
|
||
|
||
1. 新建逻辑分组
|
||
2. 绑定公开模型集合
|
||
3. 绑定多条 route
|
||
4. 每条 route 绑定 shadow group
|
||
5. 管理页可查看该逻辑分组当前真实承载结构
|
||
|
||
### 新表:`logical_group_routes`
|
||
|
||
```sql
|
||
CREATE TABLE logical_group_routes (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
route_id TEXT NOT NULL UNIQUE,
|
||
logical_group_id TEXT NOT NULL,
|
||
name TEXT NOT NULL,
|
||
status TEXT NOT NULL,
|
||
priority INTEGER NOT NULL,
|
||
weight INTEGER NOT NULL DEFAULT 100,
|
||
shadow_group_id TEXT NOT NULL,
|
||
shadow_host_id TEXT NOT NULL,
|
||
upstream_base_url_hint TEXT NOT NULL DEFAULT '',
|
||
cooldown_until TEXT NOT NULL DEFAULT '',
|
||
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
||
updated_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
|
||
);
|
||
```
|
||
|
||
### 新表:`logical_group_route_models`
|
||
|
||
```sql
|
||
CREATE TABLE logical_group_route_models (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
route_id TEXT NOT NULL,
|
||
public_model TEXT NOT NULL,
|
||
shadow_model TEXT NOT NULL DEFAULT '',
|
||
status TEXT NOT NULL DEFAULT 'active',
|
||
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
||
UNIQUE (route_id, public_model)
|
||
);
|
||
```
|
||
|
||
### 闭环接口
|
||
|
||
- `POST /api/logical-groups/{group_id}/routes`
|
||
- `PUT /api/logical-groups/{group_id}/routes/{route_id}`
|
||
- `GET /api/logical-groups/{group_id}/routes`
|
||
- `POST /api/logical-groups/{group_id}/routes/{route_id}/models`
|
||
|
||
### 验收标准
|
||
|
||
- 打开某个 logical group 时,可以完整看到:
|
||
- 公开模型
|
||
- route 列表
|
||
- 每条 route 对应的 shadow group
|
||
- route 当前状态
|
||
|
||
---
|
||
|
||
## 4.3 智能路由闭环
|
||
|
||
这是新增需求后的核心实现。
|
||
|
||
### 目标闭环
|
||
|
||
一次用户请求进入插件后,应形成完整闭环:
|
||
|
||
1. 识别 logical group
|
||
2. 识别 public model
|
||
3. 计算 sticky key
|
||
4. 选择 route
|
||
5. 记录 route decision log
|
||
6. 转发到对应 shadow group
|
||
7. 收到结果后更新 sticky / fail count / cooldown
|
||
8. 记录最终路由结果日志
|
||
|
||
### 运行态存储
|
||
|
||
#### Redis key:sticky
|
||
|
||
```text
|
||
lg:{logical_group_id}:m:{public_model}:conv:{conversation_id}
|
||
lg:{logical_group_id}:m:{public_model}:sess:{session_id}
|
||
lg:{logical_group_id}:m:{public_model}:user:{user_id}
|
||
```
|
||
|
||
#### Redis key:route failure
|
||
|
||
```text
|
||
routefail:{route_id}
|
||
```
|
||
|
||
#### Redis key:route cooldown
|
||
|
||
```text
|
||
routecool:{route_id}
|
||
```
|
||
|
||
### 路由日志必须写入 SQLite
|
||
|
||
#### 新表:`route_decision_logs`
|
||
|
||
```sql
|
||
CREATE TABLE route_decision_logs (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
request_id TEXT NOT NULL,
|
||
logical_group_id TEXT NOT NULL,
|
||
public_model TEXT NOT NULL,
|
||
user_key TEXT NOT NULL DEFAULT '',
|
||
conversation_key TEXT NOT NULL DEFAULT '',
|
||
sticky_key TEXT NOT NULL DEFAULT '',
|
||
sticky_key_type TEXT NOT NULL DEFAULT '',
|
||
sticky_hit INTEGER NOT NULL DEFAULT 0,
|
||
selected_route_id TEXT NOT NULL,
|
||
selected_shadow_group_id TEXT NOT NULL,
|
||
fallback_used INTEGER NOT NULL DEFAULT 0,
|
||
error_class TEXT NOT NULL DEFAULT '',
|
||
upstream_status INTEGER NOT NULL DEFAULT 0,
|
||
latency_ms INTEGER NOT NULL DEFAULT 0,
|
||
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
|
||
);
|
||
```
|
||
|
||
#### 新表:`route_failover_events`
|
||
|
||
```sql
|
||
CREATE TABLE route_failover_events (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
request_id TEXT NOT NULL,
|
||
logical_group_id TEXT NOT NULL,
|
||
public_model TEXT NOT NULL,
|
||
from_route_id TEXT NOT NULL,
|
||
to_route_id TEXT NOT NULL,
|
||
reason TEXT NOT NULL,
|
||
failure_count INTEGER NOT NULL DEFAULT 0,
|
||
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
|
||
);
|
||
```
|
||
|
||
#### 新表:`route_sticky_audit`
|
||
|
||
```sql
|
||
CREATE TABLE route_sticky_audit (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
sticky_key TEXT NOT NULL,
|
||
sticky_key_type TEXT NOT NULL,
|
||
logical_group_id TEXT NOT NULL,
|
||
public_model TEXT NOT NULL,
|
||
route_id TEXT NOT NULL,
|
||
action TEXT NOT NULL,
|
||
expires_at TEXT NOT NULL DEFAULT '',
|
||
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
|
||
);
|
||
```
|
||
|
||
### 为什么日志必须进 SQLite
|
||
|
||
因为后续必须支持:
|
||
|
||
- 查询“这次为什么走了 codex2api 而不是 asxs”
|
||
- 查询某个 logical group 最近 24h 的 route 命中情况
|
||
- 查询 fallback 是否频繁发生
|
||
- 查询 sticky 命中率
|
||
|
||
这些都必须靠结构化插件日志完成,宿主侧日志无法替代。
|
||
|
||
### 日志写入策略
|
||
|
||
SQLite 是单 writer,不能每个请求把大量日志同步直写。
|
||
|
||
建议:
|
||
|
||
1. 路由热路径先产出 `RouteDecisionEvent`
|
||
2. 写入内存 channel/buffer
|
||
3. 后台 writer 每 100ms 或每 100 条批量写 SQLite
|
||
4. 如果 writer 满了:
|
||
- 保底同步写一条精简版失败记录
|
||
- 或至少 stderr 告警
|
||
|
||
### 路由服务接口建议
|
||
|
||
建议新增:
|
||
|
||
- `RouteResolver`
|
||
- `StickyStore`
|
||
- `RouteDecisionLogger`
|
||
|
||
```go
|
||
type RouteResolver interface {
|
||
Resolve(ctx context.Context, req RouteRequest) (RouteDecision, error)
|
||
}
|
||
|
||
type StickyStore interface {
|
||
Get(ctx context.Context, key string) (StickyBinding, bool, error)
|
||
Set(ctx context.Context, key string, binding StickyBinding, ttl time.Duration) error
|
||
Delete(ctx context.Context, key string) error
|
||
}
|
||
|
||
type RouteDecisionLogger interface {
|
||
Append(ctx context.Context, event RouteDecisionEvent) error
|
||
}
|
||
```
|
||
|
||
### 验收标准
|
||
|
||
- 同一会话优先命中同一路 route
|
||
- retryable 失败时能 fallback
|
||
- fallback 有日志
|
||
- sticky 命中有日志
|
||
- 所有路由决策都能在插件库里查询到
|
||
|
||
---
|
||
|
||
## 4.4 供应商帐号导入与停启用闭环
|
||
|
||
### 目标闭环
|
||
|
||
管理员对供应商帐号做一次操作后,应形成:
|
||
|
||
1. 预检
|
||
2. 导入或复用
|
||
3. 绑定到 provider / route / shadow group
|
||
4. 记录帐号库存状态
|
||
5. 后续可以启用/停用/下线
|
||
|
||
### 当前问题
|
||
|
||
当前更像“导入任务系统”,还不是“帐号资产系统”。
|
||
|
||
### 建议新增表:`provider_accounts`
|
||
|
||
```sql
|
||
CREATE TABLE provider_accounts (
|
||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
host_id TEXT NOT NULL,
|
||
provider_id TEXT NOT NULL,
|
||
route_id TEXT NOT NULL DEFAULT '',
|
||
shadow_group_id TEXT NOT NULL DEFAULT '',
|
||
host_account_id TEXT NOT NULL,
|
||
key_fingerprint TEXT NOT NULL,
|
||
account_name TEXT NOT NULL DEFAULT '',
|
||
account_status TEXT NOT NULL,
|
||
last_probe_status TEXT NOT NULL DEFAULT '',
|
||
last_probe_at TEXT NOT NULL DEFAULT '',
|
||
disabled_reason TEXT NOT NULL DEFAULT '',
|
||
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
||
updated_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
||
UNIQUE (host_id, host_account_id)
|
||
);
|
||
```
|
||
|
||
### 闭环接口
|
||
|
||
- `GET /api/provider-accounts`
|
||
- `POST /api/provider-accounts/{id}/disable`
|
||
- `POST /api/provider-accounts/{id}/enable`
|
||
- `POST /api/provider-accounts/{id}/retire`
|
||
|
||
### 验收标准
|
||
|
||
- 能看到帐号库存
|
||
- 能看到帐号属于哪个 route / shadow group
|
||
- 能手动停启用
|
||
- 状态变化有日志可追
|
||
|
||
---
|
||
|
||
## 4.5 普通用户前端闭环
|
||
|
||
### 目标闭环
|
||
|
||
普通用户看到的是逻辑分组,而不是宿主真实 group。
|
||
|
||
一次用户进入 portal 后,应形成:
|
||
|
||
1. 看到逻辑分组列表
|
||
2. 看到该分组可用模型
|
||
3. 申请/使用的 key 对应逻辑分组
|
||
4. 后端实际路由到某个 shadow group
|
||
5. 用户无需感知宿主真实 group 名称
|
||
|
||
### 当前现实
|
||
|
||
当前 `/portal/` 已存在,但仍偏宿主分组视角。
|
||
|
||
### 待做接口
|
||
|
||
建议新增面向 portal 的聚合 API:
|
||
|
||
- `GET /api/portal/logical-groups`
|
||
- `GET /api/portal/logical-groups/{group_id}`
|
||
- `GET /api/portal/logical-groups/{group_id}/models`
|
||
|
||
### 验收标准
|
||
|
||
- 用户前端不再直接暴露 `gpt-shared__asxs` 这种 shadow group
|
||
- 只看到 `GPT Shared`
|
||
- 用户使用体验上是一个产品,不是多个宿主组装件
|
||
|
||
---
|
||
|
||
## 五、实施顺序
|
||
|
||
为了保证每步都闭环,建议按下面顺序推进。
|
||
|
||
## Phase 1:数据模型闭环
|
||
|
||
目标:
|
||
|
||
- 先让插件库知道什么是 `logical_group / route / shadow_group`
|
||
|
||
范围:
|
||
|
||
- SQLite migrations
|
||
- sqlite repos
|
||
- 基础 CRUD API
|
||
|
||
输出:
|
||
|
||
- 逻辑分组配置可持久化
|
||
- route 配置可持久化
|
||
- 每个闭环功能完成后,必须提交、推送、部署到 `remote43`,并完成服务器验证后再更新执行板
|
||
|
||
## Phase 2:智能路由最小闭环
|
||
|
||
目标:
|
||
|
||
- 让请求真正能按 logical group -> route -> shadow group 跑起来
|
||
|
||
范围:
|
||
|
||
- Redis sticky
|
||
- route resolver
|
||
- route forwarding
|
||
- SQLite route logs
|
||
|
||
输出:
|
||
|
||
- `asxs + codex2api` 单 logical group 跑通
|
||
|
||
## Phase 3:帐号资产闭环
|
||
|
||
目标:
|
||
|
||
- 把“导入任务”升级成“帐号资产管理”
|
||
|
||
范围:
|
||
|
||
- provider_accounts 视图
|
||
- enable/disable/retire
|
||
- route / shadow group 归属展示
|
||
|
||
输出:
|
||
|
||
- 能运维帐号池
|
||
|
||
## Phase 4:普通用户产品闭环
|
||
|
||
目标:
|
||
|
||
- 用户前端彻底切换到逻辑分组视角
|
||
|
||
范围:
|
||
|
||
- portal 聚合 API
|
||
- portal 前端改造
|
||
|
||
输出:
|
||
|
||
- 用户只看到逻辑分组
|
||
|
||
---
|
||
|
||
## 六、技术决策总结
|
||
|
||
### 当前数据库
|
||
|
||
- **SQLite**
|
||
|
||
### 后续主状态库
|
||
|
||
- **继续使用 SQLite**
|
||
|
||
### 智能路由运行态
|
||
|
||
- **Redis**
|
||
|
||
### 智能路由日志
|
||
|
||
- **必须最终落 SQLite**
|
||
|
||
### 为什么不是一开始就换 PostgreSQL
|
||
|
||
当前不建议立刻切 PostgreSQL,原因是:
|
||
|
||
- 现有 repo、migration、integration test 都围绕 SQLite
|
||
- 当前主要瓶颈不是查询能力,而是产品结构未闭环
|
||
- 先把闭环跑通比先换库更重要
|
||
|
||
但要保留一个现实判断:
|
||
|
||
- 如果后续 route 日志量显著升高
|
||
- 或多实例控制面出现
|
||
- 或需要复杂聚合分析
|
||
|
||
那时再评估:
|
||
|
||
- SQLite -> PostgreSQL
|
||
- Redis 保持不变
|
||
|
||
## 一句话结论
|
||
|
||
当前插件数据库是 **SQLite**;后续仍建议以 **SQLite 作为主状态库**,并以 **Redis 承载智能路由运行态缓存**。
|
||
同时,**智能路由日志必须结构化写回插件 SQLite**,不能只停留在 Redis 或 stdout。
|
||
后续真正的闭环实施顺序应当是:
|
||
|
||
1. 先补 `logical_group / route / shadow_group` 数据模型
|
||
2. 再做前置智能路由最小闭环
|
||
3. 再做供应商帐号资产管理
|
||
4. 最后把普通用户前端切到逻辑分组视角
|