Add a top-level README plus production configuration, API, and rollout documentation. Also align deployment and runbook docs with the current runtime semantics, ports, and daily pipeline entrypoints.
144 lines
3.8 KiB
Markdown
144 lines
3.8 KiB
Markdown
# LLM Intelligence Hub - 运维手册
|
||
|
||
> 版本: v1.1
|
||
> 日期: 2026-05-14
|
||
> 适用版本: Phase 3 / Phase 5
|
||
|
||
相关文档:
|
||
|
||
- `docs/PRODUCTION_CHECKLIST.md`:上线前门禁、发布步骤、回滚流程
|
||
- `docs/CONFIGURATION.md`:环境变量与产物路径约定
|
||
- `docs/API_REFERENCE.md`:健康检查与只读接口说明
|
||
|
||
---
|
||
|
||
## 服务启停
|
||
|
||
### 启动全部服务
|
||
```bash
|
||
docker-compose up -d
|
||
```
|
||
|
||
### 停止服务
|
||
```bash
|
||
docker-compose down
|
||
```
|
||
|
||
### 查看日志
|
||
```bash
|
||
docker-compose logs -f app
|
||
docker-compose logs -f db
|
||
```
|
||
|
||
---
|
||
|
||
## 日常巡检
|
||
|
||
### 数据库健康
|
||
```bash
|
||
psql "$DATABASE_URL" -c "SELECT COUNT(*) FROM models WHERE deleted_at IS NULL"
|
||
psql "$DATABASE_URL" -c "SELECT source, success, created_at FROM collector_stats ORDER BY created_at DESC LIMIT 5"
|
||
psql "$DATABASE_URL" -c "SELECT report_date, run_kind, trigger_source, is_official_daily, status FROM daily_report ORDER BY updated_at DESC LIMIT 5"
|
||
psql "$DATABASE_URL" -c "SELECT report_date, run_kind, trigger_source, is_official_daily, status FROM report_runs ORDER BY report_date DESC, created_at DESC LIMIT 5"
|
||
```
|
||
|
||
### 日报检查
|
||
```bash
|
||
ls -la reports/daily/daily_report_$(date +%Y-%m-%d).md
|
||
ls -la reports/daily/html/daily_report_$(date +%Y-%m-%d).html
|
||
ls -la reports/daily/$(date +%Y)/$(date +%m)/daily_report_$(date +%Y-%m-%d).md
|
||
```
|
||
|
||
### 磁盘空间
|
||
```bash
|
||
df -h /var/lib/postgresql
|
||
df -h /tmp
|
||
```
|
||
|
||
---
|
||
|
||
## 故障排查
|
||
|
||
### 采集器失败
|
||
1. 检查 API Key: `echo $OPENROUTER_API_KEY`
|
||
2. 检查网络: `curl https://openrouter.ai/api/v1/models`
|
||
3. 查看日志: `tail /tmp/llm_hub_daily_*.log`
|
||
|
||
### 数据库连接失败
|
||
1. 检查 PostgreSQL 状态: `pg_isready`
|
||
2. 检查连接串: `echo $DATABASE_URL`
|
||
3. 检查权限: `psql -c "\du"`
|
||
|
||
### 日报未生成
|
||
1. 检查 cron: `crontab -l | grep llm-intelligence`
|
||
2. 手动运行: `bash scripts/run_daily.sh`
|
||
3. 检查降级报告: `ls reports/daily/*.md | tail -1`
|
||
4. 如果是历史补跑,使用 `REPORT_RUN_KIND=historical_rebuild` 和 `REPORT_TRIGGER_SOURCE=rebuild_script`,不要当作正式定时产出读取
|
||
|
||
### 正式日报与历史重建
|
||
- 正式定时产出由 `scripts/run_daily.sh` 生成,`is_official_daily=true`
|
||
- 真实复跑由 `scripts/run_real_pipeline.sh` 负责,通常用于手工验证真实采集 + 真实写库 + 报告生成
|
||
- 历史重建通过 `scripts/rebuild_historical_report.sh <date>` 执行,运行语义应保持 `run_kind=historical_rebuild`
|
||
- 前端 `/api/v1/reports/latest` 默认只读正式日报,不会把历史重建当成最新正式产出
|
||
|
||
### 前端无法访问
|
||
1. 检查 Nginx: `docker-compose ps nginx`
|
||
2. 检查 dist: `ls frontend/dist/`
|
||
3. 检查端口: `netstat -tlnp | grep 80`
|
||
|
||
---
|
||
|
||
## 备份恢复
|
||
|
||
### 手动备份
|
||
```bash
|
||
bash scripts/backup.sh
|
||
```
|
||
|
||
### 手动恢复
|
||
```bash
|
||
gunzip < backup_file.sql.gz | psql "$DATABASE_URL"
|
||
```
|
||
|
||
### 定时备份 (cron)
|
||
```bash
|
||
0 2 * * * cd /path/to/llm-intelligence && bash scripts/backup.sh >> /tmp/backup.log 2>&1
|
||
```
|
||
|
||
---
|
||
|
||
## 监控指标
|
||
|
||
| 指标 | 告警阈值 | 检查命令 |
|
||
|------|----------|----------|
|
||
| 模型数 | < 300 | `SELECT COUNT(*) FROM models` |
|
||
| 采集成功率 | < 95% | `SELECT success_rate FROM collector_stats` |
|
||
| 数据库连接 | 失败 | `pg_isready` |
|
||
| 磁盘空间 | > 80% | `df -h` |
|
||
|
||
## 运行审计
|
||
|
||
正式日报与历史重建现在会写入运行语义字段,排障时优先看这些字段:
|
||
|
||
- `run_kind`: `scheduled` / `historical_rebuild` / `manual`
|
||
- `trigger_source`: `cron` / `rebuild_script` / `pipeline`
|
||
- `is_official_daily`: 是否属于当天定时正式产出
|
||
- `summary_md`: 真实运行审计前缀 + 报告摘要
|
||
|
||
---
|
||
|
||
## 扩容指南
|
||
|
||
### 垂直扩容
|
||
增加 PostgreSQL 内存和 CPU。
|
||
|
||
### 水平扩容
|
||
使用读写分离或分片(Phase 2+)。
|
||
|
||
---
|
||
|
||
## 联系信息
|
||
|
||
- 维护者: 宰相
|
||
- 项目路径: /home/long/project/llm-intelligence
|