194 lines
4.4 KiB
Markdown
194 lines
4.4 KiB
Markdown
|
|
# AI-Ops 单机运行 Runbook
|
|||
|
|
|
|||
|
|
> 适用范围:开发机、单台线上服务器。目标是稳定可重复启动、可健康检查、可备份、可回滚、可故障恢复。不是多节点高可用方案。
|
|||
|
|
|
|||
|
|
## 0. 前置条件
|
|||
|
|
|
|||
|
|
任选一种容器运行时:
|
|||
|
|
|
|||
|
|
- Docker + docker compose
|
|||
|
|
- Podman + podman-compose
|
|||
|
|
|
|||
|
|
本机还需要:
|
|||
|
|
|
|||
|
|
- go 1.22+
|
|||
|
|
- curl
|
|||
|
|
- python3
|
|||
|
|
- gzip / zcat
|
|||
|
|
|
|||
|
|
## 1. 一键启动
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd /home/long/project/ai-ops
|
|||
|
|
scripts/aiops-single-node.sh start
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
脚本会自动完成:
|
|||
|
|
|
|||
|
|
1. 生成 `.runtime/single-node.env`,包含随机 JWT secret 和 metrics auth。
|
|||
|
|
2. 生成 `.runtime/config.single.yaml`,使用 production mode。
|
|||
|
|
3. 编译静态二进制 `.runtime/ai-ops`。
|
|||
|
|
4. 启动 PostgreSQL、Redis、AI-Ops App。
|
|||
|
|
5. 等待 `/actuator/health/ready` 变绿。
|
|||
|
|
6. 执行 smoke:health、login、alerts、rules、channels、dashboard、openapi。
|
|||
|
|
|
|||
|
|
默认监听地址和端口:
|
|||
|
|
|
|||
|
|
| 服务 | 默认监听 | 说明 |
|
|||
|
|
|------|----------|------|
|
|||
|
|
| App | 127.0.0.1:18080 | 默认只允许本机访问,生产机不要直接公网暴露 |
|
|||
|
|
| PostgreSQL | 127.0.0.1:15432 | 默认只允许本机访问 |
|
|||
|
|
| Redis | 127.0.0.1:16379 | 默认只允许本机访问 |
|
|||
|
|
|
|||
|
|
可通过环境变量覆盖:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
AI_OPS_APP_PORT=28080 AI_OPS_DB_PORT=25432 AI_OPS_REDIS_PORT=26379 scripts/aiops-single-node.sh start
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 2. 日常检查
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
scripts/aiops-single-node.sh status
|
|||
|
|
scripts/aiops-single-node.sh smoke
|
|||
|
|
scripts/aiops-single-node.sh logs
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
直接访问:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
curl -fsS http://127.0.0.1:18080/health
|
|||
|
|
curl -fsS http://127.0.0.1:18080/actuator/health/ready
|
|||
|
|
curl -fsS http://127.0.0.1:18080/ops/dashboard
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 3. 告警能力边界
|
|||
|
|
|
|||
|
|
当前单机版支持:
|
|||
|
|
|
|||
|
|
- 告警规则 CRUD
|
|||
|
|
- 规则引擎定时评估
|
|||
|
|
- P2 持续 2 小时升级 P1
|
|||
|
|
- 同资源 1 分钟聚合告警
|
|||
|
|
- webhook 通知发送
|
|||
|
|
- 通知日志落库
|
|||
|
|
- 失败后尝试备用渠道
|
|||
|
|
|
|||
|
|
当前占位,不能作为正式值班渠道承诺:
|
|||
|
|
|
|||
|
|
- email
|
|||
|
|
- Feishu
|
|||
|
|
- Wechat
|
|||
|
|
|
|||
|
|
因此单机稳定版建议先用 webhook 接入现有告警网关、企业机器人转发器或自建 relay。
|
|||
|
|
|
|||
|
|
## 4. 备份
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
scripts/aiops-single-node.sh backup
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
备份文件输出到:
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
backups/ai_ops_YYYYMMDD-HHMMSS.sql.gz
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
建议线上服务器至少每天执行一次,可用 crontab:
|
|||
|
|
|
|||
|
|
```cron
|
|||
|
|
30 2 * * * cd /home/long/project/ai-ops && scripts/aiops-single-node.sh backup >> backups/backup.log 2>&1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 5. 回滚 / 恢复数据库
|
|||
|
|
|
|||
|
|
从某个备份恢复:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
scripts/aiops-single-node.sh restore backups/ai_ops_YYYYMMDD-HHMMSS.sql.gz
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
脚本会:
|
|||
|
|
|
|||
|
|
1. 停止 app 容器,避免恢复期间写入。
|
|||
|
|
2. 清空 PostgreSQL `public` schema,避免表/函数/触发器已存在导致恢复失败。
|
|||
|
|
3. 用 psql 导入备份。
|
|||
|
|
4. 启动 app。
|
|||
|
|
5. 等待 ready。
|
|||
|
|
6. 自动 smoke。
|
|||
|
|
|
|||
|
|
注意:restore 是有副作用操作,执行前应先确认备份文件正确,必要时先复制一份当前备份。
|
|||
|
|
|
|||
|
|
## 6. 故障恢复
|
|||
|
|
|
|||
|
|
容器异常退出、服务器重启后:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
scripts/aiops-single-node.sh recover
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
脚本会基于现有 volume 重新拉起 PostgreSQL、Redis、App,并执行 ready + smoke。
|
|||
|
|
|
|||
|
|
如果 app 异常但 DB/Redis 正常:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
scripts/aiops-single-node.sh restart
|
|||
|
|
scripts/aiops-single-node.sh smoke
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 7. 停止服务
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
scripts/aiops-single-node.sh stop
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
该命令保留 volume,不删除数据。
|
|||
|
|
|
|||
|
|
## 8. 安全配置
|
|||
|
|
|
|||
|
|
`.runtime/single-node.env` 默认权限由脚本以 `umask 077` 创建,包含:
|
|||
|
|
|
|||
|
|
- `AI_OPS_JWT_SECRET`
|
|||
|
|
- `AI_OPS_METRICS_AUTH`
|
|||
|
|
- 数据库密码
|
|||
|
|
|
|||
|
|
不要提交 `.runtime/` 和 `backups/`。仓库 `.gitignore` 已屏蔽这些目录。
|
|||
|
|
|
|||
|
|
production mode 下应用会强制校验:
|
|||
|
|
|
|||
|
|
- JWT secret 至少 32 字符
|
|||
|
|
- metrics auth 至少 16 字符
|
|||
|
|
- DB host/user/password/dbname 必填
|
|||
|
|
- port/pool/retention 必须合法
|
|||
|
|
|
|||
|
|
## 9. 单机版 Gate
|
|||
|
|
|
|||
|
|
上线前至少执行:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
go vet ./...
|
|||
|
|
go test -race -buildvcs=false ./...
|
|||
|
|
scripts/aiops-single-node.sh doctor
|
|||
|
|
scripts/aiops-single-node.sh start
|
|||
|
|
scripts/aiops-single-node.sh backup
|
|||
|
|
scripts/aiops-single-node.sh recover
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
如果有回滚演练窗口,再执行:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
scripts/aiops-single-node.sh restore backups/<latest>.sql.gz
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 10. 仍然不是多节点生产级
|
|||
|
|
|
|||
|
|
单机版不提供:
|
|||
|
|
|
|||
|
|
- 多副本高可用
|
|||
|
|
- PostgreSQL 主从切换
|
|||
|
|
- Redis 高可用
|
|||
|
|
- 多节点任务互斥
|
|||
|
|
- 完整 Feishu/Wechat/email 生产通知实现
|
|||
|
|
|
|||
|
|
但它满足开发机和单台线上服务器的稳定运行、备份、回滚和恢复闭环。
|