Files
ai-customer-service/docs/RUNBOOK.md
Your Name 087de4e102 fix(audit): use uuid.New() for ticket workflow audit IDs
Fixes 'invalid input syntax for type uuid' error when writing ticket
workflow audit logs. The audit Event.ID field was using fmt.Sprintf
with nanoseconds ('wf-%d') which doesn't match PostgreSQL's uuid type.

Also adds uuid import to ticket_workflow.go.

Verified: full chain webhook→assign→resolve→close produces 3 audit
logs correctly, no more 'invalid uuid' errors in logs.
2026-05-04 13:44:39 +08:00

6.8 KiB
Raw Blame History

DO-P1-2运行与回滚 Runbook

状态: 已交付 负责人DevOps宰相代填 基准P0 完成 Gate B 预生产验证 日期2026-05-04


一、部署前检查清单Pre-flight

# 1. 确认环境变量完整
echo "AI_CS_RUNTIME_ENV=$AI_CS_RUNTIME_ENV"
echo "AI_CS_POSTGRES_ENABLED=$AI_CS_POSTGRES_ENABLED"
echo "AI_CS_POSTGRES_DSN=${AI_CS_POSTGRES_DSN:+[SET]}"
echo "AI_CS_WEBHOOK_SECRET=${AI_CS_WEBHOOK_SECRET:+[SET]}"
echo "AI_CS_LOG_LEVEL=$AI_CS_LOG_LEVEL"

# 2. 确认 PostgreSQL 可连
PGPASSWORD=ai_cs_secret psql -h localhost -p 5434 -U ai_cs -d ai_customer_service -c "SELECT 1" || exit 1

# 3. 确认 migration 已执行
PGPASSWORD=ai_cs_secret psql -h localhost -p 5434 -U ai_cs -d ai_customer_service -c "SELECT table_name FROM information_schema.tables WHERE table_schema='public' ORDER BY table_name;" | grep -q cs_sessions || { echo "MIGRATION MISSING"; exit 1; }

# 4. 启动服务(后台)
nohup ./ai-customer-service > /var/log/ai-cs.log 2>&1 &
sleep 3

# 5. 验证 ready probe
curl -s http://localhost:8080/actuator/health/ready | grep -q '"status":"UP"' || { echo "READY FAILED"; cat /var/log/ai-cs.log; exit 1; }

二、启动失败排查

症状 原因 解决方案
memory fallback is not allowed ERROR Env=production 但 AI_CS_POSTGRES_ENABLED≠true 设置 AI_CS_POSTGRES_ENABLED=true 并重启
AI_CS_POSTGRES_DSN is required ERROR Env=production 但 DSN 未配置 配置完整 DSNpostgres://user:pass@host:5434/db?sslmode=disable
listen tcp :8080: bind: address already in use 8080 端口被占用 pkill -f ai-customer-service 或改 AI_CS_ADDR=:8081
pq: connection refused PostgreSQL 不可达 检查 PG 主机/端口/防火墙,确认 psql 可连
pq: password authentication failed 密码错误 核对 AI_CS_POSTGRES_DSN 中的密码
启动成功但 /actuator/health/ready 返回 postgres:DOWN PG 连通但 health check 失败 检查 PG 是否在 AI_CS_POSTGRES_DSN 指定端口响应

三、Migration 失败排查

症状 原因 解决方案
pq: relation "cs_sessions" does not exist migration 未执行 手动执行 psql -f db/migration/0001_init.up.sql
pq: duplicate key value violates unique constraint 表已存在但 migration 重跑 migration 已幂等(CREATE TABLE IF NOT EXISTS),忽略即可
pq: permission denied PG 用户无建表权限 确认 ai_cs 用户是 superuser 或拥有 ai_customer_service
# 手动执行 migration
psql "postgres://ai_cs:ai_cs_secret@localhost:5434/ai_customer_service?sslmode=disable" -f db/migration/0001_init.up.sql

四、数据库不可用时的行为

  • Env=production:启动时 config.go 会检查 AI_CS_POSTGRES_ENABLED=true,若 DSN 不可达或认证失败,服务拒绝启动(不会 fallback 到 memory
  • Env=test/development:可设置 AI_CS_POSTGRES_ENABLED=false 使用 memory store测试用

五、Webhook 签名认证联调失败排查

症状 原因 解决方案
CS_AUTH_4034 invalid webhook signature HMAC secret 不匹配 确认上游使用与 AI_CS_WEBHOOK_SECRET 相同的密钥
CS_AUTH_4031 missing webhook signature 上游未传 X-CS-Signature header 检查上游 webhook 发送逻辑
CS_AUTH_4033 stale webhook request 请求时间戳 > MaxSkew默认 300s 确认服务器时间同步NTP或调整 AI_CS_WEBHOOK_MAX_SKEW_SECONDS
偶发性 403 时钟漂移超过 300s 检查服务器时区与 NTP 配置
# 验证签名算法(本地测试)
TS=$(date +%s)
BODY='{"test":"payload"}'
SIG=$(echo -n "${TS}.${BODY}" | openssl dgst -sha256 -hmac "test-secret-123" | awk '{print $2}')
curl -v -X POST http://localhost:8080/api/v1/customer-service/webhook \
  -H "Content-Type: application/json" \
  -H "X-CS-Timestamp: $TS" \
  -H "X-CS-Signature: $SIG" \
  -d "$BODY"

六、回滚操作流程

6.1 版本回滚(从 v1.1.0 回滚到 v1.0.0

# 1. 记录当前版本
echo "Rolling back from $(./ai-customer-service --version) to v1.0.0"

# 2. 停止当前服务
pkill -f "ai-customer-service"
sleep 2

# 3. 备份当前数据库(可选,建议先备份)
PGPASSWORD=ai_cs_secret pg_dump -h localhost -p 5434 -U ai_cs ai_customer_service > /tmp/ai_cs_backup_$(date +%Y%m%d_%H%M%S).sql

# 4. 拉取旧版本镜像 / 二进制
# Docker: docker pull ai-customer-service:v1.0.0
# Binary: 从备份位置获取 v1.0.0 二进制

# 5. 重启服务
nohup ./ai-customer-service-v1.0.0 > /var/log/ai-cs-v1.0.0.log 2>&1 &
sleep 3

# 6. 验证
curl -s http://localhost:8080/actuator/health/ready
curl -s http://localhost:8080/actuator/health

6.2 配置回滚

# 若新配置有问题,恢复环境变量
export AI_CS_POSTGRES_ENABLED=true
export AI_CS_POSTGRES_DSN="postgres://ai_cs:ai_cs_secret@localhost:5434/ai_customer_service?sslmode=disable"
export AI_CS_WEBHOOK_SECRET="previous-secret"
pkill -f "ai-customer-service"
sleep 2
nohup ./ai-customer-service > /var/log/ai-cs.log 2>&1 &

6.3 数据库回滚Migration 不支持向下回滚,需手动处理)

-- 紧急情况:清空所有数据重建(仅 development
TRUNCATE cs_audit_logs, cs_tickets, cs_messages, cs_sessions, cs_message_dedup CASCADE;
-- 然后重启服务,让 migration 重新初始化

七、健康状态快速诊断

#!/bin/bash
# 60s 快速诊断脚本

echo "=== AI-CS Health Diagnostic ==="
echo ""

echo "[1/5] Service process:"
ps aux | grep "ai-customer-service" | grep -v grep || echo "  NOT RUNNING ❌"

echo ""
echo "[2/5] HTTP endpoints:"
for endpoint in "/actuator/health/live" "/actuator/health/ready" "/actuator/health"; do
  status=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080$endpoint)
  echo "  $endpoint → HTTP $status $([ "$status" = "200" ] && echo '✅' || echo '❌')"
done

echo ""
echo "[3/5] PostgreSQL:"
PGPASSWORD=ai_cs_secret psql -h localhost -p 5434 -U ai_cs -d ai_customer_service -c "SELECT count(*) as tickets FROM cs_tickets;" 2>&1 | grep -v "^Password" | tail -1

echo ""
echo "[4/5] Recent errors in log:"
tail -50 /var/log/ai-cs.log 2>/dev/null | grep "ERROR" | tail -5 || echo "  No recent errors ✅"

echo ""
echo "[5/5] Webhook test:"
TS=$(date +%s)
BODY='{"channel":"widget","message_id":"diag-001","open_id":"diag-open","content":"health check","timestamp":"2026-05-04T00:00:00Z"}'
SIG=$(echo -n "${TS}.${BODY}" | openssl dgst -sha256 -hmac "test-secret-123" | awk '{print $2}')
curl -s -X POST http://localhost:8080/api/v1/customer-service/webhook \
  -H "Content-Type: application/json" \
  -H "X-CS-Timestamp: $TS" \
  -H "X-CS-Signature: $SIG" \
  -d "$BODY" | head -c 200

echo ""
echo "=== Diagnostic complete ==="