33
Some checks failed
Deploy Nomad Configurations / deploy-nomad (push) Failing after 2m49s
Infrastructure CI/CD / Validate Infrastructure (push) Failing after 30s
Infrastructure CI/CD / Plan Infrastructure (push) Has been skipped
Infrastructure CI/CD / Apply Infrastructure (push) Has been skipped
Simple Test / test (push) Successful in 4s
Some checks failed
Deploy Nomad Configurations / deploy-nomad (push) Failing after 2m49s
Infrastructure CI/CD / Validate Infrastructure (push) Failing after 30s
Infrastructure CI/CD / Plan Infrastructure (push) Has been skipped
Infrastructure CI/CD / Apply Infrastructure (push) Has been skipped
Simple Test / test (push) Successful in 4s
This commit is contained in:
107
docs/CONSUL_PERSISTENCE_FIX.md
Normal file
107
docs/CONSUL_PERSISTENCE_FIX.md
Normal file
@@ -0,0 +1,107 @@
|
||||
# Consul 持久化存储修复方案
|
||||
|
||||
## 🚨 问题诊断
|
||||
|
||||
**根本原因:你的 Consul 集群确实没有配置持久化存储!**
|
||||
|
||||
### 当前问题:
|
||||
1. **数据目录** `/opt/nomad/data/consul` 只是容器内临时目录
|
||||
2. **没有 volume 挂载** - 重启后数据完全丢失
|
||||
3. **缺少 onecloud1 节点** - 配置与实际运行状态不一致
|
||||
|
||||
### 影响:
|
||||
- ✅ **Consul 服务发现正常** - 这部分数据在内存中
|
||||
- ❌ **KV 存储数据丢失** - 所有配置、tokens、证书都没了
|
||||
- ❌ **ACL 配置丢失** - 权限设置重置
|
||||
- ❌ **服务配置丢失** - 注册的服务元数据丢失
|
||||
|
||||
## 🔧 修复方案
|
||||
|
||||
### 第一步:配置持久化存储
|
||||
|
||||
**在每个 Consul 节点上运行:**
|
||||
```bash
|
||||
# 在 ch4, ash3c, warden 节点上分别执行
|
||||
./scripts/setup-consul-persistent-storage.sh
|
||||
```
|
||||
|
||||
**这个脚本会:**
|
||||
1. 创建 `/opt/consul/data` 目录
|
||||
2. 设置正确的权限 (nomad:nomad)
|
||||
3. 在 Nomad 配置中添加 host volume
|
||||
4. 重启 Nomad 客户端
|
||||
|
||||
### 第二步:部署持久化 Consul
|
||||
|
||||
**停止当前 job:**
|
||||
```bash
|
||||
nomad job stop consul-cluster-nomad
|
||||
```
|
||||
|
||||
**部署新配置:**
|
||||
```bash
|
||||
nomad job run infrastructure/nomad/nomad-jobs/consul-cluster/consul-cluster-persistent.nomad
|
||||
```
|
||||
|
||||
### 第三步:恢复数据
|
||||
|
||||
**如果有备份数据:**
|
||||
```bash
|
||||
# 从 Consul KV 备份恢复
|
||||
consul kv import @backup.json
|
||||
|
||||
# 或从快照恢复
|
||||
consul snapshot restore backup.snap
|
||||
```
|
||||
|
||||
**如果没有备份:**
|
||||
- 需要重新配置所有 KV 数据
|
||||
- 重新设置 Cloudflare tokens
|
||||
- 重新注册服务
|
||||
|
||||
## 🎯 新配置的优势
|
||||
|
||||
### 持久化存储:
|
||||
- **Host Volume** - 数据存储在宿主机 `/opt/consul/data`
|
||||
- **重启安全** - 重启 job 不会丢失数据
|
||||
- **跨 allocation** - 数据在 allocation 之间保持
|
||||
|
||||
### 改进配置:
|
||||
- **统一 bootstrap-expect=3** - 所有节点都知道集群大小
|
||||
- **健康检查** - 自动监控服务状态
|
||||
- **日志级别** - 便于调试
|
||||
- **服务注册** - 自动注册到 Consul
|
||||
|
||||
## 📋 执行清单
|
||||
|
||||
### 准备阶段:
|
||||
- [ ] 备份当前 KV 数据 (如果还有)
|
||||
- [ ] 记录当前服务注册状态
|
||||
- [ ] 准备重新配置的数据
|
||||
|
||||
### 执行阶段:
|
||||
- [ ] 在 ch4 节点运行存储配置脚本
|
||||
- [ ] 在 ash3c 节点运行存储配置脚本
|
||||
- [ ] 在 warden 节点运行存储配置脚本
|
||||
- [ ] 停止当前 Consul job
|
||||
- [ ] 部署持久化 Consul job
|
||||
- [ ] 验证集群状态
|
||||
|
||||
### 验证阶段:
|
||||
- [ ] 检查 Consul 集群状态
|
||||
- [ ] 验证 leader 选举
|
||||
- [ ] 测试 KV 存储
|
||||
- [ ] 恢复关键配置数据
|
||||
|
||||
## 🚨 重要提醒
|
||||
|
||||
**这是一个严重的架构缺陷!**
|
||||
- 生产环境的 Consul 集群没有持久化存储是不可接受的
|
||||
- 这相当于把银行的金库建在沙滩上
|
||||
- 必须立即修复,否则随时可能再次丢失数据
|
||||
|
||||
**修复后的好处:**
|
||||
- 真正的高可用 Consul 集群
|
||||
- 数据持久化保证
|
||||
- 符合生产环境标准
|
||||
- 可以安全地重启和维护
|
||||
Reference in New Issue
Block a user