Clean repository: organized structure and GitOps setup
- Organized root directory structure - Moved orphan files to proper locations - Updated .gitignore to ignore temporary files - Set up Gitea Runner for GitOps automation - Fixed Tailscale access issues - Added workflow for automated Nomad deployment
This commit is contained in:
121
docs/7-days-creation-world.md
Normal file
121
docs/7-days-creation-world.md
Normal file
@@ -0,0 +1,121 @@
|
||||
# CSOL 基础设施建设 - 7天创造世界
|
||||
|
||||
## 概述
|
||||
|
||||
本文档描述了CSOL基础设施建设的完整流程,采用"7天创造世界"的比喻,系统地阐述了从网络连接到应用部署的完整建设过程。
|
||||
|
||||
## 第1天:Tailscale - 网络连接基础
|
||||
|
||||
**目标**:打通所有分布式地点的网络连接
|
||||
|
||||
**核心任务**:
|
||||
- 在所有节点部署Tailscale,建立安全的网络连接
|
||||
- 确保所有节点可以通过Tailscale网络相互访问
|
||||
- 为后续的分布式管理奠定网络基础
|
||||
|
||||
**关键成果**:
|
||||
- 所有节点加入Tailscale网络
|
||||
- 节点间可以通过Tailscale IP直接通信
|
||||
- 为后续的Ansible、Nomad等工具提供网络基础
|
||||
|
||||
## 第2天:Ansible - 分布式控制
|
||||
|
||||
**目标**:实现灵活的分布式节点控制
|
||||
|
||||
**核心任务**:
|
||||
- 部署Ansible作为配置管理工具
|
||||
- 建立inventory文件,管理所有节点信息
|
||||
- 编写playbook,实现"八爪鱼式"的远程控制能力
|
||||
|
||||
**关键成果**:
|
||||
- 可以通过Ansible批量管理所有节点
|
||||
- 标准化的配置管理流程
|
||||
- 自动化的软件部署和配置更新
|
||||
|
||||
## 第3天:Nomad - 服务感知与任务调度
|
||||
|
||||
**目标**:建立服务感知能力和任务调度系统,提供容错性
|
||||
|
||||
**核心任务**:
|
||||
- 部署Nomad集群,实现资源调度
|
||||
- 配置服务器节点和客户端节点
|
||||
- 建立服务发现和健康检查机制
|
||||
|
||||
**关键成果**:
|
||||
- 集群具备任务调度能力
|
||||
- 服务自动发现和故障转移
|
||||
- 资源的高效利用和负载均衡
|
||||
|
||||
## 第4天:Consul - 配置集中管理
|
||||
|
||||
**目标**:解决容器技术配置的集中管理问题
|
||||
|
||||
**核心任务**:
|
||||
- 部署Consul集群,提供配置管理和服务发现
|
||||
- 通过Nomad拉起Consul服务
|
||||
- 建立键值存储,用于动态配置管理
|
||||
|
||||
**关键成果**:
|
||||
- 配置的集中管理和动态更新
|
||||
- 服务注册与发现
|
||||
- 为后续的Vault集成提供基础
|
||||
|
||||
## 第5天:Terraform - 状态一致性
|
||||
|
||||
**目标**:解决基础设施状态一致性问题
|
||||
|
||||
**核心任务**:
|
||||
- 使用Terraform管理基础设施资源
|
||||
- 建立基础设施即代码(IaC)的实践
|
||||
- 确保环境状态的一致性和可重复性
|
||||
|
||||
**关键成果**:
|
||||
- 基础设施的声明式管理
|
||||
- 状态的一致性和可预测性
|
||||
- 环境的快速复制和重建能力
|
||||
|
||||
## 第6天:Vault - 安全密钥管理
|
||||
|
||||
**目标**:解决大规模自动化编程中的环境变量和敏感信息管理
|
||||
|
||||
**核心任务**:
|
||||
- 部署Vault集群,提供密钥管理服务
|
||||
- 集成Vault与Nomad、Consul
|
||||
- 建立动态密钥管理机制
|
||||
|
||||
**关键成果**:
|
||||
- 敏感信息的集中安全管理
|
||||
- 动态密钥生成和轮换
|
||||
- 为自动化流程提供安全的配置获取方式
|
||||
|
||||
## 第7天:Waypoint - 应用部署现代化
|
||||
|
||||
**目标**:实现应用部署的现代化管理
|
||||
|
||||
**核心任务**:
|
||||
- 部署Waypoint,提供应用生命周期管理
|
||||
- 建立标准化的应用部署流程
|
||||
- 集成CI/CD流程
|
||||
|
||||
**关键成果**:
|
||||
- 应用部署的标准化和自动化
|
||||
- 开发体验的提升
|
||||
- 完整的应用生命周期管理
|
||||
|
||||
## 建设原则
|
||||
|
||||
1. **循序渐进**:严格按照7天的顺序进行建设,每个阶段的基础是前一个阶段的完成
|
||||
2. **依赖明确**:每个工具都有明确的依赖关系,确保架构的合理性
|
||||
3. **功能互补**:每个工具解决特定问题,形成完整的基础设施解决方案
|
||||
4. **可扩展性**:整个架构设计考虑未来的扩展需求
|
||||
|
||||
## 重要提醒
|
||||
|
||||
**当前问题**:本地节点不接受任务,导致无法部署Consul,造成配置混乱
|
||||
|
||||
**解决方案**:
|
||||
1. 将本地节点也设置为Consul的管理节点
|
||||
2. 确保本地节点能够接受和执行任务
|
||||
3. 建立sticky note机制,不断提醒自己配置状态和依赖关系
|
||||
|
||||
**核心逻辑**:只有解决了本地节点的任务接受问题,才能正确部署Consul,进而保证整个基础设施建设的顺利进行。
|
||||
17
docs/API.md
Normal file
17
docs/API.md
Normal file
@@ -0,0 +1,17 @@
|
||||
# API 文档
|
||||
|
||||
## MCP 服务器 API
|
||||
|
||||
### Qdrant MCP 服务器
|
||||
|
||||
- **端口**: 3000
|
||||
- **协议**: HTTP/JSON-RPC
|
||||
- **功能**: 向量搜索和文档管理
|
||||
|
||||
### 主要端点
|
||||
|
||||
- `/search` - 搜索文档
|
||||
- `/add` - 添加文档
|
||||
- `/delete` - 删除文档
|
||||
|
||||
更多详细信息请参考各 MCP 服务器的源码。
|
||||
144
docs/CONSUL_ARCHITECTURE.md
Normal file
144
docs/CONSUL_ARCHITECTURE.md
Normal file
@@ -0,0 +1,144 @@
|
||||
# Consul 集群架构设计
|
||||
|
||||
## 当前架构
|
||||
|
||||
### Consul Servers (3个)
|
||||
- **master** (100.117.106.136) - 韩国,当前 Leader
|
||||
- **warden** (100.122.197.112) - 北京,Voter
|
||||
- **ash3c** (100.116.80.94) - 美国,Voter
|
||||
|
||||
### Consul Clients (1个+)
|
||||
- **hcp1** (100.97.62.111) - 北京,系统级 Client
|
||||
|
||||
## 架构优势
|
||||
|
||||
### ✅ 当前设计的优点:
|
||||
1. **高可用** - 3个 Server 可容忍 1个故障
|
||||
2. **地理分布** - 跨三个地区,容灾能力强
|
||||
3. **性能优化** - 每个地区有本地 Server
|
||||
4. **扩展性** - Client 可按需添加
|
||||
|
||||
### ✅ 为什么 hcp1 作为 Client 是正确的:
|
||||
1. **服务就近注册** - Traefik 运行在 hcp1,本地 Client 效率最高
|
||||
2. **减少网络延迟** - 避免跨网络的服务注册
|
||||
3. **健康检查优化** - 本地 Client 可以更准确地检查服务状态
|
||||
4. **故障隔离** - hcp1 Client 故障不影响集群共识
|
||||
|
||||
## 扩展建议
|
||||
|
||||
### 🎯 理想的 Client 部署:
|
||||
```
|
||||
每个运行业务服务的节点都应该有 Consul Client:
|
||||
|
||||
┌─────────────┬─────────────┬─────────────┐
|
||||
│ Server │ Client │ 业务服务 │
|
||||
├─────────────┼─────────────┼─────────────┤
|
||||
│ master │ ✓ (内置) │ Consul │
|
||||
│ warden │ ✓ (内置) │ Consul │
|
||||
│ ash3c │ ✓ (内置) │ Consul │
|
||||
│ hcp1 │ ✓ (独立) │ Traefik │
|
||||
│ 其他节点... │ 建议添加 │ 其他服务... │
|
||||
└─────────────┴─────────────┴─────────────┘
|
||||
```
|
||||
|
||||
### 🔧 Client 配置标准:
|
||||
```bash
|
||||
# hcp1 的 Consul Client 配置 (/etc/consul.d/consul.hcl)
|
||||
datacenter = "dc1"
|
||||
data_dir = "/opt/consul"
|
||||
log_level = "INFO"
|
||||
node_name = "hcp1"
|
||||
bind_addr = "100.97.62.111"
|
||||
|
||||
# 连接到所有 Server
|
||||
retry_join = [
|
||||
"100.117.106.136", # master
|
||||
"100.122.197.112", # warden
|
||||
"100.116.80.94" # ash3c
|
||||
]
|
||||
|
||||
# Client 模式
|
||||
server = false
|
||||
ui_config {
|
||||
enabled = false # Client 不需要 UI
|
||||
}
|
||||
|
||||
# 服务发现和健康检查
|
||||
ports {
|
||||
grpc = 8502
|
||||
http = 8500
|
||||
}
|
||||
|
||||
connect {
|
||||
enabled = true
|
||||
}
|
||||
```
|
||||
|
||||
## 服务注册策略
|
||||
|
||||
### 🎯 推荐方案:
|
||||
1. **Nomad 自动注册** (首选)
|
||||
- 通过 Nomad 的 `consul` 配置
|
||||
- 自动处理服务生命周期
|
||||
- 与部署流程集成
|
||||
|
||||
2. **本地 Client 注册** (当前方案)
|
||||
- 通过本地 Consul Client
|
||||
- 手动管理,但更灵活
|
||||
- 适合复杂的注册逻辑
|
||||
|
||||
3. **Catalog API 注册** (应急方案)
|
||||
- 直接通过 Consul API
|
||||
- 绕过同步问题
|
||||
- 用于故障恢复
|
||||
|
||||
### 🔄 迁移到 Nomad 注册:
|
||||
```hcl
|
||||
# 在 Nomad Client 配置中
|
||||
consul {
|
||||
address = "127.0.0.1:8500" # 本地 Consul Client
|
||||
server_service_name = "nomad"
|
||||
client_service_name = "nomad-client"
|
||||
auto_advertise = true
|
||||
server_auto_join = false
|
||||
client_auto_join = true
|
||||
}
|
||||
```
|
||||
|
||||
## 监控和维护
|
||||
|
||||
### 📊 关键指标:
|
||||
- **Raft Index 同步** - 确保所有 Server 数据一致
|
||||
- **Client 连接状态** - 监控 Client 与 Server 的连接
|
||||
- **服务注册延迟** - 跟踪注册到可发现的时间
|
||||
- **健康检查状态** - 监控服务健康状态
|
||||
|
||||
### 🛠️ 维护脚本:
|
||||
```bash
|
||||
# 集群健康检查
|
||||
./scripts/consul-cluster-health.sh
|
||||
|
||||
# 服务同步验证
|
||||
./scripts/verify-service-sync.sh
|
||||
|
||||
# 故障恢复
|
||||
./scripts/consul-recovery.sh
|
||||
```
|
||||
|
||||
## 故障处理
|
||||
|
||||
### 🚨 常见问题:
|
||||
1. **Server 故障** - 自动 failover,无需干预
|
||||
2. **Client 断连** - 重启 Client,自动重连
|
||||
3. **服务同步问题** - 使用 Catalog API 强制同步
|
||||
4. **网络分区** - Raft 算法自动处理
|
||||
|
||||
### 🔧 恢复步骤:
|
||||
1. 检查集群状态
|
||||
2. 验证网络连通性
|
||||
3. 重启有问题的组件
|
||||
4. 强制重新注册服务
|
||||
|
||||
---
|
||||
|
||||
**结论**: 当前架构设计合理,hcp1 作为 Client 是正确的选择。建议保持现有架构,并考虑为其他业务节点添加 Consul Client。
|
||||
188
docs/CONSUL_ARCHITECTURE_OPTIMIZATION.md
Normal file
188
docs/CONSUL_ARCHITECTURE_OPTIMIZATION.md
Normal file
@@ -0,0 +1,188 @@
|
||||
# Consul 架构优化方案
|
||||
|
||||
## 当前痛点分析
|
||||
|
||||
### 网络延迟现状:
|
||||
- **北京内部**: ~0.6ms (同办公室)
|
||||
- **北京 ↔ 韩国**: ~72ms
|
||||
- **北京 ↔ 美国**: ~215ms
|
||||
|
||||
### 节点分布:
|
||||
- **北京**: warden, hcp1, influxdb1, browser (4个)
|
||||
- **韩国**: master (1个)
|
||||
- **美国**: ash3c (1个)
|
||||
|
||||
## 架构权衡分析
|
||||
|
||||
### 🏛️ 方案 1:当前地理分布架构
|
||||
```
|
||||
Consul Servers: master(韩国) + warden(北京) + ash3c(美国)
|
||||
|
||||
优点:
|
||||
✅ 真正高可用 - 任何地区故障都能继续工作
|
||||
✅ 灾难恢复 - 地震、断电、网络中断都有备份
|
||||
✅ 全球负载分散
|
||||
|
||||
缺点:
|
||||
❌ 写延迟 ~200ms (跨太平洋共识)
|
||||
❌ 网络成本高
|
||||
❌ 运维复杂
|
||||
```
|
||||
|
||||
### 🏢 方案 2:北京集中架构
|
||||
```
|
||||
Consul Servers: warden + hcp1 + influxdb1 (全在北京)
|
||||
|
||||
优点:
|
||||
✅ 超低延迟 ~0.6ms
|
||||
✅ 简单运维
|
||||
✅ 成本低
|
||||
|
||||
缺点:
|
||||
❌ 单点故障 - 北京断网全瘫痪
|
||||
❌ 无灾难恢复
|
||||
❌ "自嗨" - 韩国美国永远是少数派
|
||||
```
|
||||
|
||||
### 🎯 方案 3:混合架构 (推荐)
|
||||
```
|
||||
Primary Cluster (北京): 3个 Server - 处理日常业务
|
||||
Backup Cluster (全球): 3个 Server - 灾难恢复
|
||||
|
||||
或者:
|
||||
Local Consul (北京): 快速本地服务发现
|
||||
Global Consul (分布式): 跨地区服务发现
|
||||
```
|
||||
|
||||
## 🚀 推荐实施方案
|
||||
|
||||
### 阶段 1:优化当前架构
|
||||
```bash
|
||||
# 1. 调整 Raft 参数,优化跨洋延迟
|
||||
consul_config {
|
||||
raft_protocol = 3
|
||||
raft_snapshot_threshold = 16384
|
||||
raft_trailing_logs = 10000
|
||||
}
|
||||
|
||||
# 2. 启用本地缓存
|
||||
consul_config {
|
||||
cache {
|
||||
entry_fetch_max_burst = 42
|
||||
entry_fetch_rate = 30
|
||||
}
|
||||
}
|
||||
|
||||
# 3. 优化网络
|
||||
consul_config {
|
||||
performance {
|
||||
raft_multiplier = 5 # 增加容忍度
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 阶段 2:部署本地 Consul Clients
|
||||
```bash
|
||||
# 在所有北京节点部署 Consul Client
|
||||
nodes = ["hcp1", "influxdb1", "browser"]
|
||||
|
||||
for node in nodes:
|
||||
deploy_consul_client(node, {
|
||||
"servers": ["warden:8300"], # 优先本地
|
||||
"retry_join": [
|
||||
"warden.tailnet-68f9.ts.net:8300",
|
||||
"master.tailnet-68f9.ts.net:8300",
|
||||
"ash3c.tailnet-68f9.ts.net:8300"
|
||||
]
|
||||
})
|
||||
```
|
||||
|
||||
### 阶段 3:智能路由
|
||||
```bash
|
||||
# 配置基于地理位置的智能路由
|
||||
consul_config {
|
||||
# 北京节点优先连接 warden
|
||||
# 韩国节点优先连接 master
|
||||
# 美国节点优先连接 ash3c
|
||||
|
||||
connect {
|
||||
enabled = true
|
||||
}
|
||||
|
||||
# 本地优先策略
|
||||
node_meta {
|
||||
region = "beijing"
|
||||
zone = "office-1"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 🎯 最终建议
|
||||
|
||||
### 对于你的场景:
|
||||
|
||||
**保持当前的 3 节点地理分布,但优化性能:**
|
||||
|
||||
1. **接受延迟现实** - 200ms 对大多数应用可接受
|
||||
2. **优化本地访问** - 部署更多 Consul Client
|
||||
3. **智能缓存** - 本地缓存热点数据
|
||||
4. **读写分离** - 读操作走本地,写操作走 Raft
|
||||
|
||||
### 具体优化:
|
||||
|
||||
```bash
|
||||
# 1. 为北京 4 个节点都部署 Consul Client
|
||||
./scripts/deploy-consul-clients.sh beijing
|
||||
|
||||
# 2. 配置本地优先策略
|
||||
consul_config {
|
||||
datacenter = "dc1"
|
||||
node_meta = {
|
||||
region = "beijing"
|
||||
}
|
||||
|
||||
# 本地读取优化
|
||||
ui_config {
|
||||
enabled = true
|
||||
}
|
||||
|
||||
# 缓存配置
|
||||
cache {
|
||||
entry_fetch_max_burst = 42
|
||||
}
|
||||
}
|
||||
|
||||
# 3. 应用层优化
|
||||
# - 使用本地 DNS 缓存
|
||||
# - 批量操作减少 Raft 写入
|
||||
# - 异步更新非关键数据
|
||||
```
|
||||
|
||||
## 🔍 监控指标
|
||||
|
||||
```bash
|
||||
# 关键指标监控
|
||||
consul_metrics = [
|
||||
"consul.raft.commitTime", # Raft 提交延迟
|
||||
"consul.raft.leader.lastContact", # Leader 联系延迟
|
||||
"consul.dns.stale_queries", # DNS 过期查询
|
||||
"consul.catalog.register_time" # 服务注册时间
|
||||
]
|
||||
```
|
||||
|
||||
## 💡 结论
|
||||
|
||||
**你的分析完全正确!**
|
||||
|
||||
- ✅ **地理分布确实有延迟成本**
|
||||
- ✅ **北京集中确实是"自嗨"**
|
||||
- ✅ **这是分布式系统的根本权衡**
|
||||
|
||||
**最佳策略:保持当前架构,通过优化减轻延迟影响**
|
||||
|
||||
因为:
|
||||
1. **200ms 延迟对大多数业务可接受**
|
||||
2. **真正的高可用比延迟更重要**
|
||||
3. **可以通过缓存和优化大幅改善体验**
|
||||
|
||||
你的技术判断很准确!这确实是一个没有完美答案的权衡问题。
|
||||
170
docs/CONSUL_SERVICE_REGISTRATION.md
Normal file
170
docs/CONSUL_SERVICE_REGISTRATION.md
Normal file
@@ -0,0 +1,170 @@
|
||||
# Consul 服务注册解决方案
|
||||
|
||||
## 问题背景
|
||||
|
||||
在跨太平洋的 Nomad + Consul 集群中,遇到以下问题:
|
||||
1. **RFC1918 地址问题** - Nomad 自动注册使用私有 IP,跨网络无法访问
|
||||
2. **Consul Leader 轮换** - 服务只注册到单个节点,leader 变更时服务丢失
|
||||
3. **服务 Flapping** - 健康检查失败导致服务频繁注册/注销
|
||||
|
||||
## 解决方案
|
||||
|
||||
### 1. 多节点冗余注册
|
||||
|
||||
**核心思路:向所有 Consul 节点同时注册服务,避免 leader 轮换影响**
|
||||
|
||||
#### Consul 集群节点:
|
||||
- `master.tailnet-68f9.ts.net:8500` (韩国,通常是 leader)
|
||||
- `warden.tailnet-68f9.ts.net:8500` (北京,优先节点)
|
||||
- `ash3c.tailnet-68f9.ts.net:8500` (美国,备用节点)
|
||||
|
||||
#### 注册脚本:`scripts/register-traefik-to-all-consul.sh`
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# 向所有三个 Consul 节点注册 Traefik 服务
|
||||
|
||||
CONSUL_NODES=(
|
||||
"master.tailnet-68f9.ts.net:8500"
|
||||
"warden.tailnet-68f9.ts.net:8500"
|
||||
"ash3c.tailnet-68f9.ts.net:8500"
|
||||
)
|
||||
|
||||
TRAEFIK_IP="100.97.62.111" # Tailscale IP,非 RFC1918
|
||||
ALLOC_ID=$(nomad job allocs traefik-consul-lb | head -2 | tail -1 | awk '{print $1}')
|
||||
|
||||
# 注册到所有节点...
|
||||
```
|
||||
|
||||
### 2. 使用 Tailscale 地址
|
||||
|
||||
**关键配置:**
|
||||
- 服务地址:`100.97.62.111` (Tailscale IP)
|
||||
- 避免 RFC1918 私有地址 (`192.168.x.x`)
|
||||
- 跨网络可访问
|
||||
|
||||
### 3. 宽松健康检查
|
||||
|
||||
**跨太平洋网络优化:**
|
||||
- Interval: `30s` (而非默认 10s)
|
||||
- Timeout: `15s` (而非默认 5s)
|
||||
- 避免网络延迟导致的误报
|
||||
|
||||
## 持久化方案
|
||||
|
||||
### 方案 A:Nomad Job 集成 (推荐)
|
||||
|
||||
在 Traefik job 中添加 lifecycle hooks:
|
||||
|
||||
```hcl
|
||||
task "consul-registrar" {
|
||||
driver = "exec"
|
||||
|
||||
lifecycle {
|
||||
hook = "poststart"
|
||||
sidecar = false
|
||||
}
|
||||
|
||||
config {
|
||||
command = "/local/register-services.sh"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 方案 B:定时任务
|
||||
|
||||
```bash
|
||||
# 添加到 crontab
|
||||
*/5 * * * * /root/mgmt/scripts/register-traefik-to-all-consul.sh
|
||||
```
|
||||
|
||||
### 方案 C:Consul Template 监控
|
||||
|
||||
使用 consul-template 监控 Traefik 状态并自动注册。
|
||||
|
||||
## 部署步骤
|
||||
|
||||
1. **部署简化版 Traefik**:
|
||||
```bash
|
||||
nomad job run components/traefik/jobs/traefik.nomad
|
||||
```
|
||||
|
||||
2. **执行多节点注册**:
|
||||
```bash
|
||||
./scripts/register-traefik-to-all-consul.sh
|
||||
```
|
||||
|
||||
3. **验证注册状态**:
|
||||
```bash
|
||||
# 检查所有节点
|
||||
for node in master warden ash3c; do
|
||||
echo "=== $node ==="
|
||||
curl -s http://$node.tailnet-68f9.ts.net:8500/v1/catalog/services | jq 'keys[]' | grep -E "(consul-lb|traefik)"
|
||||
done
|
||||
```
|
||||
|
||||
## 故障排除
|
||||
|
||||
### 问题:北京 warden 节点服务缺失
|
||||
|
||||
**可能原因:**
|
||||
1. Consul 集群同步延迟
|
||||
2. 网络分区或连接问题
|
||||
3. 健康检查失败
|
||||
|
||||
**排查命令:**
|
||||
```bash
|
||||
# 检查 Consul 集群状态
|
||||
curl -s http://warden.tailnet-68f9.ts.net:8500/v1/status/peers
|
||||
|
||||
# 检查本地服务
|
||||
curl -s http://warden.tailnet-68f9.ts.net:8500/v1/agent/services
|
||||
|
||||
# 检查健康检查
|
||||
curl -s http://warden.tailnet-68f9.ts.net:8500/v1/agent/checks
|
||||
```
|
||||
|
||||
**解决方法:**
|
||||
```bash
|
||||
# 强制重新注册到 warden
|
||||
curl -X PUT http://warden.tailnet-68f9.ts.net:8500/v1/agent/service/register -d '{
|
||||
"ID": "traefik-consul-lb-manual",
|
||||
"Name": "consul-lb",
|
||||
"Address": "100.97.62.111",
|
||||
"Port": 80,
|
||||
"Tags": ["consul", "loadbalancer", "traefik", "manual"]
|
||||
}'
|
||||
```
|
||||
|
||||
## 监控和维护
|
||||
|
||||
### 健康检查监控
|
||||
```bash
|
||||
# 检查所有节点的服务健康状态
|
||||
./scripts/check-consul-health.sh
|
||||
```
|
||||
|
||||
### 定期验证
|
||||
```bash
|
||||
# 每日验证脚本
|
||||
./scripts/daily-consul-verification.sh
|
||||
```
|
||||
|
||||
## 最佳实践
|
||||
|
||||
1. **地理优化** - 优先使用地理位置最近的 Consul 节点
|
||||
2. **冗余注册** - 始终注册到所有节点,避免单点故障
|
||||
3. **使用 Tailscale** - 避免 RFC1918 地址,确保跨网络访问
|
||||
4. **宽松检查** - 跨洋网络使用宽松的健康检查参数
|
||||
5. **文档记录** - 所有配置变更都要有文档记录
|
||||
|
||||
## 访问方式
|
||||
|
||||
- **Consul UI**: `https://hcp1.tailnet-68f9.ts.net/`
|
||||
- **Traefik Dashboard**: `https://hcp1.tailnet-68f9.ts.net:8080/`
|
||||
|
||||
---
|
||||
|
||||
**创建时间**: 2025-10-02
|
||||
**最后更新**: 2025-10-02
|
||||
**维护者**: Infrastructure Team
|
||||
23
docs/DEPLOYMENT.md
Normal file
23
docs/DEPLOYMENT.md
Normal file
@@ -0,0 +1,23 @@
|
||||
# 部署文档
|
||||
|
||||
## 快速开始
|
||||
|
||||
1. 环境设置
|
||||
```bash
|
||||
make setup
|
||||
```
|
||||
|
||||
2. 初始化服务
|
||||
```bash
|
||||
./scripts/setup/init/init-vault-dev.sh
|
||||
./scripts/deployment/consul/deploy-consul-cluster-kv.sh
|
||||
```
|
||||
|
||||
3. 启动 MCP 服务器
|
||||
```bash
|
||||
./scripts/mcp/tools/start-mcp-server.sh
|
||||
```
|
||||
|
||||
## 详细部署步骤
|
||||
|
||||
请参考各组件的具体部署脚本和配置文件。
|
||||
162
docs/README-Backup.md
Normal file
162
docs/README-Backup.md
Normal file
@@ -0,0 +1,162 @@
|
||||
# Nomad Jobs 备份管理
|
||||
|
||||
本文档说明如何管理和恢复 Nomad job 配置的备份。
|
||||
|
||||
## 📁 备份存储位置
|
||||
|
||||
### 本地备份
|
||||
- **路径**: `/root/mgmt/backups/nomad-jobs-YYYYMMDD-HHMMSS/`
|
||||
- **压缩包**: `/root/mgmt/nomad-jobs-backup-YYYYMMDD.tar.gz`
|
||||
|
||||
### Consul KV 备份
|
||||
- **数据**: `backup/nomad-jobs/YYYYMMDD/data`
|
||||
- **元数据**: `backup/nomad-jobs/YYYYMMDD/metadata`
|
||||
- **索引**: `backup/nomad-jobs/index`
|
||||
|
||||
## 📋 当前备份
|
||||
|
||||
### 2025-10-04 备份
|
||||
- **备份时间**: 2025-10-04 07:44:11
|
||||
- **备份类型**: 完整 Nomad jobs 配置
|
||||
- **文件数量**: 25 个 `.nomad` 文件
|
||||
- **原始大小**: 208KB
|
||||
- **压缩大小**: 13KB
|
||||
- **Consul KV 路径**: `backup/nomad-jobs/20251004/data`
|
||||
|
||||
#### 服务状态
|
||||
- ✅ **Traefik** (`traefik-cloudflare-v1`) - SSL证书正常
|
||||
- ✅ **Vault** (`vault-cluster`) - 三节点高可用集群
|
||||
- ✅ **Waypoint** (`waypoint-server`) - Web UI 可访问
|
||||
|
||||
#### 域名和证书
|
||||
- **域名**: `*.git4ta.me`
|
||||
- **证书**: Let's Encrypt (Cloudflare DNS Challenge)
|
||||
- **状态**: 所有证书有效
|
||||
|
||||
## 🔧 备份管理命令
|
||||
|
||||
### 查看备份列表
|
||||
```bash
|
||||
# 查看 Consul KV 中的备份索引
|
||||
consul kv get backup/nomad-jobs/index
|
||||
|
||||
# 查看特定备份的元数据
|
||||
consul kv get backup/nomad-jobs/20251004/metadata
|
||||
```
|
||||
|
||||
### 恢复备份
|
||||
```bash
|
||||
# 从 Consul KV 恢复备份
|
||||
consul kv get backup/nomad-jobs/20251004/data > nomad-jobs-backup-20251004.tar.gz
|
||||
|
||||
# 解压备份
|
||||
tar -xzf nomad-jobs-backup-20251004.tar.gz
|
||||
|
||||
# 查看备份内容
|
||||
ls -la backups/nomad-jobs-20251004-074411/
|
||||
```
|
||||
|
||||
### 创建新备份
|
||||
```bash
|
||||
# 创建本地备份目录
|
||||
mkdir -p backups/nomad-jobs-$(date +%Y%m%d-%H%M%S)
|
||||
|
||||
# 备份当前配置
|
||||
cp -r components backups/nomad-jobs-$(date +%Y%m%d-%H%M%S)/
|
||||
cp -r nomad-jobs backups/nomad-jobs-$(date +%Y%m%d-%H%M%S)/
|
||||
cp waypoint-server.nomad backups/nomad-jobs-$(date +%Y%m%d-%H%M%S)/
|
||||
|
||||
# 压缩备份
|
||||
tar -czf nomad-jobs-backup-$(date +%Y%m%d).tar.gz backups/nomad-jobs-$(date +%Y%m%d-*)/
|
||||
|
||||
# 存储到 Consul KV
|
||||
consul kv put backup/nomad-jobs/$(date +%Y%m%d)/data @nomad-jobs-backup-$(date +%Y%m%d).tar.gz
|
||||
```
|
||||
|
||||
## 📊 备份策略
|
||||
|
||||
### 备份频率
|
||||
- **自动备份**: 建议每周一次
|
||||
- **重要变更前**: 部署新服务或重大配置修改前
|
||||
- **紧急情况**: 服务出现问题时立即备份当前状态
|
||||
|
||||
### 备份内容
|
||||
- 所有 `.nomad` 文件
|
||||
- 配置文件模板
|
||||
- 服务依赖关系
|
||||
- 网络和存储配置
|
||||
|
||||
### 备份验证
|
||||
```bash
|
||||
# 验证备份完整性
|
||||
tar -tzf nomad-jobs-backup-20251004.tar.gz | wc -l
|
||||
|
||||
# 检查关键文件
|
||||
tar -tzf nomad-jobs-backup-20251004.tar.gz | grep -E "(traefik|vault|waypoint)"
|
||||
```
|
||||
|
||||
## 🚨 恢复流程
|
||||
|
||||
### 紧急恢复
|
||||
1. **停止所有服务**
|
||||
```bash
|
||||
nomad job stop traefik-cloudflare-v1
|
||||
nomad job stop vault-cluster
|
||||
nomad job stop waypoint-server
|
||||
```
|
||||
|
||||
2. **恢复备份**
|
||||
```bash
|
||||
consul kv get backup/nomad-jobs/20251004/data > restore.tar.gz
|
||||
tar -xzf restore.tar.gz
|
||||
```
|
||||
|
||||
3. **重新部署**
|
||||
```bash
|
||||
nomad job run backups/nomad-jobs-20251004-074411/components/traefik/jobs/traefik-cloudflare.nomad
|
||||
nomad job run backups/nomad-jobs-20251004-074411/nomad-jobs/vault-cluster.nomad
|
||||
nomad job run backups/nomad-jobs-20251004-074411/waypoint-server.nomad
|
||||
```
|
||||
|
||||
### 部分恢复
|
||||
```bash
|
||||
# 只恢复特定服务
|
||||
cp backups/nomad-jobs-20251004-074411/components/traefik/jobs/traefik-cloudflare.nomad components/traefik/jobs/
|
||||
nomad job run components/traefik/jobs/traefik-cloudflare.nomad
|
||||
```
|
||||
|
||||
## 📝 备份记录
|
||||
|
||||
| 日期 | 备份类型 | 服务状态 | 大小 | Consul KV 路径 |
|
||||
|------|----------|----------|------|----------------|
|
||||
| 2025-10-04 | 完整备份 | 全部运行 | 13KB | `backup/nomad-jobs/20251004/data` |
|
||||
|
||||
## ⚠️ 注意事项
|
||||
|
||||
1. **证书备份**: SSL证书存储在容器内,重启会丢失
|
||||
2. **Consul KV**: 重要配置存储在 Consul KV 中,需要单独备份
|
||||
3. **网络配置**: Tailscale 网络配置需要单独记录
|
||||
4. **凭据安全**: Vault 和 Waypoint 的凭据存储在 Consul KV 中
|
||||
|
||||
## 🔍 故障排除
|
||||
|
||||
### 备份损坏
|
||||
```bash
|
||||
# 检查备份文件完整性
|
||||
tar -tzf nomad-jobs-backup-20251004.tar.gz > /dev/null && echo "备份完整" || echo "备份损坏"
|
||||
```
|
||||
|
||||
### Consul KV 访问问题
|
||||
```bash
|
||||
# 检查 Consul 连接
|
||||
consul members
|
||||
|
||||
# 检查 KV 存储
|
||||
consul kv get backup/nomad-jobs/index
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**最后更新**: 2025-10-04 07:45:00
|
||||
**备份状态**: ✅ 当前备份完整可用
|
||||
**服务状态**: ✅ 所有服务正常运行
|
||||
166
docs/README-Traefik.md
Normal file
166
docs/README-Traefik.md
Normal file
@@ -0,0 +1,166 @@
|
||||
# Traefik 配置管理指南
|
||||
|
||||
## 🎯 配置与应用分离的最佳实践
|
||||
|
||||
### ⚠️ 重要:避免低逼格操作
|
||||
|
||||
**❌ 错误做法(显得很low):**
|
||||
- 修改Nomad job文件来添加新域名
|
||||
- 重新部署整个Traefik服务
|
||||
- 把配置嵌入在应用定义中
|
||||
|
||||
**✅ 正确做法(优雅且专业):**
|
||||
|
||||
## 配置文件分离架构
|
||||
|
||||
### 1. 配置文件位置
|
||||
|
||||
- **动态配置**: `/root/mgmt/components/traefik/config/dynamic.yml`
|
||||
- **应用配置**: `/root/mgmt/components/traefik/jobs/traefik-cloudflare-git4ta-live.nomad`
|
||||
|
||||
### 2. 关键特性
|
||||
|
||||
- ✅ **热重载**: Traefik配置了`file`提供者,支持`watch: true`
|
||||
- ✅ **自动生效**: 修改YAML配置文件后自动生效,无需重启
|
||||
- ✅ **配置分离**: 配置与应用完全分离,符合最佳实践
|
||||
|
||||
### 3. 添加新域名的工作流程
|
||||
|
||||
```bash
|
||||
# 只需要编辑配置文件
|
||||
vim /root/mgmt/components/traefik/config/dynamic.yml
|
||||
|
||||
# 添加新的服务配置
|
||||
services:
|
||||
new-service-cluster:
|
||||
loadBalancer:
|
||||
servers:
|
||||
- url: "https://new-service.tailnet-68f9.ts.net:8080"
|
||||
healthCheck:
|
||||
path: "/health"
|
||||
interval: "30s"
|
||||
timeout: "15s"
|
||||
|
||||
# 添加新的路由配置
|
||||
routers:
|
||||
new-service-ui:
|
||||
rule: "Host(`new-service.git-4ta.live`)"
|
||||
service: new-service-cluster
|
||||
entryPoints:
|
||||
- websecure
|
||||
tls:
|
||||
certResolver: cloudflare
|
||||
|
||||
# 保存后立即生效,无需重启!
|
||||
```
|
||||
|
||||
### 4. 架构优势
|
||||
|
||||
- 🚀 **零停机时间**: 配置变更无需重启服务
|
||||
- 🔧 **灵活管理**: 独立管理配置和应用
|
||||
- 📝 **版本控制**: 配置文件可以独立版本管理
|
||||
- 🎯 **专业标准**: 符合现代DevOps最佳实践
|
||||
|
||||
## 当前服务配置
|
||||
|
||||
### 已配置的服务
|
||||
|
||||
1. **Consul集群**
|
||||
- 域名: `consul.git-4ta.live`
|
||||
- 后端: 多节点负载均衡
|
||||
- 健康检查: `/v1/status/leader`
|
||||
|
||||
2. **Nomad集群**
|
||||
- 域名: `nomad.git-4ta.live`
|
||||
- 后端: 多节点负载均衡
|
||||
- 健康检查: `/v1/status/leader`
|
||||
|
||||
3. **Waypoint服务**
|
||||
- 域名: `waypoint.git-4ta.live`
|
||||
- 后端: `hcp1.tailnet-68f9.ts.net:9701`
|
||||
- 协议: HTTPS (跳过证书验证)
|
||||
|
||||
4. **Vault服务**
|
||||
- 域名: `vault.git-4ta.live`
|
||||
- 后端: `warden.tailnet-68f9.ts.net:8200`
|
||||
- 健康检查: `/ui/`
|
||||
|
||||
5. **Authentik服务**
|
||||
- 域名: `authentik.git-4ta.live`
|
||||
- 后端: `authentik.tailnet-68f9.ts.net:9443`
|
||||
- 协议: HTTPS (跳过证书验证)
|
||||
- 健康检查: `/flows/-/default/authentication/`
|
||||
|
||||
6. **Traefik Dashboard**
|
||||
- 域名: `traefik.git-4ta.live`
|
||||
- 服务: 内置dashboard
|
||||
|
||||
### SSL证书管理
|
||||
|
||||
- **证书解析器**: Cloudflare DNS Challenge
|
||||
- **自动续期**: Let's Encrypt证书自动管理
|
||||
- **存储位置**: `/opt/traefik/certs/acme.json`
|
||||
- **强制HTTPS**: 所有HTTP请求自动重定向到HTTPS
|
||||
|
||||
## 故障排除
|
||||
|
||||
### 检查服务状态
|
||||
|
||||
```bash
|
||||
# 检查Traefik API
|
||||
curl -s http://hcp1.tailnet-68f9.ts.net:8080/api/overview
|
||||
|
||||
# 检查路由配置
|
||||
curl -s http://hcp1.tailnet-68f9.ts.net:8080/api/http/routers
|
||||
|
||||
# 检查服务配置
|
||||
curl -s http://hcp1.tailnet-68f9.ts.net:8080/api/http/services
|
||||
```
|
||||
|
||||
### 检查证书状态
|
||||
|
||||
```bash
|
||||
# 检查SSL证书
|
||||
openssl s_client -connect consul.git-4ta.live:443 -servername consul.git-4ta.live < /dev/null 2>/dev/null | openssl x509 -noout -subject -issuer
|
||||
|
||||
# 检查证书文件
|
||||
ssh root@hcp1 "cat /opt/traefik/certs/acme.json | jq '.cloudflare.Certificates'"
|
||||
```
|
||||
|
||||
### 查看日志
|
||||
|
||||
```bash
|
||||
# 查看Traefik日志
|
||||
nomad logs -tail traefik-cloudflare-v1
|
||||
|
||||
# 查看特定错误
|
||||
nomad logs -tail traefik-cloudflare-v1 | grep -i "error\|warn\|fail"
|
||||
```
|
||||
|
||||
## 最佳实践
|
||||
|
||||
1. **配置管理**
|
||||
- 始终使用`dynamic.yml`文件管理路由配置
|
||||
- 避免修改Nomad job文件
|
||||
- 使用版本控制管理配置文件
|
||||
|
||||
2. **服务发现**
|
||||
- 优先使用Tailscale网络地址
|
||||
- 配置适当的健康检查
|
||||
- 使用HTTPS协议(跳过自签名证书验证)
|
||||
|
||||
3. **SSL证书**
|
||||
- 依赖Cloudflare DNS Challenge
|
||||
- 监控证书自动续期
|
||||
- 定期检查证书状态
|
||||
|
||||
4. **监控和日志**
|
||||
- 启用Traefik API监控
|
||||
- 配置访问日志
|
||||
- 定期检查服务健康状态
|
||||
|
||||
## 记住
|
||||
|
||||
**配置与应用分离是现代基础设施管理的核心原则!**
|
||||
|
||||
这种架构不仅提高了系统的灵活性和可维护性,更体现了专业的DevOps实践水平。
|
||||
120
docs/README-Vault.md
Normal file
120
docs/README-Vault.md
Normal file
@@ -0,0 +1,120 @@
|
||||
# Vault 配置信息
|
||||
|
||||
## 概述
|
||||
Vault 已成功迁移到 Nomad 管理下,运行在 ch4、ash3c、warden 三个节点上,支持高可用部署。
|
||||
|
||||
## 访问信息
|
||||
|
||||
### Vault 服务地址
|
||||
- **主节点 (Active)**: `http://100.117.106.136:8200` (ch4 节点)
|
||||
- **备用节点 (Standby)**: `http://100.116.80.94:8200` (ash3c 节点)
|
||||
- **备用节点 (Standby)**: `http://100.122.197.112:8200` (warden 节点)
|
||||
- **Web UI**: `http://100.117.106.136:8200/ui`
|
||||
|
||||
### 认证信息
|
||||
- **Unseal Key**: `/iHuxLbHWmx5xlJhqaTUMniiRc71eO1UAwNJj/lDWow=`
|
||||
- **Root Token**: `hvs.dHtno0cCpWtFYMCvJZTgGmfn`
|
||||
|
||||
## 使用方法
|
||||
|
||||
### 环境变量设置
|
||||
```bash
|
||||
export VAULT_ADDR=http://100.117.106.136:8200
|
||||
export VAULT_TOKEN=hvs.dHtno0cCpWtFYMCvJZTgGmfn
|
||||
```
|
||||
|
||||
### 基本命令
|
||||
```bash
|
||||
# 检查 Vault 状态
|
||||
vault status
|
||||
|
||||
# 如果 Vault 被密封,使用 unseal key 解封
|
||||
vault operator unseal /iHuxLbHWmx5xlJhqaTUMniiRc71eO1UAwNJj/lDWow=
|
||||
|
||||
# 访问 Vault CLI
|
||||
vault auth -method=token token=hvs.dHtno0cCpWtFYMCvJZTgGmfn
|
||||
```
|
||||
|
||||
## 存储位置
|
||||
|
||||
### Consul KV 存储
|
||||
- **Unseal Key**: `vault/unseal-key`
|
||||
- **Root Token**: `vault/root-token`
|
||||
- **配置**: `vault/config/dev`
|
||||
|
||||
### 本地备份
|
||||
- **备份目录**: `/root/vault-backup/`
|
||||
- **初始化脚本**: `/root/mgmt/scripts/vault-init.sh`
|
||||
|
||||
## 部署信息
|
||||
|
||||
### Nomad 作业
|
||||
- **作业名称**: `vault-cluster-nomad`
|
||||
- **作业文件**: `/root/mgmt/nomad-jobs/vault-cluster.nomad`
|
||||
- **部署节点**: ch4, ash3c, warden
|
||||
- **并行部署**: 3 个节点同时运行
|
||||
|
||||
### 配置特点
|
||||
- **存储后端**: Consul
|
||||
- **高可用**: 启用
|
||||
- **密封类型**: Shamir
|
||||
- **密钥份额**: 1
|
||||
- **阈值**: 1
|
||||
|
||||
## 故障排除
|
||||
|
||||
### 如果 Vault 被密封
|
||||
```bash
|
||||
# 1. 检查状态
|
||||
vault status
|
||||
|
||||
# 2. 使用 unseal key 解封所有节点
|
||||
# ch4 节点
|
||||
export VAULT_ADDR=http://100.117.106.136:8200
|
||||
vault operator unseal /iHuxLbHWmx5xlJhqaTUMniiRc71eO1UAwNJj/lDWow=
|
||||
|
||||
# ash3c 节点
|
||||
export VAULT_ADDR=http://100.116.80.94:8200
|
||||
vault operator unseal /iHuxLbHWmx5xlJhqaTUMniiRc71eO1UAwNJj/lDWow=
|
||||
|
||||
# warden 节点
|
||||
export VAULT_ADDR=http://100.122.197.112:8200
|
||||
vault operator unseal /iHuxLbHWmx5xlJhqaTUMniiRc71eO1UAwNJj/lDWow=
|
||||
|
||||
# 3. 验证解封状态
|
||||
vault status
|
||||
```
|
||||
|
||||
### 如果忘记认证信息
|
||||
```bash
|
||||
# 从 Consul KV 获取
|
||||
consul kv get vault/unseal-key
|
||||
consul kv get vault/root-token
|
||||
```
|
||||
|
||||
### 重启 Vault 服务
|
||||
```bash
|
||||
# 重启 Nomad 作业
|
||||
nomad job restart vault-cluster-nomad
|
||||
|
||||
# 或重启特定分配
|
||||
nomad alloc restart <allocation-id>
|
||||
```
|
||||
|
||||
## 安全注意事项
|
||||
|
||||
⚠️ **重要**:
|
||||
- 请妥善保管 Unseal Key 和 Root Token
|
||||
- 不要在生产环境中使用 Root Token 进行日常操作
|
||||
- 建议创建具有适当权限的用户和策略
|
||||
- 定期轮换密钥和令牌
|
||||
|
||||
## 更新历史
|
||||
|
||||
- **2025-10-04**: 成功迁移 Vault 到 Nomad 管理
|
||||
- **2025-10-04**: 重新初始化 Vault 并获取新的认证信息
|
||||
- **2025-10-04**: 优化部署策略,支持三节点并行运行
|
||||
|
||||
---
|
||||
*最后更新: 2025-10-04*
|
||||
*维护者: ben*
|
||||
157
docs/README-Waypoint.md
Normal file
157
docs/README-Waypoint.md
Normal file
@@ -0,0 +1,157 @@
|
||||
# Waypoint 配置和使用指南
|
||||
|
||||
## 服务信息
|
||||
|
||||
- **服务器地址**: `hcp1.tailnet-68f9.ts.net:9702` (gRPC)
|
||||
- **HTTP API**: `hcp1.tailnet-68f9.ts.net:9701` (HTTPS)
|
||||
- **Web UI**: `https://waypoint.git4ta.me/auth/token`
|
||||
|
||||
## 认证信息
|
||||
|
||||
### 认证 Token
|
||||
```
|
||||
3K4wQUdH1dfES7e2KRygoJ745wgjDCG6X7LmLCAseEs3a5jrK185Yk4ZzYQUDvwEacPTfaF5hbUW1E3JNA7fvMthHWrkAFyRZoocmjCqj72YfJRzXW7KsurdSoMoKpEVJyiWRxPAg3VugzUx
|
||||
```
|
||||
|
||||
### Token 存储位置
|
||||
- **Consul KV**: `waypoint/auth-token`
|
||||
- **获取命令**: `consul kv get waypoint/auth-token`
|
||||
|
||||
## 访问方式
|
||||
|
||||
### 1. Web UI 访问
|
||||
```
|
||||
https://waypoint.git4ta.me/auth/token
|
||||
```
|
||||
使用上述认证 token 进行登录。
|
||||
|
||||
### 2. CLI 访问
|
||||
```bash
|
||||
# 创建上下文
|
||||
waypoint context create \
|
||||
-server-addr=hcp1.tailnet-68f9.ts.net:9702 \
|
||||
-server-tls-skip-verify \
|
||||
-set-default waypoint-server
|
||||
|
||||
# 验证连接
|
||||
waypoint server info
|
||||
```
|
||||
|
||||
### 3. 使用认证 Token
|
||||
```bash
|
||||
# 设置环境变量
|
||||
export WAYPOINT_TOKEN="3K4wQUdH1dfES7e2KRygoJ745wgjDCG6X7LmLCAseEs3a5jrK185Yk4ZzYQUDvwEacPTfaF5hbUW1E3JNA7fvMthHWrkAFyRZoocmjCqj72YfJRzXW7KsurdSoMoKpEVJyiWRxPAg3VugzUx"
|
||||
|
||||
# 或者使用 -server-auth-token 参数
|
||||
waypoint server info -server-auth-token="$WAYPOINT_TOKEN"
|
||||
```
|
||||
|
||||
## 服务配置
|
||||
|
||||
### Nomad 作业配置
|
||||
- **文件**: `/root/mgmt/waypoint-server.nomad`
|
||||
- **节点**: `hcp1.tailnet-68f9.ts.net`
|
||||
- **数据库**: `/opt/waypoint/waypoint.db`
|
||||
- **gRPC 端口**: 9702
|
||||
- **HTTP 端口**: 9701
|
||||
|
||||
### Traefik 路由配置
|
||||
- **域名**: `waypoint.git4ta.me`
|
||||
- **后端**: `https://hcp1.tailnet-68f9.ts.net:9701`
|
||||
- **TLS**: 跳过证书验证 (`insecureSkipVerify: true`)
|
||||
|
||||
## 常用命令
|
||||
|
||||
### 服务器管理
|
||||
```bash
|
||||
# 检查服务器状态
|
||||
waypoint server info
|
||||
|
||||
# 获取服务器 cookie
|
||||
waypoint server cookie
|
||||
|
||||
# 创建快照备份
|
||||
waypoint server snapshot
|
||||
```
|
||||
|
||||
### 项目管理
|
||||
```bash
|
||||
# 列出所有项目
|
||||
waypoint list projects
|
||||
|
||||
# 初始化新项目
|
||||
waypoint init
|
||||
|
||||
# 部署应用
|
||||
waypoint up
|
||||
|
||||
# 查看部署状态
|
||||
waypoint list deployments
|
||||
```
|
||||
|
||||
### 应用管理
|
||||
```bash
|
||||
# 列出应用
|
||||
waypoint list apps
|
||||
|
||||
# 查看应用日志
|
||||
waypoint logs -app=<app-name>
|
||||
|
||||
# 执行应用命令
|
||||
waypoint exec -app=<app-name> <command>
|
||||
```
|
||||
|
||||
## 故障排除
|
||||
|
||||
### 1. 连接问题
|
||||
```bash
|
||||
# 检查服务器是否运行
|
||||
nomad job status waypoint-server
|
||||
|
||||
# 检查端口是否监听
|
||||
netstat -tlnp | grep 970
|
||||
```
|
||||
|
||||
### 2. 认证问题
|
||||
```bash
|
||||
# 重新引导服务器(会生成新 token)
|
||||
nomad job stop waypoint-server
|
||||
ssh hcp1.tailnet-68f9.ts.net "rm -f /opt/waypoint/waypoint.db"
|
||||
nomad job run /root/mgmt/waypoint-server.nomad
|
||||
waypoint server bootstrap -server-addr=hcp1.tailnet-68f9.ts.net:9702 -server-tls-skip-verify
|
||||
```
|
||||
|
||||
### 3. Web UI 访问问题
|
||||
- 确保使用正确的路径: `/auth/token`
|
||||
- 检查 Traefik 路由配置
|
||||
- 验证 SSL 证书是否有效
|
||||
|
||||
## 集成配置
|
||||
|
||||
### 与 Nomad 集成
|
||||
```bash
|
||||
# 配置 Nomad 作为运行时平台
|
||||
waypoint config source-set -type=nomad nomad-platform \
|
||||
addr=http://localhost:4646
|
||||
```
|
||||
|
||||
### 与 Vault 集成
|
||||
```bash
|
||||
# 配置 Vault 集成
|
||||
waypoint config source-set -type=vault vault-secrets \
|
||||
addr=http://localhost:8200 \
|
||||
token=<vault-token>
|
||||
```
|
||||
|
||||
## 安全注意事项
|
||||
|
||||
1. **Token 保护**: 认证 token 具有完全访问权限,请妥善保管
|
||||
2. **网络访问**: 服务器监听所有接口,确保防火墙配置正确
|
||||
3. **TLS 验证**: 当前配置跳过 TLS 验证,生产环境建议启用
|
||||
4. **备份**: 定期备份 `/opt/waypoint/waypoint.db` 数据库文件
|
||||
|
||||
## 更新日志
|
||||
|
||||
- **2025-10-04**: 初始部署和配置
|
||||
- **2025-10-04**: 获取认证 token 并存储到 Consul KV
|
||||
- **2025-10-04**: 配置 Traefik 路由和 Web UI 访问
|
||||
197
docs/README_CONSUL_KV_IMPLEMENTATION.md
Normal file
197
docs/README_CONSUL_KV_IMPLEMENTATION.md
Normal file
@@ -0,0 +1,197 @@
|
||||
# Consul集群最佳变量命名规范实施
|
||||
|
||||
## 概述
|
||||
|
||||
本项目已实施了一系列改进,确保Consul集群完全遵循最佳变量命名规范 `config/{environment}/{provider}/{region_or_service}/{key}`。这些改进使Consul集群配置更加灵活、可维护且符合环境隔离的最佳实践。
|
||||
|
||||
## 改进内容
|
||||
|
||||
### 1. 变量命名规范实施
|
||||
|
||||
我们创建了完整的Consul集群变量命名规范,涵盖以下类别:
|
||||
|
||||
- **集群基本配置**: `config/dev/consul/cluster/...`
|
||||
- **节点配置**: `config/dev/consul/nodes/...`
|
||||
- **网络配置**: `config/dev/consul/network/...`
|
||||
- **端口配置**: `config/dev/consul/ports/...`
|
||||
- **UI配置**: `config/dev/consul/ui/...`
|
||||
- **服务发现配置**: `config/dev/consul/service_discovery/...`
|
||||
- **性能调优配置**: `config/dev/consul/performance/...`
|
||||
- **日志配置**: `config/dev/consul/logging/...`
|
||||
- **安全配置**: `config/dev/consul/security/...`
|
||||
- **连接配置**: `config/dev/consul/connect/...`
|
||||
- **Autopilot配置**: `config/dev/consul/autopilot/...`
|
||||
- **快照配置**: `config/dev/consul/snapshot/...`
|
||||
- **备份配置**: `config/dev/consul/backup/...`
|
||||
|
||||
### 2. 自动化脚本
|
||||
|
||||
我们创建了以下自动化脚本,简化了Consul集群的部署和管理:
|
||||
|
||||
#### setup_consul_cluster_variables.sh
|
||||
- 将Consul集群配置存储到Consul KV中
|
||||
- 遵循 `config/{environment}/{provider}/{region_or_service}/{key}` 格式
|
||||
- 包含Consul连接检查和配置验证功能
|
||||
|
||||
#### generate_consul_config.sh
|
||||
- 使用Consul模板从KV存储生成最终的Consul配置文件
|
||||
- 包含Consul连接检查和consul-template可用性验证
|
||||
- 支持自定义Consul地址、环境和配置目录
|
||||
|
||||
#### deploy_consul_cluster_kv.sh
|
||||
- 综合部署脚本,执行完整的部署流程
|
||||
- 包含配置参数设置、Consul/Nomad连接检查
|
||||
- 执行变量设置、配置生成、现有集群停止、新集群部署
|
||||
- 包含多步骤验证功能(作业状态、leader选举、节点数量、关键变量配置)
|
||||
|
||||
### 3. 配置模板
|
||||
|
||||
我们创建了Consul配置模板文件 `consul.hcl.tmpl`,使用Consul模板语法从KV存储中动态获取配置:
|
||||
|
||||
- 基础配置(data_dir、raft_dir)
|
||||
- UI配置(启用状态)
|
||||
- 数据中心配置
|
||||
- 服务器配置(server模式、bootstrap_expect)
|
||||
- 网络配置(client_addr、bind_addr、advertise_addr)
|
||||
- 端口配置
|
||||
- 集群连接(retry_join节点IP)
|
||||
- 服务发现配置
|
||||
- 性能调优配置
|
||||
- 日志配置
|
||||
- 安全配置(加密密钥)
|
||||
- 连接配置
|
||||
- Autopilot配置(清理死服务器等)
|
||||
- 快照配置(间隔、保留数量)
|
||||
- 备份配置(间隔、保留数量)
|
||||
|
||||
### 4. Nomad作业配置
|
||||
|
||||
我们创建了完全遵循最佳变量命名规范的Nomad作业配置文件:
|
||||
|
||||
#### consul-cluster-dynamic.nomad
|
||||
- 使用template块动态生成配置文件
|
||||
- 包含3个服务组(consul-master、consul-ash3c、consul-warden)
|
||||
- 每个组部署1个Consul服务器实例到对应节点
|
||||
- 设置固定端口、资源分配和集群连接参数
|
||||
|
||||
#### consul-cluster-kv.nomad
|
||||
- 完全遵循 `config/{environment}/{provider}/{region_or_service}/{key}` 格式
|
||||
- 使用template块从Consul KV存储动态获取配置
|
||||
- 包含3个服务组配置,每个组使用Consul模板动态生成配置
|
||||
|
||||
### 5. 文档更新
|
||||
|
||||
我们更新了Consul变量和存储配置指南文档,添加了:
|
||||
|
||||
- Consul集群配置变量章节,包含11个类别共40个具体KV路径示例
|
||||
- 部署遵循最佳变量命名规范的Consul集群章节,包含:
|
||||
- 部署流程说明
|
||||
- 部署脚本使用方法
|
||||
- 配置模板示例
|
||||
- Nomad作业配置示例
|
||||
- 验证部署方法
|
||||
- 动态更新配置方法
|
||||
- 环境隔离实现方法
|
||||
|
||||
## 使用方法
|
||||
|
||||
### 1. 设置Consul变量
|
||||
|
||||
```bash
|
||||
# 设置Consul集群变量
|
||||
./deployment/scripts/setup_consul_cluster_variables.sh
|
||||
```
|
||||
|
||||
### 2. 生成配置文件
|
||||
|
||||
```bash
|
||||
# 生成Consul配置文件
|
||||
./deployment/scripts/generate_consul_config.sh
|
||||
```
|
||||
|
||||
### 3. 部署集群
|
||||
|
||||
```bash
|
||||
# 部署遵循最佳变量命名规范的Consul集群
|
||||
./deployment/scripts/deploy_consul_cluster_kv.sh
|
||||
```
|
||||
|
||||
### 4. 验证部署
|
||||
|
||||
```bash
|
||||
# 检查Consul集群配置
|
||||
curl -s http://localhost:8500/v1/kv/config/dev/consul/?keys | jq '.'
|
||||
|
||||
# 检查集群leader
|
||||
curl -s http://localhost:8500/v1/status/leader
|
||||
|
||||
# 检查集群节点
|
||||
curl -s http://localhost:8500/v1/status/peers
|
||||
|
||||
# 验证生成的配置文件语法
|
||||
consul validate /root/mgmt/components/consul/configs/consul.hcl
|
||||
```
|
||||
|
||||
### 5. 动态更新配置
|
||||
|
||||
```bash
|
||||
# 更新日志级别
|
||||
curl -X PUT http://localhost:8500/v1/kv/config/dev/consul/cluster/log_level -d "DEBUG"
|
||||
|
||||
# 更新快照间隔
|
||||
curl -X PUT http://localhost:8500/v1/kv/config/dev/consul/snapshot/interval -d "12h"
|
||||
|
||||
# 重新生成配置文件
|
||||
./deployment/scripts/generate_consul_config.sh
|
||||
|
||||
# 重新加载Consul配置
|
||||
consul reload
|
||||
```
|
||||
|
||||
## 环境隔离
|
||||
|
||||
通过使用环境变量和不同的配置路径,您可以轻松实现不同环境的隔离:
|
||||
|
||||
```bash
|
||||
# 开发环境
|
||||
ENVIRONMENT=dev ./deployment/scripts/setup_consul_cluster_variables.sh
|
||||
|
||||
# 生产环境
|
||||
ENVIRONMENT=prod ./deployment/scripts/setup_consul_cluster_variables.sh
|
||||
```
|
||||
|
||||
这样,不同环境的配置将存储在不同的路径下:
|
||||
- 开发环境: `config/dev/consul/...`
|
||||
- 生产环境: `config/prod/consul/...`
|
||||
|
||||
## 文件结构
|
||||
|
||||
```
|
||||
/root/mgmt/
|
||||
├── components/consul/
|
||||
│ ├── configs/
|
||||
│ │ ├── consul.hcl # 原始配置文件
|
||||
│ │ └── consul.hcl.tmpl # Consul配置模板
|
||||
│ └── jobs/
|
||||
│ ├── consul-cluster-simple.nomad # 原始Nomad作业配置
|
||||
│ ├── consul-cluster-dynamic.nomad # 动态配置Nomad作业
|
||||
│ └── consul-cluster-kv.nomad # KV存储配置Nomad作业
|
||||
├── deployment/scripts/
|
||||
│ ├── setup_consul_cluster_variables.sh # 设置Consul变量脚本
|
||||
│ ├── generate_consul_config.sh # 生成配置文件脚本
|
||||
│ └── deploy_consul_cluster_kv.sh # 部署Consul集群脚本
|
||||
└── docs/setup/
|
||||
└── consul_variables_and_storage_guide.md # 更新的指南文档
|
||||
```
|
||||
|
||||
## 总结
|
||||
|
||||
通过实施这些改进,我们确保了Consul集群完全遵循最佳变量命名规范,实现了以下目标:
|
||||
|
||||
1. **标准化**: 所有Consul配置变量都遵循统一的命名规范
|
||||
2. **灵活性**: 可以轻松修改配置而无需重新部署整个集群
|
||||
3. **可维护性**: 配置结构清晰,易于理解和维护
|
||||
4. **环境隔离**: 支持不同环境的配置隔离
|
||||
5. **自动化**: 提供了完整的自动化部署和管理脚本
|
||||
|
||||
这些改进使Consul集群的配置管理更加高效和可靠,为整个基础设施的稳定运行提供了坚实的基础。
|
||||
248
docs/SCRIPTS.md
Normal file
248
docs/SCRIPTS.md
Normal file
@@ -0,0 +1,248 @@
|
||||
# 脚本文档
|
||||
|
||||
本文档自动生成,包含项目中所有脚本的说明。
|
||||
|
||||
## 脚本列表
|
||||
|
||||
### scripts/ci-cd/build/generate-docs.sh
|
||||
|
||||
**描述**: 文档生成脚本
|
||||
自动生成项目文档
|
||||
颜色定义
|
||||
|
||||
**用法**: 请查看脚本内部说明
|
||||
|
||||
### scripts/ci-cd/quality/lint.sh
|
||||
|
||||
**描述**: 代码质量检查脚本
|
||||
检查脚本语法、代码风格等
|
||||
颜色定义
|
||||
|
||||
|
||||
### scripts/ci-cd/quality/security-scan.sh
|
||||
|
||||
**描述**: 安全扫描脚本
|
||||
扫描代码中的安全问题和敏感信息
|
||||
颜色定义
|
||||
|
||||
|
||||
### scripts/deployment/consul/consul-variables-example.sh
|
||||
|
||||
**描述**: Consul 变量和存储配置示例脚本
|
||||
此脚本展示了如何配置Consul的变量和存储功能
|
||||
配置参数
|
||||
|
||||
|
||||
### scripts/deployment/consul/deploy-consul-cluster-kv.sh
|
||||
|
||||
**描述**: Consul集群部署脚本 - 遵循最佳变量命名规范
|
||||
此脚本将部署一个完全遵循 config/{environment}/{provider}/{region_or_service}/{key} 格式的Consul集群
|
||||
配置参数
|
||||
|
||||
|
||||
### scripts/deployment/vault/deploy-vault.sh
|
||||
|
||||
**描述**: 部署Vault集群的脚本
|
||||
检查并安装Vault
|
||||
|
||||
|
||||
### scripts/deployment/vault/vault-dev-example.sh
|
||||
|
||||
**描述**: Vault开发环境使用示例
|
||||
设置环境变量
|
||||
|
||||
|
||||
### scripts/deployment/vault/vault-dev-quickstart.sh
|
||||
|
||||
**描述**: Vault开发环境快速开始指南
|
||||
1. 设置环境变量
|
||||
|
||||
|
||||
### scripts/mcp/configs/sync-all-configs.sh
|
||||
|
||||
**描述**: 链接所有MCP配置文件的脚本
|
||||
该脚本将所有IDE和AI助手的MCP配置链接到NFS共享的配置文件
|
||||
检查NFS配置文件是否存在
|
||||
|
||||
|
||||
### scripts/mcp/tools/start-mcp-server.sh
|
||||
|
||||
**描述**: 设置环境变量
|
||||
启动MCP服务器
|
||||
|
||||
|
||||
### scripts/setup/config/generate-consul-config.sh
|
||||
|
||||
**描述**: Consul配置生成脚本
|
||||
此脚本使用Consul模板从KV存储生成最终的Consul配置文件
|
||||
配置参数
|
||||
|
||||
|
||||
### scripts/setup/config/setup-consul-cluster-variables.sh
|
||||
|
||||
**描述**: Consul变量配置脚本 - 遵循最佳命名规范
|
||||
此脚本将Consul集群配置存储到Consul KV中,遵循 config/{environment}/{provider}/{region_or_service}/{key} 格式
|
||||
配置参数
|
||||
|
||||
|
||||
### scripts/setup/config/setup-consul-variables-and-storage.sh
|
||||
|
||||
**描述**: Consul 变量和存储配置脚本
|
||||
用于增强Consul集群功能
|
||||
颜色输出
|
||||
|
||||
|
||||
### scripts/setup/environment/setup-environment.sh
|
||||
|
||||
**描述**: 环境设置脚本
|
||||
用于设置开发环境的必要组件和依赖
|
||||
颜色定义
|
||||
|
||||
|
||||
### scripts/setup/init/init-vault-cluster.sh
|
||||
|
||||
**描述**: Vault集群初始化和解封脚本
|
||||
颜色定义
|
||||
|
||||
|
||||
### scripts/setup/init/init-vault-dev-api.sh
|
||||
|
||||
**描述**: 通过API初始化Vault开发环境(无需本地vault命令)
|
||||
颜色定义
|
||||
|
||||
|
||||
### scripts/setup/init/init-vault-dev.sh
|
||||
|
||||
**描述**: Vault开发环境初始化脚本
|
||||
颜色定义
|
||||
|
||||
|
||||
### scripts/testing/infrastructure/test-nomad-config.sh
|
||||
|
||||
**描述**: 测试Nomad配置文件
|
||||
|
||||
|
||||
### scripts/testing/infrastructure/test-traefik-deployment.sh
|
||||
|
||||
**描述**: Traefik部署测试脚本
|
||||
用于测试Traefik在Nomad集群中的部署和功能
|
||||
颜色定义
|
||||
|
||||
**用法**: 请查看脚本内部说明
|
||||
|
||||
### scripts/testing/integration/verify-vault-consul-integration.sh
|
||||
|
||||
**描述**: 验证Vault与Consul集成状态
|
||||
颜色定义
|
||||
|
||||
|
||||
### scripts/testing/mcp/test_direct_search.sh
|
||||
|
||||
**描述**: 创建一个简单的Python脚本来测试search_documents方法
|
||||
|
||||
|
||||
### scripts/testing/mcp/test_local_mcp_servers.sh
|
||||
|
||||
**描述**: 测试当前环境中的MCP服务器
|
||||
检查当前环境中是否有MCP配置
|
||||
|
||||
|
||||
### scripts/testing/mcp/test_mcp_interface.sh
|
||||
|
||||
**描述**: 测试MCP服务器在实际MCP接口中的调用
|
||||
|
||||
|
||||
### scripts/testing/mcp/test_mcp_search_final.sh
|
||||
|
||||
**描述**: 先添加一个文档
|
||||
|
||||
|
||||
### scripts/testing/mcp/test_mcp_servers.sh
|
||||
|
||||
**描述**: 测试MCP服务器脚本
|
||||
|
||||
|
||||
### scripts/testing/mcp/test_qdrant_ollama_tools.sh
|
||||
|
||||
**描述**: 测试search_documents工具
|
||||
|
||||
|
||||
### scripts/testing/mcp/test_qdrant_ollama_tools_fixed.sh
|
||||
|
||||
**描述**: 测试search_documents工具(不带filter参数)
|
||||
|
||||
|
||||
### scripts/testing/mcp/test_search_documents.sh
|
||||
|
||||
**描述**: 先添加一个文档
|
||||
|
||||
|
||||
### scripts/testing/run_all_tests.sh
|
||||
|
||||
**描述**: MCP服务器测试运行器
|
||||
自动运行所有MCP服务器测试脚本
|
||||
颜色定义
|
||||
|
||||
|
||||
### scripts/testing/test-runner.sh
|
||||
|
||||
**描述**: 项目测试快速执行脚本
|
||||
从项目根目录快速运行所有MCP服务器测试
|
||||
颜色定义
|
||||
|
||||
|
||||
### scripts/utilities/backup/backup-all.sh
|
||||
|
||||
**描述**: 全量备份脚本
|
||||
备份所有重要的配置和数据
|
||||
颜色定义
|
||||
|
||||
|
||||
### scripts/utilities/backup/backup-consul.sh
|
||||
|
||||
**描述**: Consul备份脚本
|
||||
此脚本用于创建Consul的快照备份,并管理备份文件
|
||||
配置参数
|
||||
|
||||
|
||||
### scripts/utilities/helpers/fix-alpine-cgroups-systemd.sh
|
||||
|
||||
**描述**: Alternative script to fix cgroup configuration using systemd approach
|
||||
Check if running as root
|
||||
|
||||
|
||||
### scripts/utilities/helpers/fix-alpine-cgroups.sh
|
||||
|
||||
**描述**: Script to fix cgroup configuration for container runtime in Alpine Linux
|
||||
Check if running as root
|
||||
|
||||
|
||||
### scripts/utilities/helpers/manage-vault-consul.sh
|
||||
|
||||
**描述**: Vault与Consul集成管理脚本
|
||||
颜色定义
|
||||
函数定义
|
||||
|
||||
**用法**: 请查看脚本内部说明
|
||||
|
||||
### scripts/utilities/helpers/nomad-leader-discovery.sh
|
||||
|
||||
**描述**: Nomad 集群领导者发现与访问脚本
|
||||
此脚本自动发现当前 Nomad 集群领导者并执行相应命令
|
||||
默认服务器列表(可根据实际情况修改)
|
||||
|
||||
**用法**: 请查看脚本内部说明
|
||||
|
||||
### scripts/utilities/helpers/show-vault-dev-keys.sh
|
||||
|
||||
**描述**: 显示开发环境Vault密钥信息
|
||||
检查密钥文件是否存在
|
||||
|
||||
|
||||
### scripts/utilities/maintenance/cleanup-global-config.sh
|
||||
|
||||
**描述**: Nomad Global 配置清理脚本
|
||||
此脚本用于移除配置文件中的 .global 后缀
|
||||
颜色输出
|
||||
|
||||
|
||||
192
docs/authentik-traefik-setup.md
Normal file
192
docs/authentik-traefik-setup.md
Normal file
@@ -0,0 +1,192 @@
|
||||
# Authentik Traefik 代理配置指南
|
||||
|
||||
## 配置概述
|
||||
|
||||
已为Authentik配置Traefik代理,实现SSL证书自动管理和域名访问。
|
||||
|
||||
## 配置详情
|
||||
|
||||
### Authentik服务信息
|
||||
- **容器IP**: 192.168.31.144
|
||||
- **HTTP端口**: 9000 (可选)
|
||||
- **HTTPS端口**: 9443 (主要)
|
||||
- **容器状态**: 运行正常
|
||||
- **SSH认证**: 已配置密钥认证,无需密码
|
||||
|
||||
### Traefik代理配置
|
||||
|
||||
#### 服务配置
|
||||
```yaml
|
||||
authentik-cluster:
|
||||
loadBalancer:
|
||||
servers:
|
||||
- url: "https://192.168.31.144:9443" # Authentik容器HTTPS端口
|
||||
serversTransport: authentik-insecure
|
||||
healthCheck:
|
||||
path: "/flows/-/default/authentication/"
|
||||
interval: "30s"
|
||||
timeout: "15s"
|
||||
```
|
||||
|
||||
#### 路由配置
|
||||
```yaml
|
||||
authentik-ui:
|
||||
rule: "Host(`authentik.git-4ta.live`)"
|
||||
service: authentik-cluster
|
||||
entryPoints:
|
||||
- websecure
|
||||
tls:
|
||||
certResolver: cloudflare
|
||||
```
|
||||
|
||||
## DNS配置要求
|
||||
|
||||
需要在Cloudflare中为以下域名添加DNS记录:
|
||||
|
||||
### A记录
|
||||
```
|
||||
authentik.git-4ta.live A <hcp1的Tailscale IP>
|
||||
```
|
||||
|
||||
### 获取hcp1的Tailscale IP
|
||||
```bash
|
||||
# 方法1: 通过Tailscale命令
|
||||
tailscale ip -4 hcp1
|
||||
|
||||
# 方法2: 通过ping
|
||||
ping hcp1.tailnet-68f9.ts.net
|
||||
```
|
||||
|
||||
## 部署步骤
|
||||
|
||||
### 1. 更新Traefik配置
|
||||
```bash
|
||||
# 重新部署Traefik job
|
||||
nomad job run components/traefik/jobs/traefik-cloudflare-git4ta-live.nomad
|
||||
```
|
||||
|
||||
### 2. 配置DNS记录
|
||||
在Cloudflare Dashboard中添加A记录:
|
||||
- **Name**: authentik
|
||||
- **Type**: A
|
||||
- **Content**: <hcp1的Tailscale IP>
|
||||
- **TTL**: Auto
|
||||
|
||||
### 3. 验证SSL证书
|
||||
```bash
|
||||
# 检查证书是否自动生成
|
||||
curl -I https://authentik.git-4ta.live
|
||||
|
||||
# 预期返回200状态码和有效的SSL证书
|
||||
```
|
||||
|
||||
### 4. 测试访问
|
||||
```bash
|
||||
# 访问Authentik Web UI
|
||||
open https://authentik.git-4ta.live
|
||||
|
||||
# 或使用curl测试
|
||||
curl -k https://authentik.git-4ta.live
|
||||
```
|
||||
|
||||
## 健康检查
|
||||
|
||||
### Authentik健康检查端点
|
||||
- **路径**: `/if/flow/default-authentication-flow/`
|
||||
- **间隔**: 30秒
|
||||
- **超时**: 15秒
|
||||
|
||||
### 检查服务状态
|
||||
```bash
|
||||
# 检查Traefik路由状态
|
||||
curl -s http://hcp1.tailnet-68f9.ts.net:8080/api/http/routers | jq '.[] | select(.name=="authentik-ui")'
|
||||
|
||||
# 检查服务健康状态
|
||||
curl -s http://hcp1.tailnet-68f9.ts.net:8080/api/http/services | jq '.[] | select(.name=="authentik-cluster")'
|
||||
```
|
||||
|
||||
## 故障排除
|
||||
|
||||
### 常见问题
|
||||
|
||||
1. **DNS解析问题**
|
||||
```bash
|
||||
# 检查DNS解析
|
||||
nslookup authentik.git-4ta.live
|
||||
|
||||
# 检查Cloudflare DNS
|
||||
dig @1.1.1.1 authentik.git-4ta.live
|
||||
```
|
||||
|
||||
2. **SSL证书问题**
|
||||
```bash
|
||||
# 检查证书状态
|
||||
openssl s_client -connect authentik.git-4ta.live:443 -servername authentik.git-4ta.live
|
||||
|
||||
# 检查Traefik证书存储
|
||||
ls -la /opt/traefik/certs/
|
||||
```
|
||||
|
||||
3. **服务连接问题**
|
||||
```bash
|
||||
# 检查Authentik容器状态
|
||||
sshpass -p "Aa313131@ben" ssh -o StrictHostKeyChecking=no root@pve "pct exec 113 -- netstat -tlnp | grep 9000"
|
||||
|
||||
# 检查Traefik日志
|
||||
nomad logs -f traefik-cloudflare-v1
|
||||
```
|
||||
|
||||
### 调试命令
|
||||
|
||||
```bash
|
||||
# 检查Traefik配置
|
||||
curl -s http://hcp1.tailnet-68f9.ts.net:8080/api/rawdata | jq '.routers[] | select(.name=="authentik-ui")'
|
||||
|
||||
# 检查服务发现
|
||||
curl -s http://hcp1.tailnet-68f9.ts.net:8080/api/rawdata | jq '.services[] | select(.name=="authentik-cluster")'
|
||||
|
||||
# 检查中间件
|
||||
curl -s http://hcp1.tailnet-68f9.ts.net:8080/api/rawdata | jq '.middlewares'
|
||||
```
|
||||
|
||||
## 下一步
|
||||
|
||||
配置完成后,可以:
|
||||
|
||||
1. **配置OAuth2 Provider**
|
||||
- 在Authentik中创建OAuth2应用
|
||||
- 配置回调URL
|
||||
- 设置客户端凭据
|
||||
|
||||
2. **集成HCP服务**
|
||||
- 为Nomad UI配置OAuth2认证
|
||||
- 为Consul UI配置OAuth2认证
|
||||
- 为Vault配置OIDC认证
|
||||
|
||||
3. **用户管理**
|
||||
- 创建用户组和权限
|
||||
- 配置多因素认证
|
||||
- 设置访问策略
|
||||
|
||||
## 安全注意事项
|
||||
|
||||
1. **网络安全**
|
||||
- Authentik容器使用内网IP (192.168.31.144)
|
||||
- 通过Traefik代理访问,不直接暴露
|
||||
|
||||
2. **SSL/TLS**
|
||||
- 使用Cloudflare自动SSL证书
|
||||
- 强制HTTPS重定向
|
||||
- 支持现代TLS协议
|
||||
|
||||
3. **访问控制**
|
||||
- 建议配置IP白名单
|
||||
- 启用多因素认证
|
||||
- 定期轮换密钥
|
||||
|
||||
---
|
||||
|
||||
**配置完成时间**: $(date)
|
||||
**配置文件**: `/root/mgmt/components/traefik/jobs/traefik-cloudflare-git4ta-live.nomad`
|
||||
**域名**: `authentik.git-4ta.live`
|
||||
**状态**: 待部署和测试
|
||||
124
docs/consul-cluster-troubleshooting.md
Normal file
124
docs/consul-cluster-troubleshooting.md
Normal file
@@ -0,0 +1,124 @@
|
||||
# Consul 集群故障排除指南
|
||||
|
||||
## 问题诊断
|
||||
|
||||
### 发现的问题
|
||||
1. **DNS 解析失败**:服务间无法通过服务名相互发现
|
||||
2. **网络连通性问题**:`ash3c` 节点网络配置异常
|
||||
3. **跨节点通信失败**:`no route to host` 错误
|
||||
4. **集群无法形成**:持续的 "No cluster leader" 错误
|
||||
|
||||
### 根本原因
|
||||
- 网络配置问题
|
||||
- 防火墙或网络策略可能阻止了 Consul 集群通信端口
|
||||
|
||||
## 解决方案
|
||||
|
||||
### 当前部署方案(使用 Nomad + Podman)
|
||||
目前集群已从 Docker Swarm 迁移到 Nomad + Podman,使用 `consul-cluster-nomad.nomad` 文件部署 Consul 集群。
|
||||
|
||||
## 网络诊断步骤
|
||||
|
||||
### 1. 检查节点状态
|
||||
```bash
|
||||
nomad node status
|
||||
```
|
||||
|
||||
### 2. 检查网络连通性
|
||||
```bash
|
||||
# 在 master 节点上测试到 ash3c 的连通性
|
||||
ping <ash3c-ip>
|
||||
telnet <ash3c-ip> 8301
|
||||
```
|
||||
|
||||
### 3. 检查防火墙设置
|
||||
```bash
|
||||
# 确保以下端口开放
|
||||
# 8300: Consul server RPC
|
||||
# 8301: Consul Serf LAN
|
||||
# 8302: Consul Serf WAN
|
||||
# 8500: Consul HTTP API
|
||||
# 8600: Consul DNS
|
||||
```
|
||||
|
||||
### 4. 检查 Podman 网络
|
||||
```bash
|
||||
podman network ls
|
||||
podman network inspect <network-name>
|
||||
```
|
||||
|
||||
## 推荐的修复流程
|
||||
|
||||
### 立即解决方案(单节点)
|
||||
1. 部署单节点 Consul 以恢复服务
|
||||
2. 验证基本功能正常
|
||||
|
||||
### 长期解决方案(集群)
|
||||
1. 修复 `ash3c` 节点的网络配置
|
||||
2. 确保节点间网络连通性
|
||||
3. 配置防火墙规则
|
||||
4. 重新部署集群配置
|
||||
|
||||
## 验证步骤
|
||||
|
||||
### 单节点验证
|
||||
```bash
|
||||
# 检查服务状态
|
||||
docker service ls | grep consul
|
||||
|
||||
# 检查日志
|
||||
docker service logs consul_consul
|
||||
|
||||
# 访问 Web UI
|
||||
curl http://localhost:8500/v1/status/leader
|
||||
```
|
||||
|
||||
### 集群验证
|
||||
```bash
|
||||
# 检查集群成员
|
||||
docker exec <consul-container> consul members
|
||||
|
||||
# 检查领导者
|
||||
docker exec <consul-container> consul operator raft list-peers
|
||||
```
|
||||
|
||||
## 常见问题
|
||||
|
||||
### Q: 为什么服务发现不工作?
|
||||
A: 在之前的 Docker Swarm 架构中,overlay 网络在某些配置下可能存在 DNS 解析问题。当前的 Nomad + Podman 架构已解决了这些问题。
|
||||
|
||||
### Q: 如何选择合适的网络方案?
|
||||
A:
|
||||
- 开发/测试环境:使用单节点或 overlay 网络
|
||||
- 生产环境:推荐使用 macvlan 或主机网络以获得更好的性能和可靠性
|
||||
|
||||
### Q: 集群恢复后数据会丢失吗?
|
||||
A: 如果使用了持久化卷,数据不会丢失。但建议在修复前备份重要数据。
|
||||
|
||||
## 监控和维护
|
||||
|
||||
### 健康检查
|
||||
```bash
|
||||
# 定期检查集群状态
|
||||
consul members
|
||||
consul operator raft list-peers
|
||||
```
|
||||
|
||||
### 日志监控
|
||||
```bash
|
||||
# 监控关键错误
|
||||
docker service logs consul_consul | grep -E "(ERROR|WARN)"
|
||||
```
|
||||
|
||||
### 性能监控
|
||||
- 监控 Consul 的 HTTP API 响应时间
|
||||
- 检查集群同步延迟
|
||||
- 监控网络连接数
|
||||
|
||||
## 联系支持
|
||||
|
||||
如果问题持续存在,请提供以下信息:
|
||||
1. Docker 版本和 Swarm 配置
|
||||
2. 网络拓扑图
|
||||
3. 完整的服务日志
|
||||
4. 节点间网络测试结果
|
||||
169
docs/consul-provider-integration.md
Normal file
169
docs/consul-provider-integration.md
Normal file
@@ -0,0 +1,169 @@
|
||||
# Terraform Consul Provider 集成指南
|
||||
|
||||
本指南说明如何使用Terraform Consul Provider直接从Consul获取Oracle Cloud配置,无需手动保存私钥到临时文件。
|
||||
|
||||
## 集成概述
|
||||
|
||||
我们已经将Terraform Consul Provider集成到现有的Terraform配置中,实现了以下功能:
|
||||
|
||||
1. 直接从Consul获取Oracle Cloud配置(包括tenancy_ocid、user_ocid、fingerprint和private_key)
|
||||
2. 自动将从Consul获取的私钥保存到临时文件
|
||||
3. 使用从Consul获取的配置初始化OCI Provider
|
||||
4. 支持多个区域(韩国和美国)的配置
|
||||
|
||||
## 配置结构
|
||||
|
||||
### 1. Consul中的配置存储
|
||||
|
||||
Oracle Cloud配置存储在Consul的以下路径中:
|
||||
|
||||
- 韩国区域:`config/dev/oracle/kr/`
|
||||
- `tenancy_ocid`
|
||||
- `user_ocid`
|
||||
- `fingerprint`
|
||||
- `private_key`
|
||||
|
||||
- 美国区域:`config/dev/oracle/us/`
|
||||
- `tenancy_ocid`
|
||||
- `user_ocid`
|
||||
- `fingerprint`
|
||||
- `private_key`
|
||||
|
||||
### 2. Terraform配置
|
||||
|
||||
#### Provider配置
|
||||
|
||||
```hcl
|
||||
# Consul Provider配置
|
||||
provider "consul" {
|
||||
address = "localhost:8500"
|
||||
scheme = "http"
|
||||
datacenter = "dc1"
|
||||
}
|
||||
```
|
||||
|
||||
#### 数据源配置
|
||||
|
||||
```hcl
|
||||
# 从Consul获取Oracle Cloud配置
|
||||
data "consul_keys" "oracle_config" {
|
||||
key {
|
||||
name = "tenancy_ocid"
|
||||
path = "config/dev/oracle/kr/tenancy_ocid"
|
||||
}
|
||||
key {
|
||||
name = "user_ocid"
|
||||
path = "config/dev/oracle/kr/user_ocid"
|
||||
}
|
||||
key {
|
||||
name = "fingerprint"
|
||||
path = "config/dev/oracle/kr/fingerprint"
|
||||
}
|
||||
key {
|
||||
name = "private_key"
|
||||
path = "config/dev/oracle/kr/private_key"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### OCI Provider配置
|
||||
|
||||
```hcl
|
||||
# 使用从Consul获取的配置的OCI Provider
|
||||
provider "oci" {
|
||||
tenancy_ocid = data.consul_keys.oracle_config.var.tenancy_ocid
|
||||
user_ocid = data.consul_keys.oracle_config.var.user_ocid
|
||||
fingerprint = data.consul_keys.oracle_config.var.fingerprint
|
||||
private_key = file(var.oci_config.private_key_path)
|
||||
region = "ap-chuncheon-1"
|
||||
}
|
||||
```
|
||||
|
||||
## 使用方法
|
||||
|
||||
### 1. 确保Consul正在运行
|
||||
|
||||
```bash
|
||||
# 检查Consul是否运行
|
||||
pgrep consul
|
||||
```
|
||||
|
||||
### 2. 确保Oracle Cloud配置已存储在Consul中
|
||||
|
||||
```bash
|
||||
# 检查韩国区域配置
|
||||
consul kv get config/dev/oracle/kr/tenancy_ocid
|
||||
consul kv get config/dev/oracle/kr/user_ocid
|
||||
consul kv get config/dev/oracle/kr/fingerprint
|
||||
consul kv get config/dev/oracle/kr/private_key
|
||||
|
||||
# 检查美国区域配置
|
||||
consul kv get config/dev/oracle/us/tenancy_ocid
|
||||
consul kv get config/dev/oracle/us/user_ocid
|
||||
consul kv get config/dev/oracle/us/fingerprint
|
||||
consul kv get config/dev/oracle/us/private_key
|
||||
```
|
||||
|
||||
### 3. 初始化Terraform
|
||||
|
||||
```bash
|
||||
cd /root/mgmt/tofu/environments/dev
|
||||
terraform init -upgrade
|
||||
```
|
||||
|
||||
### 4. 运行测试脚本
|
||||
|
||||
```bash
|
||||
# 从项目根目录运行
|
||||
/root/mgmt/test_consul_provider.sh
|
||||
```
|
||||
|
||||
### 5. 使用Consul配置运行Terraform
|
||||
|
||||
```bash
|
||||
cd /root/mgmt/tofu/environments/dev
|
||||
terraform plan -var-file=consul.tfvars
|
||||
terraform apply -var-file=consul.tfvars
|
||||
```
|
||||
|
||||
## 优势
|
||||
|
||||
使用Consul Provider直接从Consul获取配置有以下优势:
|
||||
|
||||
1. **更高的安全性**:私钥不再需要存储在磁盘上的临时文件中,而是直接从Consul获取并在内存中使用
|
||||
2. **更简洁的配置**:无需手动创建临时文件,Terraform直接处理私钥内容
|
||||
3. **声明式风格**:完全符合Terraform的声明式配置风格
|
||||
4. **更好的维护性**:配置集中存储在Consul中,便于管理和更新
|
||||
5. **多环境支持**:可以轻松支持多个环境(dev、staging、production)的配置
|
||||
|
||||
## 故障排除
|
||||
|
||||
### 1. Consul连接问题
|
||||
|
||||
如果无法连接到Consul,请检查:
|
||||
|
||||
- Consul服务是否正在运行
|
||||
- Consul地址和端口是否正确(默认为localhost:8500)
|
||||
- 网络连接是否正常
|
||||
|
||||
### 2. 配置获取问题
|
||||
|
||||
如果无法从Consul获取配置,请检查:
|
||||
|
||||
- 配置是否已正确存储在Consul中
|
||||
- 路径是否正确
|
||||
- 权限是否足够
|
||||
|
||||
### 3. Terraform初始化问题
|
||||
|
||||
如果Terraform初始化失败,请检查:
|
||||
|
||||
- Terraform版本是否符合要求(>=1.6)
|
||||
- 网络连接是否正常
|
||||
- Provider源是否可访问
|
||||
|
||||
## 版本信息
|
||||
|
||||
- Terraform: >=1.6
|
||||
- Consul Provider: ~2.22.0
|
||||
- OCI Provider: ~5.0
|
||||
219
docs/consul-traefik-config-examples.md
Normal file
219
docs/consul-traefik-config-examples.md
Normal file
@@ -0,0 +1,219 @@
|
||||
# 通过 Traefik 连接 Consul 的配置示例
|
||||
|
||||
## 🎯 目标实现
|
||||
让其他节点通过 `consul.git4ta.me` 和 `nomad.git4ta.me` 访问服务,而不是直接连接 IP。
|
||||
|
||||
## ✅ 当前状态验证
|
||||
|
||||
### Consul 智能检测
|
||||
```bash
|
||||
# Leader 检测
|
||||
curl -s https://consul.git4ta.me/v1/status/leader
|
||||
# 返回: "100.117.106.136:8300" (ch4 是 leader)
|
||||
|
||||
# 当前路由节点
|
||||
curl -s https://consul.git4ta.me/v1/agent/self | jq -r '.Config.NodeName'
|
||||
# 返回: "ash3c" (Traefik 路由到 ash3c)
|
||||
```
|
||||
|
||||
### Nomad 智能检测
|
||||
```bash
|
||||
# Leader 检测
|
||||
curl -s https://nomad.git4ta.me/v1/status/leader
|
||||
# 返回: "100.90.159.68:4647" (ch2 是 leader)
|
||||
```
|
||||
|
||||
## 🔧 节点配置示例
|
||||
|
||||
### 1. Consul 客户端配置
|
||||
|
||||
#### 当前配置 (直接连接)
|
||||
```hcl
|
||||
# /etc/consul.d/consul.hcl
|
||||
datacenter = "dc1"
|
||||
node_name = "client-node"
|
||||
|
||||
retry_join = [
|
||||
"warden.tailnet-68f9.ts.net:8301",
|
||||
"ch4.tailnet-68f9.ts.net:8301",
|
||||
"ash3c.tailnet-68f9.ts.net:8301"
|
||||
]
|
||||
```
|
||||
|
||||
#### 新配置 (通过 Traefik)
|
||||
```hcl
|
||||
# /etc/consul.d/consul.hcl
|
||||
datacenter = "dc1"
|
||||
node_name = "client-node"
|
||||
|
||||
# 通过 Traefik 连接 Consul
|
||||
retry_join = ["consul.git4ta.me:8301"]
|
||||
|
||||
# 或者使用 HTTP API
|
||||
addresses {
|
||||
http = "consul.git4ta.me"
|
||||
}
|
||||
|
||||
ports {
|
||||
http = 8301
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Nomad 客户端配置
|
||||
|
||||
#### 当前配置 (直接连接)
|
||||
```hcl
|
||||
# /etc/nomad.d/nomad.hcl
|
||||
consul {
|
||||
address = "http://warden.tailnet-68f9.ts.net:8500"
|
||||
}
|
||||
```
|
||||
|
||||
#### 新配置 (通过 Traefik)
|
||||
```hcl
|
||||
# /etc/nomad.d/nomad.hcl
|
||||
consul {
|
||||
address = "https://consul.git4ta.me:8500"
|
||||
# 或者使用 HTTP
|
||||
# address = "http://consul.git4ta.me:8500"
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Vault 配置
|
||||
|
||||
#### 当前配置 (直接连接)
|
||||
```hcl
|
||||
# Consul KV: vault/config
|
||||
storage "consul" {
|
||||
address = "ch4.tailnet-68f9.ts.net:8500"
|
||||
path = "vault/"
|
||||
}
|
||||
|
||||
service_registration "consul" {
|
||||
address = "ch4.tailnet-68f9.ts.net:8500"
|
||||
service = "vault"
|
||||
}
|
||||
```
|
||||
|
||||
#### 新配置 (通过 Traefik)
|
||||
```hcl
|
||||
# Consul KV: vault/config
|
||||
storage "consul" {
|
||||
address = "consul.git4ta.me:8500"
|
||||
path = "vault/"
|
||||
}
|
||||
|
||||
service_registration "consul" {
|
||||
address = "consul.git4ta.me:8500"
|
||||
service = "vault"
|
||||
}
|
||||
```
|
||||
|
||||
## 🚀 实施步骤
|
||||
|
||||
### 步骤 1: 验证 Traefik 路由
|
||||
```bash
|
||||
# 测试 Consul 路由
|
||||
curl -I https://consul.git4ta.me/v1/status/leader
|
||||
|
||||
# 测试 Nomad 路由
|
||||
curl -I https://nomad.git4ta.me/v1/status/leader
|
||||
```
|
||||
|
||||
### 步骤 2: 更新节点配置
|
||||
```bash
|
||||
# 在目标节点上执行
|
||||
# 备份现有配置
|
||||
cp /etc/consul.d/consul.hcl /etc/consul.d/consul.hcl.backup
|
||||
cp /etc/nomad.d/nomad.hcl /etc/nomad.d/nomad.hcl.backup
|
||||
|
||||
# 修改 Consul 配置
|
||||
sed -i 's/warden\.tailnet-68f9\.ts\.net:8301/consul.git4ta.me:8301/g' /etc/consul.d/consul.hcl
|
||||
sed -i 's/ch4\.tailnet-68f9\.ts\.net:8301/consul.git4ta.me:8301/g' /etc/consul.d/consul.hcl
|
||||
sed -i 's/ash3c\.tailnet-68f9\.ts\.net:8301/consul.git4ta.me:8301/g' /etc/consul.d/consul.hcl
|
||||
|
||||
# 修改 Nomad 配置
|
||||
sed -i 's/warden\.tailnet-68f9\.ts\.net:8500/consul.git4ta.me:8500/g' /etc/nomad.d/nomad.hcl
|
||||
sed -i 's/ch4\.tailnet-68f9\.ts\.net:8500/consul.git4ta.me:8500/g' /etc/nomad.d/nomad.hcl
|
||||
sed -i 's/ash3c\.tailnet-68f9\.ts\.net:8500/consul.git4ta.me:8500/g' /etc/nomad.d/nomad.hcl
|
||||
```
|
||||
|
||||
### 步骤 3: 重启服务
|
||||
```bash
|
||||
# 重启 Consul
|
||||
systemctl restart consul
|
||||
|
||||
# 重启 Nomad
|
||||
systemctl restart nomad
|
||||
|
||||
# 重启 Vault (如果适用)
|
||||
systemctl restart vault
|
||||
```
|
||||
|
||||
### 步骤 4: 验证连接
|
||||
```bash
|
||||
# 检查 Consul 连接
|
||||
consul members
|
||||
|
||||
# 检查 Nomad 连接
|
||||
nomad node status
|
||||
|
||||
# 检查 Vault 连接
|
||||
vault status
|
||||
```
|
||||
|
||||
## 📊 性能对比
|
||||
|
||||
### 延迟测试
|
||||
```bash
|
||||
# 直接连接
|
||||
time curl -s http://ch4.tailnet-68f9.ts.net:8500/v1/status/leader
|
||||
|
||||
# 通过 Traefik
|
||||
time curl -s https://consul.git4ta.me/v1/status/leader
|
||||
```
|
||||
|
||||
### 可靠性测试
|
||||
```bash
|
||||
# 测试故障转移
|
||||
# 1. 停止 ch4 Consul
|
||||
# 2. 检查 Traefik 是否自动路由到其他节点
|
||||
curl -s https://consul.git4ta.me/v1/status/leader
|
||||
```
|
||||
|
||||
## 🎯 优势总结
|
||||
|
||||
### 1. 统一入口
|
||||
- **之前**: 每个节点需要知道所有 Consul/Nomad 节点 IP
|
||||
- **现在**: 只需要知道 `consul.git4ta.me` 和 `nomad.git4ta.me`
|
||||
|
||||
### 2. 自动故障转移
|
||||
- **之前**: 节点需要手动配置多个 IP
|
||||
- **现在**: Traefik 自动路由到健康的节点
|
||||
|
||||
### 3. 简化配置
|
||||
- **之前**: 硬编码 IP 地址,难以维护
|
||||
- **现在**: 使用域名,易于管理和更新
|
||||
|
||||
### 4. 负载均衡
|
||||
- **之前**: 所有请求都到同一个节点
|
||||
- **现在**: Traefik 可以分散请求到多个节点
|
||||
|
||||
## ⚠️ 注意事项
|
||||
|
||||
### 1. 端口映射
|
||||
- **Traefik 外部**: 443 (HTTPS) / 80 (HTTP)
|
||||
- **服务内部**: 8500 (Consul), 4646 (Nomad)
|
||||
- **需要配置**: Traefik 端口转发
|
||||
|
||||
### 2. SSL 证书
|
||||
- **HTTPS**: 需要有效证书
|
||||
- **HTTP**: 可以使用自签名证书
|
||||
|
||||
### 3. 单点故障
|
||||
- **风险**: Traefik 成为单点故障
|
||||
- **缓解**: Traefik 本身也是高可用的
|
||||
|
||||
---
|
||||
|
||||
**结论**: 完全可行!通过 Traefik 统一访问 Consul 和 Nomad 是一个优秀的架构改进,提供了更好的可维护性和可靠性。
|
||||
191
docs/consul-traefik-integration.md
Normal file
191
docs/consul-traefik-integration.md
Normal file
@@ -0,0 +1,191 @@
|
||||
# Consul 通过 Traefik 连接的配置方案
|
||||
|
||||
## 🎯 目标
|
||||
让所有节点通过 `consul.git4ta.me` 访问 Consul,而不是直接连接 IP 地址。
|
||||
|
||||
## ✅ 可行性验证
|
||||
|
||||
### 测试结果
|
||||
```bash
|
||||
# 通过 Traefik 访问 Consul API
|
||||
curl -s https://consul.git4ta.me/v1/status/leader
|
||||
# 返回: "100.117.106.136:8300" (ch4 是 leader)
|
||||
|
||||
curl -s https://consul.git4ta.me/v1/agent/self | jq -r '.Config.NodeName'
|
||||
# 返回: "warden" (当前路由到的节点)
|
||||
```
|
||||
|
||||
### 优势
|
||||
1. **统一入口**: 所有服务都通过域名访问
|
||||
2. **自动故障转移**: Traefik 自动路由到健康的 Consul 节点
|
||||
3. **简化配置**: 不需要硬编码 IP 地址
|
||||
4. **负载均衡**: 可以分散请求到多个 Consul 节点
|
||||
|
||||
## 🔧 配置方案
|
||||
|
||||
### 方案 1: 修改现有节点配置
|
||||
|
||||
#### Consul 客户端配置
|
||||
```hcl
|
||||
# /etc/consul.d/consul.hcl
|
||||
datacenter = "dc1"
|
||||
node_name = "node-name"
|
||||
|
||||
# 通过 Traefik 连接 Consul
|
||||
retry_join = ["consul.git4ta.me:8500"]
|
||||
|
||||
# 或者使用 HTTP 连接
|
||||
addresses {
|
||||
http = "consul.git4ta.me"
|
||||
https = "consul.git4ta.me"
|
||||
}
|
||||
|
||||
ports {
|
||||
http = 8500
|
||||
https = 8500
|
||||
}
|
||||
```
|
||||
|
||||
#### Nomad 配置
|
||||
```hcl
|
||||
# /etc/nomad.d/nomad.hcl
|
||||
consul {
|
||||
address = "https://consul.git4ta.me:8500"
|
||||
# 或者
|
||||
address = "http://consul.git4ta.me:8500"
|
||||
}
|
||||
```
|
||||
|
||||
#### Vault 配置
|
||||
```hcl
|
||||
# 在 Consul KV vault/config 中
|
||||
storage "consul" {
|
||||
address = "consul.git4ta.me:8500"
|
||||
path = "vault/"
|
||||
}
|
||||
|
||||
service_registration "consul" {
|
||||
address = "consul.git4ta.me:8500"
|
||||
service = "vault"
|
||||
service_tags = "vault-server"
|
||||
}
|
||||
```
|
||||
|
||||
### 方案 2: 创建新的服务发现配置
|
||||
|
||||
#### 在 Traefik 中添加 Consul 服务发现
|
||||
```yaml
|
||||
# 在 dynamic.yml 中添加
|
||||
services:
|
||||
consul-api:
|
||||
loadBalancer:
|
||||
servers:
|
||||
- url: "http://ch4.tailnet-68f9.ts.net:8500" # Leader
|
||||
- url: "http://warden.tailnet-68f9.ts.net:8500" # Follower
|
||||
- url: "http://ash3c.tailnet-68f9.ts.net:8500" # Follower
|
||||
healthCheck:
|
||||
path: "/v1/status/leader"
|
||||
interval: "30s"
|
||||
timeout: "15s"
|
||||
|
||||
routers:
|
||||
consul-api:
|
||||
rule: "Host(`consul.git4ta.me`)"
|
||||
service: consul-api
|
||||
entryPoints:
|
||||
- websecure
|
||||
tls:
|
||||
certResolver: cloudflare
|
||||
```
|
||||
|
||||
## 🚨 注意事项
|
||||
|
||||
### 1. 端口映射
|
||||
- **Traefik 外部端口**: 443 (HTTPS) / 80 (HTTP)
|
||||
- **Consul 内部端口**: 8500
|
||||
- **需要配置**: Traefik 端口转发
|
||||
|
||||
### 2. SSL 证书
|
||||
- **HTTPS**: 需要有效的 SSL 证书
|
||||
- **HTTP**: 可以使用自签名证书或跳过验证
|
||||
|
||||
### 3. 健康检查
|
||||
- **路径**: `/v1/status/leader`
|
||||
- **间隔**: 30秒
|
||||
- **超时**: 15秒
|
||||
|
||||
### 4. 故障转移
|
||||
- **自动切换**: Traefik 会自动路由到健康的节点
|
||||
- **Leader 选举**: Consul 会自动选举新的 leader
|
||||
|
||||
## 🔄 实施步骤
|
||||
|
||||
### 步骤 1: 验证 Traefik 配置
|
||||
```bash
|
||||
# 检查当前 Traefik 是否已配置 Consul 路由
|
||||
curl -I https://consul.git4ta.me/v1/status/leader
|
||||
```
|
||||
|
||||
### 步骤 2: 更新节点配置
|
||||
```bash
|
||||
# 备份现有配置
|
||||
cp /etc/consul.d/consul.hcl /etc/consul.d/consul.hcl.backup
|
||||
|
||||
# 修改配置使用域名
|
||||
sed -i 's/warden\.tailnet-68f9\.ts\.net:8500/consul.git4ta.me:8500/g' /etc/consul.d/consul.hcl
|
||||
```
|
||||
|
||||
### 步骤 3: 重启服务
|
||||
```bash
|
||||
# 重启 Consul
|
||||
systemctl restart consul
|
||||
|
||||
# 重启 Nomad
|
||||
systemctl restart nomad
|
||||
|
||||
# 重启 Vault
|
||||
systemctl restart vault
|
||||
```
|
||||
|
||||
### 步骤 4: 验证连接
|
||||
```bash
|
||||
# 检查 Consul 连接
|
||||
consul members
|
||||
|
||||
# 检查 Nomad 连接
|
||||
nomad node status
|
||||
|
||||
# 检查 Vault 连接
|
||||
vault status
|
||||
```
|
||||
|
||||
## 📊 性能影响
|
||||
|
||||
### 延迟
|
||||
- **直接连接**: ~1-2ms
|
||||
- **通过 Traefik**: ~5-10ms (增加 3-8ms)
|
||||
|
||||
### 吞吐量
|
||||
- **Traefik 限制**: 取决于 Traefik 配置
|
||||
- **建议**: 监控 Traefik 性能指标
|
||||
|
||||
### 可靠性
|
||||
- **提升**: 自动故障转移
|
||||
- **风险**: Traefik 单点故障
|
||||
|
||||
## 🎯 推荐方案
|
||||
|
||||
**建议采用方案 1**,因为:
|
||||
1. **简单直接**: 只需要修改配置文件
|
||||
2. **向后兼容**: 不影响现有功能
|
||||
3. **易于维护**: 统一管理入口
|
||||
|
||||
**实施优先级**:
|
||||
1. ✅ **Traefik 配置** - 已完成
|
||||
2. 🔄 **Consul 客户端** - 需要修改
|
||||
3. 🔄 **Nomad 配置** - 需要修改
|
||||
4. 🔄 **Vault 配置** - 需要修改
|
||||
|
||||
---
|
||||
|
||||
**结论**: 完全可行!通过 Traefik 统一访问 Consul 是一个很好的架构改进。
|
||||
169
docs/disk-management.md
Normal file
169
docs/disk-management.md
Normal file
@@ -0,0 +1,169 @@
|
||||
# 磁盘管理工具使用指南
|
||||
|
||||
## 🔧 工具概览
|
||||
|
||||
我们提供了三个主要的磁盘管理工具来解决磁盘空间不足的问题:
|
||||
|
||||
### 1. 磁盘分析工具 (`disk-analysis-ncdu.yml`)
|
||||
使用 `ncdu` 工具深度分析磁盘使用情况,生成详细报告。
|
||||
|
||||
### 2. 磁盘清理工具 (`disk-cleanup.yml`)
|
||||
自动清理系统垃圾文件、日志、缓存等。
|
||||
|
||||
### 3. 磁盘监控脚本 (`disk-monitor.sh`)
|
||||
一键监控所有节点的磁盘使用情况。
|
||||
|
||||
## 🚀 快速使用
|
||||
|
||||
### 监控所有节点磁盘使用情况
|
||||
```bash
|
||||
# 使用默认阈值 85%
|
||||
./scripts/utilities/disk-monitor.sh
|
||||
|
||||
# 使用自定义阈值 90%
|
||||
./scripts/utilities/disk-monitor.sh 90
|
||||
```
|
||||
|
||||
### 分析特定节点磁盘使用
|
||||
```bash
|
||||
# 分析所有节点
|
||||
ansible-playbook -i configuration/inventories/production/nomad-cluster.ini \
|
||||
configuration/playbooks/disk-analysis-ncdu.yml
|
||||
|
||||
# 分析特定节点
|
||||
ansible-playbook -i configuration/inventories/production/nomad-cluster.ini \
|
||||
configuration/playbooks/disk-analysis-ncdu.yml --limit semaphore
|
||||
```
|
||||
|
||||
### 清理磁盘空间
|
||||
```bash
|
||||
# 清理所有节点 (安全模式)
|
||||
ansible-playbook -i configuration/inventories/production/nomad-cluster.ini \
|
||||
configuration/playbooks/disk-cleanup.yml
|
||||
|
||||
# 清理特定节点
|
||||
ansible-playbook -i configuration/inventories/production/nomad-cluster.ini \
|
||||
configuration/playbooks/disk-cleanup.yml --limit ash3c
|
||||
|
||||
# 包含容器清理 (谨慎使用)
|
||||
ansible-playbook -i configuration/inventories/production/nomad-cluster.ini \
|
||||
configuration/playbooks/disk-cleanup.yml -e cleanup_containers=true
|
||||
```
|
||||
|
||||
## 📊 分析报告说明
|
||||
|
||||
### ncdu 文件位置
|
||||
分析完成后,ncdu 扫描文件保存在各节点的 `/tmp/disk-analysis/` 目录:
|
||||
|
||||
- `ncdu-root-<hostname>.json` - 根目录扫描结果
|
||||
- `ncdu-var-<hostname>.json` - /var 目录扫描结果
|
||||
- `ncdu-opt-<hostname>.json` - /opt 目录扫描结果
|
||||
|
||||
### 查看 ncdu 报告
|
||||
```bash
|
||||
# 在目标节点上查看交互式报告
|
||||
ncdu -f /tmp/disk-analysis/ncdu-root-semaphore.json
|
||||
|
||||
# 查看文本报告
|
||||
cat /tmp/disk-analysis/disk-report-semaphore.txt
|
||||
|
||||
# 查看清理建议
|
||||
cat /tmp/disk-analysis/cleanup-suggestions-semaphore.txt
|
||||
```
|
||||
|
||||
## 🧹 清理选项说明
|
||||
|
||||
### 默认清理项目
|
||||
- ✅ **系统日志**: 清理7天前的日志文件
|
||||
- ✅ **包缓存**: 清理 APT/YUM 缓存
|
||||
- ✅ **临时文件**: 清理7天前的临时文件
|
||||
- ✅ **核心转储**: 删除 core dump 文件
|
||||
|
||||
### 可选清理项目
|
||||
- ⚠️ **容器清理**: 需要手动启用 (`cleanup_containers=true`)
|
||||
- 停止所有容器
|
||||
- 删除未使用的容器、镜像、卷
|
||||
|
||||
### 自定义清理参数
|
||||
```bash
|
||||
ansible-playbook configuration/playbooks/disk-cleanup.yml \
|
||||
-e cleanup_logs=false \
|
||||
-e cleanup_cache=true \
|
||||
-e cleanup_temp=true \
|
||||
-e cleanup_containers=false
|
||||
```
|
||||
|
||||
## 🚨 紧急情况处理
|
||||
|
||||
### 磁盘使用率 > 95%
|
||||
```bash
|
||||
# 1. 立即检查最大文件
|
||||
ansible all -i configuration/inventories/production/nomad-cluster.ini \
|
||||
-m shell -a "find / -type f -size +1G -exec ls -lh {} \; 2>/dev/null | head -5"
|
||||
|
||||
# 2. 紧急清理
|
||||
ansible-playbook configuration/playbooks/disk-cleanup.yml \
|
||||
-e cleanup_containers=true
|
||||
|
||||
# 3. 手动清理大文件
|
||||
ansible all -m shell -a "truncate -s 0 /var/log/large.log"
|
||||
```
|
||||
|
||||
### 常见大文件位置
|
||||
- `/var/log/` - 系统日志
|
||||
- `/tmp/` - 临时文件
|
||||
- `/var/cache/` - 包管理器缓存
|
||||
- `/opt/nomad/data/` - Nomad 数据
|
||||
- `~/.local/share/containers/` - Podman 数据
|
||||
|
||||
## 📈 定期维护建议
|
||||
|
||||
### 每日监控
|
||||
```bash
|
||||
# 添加到 crontab
|
||||
0 9 * * * /root/mgmt/scripts/utilities/disk-monitor.sh 85
|
||||
```
|
||||
|
||||
### 每周清理
|
||||
```bash
|
||||
# 每周日凌晨2点自动清理
|
||||
0 2 * * 0 cd /root/mgmt && ansible-playbook configuration/playbooks/disk-cleanup.yml
|
||||
```
|
||||
|
||||
### 每月深度分析
|
||||
```bash
|
||||
# 每月1号生成详细报告
|
||||
0 3 1 * * cd /root/mgmt && ansible-playbook configuration/playbooks/disk-analysis-ncdu.yml
|
||||
```
|
||||
|
||||
## 🔍 故障排除
|
||||
|
||||
### ncdu 安装失败
|
||||
```bash
|
||||
# 手动安装
|
||||
ansible all -m package -a "name=ncdu state=present" --become
|
||||
```
|
||||
|
||||
### 扫描超时
|
||||
```bash
|
||||
# 增加超时时间
|
||||
ansible-playbook disk-analysis-ncdu.yml -e ansible_timeout=600
|
||||
```
|
||||
|
||||
### 权限问题
|
||||
```bash
|
||||
# 确保使用 sudo
|
||||
ansible-playbook disk-analysis-ncdu.yml --become
|
||||
```
|
||||
|
||||
## 💡 最佳实践
|
||||
|
||||
1. **定期监控**: 每天检查磁盘使用情况
|
||||
2. **预防性清理**: 使用率超过80%时主动清理
|
||||
3. **日志轮转**: 配置合适的日志轮转策略
|
||||
4. **容器管理**: 定期清理未使用的容器镜像
|
||||
5. **监控告警**: 设置磁盘使用率告警阈值
|
||||
|
||||
---
|
||||
|
||||
💡 **提示**: 使用 `./scripts/utilities/disk-monitor.sh` 可以快速检查所有节点状态!
|
||||
146
docs/nomad-nfs-setup.md
Normal file
146
docs/nomad-nfs-setup.md
Normal file
@@ -0,0 +1,146 @@
|
||||
# Nomad集群NFS配置指南
|
||||
|
||||
## 概述
|
||||
|
||||
本文档介绍如何为Nomad集群配置NFS存储,支持不同类型的容器和地理位置。
|
||||
|
||||
## 容器类型分类
|
||||
|
||||
### 1. 本地LXC容器
|
||||
- **位置**: 本地网络环境
|
||||
- **节点示例**: influxdb, warden, hcp1, hcp2
|
||||
- **特点**: 直接使用已映射的NFS目录
|
||||
- **NFS参数**: `rw,sync,vers=4.2`
|
||||
|
||||
### 2. 海外PVE容器
|
||||
- **位置**: 海外云服务器
|
||||
- **节点示例**: ash1d, ash2e, ash3c, ch2, ch3
|
||||
- **特点**: 需要网络优化参数
|
||||
- **NFS参数**: `rw,sync,vers=3,timeo=600,retrans=2`
|
||||
|
||||
## NFS配置详情
|
||||
|
||||
### NFS服务器信息
|
||||
- **服务器**: snail
|
||||
- **导出路径**: `/fs/1000/nfs/Fnsync`
|
||||
- **挂载点**: `/mnt/fnsync`
|
||||
|
||||
### 当前挂载状态
|
||||
```bash
|
||||
# 检查当前挂载
|
||||
df -h | grep fnsync
|
||||
# 输出: snail:/fs/1000/nfs/Fnsync 8.2T 2.2T 6.0T 27% /mnt/fnsync
|
||||
```
|
||||
|
||||
## 部署步骤
|
||||
|
||||
### 1. 自动部署
|
||||
```bash
|
||||
chmod +x scripts/deploy-nfs-for-nomad.sh
|
||||
./scripts/deploy-nfs-for-nomad.sh
|
||||
```
|
||||
|
||||
### 2. 手动分步部署
|
||||
```bash
|
||||
# 步骤1: 配置NFS挂载
|
||||
ansible-playbook -i configuration/inventories/production/inventory.ini \
|
||||
playbooks/setup-nfs-by-container-type.yml
|
||||
|
||||
# 步骤2: 配置Nomad客户端
|
||||
ansible-playbook -i configuration/inventories/production/nomad-cluster.ini \
|
||||
playbooks/setup-nomad-nfs-client.yml
|
||||
```
|
||||
|
||||
## Nomad作业配置
|
||||
|
||||
### 使用NFS卷的Nomad作业示例
|
||||
|
||||
```hcl
|
||||
job "nfs-example" {
|
||||
volume "nfs-shared" {
|
||||
type = "host"
|
||||
source = "nfs-shared"
|
||||
read_only = false
|
||||
}
|
||||
|
||||
task "app" {
|
||||
volume_mount {
|
||||
volume = "nfs-shared"
|
||||
destination = "/shared"
|
||||
read_only = false
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 针对不同容器类型的约束
|
||||
|
||||
```hcl
|
||||
# 本地LXC容器约束
|
||||
constraint {
|
||||
attribute = "${attr.unique.hostname}"
|
||||
operator = "regexp"
|
||||
value = "(influxdb|warden|hcp1|hcp2)"
|
||||
}
|
||||
|
||||
# 海外PVE容器约束
|
||||
constraint {
|
||||
attribute = "${attr.unique.hostname}"
|
||||
operator = "regexp"
|
||||
value = "(ash1d|ash2e|ash3c|ch2|ch3)"
|
||||
}
|
||||
```
|
||||
|
||||
## 验证和监控
|
||||
|
||||
### 验证命令
|
||||
```bash
|
||||
# 检查NFS挂载
|
||||
ansible all -i configuration/inventories/production/inventory.ini \
|
||||
-m shell -a "df -h /mnt/fnsync"
|
||||
|
||||
# 检查Nomad状态
|
||||
nomad node status
|
||||
|
||||
# 检查NFS任务状态
|
||||
nomad job status nfs-multi-type-example
|
||||
```
|
||||
|
||||
### 监控指标
|
||||
- NFS挂载状态
|
||||
- 网络延迟(海外节点)
|
||||
- 存储使用情况
|
||||
- Nomad任务运行状态
|
||||
|
||||
## 故障排除
|
||||
|
||||
### 常见问题
|
||||
|
||||
1. **NFS挂载失败**
|
||||
- 检查网络连通性: `ping snail`
|
||||
- 验证NFS服务: `showmount -e snail`
|
||||
- 检查防火墙设置
|
||||
|
||||
2. **海外节点连接慢**
|
||||
- 使用NFSv3协议
|
||||
- 增加超时参数
|
||||
- 考虑使用缓存方案
|
||||
|
||||
3. **Nomad卷无法挂载**
|
||||
- 检查Nomad客户端配置
|
||||
- 验证目录权限
|
||||
- 检查Nomad服务状态
|
||||
|
||||
## 最佳实践
|
||||
|
||||
1. **数据备份**: 定期备份NFS上的重要数据
|
||||
2. **监控告警**: 设置NFS挂载状态监控
|
||||
3. **容量规划**: 监控存储使用情况
|
||||
4. **网络优化**: 为海外节点配置合适的网络参数
|
||||
|
||||
## 相关文件
|
||||
|
||||
- `playbooks/setup-nfs-by-container-type.yml` - NFS挂载配置
|
||||
- `playbooks/setup-nomad-nfs-client.yml` - Nomad客户端配置
|
||||
- `jobs/nomad-nfs-multi-type.nomad` - 示例Nomad作业
|
||||
- `scripts/deploy-nfs-for-nomad.sh` - 部署脚本
|
||||
196
docs/setup/consul-terraform-integration.md
Normal file
196
docs/setup/consul-terraform-integration.md
Normal file
@@ -0,0 +1,196 @@
|
||||
# Consul + Terraform 集成指南
|
||||
|
||||
本指南介绍如何使用 Consul 安全地管理 Terraform 中的敏感配置信息,特别是 Oracle Cloud 的凭据。
|
||||
|
||||
## 概述
|
||||
|
||||
我们使用 Consul 作为安全的密钥存储,避免在 Terraform 配置文件中直接暴露敏感信息。
|
||||
|
||||
## 架构
|
||||
|
||||
```
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ Terraform │───▶│ Consul │───▶│ Oracle Cloud │
|
||||
│ │ │ (密钥存储) │ │ │
|
||||
│ consul provider │ │ │ │ │
|
||||
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
## 前提条件
|
||||
|
||||
1. Consul 集群正在运行
|
||||
2. 可以访问 Consul API (默认: http://localhost:8500)
|
||||
3. 已安装 curl 和 Terraform
|
||||
|
||||
## 快速开始
|
||||
|
||||
### 1. 启动 Consul 集群
|
||||
|
||||
当前集群已从 Docker Swarm 迁移到 Nomad + Podman,请使用 Nomad 部署 Consul 集群:
|
||||
|
||||
```bash
|
||||
nomad run /root/mgmt/consul-cluster-nomad.nomad
|
||||
```
|
||||
|
||||
### 2. 设置 Oracle Cloud 配置
|
||||
|
||||
```bash
|
||||
# 使用密钥管理脚本设置配置
|
||||
./scripts/utilities/consul-secrets-manager.sh set-oracle
|
||||
```
|
||||
|
||||
脚本会提示你输入:
|
||||
- 租户 OCID
|
||||
- 用户 OCID
|
||||
- API 密钥指纹
|
||||
- 私钥文件路径
|
||||
- 区间 OCID
|
||||
|
||||
### 3. 配置 Terraform
|
||||
|
||||
```bash
|
||||
# 设置 Terraform Consul Provider
|
||||
./scripts/utilities/terraform-consul-provider.sh setup
|
||||
```
|
||||
|
||||
### 4. 验证配置
|
||||
|
||||
```bash
|
||||
# 查看存储在 Consul 中的配置
|
||||
./scripts/utilities/consul-secrets-manager.sh get-oracle
|
||||
```
|
||||
|
||||
### 5. 运行 Terraform
|
||||
|
||||
```bash
|
||||
cd infrastructure/environments/dev
|
||||
|
||||
# 初始化 Terraform
|
||||
terraform init
|
||||
|
||||
# 规划部署
|
||||
terraform plan
|
||||
|
||||
# 应用配置
|
||||
terraform apply
|
||||
```
|
||||
|
||||
## 详细说明
|
||||
|
||||
### Consul 密钥存储结构
|
||||
|
||||
```
|
||||
config/
|
||||
└── dev/
|
||||
└── oracle/
|
||||
├── tenancy_ocid
|
||||
├── user_ocid
|
||||
├── fingerprint
|
||||
├── private_key
|
||||
└── compartment_ocid
|
||||
```
|
||||
|
||||
### 脚本功能
|
||||
|
||||
#### consul-secrets-manager.sh
|
||||
|
||||
- `set-oracle`: 设置 Oracle Cloud 配置到 Consul
|
||||
- `get-oracle`: 从 Consul 获取配置信息
|
||||
- `delete-oracle`: 删除 Consul 中的配置
|
||||
- `generate-vars`: 生成临时 Terraform 变量文件
|
||||
- `cleanup`: 清理临时文件
|
||||
|
||||
#### terraform-consul-provider.sh
|
||||
|
||||
- `setup`: 创建 Terraform Consul Provider 配置文件
|
||||
|
||||
### 安全特性
|
||||
|
||||
1. **敏感信息隔离**: 私钥等敏感信息只存储在 Consul 中
|
||||
2. **临时文件**: 私钥文件只在 Terraform 运行时临时创建
|
||||
3. **权限控制**: 临时私钥文件设置为 600 权限
|
||||
4. **自动清理**: 提供清理脚本删除临时文件
|
||||
|
||||
## 环境变量
|
||||
|
||||
```bash
|
||||
# Consul 地址
|
||||
export CONSUL_ADDR="http://localhost:8500"
|
||||
|
||||
# Consul ACL Token (如果启用了 ACL)
|
||||
export CONSUL_TOKEN="your-token"
|
||||
|
||||
# 环境名称
|
||||
export ENVIRONMENT="dev"
|
||||
```
|
||||
|
||||
## 故障排除
|
||||
|
||||
### Consul 连接问题
|
||||
|
||||
```bash
|
||||
# 检查 Consul 状态
|
||||
curl http://localhost:8500/v1/status/leader
|
||||
|
||||
# 检查 Consul 服务
|
||||
docker ps | grep consul
|
||||
```
|
||||
|
||||
### 配置验证
|
||||
|
||||
```bash
|
||||
# 验证 Consul 中的配置
|
||||
curl http://localhost:8500/v1/kv/config/dev/oracle?recurse
|
||||
|
||||
# 检查 Terraform 配置
|
||||
terraform validate
|
||||
```
|
||||
|
||||
### 清理和重置
|
||||
|
||||
```bash
|
||||
# 清理临时文件
|
||||
./scripts/utilities/consul-secrets-manager.sh cleanup
|
||||
|
||||
# 删除 Consul 中的配置
|
||||
./scripts/utilities/consul-secrets-manager.sh delete-oracle
|
||||
```
|
||||
|
||||
## 最佳实践
|
||||
|
||||
1. **定期轮换密钥**: 定期更新 Oracle Cloud API 密钥
|
||||
2. **备份配置**: 定期备份 Consul 数据
|
||||
3. **监控访问**: 监控 Consul 密钥访问日志
|
||||
4. **环境隔离**: 不同环境使用不同的 Consul 路径
|
||||
|
||||
## 扩展其他云服务商
|
||||
|
||||
可以类似地为其他云服务商添加 Consul 集成:
|
||||
|
||||
```bash
|
||||
# 华为云配置路径
|
||||
config/dev/huawei/access_key
|
||||
config/dev/huawei/secret_key
|
||||
|
||||
# AWS 配置路径
|
||||
config/dev/aws/access_key
|
||||
config/dev/aws/secret_key
|
||||
|
||||
# Google Cloud 配置路径
|
||||
config/dev/gcp/service_account_key
|
||||
```
|
||||
|
||||
## 相关文件
|
||||
|
||||
- `infrastructure/environments/dev/terraform.tfvars` - Terraform 变量配置
|
||||
- `scripts/utilities/consul-secrets-manager.sh` - Consul 密钥管理脚本
|
||||
- `scripts/utilities/terraform-consul-provider.sh` - Terraform Consul Provider 配置脚本
|
||||
- `swarm/configs/traefik-consul-setup.yml` - Consul 集群配置
|
||||
|
||||
## 支持
|
||||
|
||||
如有问题,请检查:
|
||||
1. Consul 集群是否正常运行
|
||||
2. 网络连接是否正常
|
||||
3. 权限设置是否正确
|
||||
4. 环境变量是否正确设置
|
||||
690
docs/setup/consul_variables_and_storage_guide.md
Normal file
690
docs/setup/consul_variables_and_storage_guide.md
Normal file
@@ -0,0 +1,690 @@
|
||||
# Consul 变量和存储配置指南
|
||||
|
||||
本文档介绍如何配置Consul的变量(Variables)和存储(Storage)功能,以增强集群的功能性和可靠性。
|
||||
|
||||
## 概述
|
||||
|
||||
Consul提供了两种关键功能来增强集群能力:
|
||||
1. **变量(Variables)**: 用于存储配置信息、特性开关、应用参数等
|
||||
2. **存储(Storage)**: 用于持久化数据、快照和备份
|
||||
|
||||
## 变量(Variables)配置
|
||||
|
||||
### 变量命名规范
|
||||
|
||||
我们遵循统一的命名规范来管理Consul KV存储中的配置:
|
||||
|
||||
```
|
||||
config/{environment}/{provider}/{region_or_service}/{key}
|
||||
```
|
||||
|
||||
各部分说明:
|
||||
- **config**: 固定前缀,表示这是一个配置项
|
||||
- **environment**: 环境名称,如 `dev`、`staging`、`prod` 等
|
||||
- **provider**: 云服务提供商,如 `oracle`、`digitalocean`、`aws`、`gcp` 等
|
||||
- **region_or_service**: 区域或服务名称,如 `kr`、`us`、`sgp` 等
|
||||
- **key**: 具体的配置键名,如 `token`、`tenancy_ocid`、`user_ocid` 等
|
||||
|
||||
### Consul集群配置变量
|
||||
|
||||
Consul集群自身配置也应遵循上述命名规范。以下是一些关键配置变量的示例:
|
||||
|
||||
#### 集群基本配置
|
||||
```
|
||||
config/dev/consul/cluster/data_dir
|
||||
config/dev/consul/cluster/raft_dir
|
||||
config/dev/consul/cluster/datacenter
|
||||
config/dev/consul/cluster/bootstrap_expect
|
||||
config/dev/consul/cluster/log_level
|
||||
config/dev/consul/cluster/encrypt_key
|
||||
```
|
||||
|
||||
#### 节点配置
|
||||
```
|
||||
config/dev/consul/nodes/master/ip
|
||||
config/dev/consul/nodes/ash3c/ip
|
||||
config/dev/consul/nodes/warden/ip
|
||||
```
|
||||
|
||||
#### 网络配置
|
||||
```
|
||||
config/dev/consul/network/client_addr
|
||||
config/dev/consul/network/bind_interface
|
||||
config/dev/consul/network/advertise_interface
|
||||
```
|
||||
|
||||
#### 端口配置
|
||||
```
|
||||
config/dev/consul/ports/dns
|
||||
config/dev/consul/ports/http
|
||||
config/dev/consul/ports/https
|
||||
config/dev/consul/ports/grpc
|
||||
config/dev/consul/ports/grpc_tls
|
||||
config/dev/consul/ports/serf_lan
|
||||
config/dev/consul/ports/serf_wan
|
||||
config/dev/consul/ports/server
|
||||
```
|
||||
|
||||
#### 服务发现配置
|
||||
```
|
||||
config/dev/consul/service/enable_script_checks
|
||||
config/dev/consul/service/enable_local_script_checks
|
||||
config/dev/consul/service/enable_service_script
|
||||
```
|
||||
|
||||
#### 性能配置
|
||||
```
|
||||
config/dev/consul/performance/raft_multiplier
|
||||
```
|
||||
|
||||
#### 日志配置
|
||||
```
|
||||
config/dev/consul/log/enable_syslog
|
||||
config/dev/consul/log/log_file
|
||||
```
|
||||
|
||||
#### 连接配置
|
||||
```
|
||||
config/dev/consul/connection/reconnect_timeout
|
||||
config/dev/consul/connection/reconnect_timeout_wan
|
||||
config/dev/consul/connection/session_ttl_min
|
||||
```
|
||||
|
||||
#### Autopilot配置
|
||||
```
|
||||
config/dev/consul/autopilot/cleanup_dead_servers
|
||||
config/dev/consul/autopilot/last_contact_threshold
|
||||
config/dev/consul/autopilot/max_trailing_logs
|
||||
config/dev/consul/autopilot/server_stabilization_time
|
||||
config/dev/consul/autopilot/disable_upgrade_migration
|
||||
```
|
||||
|
||||
#### 快照配置
|
||||
```
|
||||
config/dev/consul/snapshot/enabled
|
||||
config/dev/consul/snapshot/interval
|
||||
config/dev/consul/snapshot/retain
|
||||
config/dev/consul/snapshot/name
|
||||
```
|
||||
|
||||
#### 备份配置
|
||||
```
|
||||
config/dev/consul/backup/enabled
|
||||
config/dev/consul/backup/interval
|
||||
config/dev/consul/backup/retain
|
||||
config/dev/consul/backup/name
|
||||
```
|
||||
|
||||
### 示例配置
|
||||
|
||||
#### 应用配置
|
||||
```
|
||||
config/dev/app/name
|
||||
config/dev/app/version
|
||||
config/dev/app/environment
|
||||
```
|
||||
|
||||
#### 数据库配置
|
||||
```
|
||||
config/dev/database/host
|
||||
config/dev/database/port
|
||||
config/dev/database/name
|
||||
```
|
||||
|
||||
#### 缓存配置
|
||||
```
|
||||
config/dev/cache/host
|
||||
config/dev/cache/port
|
||||
```
|
||||
|
||||
#### 特性开关
|
||||
```
|
||||
config/dev/features/new_ui
|
||||
config/dev/features/advanced_analytics
|
||||
```
|
||||
|
||||
### 如何添加变量
|
||||
|
||||
#### 使用curl命令
|
||||
```bash
|
||||
# 添加单个变量
|
||||
curl -X PUT http://localhost:8500/v1/kv/config/dev/app/name -d "my-application"
|
||||
|
||||
# 添加多个变量
|
||||
curl -X PUT http://localhost:8500/v1/kv/config/dev/database/host -d "db.example.com"
|
||||
curl -X PUT http://localhost:8500/v1/kv/config/dev/database/port -d "5432"
|
||||
```
|
||||
|
||||
#### 使用consul CLI
|
||||
```bash
|
||||
# 添加单个变量
|
||||
consul kv put config/dev/app/name my-application
|
||||
|
||||
# 添加多个变量
|
||||
consul kv put config/dev/database/host db.example.com
|
||||
consul kv put config/dev/database/port 5432
|
||||
```
|
||||
|
||||
#### 使用自动化脚本
|
||||
我们提供了自动化脚本来配置Consul变量:
|
||||
|
||||
```bash
|
||||
# 运行配置脚本
|
||||
./deployment/scripts/setup_consul_variables_and_storage.sh
|
||||
```
|
||||
|
||||
### 如何使用变量
|
||||
|
||||
#### 在Terraform中使用
|
||||
```hcl
|
||||
data "consul_keys" "app_config" {
|
||||
key {
|
||||
name = "app_name"
|
||||
path = "config/dev/app/name"
|
||||
}
|
||||
key {
|
||||
name = "db_host"
|
||||
path = "config/dev/database/host"
|
||||
}
|
||||
}
|
||||
|
||||
resource "some_resource" "example" {
|
||||
name = data.consul_keys.app_config.var.app_name
|
||||
host = data.consul_keys.app_config.var.db_host
|
||||
}
|
||||
```
|
||||
|
||||
#### 在应用程序中使用
|
||||
大多数Consul客户端库都提供了读取KV存储的方法。例如,在Go中:
|
||||
|
||||
```go
|
||||
import "github.com/hashicorp/consul/api"
|
||||
|
||||
// 创建Consul客户端
|
||||
client, _ := api.NewClient(api.DefaultConfig())
|
||||
|
||||
// 读取KV
|
||||
kv := client.KV()
|
||||
pair, _, _ := kv.Get("config/dev/app/name", nil)
|
||||
appName := string(pair.Value)
|
||||
```
|
||||
|
||||
## 部署遵循最佳变量命名规范的Consul集群
|
||||
|
||||
为了确保Consul集群完全遵循最佳变量命名规范,我们提供了一套完整的部署方案。
|
||||
|
||||
### 部署流程
|
||||
|
||||
1. **设置Consul变量**: 使用脚本将所有Consul集群配置存储到Consul KV中
|
||||
2. **生成配置文件**: 使用Consul模板从KV存储动态生成配置文件
|
||||
3. **部署集群**: 使用Nomad部署使用动态配置的Consul集群
|
||||
|
||||
### 部署脚本
|
||||
|
||||
我们提供了以下脚本来简化部署过程:
|
||||
|
||||
#### setup_consul_cluster_variables.sh
|
||||
此脚本将Consul集群配置存储到Consul KV中,遵循 `config/{environment}/{provider}/{region_or_service}/{key}` 格式。
|
||||
|
||||
```bash
|
||||
# 设置Consul集群变量
|
||||
./deployment/scripts/setup_consul_cluster_variables.sh
|
||||
```
|
||||
|
||||
#### generate_consul_config.sh
|
||||
此脚本使用Consul模板从KV存储生成最终的Consul配置文件。
|
||||
|
||||
```bash
|
||||
# 生成Consul配置文件
|
||||
./deployment/scripts/generate_consul_config.sh
|
||||
```
|
||||
|
||||
#### deploy_consul_cluster_kv.sh
|
||||
此脚本是一个综合部署脚本,执行完整的部署流程。
|
||||
|
||||
```bash
|
||||
# 部署遵循最佳变量命名规范的Consul集群
|
||||
./deployment/scripts/deploy_consul_cluster_kv.sh
|
||||
```
|
||||
|
||||
### 配置模板
|
||||
|
||||
我们提供了Consul配置模板文件 `consul.hcl.tmpl`,使用Consul模板语法从KV存储中动态获取配置:
|
||||
|
||||
```hcl
|
||||
# 基础配置
|
||||
data_dir = "{{ keyOrDefault `config/dev/consul/cluster/data_dir` `/opt/consul/data` }}"
|
||||
raft_dir = "{{ keyOrDefault `config/dev/consul/cluster/raft_dir` `/opt/consul/raft` }}"
|
||||
|
||||
# 启用UI
|
||||
ui_config {
|
||||
enabled = {{ keyOrDefault `config/dev/consul/ui/enabled` `true` }}
|
||||
}
|
||||
|
||||
# 服务器配置
|
||||
server = true
|
||||
bootstrap_expect = {{ keyOrDefault `config/dev/consul/cluster/bootstrap_expect` `3` }}
|
||||
|
||||
# 网络配置
|
||||
client_addr = "{{ keyOrDefault `config/dev/consul/nodes/master/ip` `100.117.106.136` }}"
|
||||
bind_addr = "{{ keyOrDefault `config/dev/consul/nodes/master/ip` `100.117.106.136` }}"
|
||||
advertise_addr = "{{ keyOrDefault `config/dev/consul/nodes/master/ip` `100.117.106.136` }}"
|
||||
|
||||
# 集群连接 - 从KV获取其他节点IP
|
||||
retry_join = [
|
||||
"{{ keyOrDefault `config/dev/consul/nodes/ash3c/ip` `100.116.80.94` }}",
|
||||
"{{ keyOrDefault `config/dev/consul/nodes/warden/ip` `100.122.197.112` }}"
|
||||
]
|
||||
```
|
||||
|
||||
### Nomad作业配置
|
||||
|
||||
我们提供了完全遵循最佳变量命名规范的Nomad作业配置文件 `consul-cluster-kv.nomad`,该文件使用Consul模板从KV存储动态获取配置:
|
||||
|
||||
```hcl
|
||||
task "consul" {
|
||||
driver = "exec"
|
||||
|
||||
# 使用模板从Consul KV获取配置
|
||||
template {
|
||||
data = <<EOF
|
||||
# Consul配置文件 - 从KV存储动态获取
|
||||
# 遵循 config/{environment}/{provider}/{region_or_service}/{key} 格式
|
||||
|
||||
# 基础配置
|
||||
data_dir = "{{ keyOrDefault `config/dev/consul/cluster/data_dir` `/opt/consul/data` }}"
|
||||
raft_dir = "{{ keyOrDefault `config/dev/consul/cluster/raft_dir` `/opt/consul/raft` }}"
|
||||
|
||||
# 服务器配置
|
||||
server = true
|
||||
bootstrap_expect = {{ keyOrDefault `config/dev/consul/cluster/bootstrap_expect` `3` }}
|
||||
|
||||
# 网络配置
|
||||
client_addr = "{{ keyOrDefault `config/dev/consul/nodes/master/ip` `100.117.106.136` }}"
|
||||
bind_addr = "{{ keyOrDefault `config/dev/consul/nodes/master/ip` `100.117.106.136` }}"
|
||||
advertise_addr = "{{ keyOrDefault `config/dev/consul/nodes/master/ip` `100.117.106.136` }}"
|
||||
|
||||
# 集群连接 - 从KV获取其他节点IP
|
||||
retry_join = [
|
||||
"{{ keyOrDefault `config/dev/consul/nodes/ash3c/ip` `100.116.80.94` }}",
|
||||
"{{ keyOrDefault `config/dev/consul/nodes/warden/ip` `100.122.197.112` }}"
|
||||
]
|
||||
EOF
|
||||
destination = "local/consul.hcl"
|
||||
}
|
||||
|
||||
config {
|
||||
command = "consul"
|
||||
args = [
|
||||
"agent",
|
||||
"-config-dir=local"
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 验证部署
|
||||
|
||||
部署完成后,可以通过以下方式验证Consul集群是否正确遵循了最佳变量命名规范:
|
||||
|
||||
1. **检查Consul KV中的配置**:
|
||||
```bash
|
||||
# 检查Consul集群配置
|
||||
curl -s http://localhost:8500/v1/kv/config/dev/consul/?keys | jq '.'
|
||||
```
|
||||
|
||||
2. **验证Consul集群状态**:
|
||||
```bash
|
||||
# 检查集群leader
|
||||
curl -s http://localhost:8500/v1/status/leader
|
||||
|
||||
# 检查集群节点
|
||||
curl -s http://localhost:8500/v1/status/peers
|
||||
```
|
||||
|
||||
3. **验证配置文件**:
|
||||
```bash
|
||||
# 验证生成的配置文件语法
|
||||
consul validate /root/mgmt/components/consul/configs/consul.hcl
|
||||
```
|
||||
|
||||
### 动态更新配置
|
||||
|
||||
使用这种部署方式,您可以动态更新Consul集群配置,而无需重新部署整个集群:
|
||||
|
||||
1. **更新Consul KV中的配置**:
|
||||
```bash
|
||||
# 更新日志级别
|
||||
curl -X PUT http://localhost:8500/v1/kv/config/dev/consul/cluster/log_level -d "DEBUG"
|
||||
|
||||
# 更新快照间隔
|
||||
curl -X PUT http://localhost:8500/v1/kv/config/dev/consul/snapshot/interval -d "12h"
|
||||
```
|
||||
|
||||
2. **重新生成配置文件**:
|
||||
```bash
|
||||
# 重新生成配置文件
|
||||
./deployment/scripts/generate_consul_config.sh
|
||||
```
|
||||
|
||||
3. **重新加载Consul配置**:
|
||||
```bash
|
||||
# 重新加载Consul配置
|
||||
consul reload
|
||||
```
|
||||
|
||||
### 环境隔离
|
||||
|
||||
通过使用环境变量和不同的配置路径,您可以轻松实现不同环境的隔离:
|
||||
|
||||
```bash
|
||||
# 开发环境
|
||||
ENVIRONMENT=dev ./deployment/scripts/setup_consul_cluster_variables.sh
|
||||
|
||||
# 生产环境
|
||||
ENVIRONMENT=prod ./deployment/scripts/setup_consul_cluster_variables.sh
|
||||
```
|
||||
|
||||
这样,不同环境的配置将存储在不同的路径下:
|
||||
- 开发环境: `config/dev/consul/...`
|
||||
- 生产环境: `config/prod/consul/...`
|
||||
|
||||
## 存储(Storage)配置
|
||||
|
||||
### 持久化存储
|
||||
|
||||
Consul需要持久化存储来保存Raft日志和快照数据。在Nomad作业配置中,我们已经指定了数据目录:
|
||||
|
||||
```hcl
|
||||
config {
|
||||
command = "consul"
|
||||
args = [
|
||||
"agent",
|
||||
"-server",
|
||||
"-bootstrap-expect=3",
|
||||
"-data-dir=/opt/nomad/data/consul", # 数据目录
|
||||
# 其他参数...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 快照配置
|
||||
|
||||
快照是Consul集群状态的时间点备份,用于灾难恢复。
|
||||
|
||||
#### 启用快照
|
||||
在Consul配置文件中添加以下配置:
|
||||
|
||||
```hcl
|
||||
snapshot {
|
||||
enabled = true
|
||||
interval = "24h" # 每24小时创建一次快照
|
||||
retain = 30 # 保留30个快照
|
||||
name = "consul-snapshot-{{.Timestamp}}"
|
||||
}
|
||||
```
|
||||
|
||||
#### 手动创建快照
|
||||
```bash
|
||||
# 创建快照
|
||||
consul snapshot save backup-$(date +%Y%m%d).snap
|
||||
|
||||
# 恢复快照
|
||||
consul snapshot restore backup-20231201.snap
|
||||
```
|
||||
|
||||
### 备份配置
|
||||
|
||||
定期备份Consul数据是确保数据安全的重要措施。
|
||||
|
||||
#### 配置自动备份
|
||||
```hcl
|
||||
backup {
|
||||
enabled = true
|
||||
interval = "6h" # 每6小时备份一次
|
||||
retain = 7 # 保留7个备份
|
||||
name = "consul-backup-{{.Timestamp}}"
|
||||
}
|
||||
```
|
||||
|
||||
#### 备份脚本
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# backup_consul.sh
|
||||
|
||||
DATE=$(date +%Y%m%d_%H%M%S)
|
||||
BACKUP_DIR="/backups/consul"
|
||||
CONSUL_ADDR="http://localhost:8500"
|
||||
|
||||
# 创建备份目录
|
||||
mkdir -p $BACKUP_DIR
|
||||
|
||||
# 创建快照
|
||||
curl -s "${CONSUL_ADDR}/v1/snapshot" > "${BACKUP_DIR}/consul-snapshot-${DATE}.snap"
|
||||
|
||||
# 保留最近7天的备份
|
||||
find $BACKUP_DIR -name "consul-snapshot-*.snap" -mtime +7 -delete
|
||||
|
||||
echo "备份完成: ${BACKUP_DIR}/consul-snapshot-${DATE}.snap"
|
||||
```
|
||||
|
||||
### Autopilot配置
|
||||
|
||||
Autopilot是Consul的自动管理功能,用于处理服务器故障和自动恢复。
|
||||
|
||||
```hcl
|
||||
autopilot {
|
||||
cleanup_dead_servers = true # 自动清理死服务器
|
||||
last_contact_threshold = "200ms" # 最后联系阈值
|
||||
max_trailing_logs = 250 # 最大 trailing 日志数
|
||||
server_stabilization_time = "10s" # 服务器稳定时间
|
||||
redundancy_zone_tag = "" # 冗余区域标签
|
||||
disable_upgrade_migration = false # 禁用升级迁移
|
||||
upgrade_version_tag = "" # 升级版本标签
|
||||
}
|
||||
```
|
||||
|
||||
## 完整配置示例
|
||||
|
||||
### Consul配置文件 (consul.hcl)
|
||||
```hcl
|
||||
# 基础配置
|
||||
data_dir = "/opt/consul/data"
|
||||
raft_dir = "/opt/consul/raft"
|
||||
|
||||
# 启用UI
|
||||
ui_config {
|
||||
enabled = true
|
||||
}
|
||||
|
||||
# 数据中心配置
|
||||
datacenter = "dc1"
|
||||
|
||||
# 服务器配置
|
||||
server = true
|
||||
bootstrap_expect = 3
|
||||
|
||||
# 网络配置
|
||||
client_addr = "0.0.0.0"
|
||||
bind_addr = "{{ GetInterfaceIP `eth0` }}"
|
||||
advertise_addr = "{{ GetInterfaceIP `eth0` }}"
|
||||
|
||||
# 端口配置
|
||||
ports {
|
||||
dns = 8600
|
||||
http = 8500
|
||||
https = -1
|
||||
grpc = 8502
|
||||
grpc_tls = 8503
|
||||
serf_lan = 8301
|
||||
serf_wan = 8302
|
||||
server = 8300
|
||||
}
|
||||
|
||||
# 集群连接
|
||||
retry_join = ["100.117.106.136", "100.116.80.94", "100.122.197.112"]
|
||||
|
||||
# 服务发现
|
||||
enable_service_script = true
|
||||
enable_script_checks = true
|
||||
enable_local_script_checks = true
|
||||
|
||||
# 性能调优
|
||||
performance {
|
||||
raft_multiplier = 1
|
||||
}
|
||||
|
||||
# 日志配置
|
||||
log_level = "INFO"
|
||||
enable_syslog = false
|
||||
log_file = "/var/log/consul/consul.log"
|
||||
|
||||
# 安全配置
|
||||
encrypt = "YourEncryptionKeyHere"
|
||||
|
||||
# 连接配置
|
||||
reconnect_timeout = "30s"
|
||||
reconnect_timeout_wan = "30s"
|
||||
session_ttl_min = "10s"
|
||||
|
||||
# Autopilot配置
|
||||
autopilot {
|
||||
cleanup_dead_servers = true
|
||||
last_contact_threshold = "200ms"
|
||||
max_trailing_logs = 250
|
||||
server_stabilization_time = "10s"
|
||||
redundancy_zone_tag = ""
|
||||
disable_upgrade_migration = false
|
||||
upgrade_version_tag = ""
|
||||
}
|
||||
|
||||
# 快照配置
|
||||
snapshot {
|
||||
enabled = true
|
||||
interval = "24h"
|
||||
retain = 30
|
||||
name = "consul-snapshot-{{.Timestamp}}"
|
||||
}
|
||||
|
||||
# 备份配置
|
||||
backup {
|
||||
enabled = true
|
||||
interval = "6h"
|
||||
retain = 7
|
||||
name = "consul-backup-{{.Timestamp}}"
|
||||
}
|
||||
```
|
||||
|
||||
## 部署步骤
|
||||
|
||||
### 1. 准备配置文件
|
||||
```bash
|
||||
# 创建配置目录
|
||||
mkdir -p /root/mgmt/components/consul/configs
|
||||
|
||||
# 创建配置文件
|
||||
cat > /root/mgmt/components/consul/configs/consul.hcl << EOF
|
||||
# 粘贴上面的完整配置示例
|
||||
EOF
|
||||
```
|
||||
|
||||
### 2. 运行配置脚本
|
||||
```bash
|
||||
# 运行自动化脚本
|
||||
./deployment/scripts/setup_consul_variables_and_storage.sh
|
||||
```
|
||||
|
||||
### 3. 重启Consul服务
|
||||
```bash
|
||||
# 停止Consul服务
|
||||
nomad job stop consul-cluster-simple
|
||||
|
||||
# 重新启动Consul服务
|
||||
nomad job run /root/mgmt/components/consul/jobs/consul-cluster-simple.nomad
|
||||
```
|
||||
|
||||
### 4. 验证配置
|
||||
```bash
|
||||
# 检查Consul状态
|
||||
curl http://localhost:8500/v1/status/leader
|
||||
|
||||
# 检查变量配置
|
||||
curl -s http://localhost:8500/v1/kv/config/dev/?recurse | jq
|
||||
|
||||
# 检查存储配置
|
||||
curl -s http://localhost:8500/v1/kv/storage/?recurse | jq
|
||||
```
|
||||
|
||||
## 最佳实践
|
||||
|
||||
1. **定期备份**: 设置定期备份Consul数据,并测试恢复过程
|
||||
2. **监控存储空间**: 监控Consul数据目录的使用情况,避免磁盘空间不足
|
||||
3. **安全配置**: 使用ACL和TLS保护Consul集群
|
||||
4. **版本控制**: 将Consul配置文件纳入版本控制系统
|
||||
5. **环境隔离**: 为不同环境(dev/staging/prod)使用不同的配置路径
|
||||
6. **文档记录**: 记录所有配置项的用途和取值范围
|
||||
|
||||
## 故障排除
|
||||
|
||||
### 常见问题
|
||||
|
||||
#### 1. 变量无法读取
|
||||
- 检查Consul服务是否正常运行
|
||||
- 验证变量路径是否正确
|
||||
- 确认ACL权限是否足够
|
||||
|
||||
#### 2. 存储空间不足
|
||||
- 检查数据目录大小
|
||||
- 调整快照和备份保留策略
|
||||
- 清理旧快照和备份
|
||||
|
||||
#### 3. 快照失败
|
||||
- 检查磁盘空间
|
||||
- 验证文件权限
|
||||
- 查看Consul日志获取详细错误信息
|
||||
|
||||
### 调试命令
|
||||
```bash
|
||||
# 查看Consul成员
|
||||
consul members
|
||||
|
||||
# 查看Raft状态
|
||||
consul operator raft list-peers
|
||||
|
||||
# 查看键值存储
|
||||
consul kv get --recurse config/dev/
|
||||
|
||||
# 查看快照信息
|
||||
consul snapshot inspect backup.snap
|
||||
```
|
||||
|
||||
## 扩展功能
|
||||
|
||||
### 与Vault集成
|
||||
|
||||
Consul可以与Vault集成,提供更强大的密钥管理功能:
|
||||
|
||||
```bash
|
||||
# 配置Vault作为Consul的加密后端
|
||||
vault secrets enable consul
|
||||
|
||||
# 配置Consul使用Vault进行加密
|
||||
consul encrypt -vault-token="$VAULT_TOKEN" -vault-addr="$VAULT_ADDR"
|
||||
```
|
||||
|
||||
### 与Nomad集成
|
||||
|
||||
Consul可以与Nomad集成,提供服务发现和配置管理:
|
||||
|
||||
```hcl
|
||||
# Nomad配置中的Consul集成
|
||||
consul {
|
||||
address = "localhost:8500"
|
||||
token = "your-consul-token"
|
||||
ssl = false
|
||||
}
|
||||
```
|
||||
|
||||
## 总结
|
||||
|
||||
通过配置Consul的变量和存储功能,可以显著增强集群的功能性和可靠性。变量功能提供了灵活的配置管理,而存储功能确保了数据的安全性和持久性。结合自动化脚本和最佳实践,可以构建一个强大且易于维护的Consul集群。
|
||||
86
docs/setup/oci-credentials-setup.md
Normal file
86
docs/setup/oci-credentials-setup.md
Normal file
@@ -0,0 +1,86 @@
|
||||
# Oracle Cloud 凭据配置指南
|
||||
|
||||
## 凭据文件位置
|
||||
|
||||
### 1. OpenTofu 配置文件
|
||||
**文件位置**: `infrastructure/environments/dev/terraform.tfvars`
|
||||
|
||||
这是主要的配置文件,需要填入你的 OCI 凭据:
|
||||
|
||||
```hcl
|
||||
# Oracle Cloud 配置
|
||||
oci_config = {
|
||||
tenancy_ocid = "ocid1.tenancy.oc1..aaaaaaaa_你的租户ID"
|
||||
user_ocid = "ocid1.user.oc1..aaaaaaaa_你的用户ID"
|
||||
fingerprint = "aa:bb:cc:dd:ee:ff:gg:hh:ii:jj:kk:ll:mm:nn:oo:pp"
|
||||
private_key_path = "~/.oci/oci_api_key.pem"
|
||||
region = "ap-seoul-1"
|
||||
compartment_ocid = "ocid1.compartment.oc1..aaaaaaaa_你的区间ID"
|
||||
}
|
||||
```
|
||||
|
||||
### 2. OCI 私钥文件
|
||||
**文件位置**: `~/.oci/oci_api_key.pem`
|
||||
|
||||
这是你的 API 私钥文件,内容类似:
|
||||
|
||||
```
|
||||
-----BEGIN PRIVATE KEY-----
|
||||
MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQC...
|
||||
你的私钥内容
|
||||
-----END PRIVATE KEY-----
|
||||
```
|
||||
|
||||
### 3. OCI 配置文件 (可选)
|
||||
**文件位置**: `~/.oci/config`
|
||||
|
||||
这是 OCI CLI 的配置文件,可以作为备用:
|
||||
|
||||
```ini
|
||||
[DEFAULT]
|
||||
user=ocid1.user.oc1..aaaaaaaa_你的用户ID
|
||||
fingerprint=aa:bb:cc:dd:ee:ff:gg:hh:ii:jj:kk:ll:mm:nn:oo:pp
|
||||
tenancy=ocid1.tenancy.oc1..aaaaaaaa_你的租户ID
|
||||
region=ap-seoul-1
|
||||
key_file=~/.oci/oci_api_key.pem
|
||||
```
|
||||
|
||||
## 设置步骤
|
||||
|
||||
### 步骤 1: 创建 .oci 目录
|
||||
```bash
|
||||
mkdir -p ~/.oci
|
||||
chmod 700 ~/.oci
|
||||
```
|
||||
|
||||
### 步骤 2: 保存私钥文件
|
||||
```bash
|
||||
# 将你的私钥内容保存到文件
|
||||
nano ~/.oci/oci_api_key.pem
|
||||
|
||||
# 设置正确的权限
|
||||
chmod 400 ~/.oci/oci_api_key.pem
|
||||
```
|
||||
|
||||
### 步骤 3: 编辑 terraform.tfvars
|
||||
```bash
|
||||
# 编辑配置文件
|
||||
nano infrastructure/environments/dev/terraform.tfvars
|
||||
```
|
||||
|
||||
## 安全注意事项
|
||||
|
||||
1. **私钥文件权限**: 确保私钥文件权限为 400 (只有所有者可读)
|
||||
2. **不要提交到 Git**: `.gitignore` 已经配置忽略 `*.tfvars` 文件
|
||||
3. **备份凭据**: 建议安全备份你的私钥和配置信息
|
||||
|
||||
## 验证配置
|
||||
|
||||
配置完成后,可以运行以下命令验证:
|
||||
|
||||
```bash
|
||||
# 检查配置
|
||||
./scripts/setup/setup-opentofu.sh check
|
||||
|
||||
# 初始化 OpenTofu
|
||||
./scripts/setup/setup-opentofu.sh init
|
||||
153
docs/setup/oracle-cloud-setup.md
Normal file
153
docs/setup/oracle-cloud-setup.md
Normal file
@@ -0,0 +1,153 @@
|
||||
# Oracle Cloud 配置指南
|
||||
|
||||
## 概述
|
||||
|
||||
本指南将帮助你配置 Oracle Cloud Infrastructure (OCI) 以便与 OpenTofu 一起使用。
|
||||
|
||||
## 前提条件
|
||||
|
||||
1. Oracle Cloud 账户(可以使用免费层)
|
||||
2. 已安装 OpenTofu
|
||||
3. 已安装 OCI CLI(可选,但推荐)
|
||||
|
||||
## 步骤 1: 创建 Oracle Cloud 账户
|
||||
|
||||
1. 访问 [Oracle Cloud](https://cloud.oracle.com/)
|
||||
2. 点击 "Start for free" 创建免费账户
|
||||
3. 完成注册流程
|
||||
|
||||
## 步骤 2: 获取必要的 OCID
|
||||
|
||||
### 获取 Tenancy OCID
|
||||
|
||||
1. 登录 Oracle Cloud Console
|
||||
2. 点击右上角的用户图标
|
||||
3. 选择 "Tenancy: <your-tenancy-name>"
|
||||
4. 复制 OCID 值
|
||||
|
||||
### 获取 User OCID
|
||||
|
||||
1. 在 Oracle Cloud Console 中
|
||||
2. 点击右上角的用户图标
|
||||
3. 选择 "User Settings"
|
||||
4. 复制 OCID 值
|
||||
|
||||
### 获取 Compartment OCID
|
||||
|
||||
1. 在导航菜单中选择 "Identity & Security" > "Compartments"
|
||||
2. 选择你要使用的 compartment(通常是 root compartment)
|
||||
3. 复制 OCID 值
|
||||
|
||||
## 步骤 3: 创建 API 密钥
|
||||
|
||||
### 生成密钥对
|
||||
|
||||
```bash
|
||||
# 创建 .oci 目录
|
||||
mkdir -p ~/.oci
|
||||
|
||||
# 生成私钥
|
||||
openssl genrsa -out ~/.oci/oci_api_key.pem 2048
|
||||
|
||||
# 生成公钥
|
||||
openssl rsa -pubout -in ~/.oci/oci_api_key.pem -out ~/.oci/oci_api_key_public.pem
|
||||
|
||||
# 设置权限
|
||||
chmod 400 ~/.oci/oci_api_key.pem
|
||||
chmod 400 ~/.oci/oci_api_key_public.pem
|
||||
```
|
||||
|
||||
### 添加公钥到 Oracle Cloud
|
||||
|
||||
1. 在 Oracle Cloud Console 中,进入 "User Settings"
|
||||
2. 在左侧菜单中选择 "API Keys"
|
||||
3. 点击 "Add API Key"
|
||||
4. 选择 "Paste Public Key"
|
||||
5. 复制 `~/.oci/oci_api_key_public.pem` 的内容并粘贴
|
||||
6. 点击 "Add"
|
||||
7. 复制显示的 fingerprint
|
||||
|
||||
## 步骤 4: 配置 terraform.tfvars
|
||||
|
||||
编辑 `infrastructure/environments/dev/terraform.tfvars` 文件:
|
||||
|
||||
```hcl
|
||||
# Oracle Cloud 配置
|
||||
oci_config = {
|
||||
tenancy_ocid = "ocid1.tenancy.oc1..aaaaaaaa_your_actual_tenancy_id"
|
||||
user_ocid = "ocid1.user.oc1..aaaaaaaa_your_actual_user_id"
|
||||
fingerprint = "aa:bb:cc:dd:ee:ff:gg:hh:ii:jj:kk:ll:mm:nn:oo:pp"
|
||||
private_key_path = "~/.oci/oci_api_key.pem"
|
||||
region = "ap-seoul-1" # 或你选择的区域
|
||||
compartment_ocid = "ocid1.compartment.oc1..aaaaaaaa_your_compartment_id"
|
||||
}
|
||||
```
|
||||
|
||||
## 步骤 5: 验证配置
|
||||
|
||||
```bash
|
||||
# 检查配置
|
||||
./scripts/setup/setup-opentofu.sh check
|
||||
|
||||
# 初始化 OpenTofu
|
||||
./scripts/setup/setup-opentofu.sh init
|
||||
|
||||
# 生成计划
|
||||
./scripts/setup/setup-opentofu.sh plan
|
||||
```
|
||||
|
||||
## 可用区域
|
||||
|
||||
常用的 Oracle Cloud 区域:
|
||||
|
||||
- `ap-seoul-1` - 韩国首尔
|
||||
- `ap-tokyo-1` - 日本东京
|
||||
- `us-ashburn-1` - 美国弗吉尼亚州
|
||||
- `us-phoenix-1` - 美国亚利桑那州
|
||||
- `eu-frankfurt-1` - 德国法兰克福
|
||||
|
||||
## 免费层资源
|
||||
|
||||
Oracle Cloud 免费层包括:
|
||||
|
||||
- 2 个 AMD 计算实例(VM.Standard.E2.1.Micro)
|
||||
- 4 个 Arm 计算实例(VM.Standard.A1.Flex)
|
||||
- 200 GB 块存储
|
||||
- 10 GB 对象存储
|
||||
- 负载均衡器
|
||||
- 数据库等
|
||||
|
||||
## 故障排除
|
||||
|
||||
### 常见错误
|
||||
|
||||
1. **401 Unauthorized**: 检查 API 密钥配置
|
||||
2. **404 Not Found**: 检查 OCID 是否正确
|
||||
3. **权限错误**: 确保用户有足够的权限
|
||||
|
||||
### 验证连接
|
||||
|
||||
```bash
|
||||
# 安装 OCI CLI(可选)
|
||||
pip install oci-cli
|
||||
|
||||
# 配置 OCI CLI
|
||||
oci setup config
|
||||
|
||||
# 测试连接
|
||||
oci iam compartment list
|
||||
```
|
||||
|
||||
## 安全最佳实践
|
||||
|
||||
1. 定期轮换 API 密钥
|
||||
2. 使用最小权限原则
|
||||
3. 不要在代码中硬编码凭据
|
||||
4. 使用 compartment 隔离资源
|
||||
5. 启用审计日志
|
||||
|
||||
## 参考资料
|
||||
|
||||
- [Oracle Cloud Infrastructure 文档](https://docs.oracle.com/en-us/iaas/)
|
||||
- [OCI Terraform Provider](https://registry.terraform.io/providers/oracle/oci/latest/docs)
|
||||
- [Oracle Cloud 免费层](https://www.oracle.com/cloud/free/)
|
||||
305
docs/vault-consul-best-practices.md
Normal file
305
docs/vault-consul-best-practices.md
Normal file
@@ -0,0 +1,305 @@
|
||||
# Vault与Consul集成最佳实践
|
||||
|
||||
## 1. 架构设计
|
||||
|
||||
### 1.1 高可用架构
|
||||
- **Vault集群**: 3个节点 (1个Leader + 2个Follower)
|
||||
- **Consul集群**: 3个节点 (1个Leader + 2个Follower)
|
||||
- **网络**: Tailscale安全网络
|
||||
- **存储**: Consul作为Vault的存储后端
|
||||
|
||||
### 1.2 节点分布
|
||||
```
|
||||
Vault节点:
|
||||
- ch4.tailnet-68f9.ts.net:8200 (Leader)
|
||||
- ash3c.tailnet-68f9.ts.net:8200 (Follower)
|
||||
- warden.tailnet-68f9.ts.net:8200 (Follower)
|
||||
|
||||
Consul节点:
|
||||
- ch4.tailnet-68f9.ts.net:8500 (Leader)
|
||||
- ash3c.tailnet-68f9.ts.net:8500 (Follower)
|
||||
- warden.tailnet-68f9.ts.net:8500 (Follower)
|
||||
```
|
||||
|
||||
## 2. Vault配置最佳实践
|
||||
|
||||
### 2.1 存储后端配置
|
||||
```hcl
|
||||
storage "consul" {
|
||||
address = "127.0.0.1:8500"
|
||||
path = "vault/"
|
||||
|
||||
# 高可用配置
|
||||
datacenter = "dc1"
|
||||
service = "vault"
|
||||
service_tags = "vault-server"
|
||||
|
||||
# 会话配置
|
||||
session_ttl = "15s"
|
||||
lock_wait_time = "15s"
|
||||
|
||||
# 一致性配置
|
||||
consistency_mode = "strong"
|
||||
|
||||
# 故障转移配置
|
||||
max_parallel = 128
|
||||
disable_registration = false
|
||||
}
|
||||
```
|
||||
|
||||
### 2.2 监听器配置
|
||||
```hcl
|
||||
listener "tcp" {
|
||||
address = "0.0.0.0:8200"
|
||||
|
||||
# 生产环境启用TLS
|
||||
tls_cert_file = "/opt/vault/tls/vault.crt"
|
||||
tls_key_file = "/opt/vault/tls/vault.key"
|
||||
tls_min_version = "1.2"
|
||||
}
|
||||
|
||||
# 集群监听器
|
||||
listener "tcp" {
|
||||
address = "0.0.0.0:8201"
|
||||
purpose = "cluster"
|
||||
|
||||
tls_cert_file = "/opt/vault/tls/vault.crt"
|
||||
tls_key_file = "/opt/vault/tls/vault.key"
|
||||
}
|
||||
```
|
||||
|
||||
### 2.3 集群配置
|
||||
```hcl
|
||||
# API地址 - 使用Tailscale网络
|
||||
api_addr = "https://{{ ansible_host }}:8200"
|
||||
|
||||
# 集群地址 - 使用Tailscale网络
|
||||
cluster_addr = "https://{{ ansible_host }}:8201"
|
||||
|
||||
# 集群名称
|
||||
cluster_name = "vault-cluster"
|
||||
|
||||
# 禁用mlock (生产环境应启用)
|
||||
disable_mlock = false
|
||||
|
||||
# 日志配置
|
||||
log_level = "INFO"
|
||||
log_format = "json"
|
||||
```
|
||||
|
||||
## 3. Consul配置最佳实践
|
||||
|
||||
### 3.1 服务注册配置
|
||||
```hcl
|
||||
services {
|
||||
name = "vault"
|
||||
tags = ["vault-server", "secrets"]
|
||||
port = 8200
|
||||
|
||||
check {
|
||||
name = "vault-health"
|
||||
http = "http://127.0.0.1:8200/v1/sys/health"
|
||||
interval = "10s"
|
||||
timeout = "3s"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3.2 ACL配置
|
||||
```hcl
|
||||
acl {
|
||||
enabled = true
|
||||
default_policy = "deny"
|
||||
enable_token_persistence = true
|
||||
|
||||
# Vault服务权限
|
||||
tokens {
|
||||
default = "{{ vault_consul_token }}"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 4. 安全最佳实践
|
||||
|
||||
### 4.1 TLS配置
|
||||
- 所有Vault节点间通信使用TLS
|
||||
- Consul节点间通信使用TLS
|
||||
- 客户端到Vault通信使用TLS
|
||||
|
||||
### 4.2 认证配置
|
||||
```hcl
|
||||
# 启用多种认证方法
|
||||
auth {
|
||||
enabled = true
|
||||
|
||||
# AppRole认证
|
||||
approle {
|
||||
enabled = true
|
||||
}
|
||||
|
||||
# LDAP认证
|
||||
ldap {
|
||||
enabled = true
|
||||
url = "ldap://authentik.tailnet-68f9.ts.net:389"
|
||||
userdn = "ou=users,dc=authentik,dc=local"
|
||||
groupdn = "ou=groups,dc=authentik,dc=local"
|
||||
}
|
||||
|
||||
# OIDC认证
|
||||
oidc {
|
||||
enabled = true
|
||||
oidc_discovery_url = "https://authentik1.git-4ta.live/application/o/vault/"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 5. 监控和审计
|
||||
|
||||
### 5.1 审计日志
|
||||
```hcl
|
||||
audit {
|
||||
enabled = true
|
||||
|
||||
# 文件审计
|
||||
file {
|
||||
path = "/opt/vault/logs/audit.log"
|
||||
format = "json"
|
||||
}
|
||||
|
||||
# Syslog审计
|
||||
syslog {
|
||||
facility = "AUTH"
|
||||
tag = "vault"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 5.2 遥测配置
|
||||
```hcl
|
||||
telemetry {
|
||||
prometheus_retention_time = "30s"
|
||||
disable_hostname = false
|
||||
|
||||
# 指标配置
|
||||
metrics {
|
||||
enabled = true
|
||||
prefix = "vault"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 6. 备份和恢复
|
||||
|
||||
### 6.1 自动备份脚本
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# /opt/vault/scripts/backup.sh
|
||||
|
||||
VAULT_ADDR="https://vault.git-4ta.live"
|
||||
VAULT_TOKEN="$(cat /opt/vault/token)"
|
||||
|
||||
# 创建快照
|
||||
vault operator raft snapshot save /opt/vault/backups/vault-$(date +%Y%m%d-%H%M%S).snapshot
|
||||
|
||||
# 清理旧备份 (保留7天)
|
||||
find /opt/vault/backups -name "vault-*.snapshot" -mtime +7 -delete
|
||||
```
|
||||
|
||||
### 6.2 Consul快照
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# /opt/consul/scripts/backup.sh
|
||||
|
||||
CONSUL_ADDR="http://127.0.0.1:8500"
|
||||
|
||||
# 创建Consul快照
|
||||
consul snapshot save /opt/consul/backups/consul-$(date +%Y%m%d-%H%M%S).snapshot
|
||||
```
|
||||
|
||||
## 7. 故障转移和灾难恢复
|
||||
|
||||
### 7.1 自动故障转移
|
||||
- Vault使用Raft协议自动选举新Leader
|
||||
- Consul使用Raft协议自动选举新Leader
|
||||
- 客户端自动重连到新的Leader节点
|
||||
|
||||
### 7.2 灾难恢复流程
|
||||
1. 停止所有Vault节点
|
||||
2. 从Consul恢复数据
|
||||
3. 启动Vault集群
|
||||
4. 验证服务状态
|
||||
|
||||
## 8. 性能优化
|
||||
|
||||
### 8.1 缓存配置
|
||||
```hcl
|
||||
cache {
|
||||
enabled = true
|
||||
size = 1000
|
||||
persist {
|
||||
type = "kubernetes"
|
||||
path = "/opt/vault/cache"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 8.2 连接池配置
|
||||
```hcl
|
||||
storage "consul" {
|
||||
# 连接池配置
|
||||
max_parallel = 128
|
||||
max_requests_per_second = 100
|
||||
}
|
||||
```
|
||||
|
||||
## 9. 部署检查清单
|
||||
|
||||
### 9.1 部署前检查
|
||||
- [ ] Consul集群健康
|
||||
- [ ] 网络连通性测试
|
||||
- [ ] TLS证书配置
|
||||
- [ ] 防火墙规则配置
|
||||
- [ ] 存储空间检查
|
||||
|
||||
### 9.2 部署后验证
|
||||
- [ ] Vault集群状态检查
|
||||
- [ ] 服务注册验证
|
||||
- [ ] 认证功能测试
|
||||
- [ ] 备份功能测试
|
||||
- [ ] 监控指标验证
|
||||
|
||||
## 10. 常见问题和解决方案
|
||||
|
||||
### 10.1 常见问题
|
||||
1. **Vault无法连接到Consul**
|
||||
- 检查网络连通性
|
||||
- 验证Consul服务状态
|
||||
- 检查ACL权限
|
||||
|
||||
2. **集群分裂问题**
|
||||
- 检查网络分区
|
||||
- 验证Raft日志一致性
|
||||
- 执行灾难恢复流程
|
||||
|
||||
3. **性能问题**
|
||||
- 调整连接池大小
|
||||
- 启用缓存
|
||||
- 优化网络配置
|
||||
|
||||
### 10.2 故障排除命令
|
||||
```bash
|
||||
# 检查Vault状态
|
||||
vault status
|
||||
|
||||
# 检查Consul成员
|
||||
consul members
|
||||
|
||||
# 检查服务注册
|
||||
consul catalog services
|
||||
|
||||
# 检查Vault日志
|
||||
journalctl -u vault -f
|
||||
|
||||
# 检查Consul日志
|
||||
journalctl -u consul -f
|
||||
```
|
||||
183
docs/vault-consul-integration.md
Normal file
183
docs/vault-consul-integration.md
Normal file
@@ -0,0 +1,183 @@
|
||||
# Vault与Consul集成配置指南
|
||||
|
||||
## 1. 概述
|
||||
|
||||
本文档详细说明了Vault与Consul的集成配置,包括架构设计、配置参数和管理操作。
|
||||
|
||||
## 2. 集成架构
|
||||
|
||||
### 2.1 架构图
|
||||
```
|
||||
+------------------+
|
||||
| Vault Client |
|
||||
+------------------+
|
||||
|
|
||||
+------------------+
|
||||
| Vault Server |
|
||||
| (3个节点集群) |
|
||||
+------------------+
|
||||
|
|
||||
+------------------+
|
||||
| Consul Backend |
|
||||
| (3个节点集群) |
|
||||
+------------------+
|
||||
```
|
||||
|
||||
### 2.2 节点分布
|
||||
- **Vault节点**:
|
||||
- master节点: 100.117.106.136
|
||||
- ash3c节点: 100.116.80.94
|
||||
- warden节点: 100.122.197.112
|
||||
|
||||
- **Consul节点**:
|
||||
- master节点: 100.117.106.136
|
||||
- ash3c节点: 100.116.80.94
|
||||
- warden节点: 100.122.197.112
|
||||
|
||||
## 3. 配置详情
|
||||
|
||||
### 3.1 Vault配置文件
|
||||
每个Vault节点的配置文件位于:`/opt/nomad/data/vault/config/vault.hcl`
|
||||
|
||||
```hcl
|
||||
storage "consul" {
|
||||
address = "<本地Consul地址>:8500"
|
||||
path = "vault/"
|
||||
}
|
||||
|
||||
listener "tcp" {
|
||||
address = "0.0.0.0:8200"
|
||||
tls_disable = 1
|
||||
}
|
||||
|
||||
api_addr = "http://<节点IP>:8200"
|
||||
cluster_addr = "http://<节点IP>:8201"
|
||||
|
||||
ui = true
|
||||
disable_mlock = true
|
||||
```
|
||||
|
||||
### 3.2 Consul配置
|
||||
Consul作为Vault的存储后端,存储了所有Vault的持久化数据,包括:
|
||||
- 密钥材料
|
||||
- 策略信息
|
||||
- 审计日志
|
||||
- 集群状态
|
||||
|
||||
## 4. 集成验证
|
||||
|
||||
### 4.1 验证命令
|
||||
```bash
|
||||
# 检查Vault状态
|
||||
vault status
|
||||
|
||||
# 检查Consul成员
|
||||
consul members
|
||||
|
||||
# 检查Consul中的Vault数据
|
||||
curl http://<consul_addr>:8500/v1/kv/vault/?recurse | jq .
|
||||
```
|
||||
|
||||
### 4.2 验证脚本
|
||||
```bash
|
||||
# 运行完整验证
|
||||
/root/mgmt/deployment/scripts/verify_vault_consul_integration.sh
|
||||
```
|
||||
|
||||
## 5. 管理操作
|
||||
|
||||
### 5.1 日常管理
|
||||
```bash
|
||||
# 显示状态
|
||||
/root/mgmt/deployment/scripts/manage_vault_consul.sh status
|
||||
|
||||
# 健康检查
|
||||
/root/mgmt/deployment/scripts/manage_vault_consul.sh health
|
||||
|
||||
# 验证集成
|
||||
/root/mgmt/deployment/scripts/manage_vault_consul.sh verify
|
||||
```
|
||||
|
||||
### 5.2 监控操作
|
||||
```bash
|
||||
# 实时监控
|
||||
/root/mgmt/deployment/scripts/manage_vault_consul.sh monitor
|
||||
|
||||
# 数据备份
|
||||
/root/mgmt/deployment/scripts/manage_vault_consul.sh backup
|
||||
```
|
||||
|
||||
## 6. 故障排除
|
||||
|
||||
### 6.1 常见问题
|
||||
|
||||
#### 6.1.1 Vault无法连接Consul
|
||||
**问题**:Vault启动失败,日志显示无法连接Consul
|
||||
**解决方案**:
|
||||
1. 检查Consul服务是否运行:`consul members`
|
||||
2. 检查网络连接:`curl http://<consul_addr>:8500/v1/status/leader`
|
||||
3. 验证Vault配置中的Consul地址是否正确
|
||||
|
||||
#### 6.1.2 Vault数据丢失
|
||||
**问题**:Vault无法读取之前存储的数据
|
||||
**解决方案**:
|
||||
1. 检查Consul中的数据:`curl http://<consul_addr>:8500/v1/kv/vault/?keys`
|
||||
2. 验证Consul集群状态:`consul members`
|
||||
3. 如有必要,从备份恢复数据
|
||||
|
||||
### 6.2 日志查看
|
||||
```bash
|
||||
# 查看Vault日志
|
||||
nomad alloc logs -address=http://100.116.158.95:4646 <vault_allocation_id>
|
||||
|
||||
# 查看Consul日志
|
||||
nomad alloc logs -address=http://100.116.158.95:4646 <consul_allocation_id>
|
||||
```
|
||||
|
||||
## 7. 安全考虑
|
||||
|
||||
### 7.1 数据加密
|
||||
- Consul中的Vault数据默认已加密
|
||||
- 网络传输使用TLS加密(生产环境)
|
||||
|
||||
### 7.2 访问控制
|
||||
- Vault使用令牌进行访问控制
|
||||
- Consul使用ACL策略进行访问控制
|
||||
|
||||
### 7.3 备份策略
|
||||
- 定期备份Consul中的Vault数据
|
||||
- 备份文件应加密存储
|
||||
- 遵循3-2-1备份原则
|
||||
|
||||
## 8. 性能优化
|
||||
|
||||
### 8.1 Consul调优
|
||||
- 调整Consul的存储后端性能参数
|
||||
- 监控Consul集群的健康状态
|
||||
- 定期清理过期的会话
|
||||
|
||||
### 8.2 Vault调优
|
||||
- 调整Vault的缓存设置
|
||||
- 监控Vault的性能指标
|
||||
- 优化密钥引擎的使用
|
||||
|
||||
## 9. 升级维护
|
||||
|
||||
### 9.1 版本升级
|
||||
1. 先升级Consul集群
|
||||
2. 再升级Vault集群
|
||||
3. 验证集成状态
|
||||
|
||||
### 9.2 滚动更新
|
||||
使用Nomad进行滚动更新,确保服务不中断:
|
||||
```bash
|
||||
nomad job run -address=http://100.116.158.95:4646 /path/to/updated/job.nomad
|
||||
```
|
||||
|
||||
## 10. 相关文档
|
||||
|
||||
- [Vault官方文档](https://www.vaultproject.io/docs)
|
||||
- [Consul官方文档](https://www.consul.io/docs)
|
||||
- [Nomad官方文档](https://www.nomadproject.io/docs)
|
||||
- Vault开发环境指南
|
||||
- Vault安全策略文档
|
||||
112
docs/vault-dev-environment.md
Normal file
112
docs/vault-dev-environment.md
Normal file
@@ -0,0 +1,112 @@
|
||||
# Vault开发环境指南
|
||||
|
||||
## 1. 概述
|
||||
|
||||
本文档介绍了如何在开发环境中使用Vault,包括初始化、密钥管理和基本操作。
|
||||
|
||||
## 2. 开发环境特点
|
||||
|
||||
- 使用1个解封密钥(简化操作)
|
||||
- 所有密钥存储在本地开发目录
|
||||
- 适用于快速测试和开发
|
||||
|
||||
**注意**:此配置仅用于开发环境,生产环境请遵循安全策略文档。
|
||||
|
||||
## 3. 初始化Vault
|
||||
|
||||
### 3.1 运行初始化脚本
|
||||
```bash
|
||||
/root/mgmt/deployment/scripts/init_vault_dev.sh
|
||||
```
|
||||
|
||||
脚本将:
|
||||
1. 初始化Vault集群
|
||||
2. 生成1个解封密钥和根令牌
|
||||
3. 自动解封所有节点
|
||||
4. 保存环境变量配置
|
||||
|
||||
### 3.2 查看密钥信息
|
||||
```bash
|
||||
/root/mgmt/deployment/scripts/show_vault_dev_keys.sh
|
||||
```
|
||||
|
||||
## 4. 使用Vault
|
||||
|
||||
### 4.1 设置环境变量
|
||||
```bash
|
||||
source /root/mgmt/security/secrets/vault/dev/vault_env.sh
|
||||
```
|
||||
|
||||
### 4.2 基本操作示例
|
||||
```bash
|
||||
# 检查状态
|
||||
vault status
|
||||
|
||||
# 写入密钥值
|
||||
vault kv put secret/myapp/config username="devuser" password="devpassword"
|
||||
|
||||
# 读取密钥值
|
||||
vault kv get secret/myapp/config
|
||||
```
|
||||
|
||||
### 4.3 运行完整示例
|
||||
```bash
|
||||
/root/mgmt/deployment/scripts/vault_dev_example.sh
|
||||
```
|
||||
|
||||
## 5. 目录结构
|
||||
|
||||
```
|
||||
/root/mgmt/security/secrets/vault/dev/
|
||||
├── init_keys.json # 初始化密钥(解封密钥和根令牌)
|
||||
├── vault_env.sh # 环境变量配置
|
||||
```
|
||||
|
||||
## 6. 重要提醒
|
||||
|
||||
### 6.1 开发环境限制
|
||||
- 仅使用1个解封密钥(生产环境应使用5个密钥中的3个阈值)
|
||||
- 密钥存储在本地文件系统(生产环境应分散存储)
|
||||
- 适用于单人开发测试
|
||||
|
||||
### 6.2 生产环境迁移
|
||||
当从开发环境迁移到生产环境时:
|
||||
1. 重新初始化Vault集群
|
||||
2. 使用5个解封密钥中的3个阈值
|
||||
3. 将密钥分发给不同管理员
|
||||
4. 遵循安全策略文档
|
||||
|
||||
## 7. 故障排除
|
||||
|
||||
### 7.1 Vault未初始化
|
||||
运行初始化脚本:
|
||||
```bash
|
||||
/root/mgmt/deployment/scripts/init_vault_dev.sh
|
||||
```
|
||||
|
||||
### 7.2 Vault已初始化但被密封
|
||||
使用解封密钥解封:
|
||||
```bash
|
||||
export VAULT_ADDR='http://<节点IP>:8200'
|
||||
vault operator unseal <解封密钥>
|
||||
```
|
||||
|
||||
### 7.3 无法连接到Vault
|
||||
检查Vault服务状态:
|
||||
```bash
|
||||
curl -v http://<节点IP>:8200/v1/sys/health
|
||||
```
|
||||
|
||||
## 8. 清理环境
|
||||
|
||||
如需重新开始,可以删除密钥文件并重新初始化:
|
||||
```bash
|
||||
rm -f /root/mgmt/security/secrets/vault/dev/init_keys.json
|
||||
/root/mgmt/deployment/scripts/init_vault_dev.sh
|
||||
```
|
||||
|
||||
## 9. 相关文档
|
||||
|
||||
- [Vault安全策略](vault-security-policy.md) - 生产环境安全指南
|
||||
- [Vault官方文档](https://www.vaultproject.io/docs)
|
||||
- [Vault API文档](https://www.vaultproject.io/api)
|
||||
139
docs/vault-security-policy.md
Normal file
139
docs/vault-security-policy.md
Normal file
@@ -0,0 +1,139 @@
|
||||
# Vault安全策略和密钥管理指南
|
||||
|
||||
## 1. 概述
|
||||
|
||||
本文档定义了Vault密钥的安全管理策略,确保基础设施的安全性和可靠性。
|
||||
|
||||
## 2. 密钥类型
|
||||
|
||||
### 2.1 初始化密钥
|
||||
- **解封密钥**:用于解封Vault实例
|
||||
- **根令牌**:具有Vault中所有权限的初始令牌
|
||||
|
||||
### 2.2 操作密钥
|
||||
- **用户令牌**:分配给用户和服务的访问令牌
|
||||
- **策略令牌**:基于特定策略的受限令牌
|
||||
|
||||
## 3. 安全存储策略
|
||||
|
||||
### 3.1 解封密钥存储
|
||||
**禁止**:
|
||||
- 将所有密钥存储在同一位置
|
||||
- 在代码或配置文件中明文存储密钥
|
||||
- 通过不安全的通信渠道传输密钥
|
||||
|
||||
**推荐**:
|
||||
1. **物理分发**:
|
||||
- 将5个解封密钥分别交给5个不同的可信管理员
|
||||
- 每个管理员仅知道自己的密钥
|
||||
- 需要3个密钥即可解封Vault(Shamir's Secret Sharing)
|
||||
|
||||
2. **加密存储**:
|
||||
- 使用GPG或其他加密工具加密密钥文件
|
||||
- 将加密后的文件存储在安全位置
|
||||
- 加密密钥由不同管理员保管
|
||||
|
||||
3. **硬件安全模块**:
|
||||
- 企业环境推荐使用HSM存储密钥
|
||||
- 提供硬件级别的安全保护
|
||||
|
||||
### 3.2 根令牌存储
|
||||
- 根令牌应立即用于创建具有最小权限的管理令牌
|
||||
- 创建后应立即撤销根令牌
|
||||
- 新的管理令牌应根据职责分离原则分发
|
||||
|
||||
## 4. 密钥生命周期管理
|
||||
|
||||
### 4.1 创建
|
||||
- 初始化时生成密钥
|
||||
- 立即按照安全策略分发和存储
|
||||
- 记录密钥创建时间和负责人
|
||||
|
||||
### 4.2 使用
|
||||
- 仅在必要时使用解封密钥
|
||||
- 定期轮换用户和服务令牌
|
||||
- 监控密钥使用情况
|
||||
|
||||
### 4.3 更新
|
||||
- 定期重新初始化Vault以生成新密钥(谨慎操作)
|
||||
- 当管理员变更时更新密钥分发
|
||||
- 发生安全事件时立即重新生成密钥
|
||||
|
||||
### 4.4 销毁
|
||||
- 安全删除不再需要的密钥副本
|
||||
- 使用安全删除工具确保数据不可恢复
|
||||
- 记录密钥销毁时间和负责人
|
||||
|
||||
## 5. 应急响应
|
||||
|
||||
### 5.1 密钥泄露
|
||||
1. 立即生成新的解封密钥
|
||||
2. 重新初始化Vault集群
|
||||
3. 更新所有依赖Vault的服务配置
|
||||
4. 调查泄露原因并修复安全漏洞
|
||||
|
||||
### 5.2 管理员不可用
|
||||
1. 确保有足够的密钥持有者可用(至少3人)
|
||||
2. 建立备用密钥持有者列表
|
||||
3. 定期验证密钥持有者的可用性
|
||||
|
||||
## 6. 审计和合规
|
||||
|
||||
### 6.1 审计要求
|
||||
- 记录所有密钥相关操作
|
||||
- 定期审查密钥管理策略执行情况
|
||||
- 生成密钥使用报告
|
||||
|
||||
### 6.2 合规性
|
||||
- 遵循组织安全政策
|
||||
- 满足行业标准要求(如SOC 2, ISO 27001等)
|
||||
- 定期进行安全评估
|
||||
|
||||
## 7. 实施步骤
|
||||
|
||||
### 7.1 初始化Vault
|
||||
```bash
|
||||
# 使用提供的脚本初始化Vault
|
||||
/root/mgmt/deployment/scripts/init_vault_cluster.sh
|
||||
```
|
||||
|
||||
### 7.2 安全分发密钥
|
||||
1. 将生成的密钥文件复制到安全位置
|
||||
2. 将密钥文件加密并分发给不同管理员
|
||||
3. 验证每个管理员都能正确解封Vault
|
||||
|
||||
### 7.3 创建管理令牌
|
||||
```bash
|
||||
# 使用根令牌创建管理令牌
|
||||
export VAULT_ADDR='http://<节点IP>:8200'
|
||||
export VAULT_TOKEN=<root_token>
|
||||
vault token create -policy=admin -period=24h
|
||||
```
|
||||
|
||||
### 7.4 撤销根令牌
|
||||
```bash
|
||||
# 撤销根令牌以提高安全性
|
||||
vault token revoke <root_token>
|
||||
```
|
||||
|
||||
## 8. 最佳实践
|
||||
|
||||
### 8.1 访问控制
|
||||
- 实施最小权限原则
|
||||
- 使用策略限制令牌权限
|
||||
- 定期审查和更新策略
|
||||
|
||||
### 8.2 监控和告警
|
||||
- 监控Vault解封和密封事件
|
||||
- 设置密钥使用异常告警
|
||||
- 定期生成安全报告
|
||||
|
||||
### 8.3 备份和恢复
|
||||
- 定期备份Vault数据
|
||||
- 测试恢复流程
|
||||
- 确保备份数据的安全性
|
||||
|
||||
## 9. 相关文档
|
||||
- [Vault官方安全指南](https://www.vaultproject.io/docs/internals/security)
|
||||
- [HashiCorp安全模型](https://www.hashicorp.com/security)
|
||||
- 组织内部安全政策
|
||||
268
docs/vault/ansible_vault_integration.md
Normal file
268
docs/vault/ansible_vault_integration.md
Normal file
@@ -0,0 +1,268 @@
|
||||
# Ansible与HashiCorp Vault集成指南
|
||||
|
||||
本文档介绍如何将Ansible与HashiCorp Vault集成,以安全地管理和使用敏感信息。
|
||||
|
||||
## 1. 安装必要的Python包
|
||||
|
||||
首先,需要安装Ansible的Vault集成包:
|
||||
|
||||
```bash
|
||||
pip install hvac
|
||||
```
|
||||
|
||||
## 2. 配置Ansible使用Vault
|
||||
|
||||
### 2.1 创建Vault连接配置
|
||||
|
||||
创建一个Vault连接配置文件 `vault_config.yml`:
|
||||
|
||||
```yaml
|
||||
vault_addr: http://localhost:8200
|
||||
vault_role_id: "your-approle-role-id"
|
||||
vault_secret_id: "your-approle-secret-id"
|
||||
```
|
||||
|
||||
### 2.2 创建Vault查询角色
|
||||
|
||||
在Vault中创建一个专用于Ansible的AppRole:
|
||||
|
||||
```bash
|
||||
# 启用AppRole认证
|
||||
vault auth enable approle
|
||||
|
||||
# 创建策略
|
||||
cat > ansible-policy.hcl <<EOF
|
||||
path "kv/data/ansible/*" {
|
||||
capabilities = ["read"]
|
||||
}
|
||||
EOF
|
||||
|
||||
vault policy write ansible ansible-policy.hcl
|
||||
|
||||
# 创建AppRole
|
||||
vault write auth/approle/role/ansible \
|
||||
token_policies="ansible" \
|
||||
token_ttl=1h \
|
||||
token_max_ttl=4h
|
||||
|
||||
# 获取Role ID
|
||||
vault read auth/approle/role/ansible/role-id
|
||||
|
||||
# 生成Secret ID
|
||||
vault write -f auth/approle/role/ansible/secret-id
|
||||
```
|
||||
|
||||
## 3. 在Ansible中使用Vault
|
||||
|
||||
### 3.1 使用lookup插件
|
||||
|
||||
在Ansible playbook中使用`hashi_vault`查找插件:
|
||||
|
||||
```yaml
|
||||
---
|
||||
- name: 使用HashiCorp Vault的示例
|
||||
hosts: all
|
||||
vars:
|
||||
vault_addr: "http://localhost:8200"
|
||||
role_id: "{{ lookup('file', '/path/to/role_id') }}"
|
||||
secret_id: "{{ lookup('file', '/path/to/secret_id') }}"
|
||||
|
||||
# 从Vault获取数据库密码
|
||||
db_password: "{{ lookup('hashi_vault', 'secret=kv/data/ansible/db:password auth_method=approle role_id=' + role_id + ' secret_id=' + secret_id + ' url=' + vault_addr) }}"
|
||||
|
||||
tasks:
|
||||
- name: 配置数据库连接
|
||||
template:
|
||||
src: db_config.j2
|
||||
dest: /etc/app/db_config.ini
|
||||
```
|
||||
|
||||
### 3.2 使用环境变量
|
||||
|
||||
也可以通过环境变量设置Vault认证信息:
|
||||
|
||||
```yaml
|
||||
---
|
||||
- name: 使用环境变量的Vault示例
|
||||
hosts: all
|
||||
environment:
|
||||
VAULT_ADDR: "http://localhost:8200"
|
||||
VAULT_ROLE_ID: "{{ lookup('file', '/path/to/role_id') }}"
|
||||
VAULT_SECRET_ID: "{{ lookup('file', '/path/to/secret_id') }}"
|
||||
|
||||
tasks:
|
||||
- name: 从Vault获取密钥
|
||||
set_fact:
|
||||
api_key: "{{ lookup('hashi_vault', 'secret=kv/data/ansible/api:key') }}"
|
||||
```
|
||||
|
||||
## 4. 创建Vault密钥模块
|
||||
|
||||
创建一个自定义的Ansible角色,用于管理Vault中的密钥:
|
||||
|
||||
### 4.1 角色结构
|
||||
|
||||
```
|
||||
roles/
|
||||
└── vault_secrets/
|
||||
├── defaults/
|
||||
│ └── main.yml
|
||||
├── tasks/
|
||||
│ └── main.yml
|
||||
└── vars/
|
||||
└── main.yml
|
||||
```
|
||||
|
||||
### 4.2 主任务文件
|
||||
|
||||
`roles/vault_secrets/tasks/main.yml`:
|
||||
|
||||
```yaml
|
||||
---
|
||||
- name: 确保Vault令牌有效
|
||||
block:
|
||||
- name: 获取Vault令牌
|
||||
set_fact:
|
||||
vault_token: "{{ lookup('hashi_vault', 'auth_method=approle role_id=' + vault_role_id + ' secret_id=' + vault_secret_id + ' url=' + vault_addr) }}"
|
||||
no_log: true
|
||||
rescue:
|
||||
- name: Vault认证失败
|
||||
fail:
|
||||
msg: "无法从Vault获取有效令牌"
|
||||
|
||||
- name: 从Vault读取密钥
|
||||
set_fact:
|
||||
secrets: "{{ lookup('hashi_vault', 'secret=' + vault_path + ' token=' + vault_token + ' url=' + vault_addr) }}"
|
||||
no_log: true
|
||||
|
||||
- name: 设置各个密钥变量
|
||||
set_fact:
|
||||
"{{ item.key }}": "{{ item.value }}"
|
||||
with_dict: "{{ secrets.data.data }}"
|
||||
no_log: true
|
||||
```
|
||||
|
||||
## 5. 将现有Ansible Vault迁移到HashiCorp Vault
|
||||
|
||||
### 5.1 创建迁移脚本
|
||||
|
||||
创建一个脚本来自动迁移Ansible Vault内容到HashiCorp Vault:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# migrate_to_hashicorp_vault.sh
|
||||
|
||||
# 设置变量
|
||||
ANSIBLE_VAULT_FILE=$1
|
||||
VAULT_PATH=$2
|
||||
VAULT_ADDR=${VAULT_ADDR:-"http://localhost:8200"}
|
||||
|
||||
# 检查参数
|
||||
if [ -z "$ANSIBLE_VAULT_FILE" ] || [ -z "$VAULT_PATH" ]; then
|
||||
echo "用法: $0 <ansible_vault_file> <vault_path>"
|
||||
echo "示例: $0 group_vars/all/vault.yml kv/ansible/group_vars/all"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# 检查Vault登录状态
|
||||
if ! vault token lookup >/dev/null 2>&1; then
|
||||
echo "请先登录Vault: vault login <token>"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# 解密Ansible Vault文件
|
||||
echo "解密Ansible Vault文件..."
|
||||
TEMP_FILE=$(mktemp)
|
||||
ansible-vault decrypt --output="$TEMP_FILE" "$ANSIBLE_VAULT_FILE"
|
||||
|
||||
# 将YAML转换为JSON并存储到HashiCorp Vault
|
||||
echo "迁移密钥到HashiCorp Vault..."
|
||||
python3 -c "
|
||||
import yaml, json, sys, subprocess
|
||||
with open('$TEMP_FILE', 'r') as f:
|
||||
data = yaml.safe_load(f)
|
||||
for key, value in data.items():
|
||||
cmd = ['vault', 'kv', 'put', '$VAULT_PATH/' + key, 'value=' + json.dumps(value)]
|
||||
subprocess.run(cmd)
|
||||
"
|
||||
|
||||
# 清理临时文件
|
||||
rm "$TEMP_FILE"
|
||||
|
||||
echo "迁移完成! 数据已存储在Vault路径: $VAULT_PATH/"
|
||||
```
|
||||
|
||||
### 5.2 执行迁移
|
||||
|
||||
```bash
|
||||
# 赋予脚本执行权限
|
||||
chmod +x migrate_to_hashicorp_vault.sh
|
||||
|
||||
# 执行迁移
|
||||
./migrate_to_hashicorp_vault.sh group_vars/all/vault.yml kv/ansible/group_vars/all
|
||||
```
|
||||
|
||||
## 6. 更新Ansible配置
|
||||
|
||||
### 6.1 修改ansible.cfg
|
||||
|
||||
更新`ansible.cfg`文件,添加Vault相关配置:
|
||||
|
||||
```ini
|
||||
[defaults]
|
||||
vault_identity_list = dev@~/.ansible/vault_dev.txt, prod@~/.ansible/vault_prod.txt
|
||||
|
||||
[hashi_vault_collection]
|
||||
url = http://localhost:8200
|
||||
auth_method = approle
|
||||
role_id = /path/to/role_id
|
||||
secret_id = /path/to/secret_id
|
||||
```
|
||||
|
||||
### 6.2 更新现有Playbook
|
||||
|
||||
将现有playbook中的Ansible Vault引用替换为HashiCorp Vault引用:
|
||||
|
||||
```yaml
|
||||
# 旧方式
|
||||
- name: 使用Ansible Vault变量
|
||||
debug:
|
||||
msg: "数据库密码: {{ vault_db_password }}"
|
||||
|
||||
# 新方式
|
||||
- name: 使用HashiCorp Vault变量
|
||||
debug:
|
||||
msg: "数据库密码: {{ lookup('hashi_vault', 'secret=kv/data/ansible/db:password') }}"
|
||||
```
|
||||
|
||||
## 7. 最佳实践
|
||||
|
||||
1. **避免硬编码认证信息**:使用环境变量或外部文件存储Vault认证信息
|
||||
2. **限制令牌权限**:为Ansible创建的Vault令牌仅授予必要的最小权限
|
||||
3. **设置合理的TTL**:为Vault令牌设置合理的生命周期,避免长期有效的令牌
|
||||
4. **使用no_log**:对包含敏感信息的任务使用`no_log: true`防止日志泄露
|
||||
5. **定期轮换认证凭据**:定期轮换AppRole的Secret ID
|
||||
6. **使用CI/CD集成**:在CI/CD流程中集成Vault认证,避免手动管理令牌
|
||||
|
||||
## 8. 故障排除
|
||||
|
||||
### 8.1 常见问题
|
||||
|
||||
1. **认证失败**:
|
||||
- 检查Role ID和Secret ID是否正确
|
||||
- 验证AppRole是否有正确的策略附加
|
||||
|
||||
2. **路径错误**:
|
||||
- KV v2引擎需要在路径中包含`data`,例如`kv/data/path`而不是`kv/path`
|
||||
|
||||
3. **权限问题**:
|
||||
- 确保AppRole有足够的权限访问请求的密钥
|
||||
|
||||
### 8.2 调试技巧
|
||||
|
||||
```yaml
|
||||
- name: 调试Vault查询
|
||||
debug:
|
||||
msg: "{{ lookup('hashi_vault', 'secret=kv/data/ansible/db:password auth_method=approle role_id=' + role_id + ' secret_id=' + secret_id + ' url=' + vault_addr) }}"
|
||||
vars:
|
||||
ansible_hashi_vault_debug: true
|
||||
117
docs/vault/vault_deployment_guide.md
Normal file
117
docs/vault/vault_deployment_guide.md
Normal file
@@ -0,0 +1,117 @@
|
||||
# Vault 通过 Nomad 部署指南
|
||||
|
||||
本文档提供了使用 Nomad 的 exec 驱动部署 HashiCorp Vault 的详细步骤,类似于 Consul 的部署方式。
|
||||
|
||||
## 部署架构
|
||||
|
||||
- **驱动方式**:使用 Nomad 的 `exec` 驱动
|
||||
- **节点分布**:在三个节点上部署(kr-master、us-ash3c、bj-warden)
|
||||
- **存储后端**:使用本地 Consul 作为存储后端
|
||||
- **网络设置**:API 端口为 8200,集群通信端口为 8201
|
||||
|
||||
## 自动部署方法
|
||||
|
||||
我们提供了一个自动化脚本来简化部署过程。该脚本会:
|
||||
|
||||
1. 使用 Ansible 在所有节点上安装 Vault
|
||||
2. 通过 Nomad 部署 Vault 服务
|
||||
3. 初始化和解封 Vault(如果需要)
|
||||
|
||||
### 使用自动部署脚本
|
||||
|
||||
```bash
|
||||
# 确保脚本有执行权限
|
||||
chmod +x scripts/deploy_vault.sh
|
||||
|
||||
# 运行部署脚本
|
||||
./scripts/deploy_vault.sh
|
||||
```
|
||||
|
||||
脚本执行完成后,Vault 将在主节点上初始化并解封。您需要在其他节点上手动执行解封操作。
|
||||
|
||||
## 手动部署步骤
|
||||
|
||||
如果您想手动部署,请按照以下步骤操作:
|
||||
|
||||
### 1. 安装 Vault
|
||||
|
||||
使用 Ansible 在所有节点上安装 Vault:
|
||||
|
||||
```bash
|
||||
ansible-playbook -i configuration/inventories/production/vault.ini configuration/playbooks/install/install_vault.yml
|
||||
```
|
||||
|
||||
### 2. 部署 Vault 服务
|
||||
|
||||
使用 Nomad 部署 Vault 服务:
|
||||
|
||||
```bash
|
||||
nomad job run jobs/vault-cluster-exec.nomad
|
||||
```
|
||||
|
||||
### 3. 初始化 Vault
|
||||
|
||||
在一个节点上初始化 Vault:
|
||||
|
||||
```bash
|
||||
export VAULT_ADDR='http://127.0.0.1:8200'
|
||||
vault operator init -key-shares=5 -key-threshold=3
|
||||
```
|
||||
|
||||
请安全保存生成的解封密钥和根令牌!
|
||||
|
||||
### 4. 解封 Vault
|
||||
|
||||
在每个节点上解封 Vault:
|
||||
|
||||
```bash
|
||||
export VAULT_ADDR='http://127.0.0.1:8200'
|
||||
vault operator unseal <解封密钥1>
|
||||
vault operator unseal <解封密钥2>
|
||||
vault operator unseal <解封密钥3>
|
||||
```
|
||||
|
||||
## 验证部署
|
||||
|
||||
验证 Vault 状态:
|
||||
|
||||
```bash
|
||||
export VAULT_ADDR='http://127.0.0.1:8200'
|
||||
vault status
|
||||
```
|
||||
|
||||
## 配置文件说明
|
||||
|
||||
### Nomad 作业文件
|
||||
|
||||
`jobs/vault-cluster-exec.nomad` 定义了 Vault 服务的 Nomad 作业配置,使用 exec 驱动在三个节点上部署 Vault。
|
||||
|
||||
### Ansible Playbook
|
||||
|
||||
`configuration/playbooks/install/install_vault.yml` 负责在目标节点上安装 Vault 软件包和创建必要的目录结构。
|
||||
|
||||
## 故障排除
|
||||
|
||||
### Vault 无法启动
|
||||
|
||||
- 检查 Nomad 作业状态:`nomad job status vault-cluster-exec`
|
||||
- 检查 Nomad 分配日志:`nomad alloc logs <allocation_id>`
|
||||
- 确保 Consul 正在运行:`consul members`
|
||||
|
||||
### Vault 无法解封
|
||||
|
||||
- 确保使用正确的解封密钥
|
||||
- 检查 Vault 状态:`vault status`
|
||||
- 检查 Consul 中的 Vault 数据:`consul kv get -recurse vault/`
|
||||
|
||||
## 后续步骤
|
||||
|
||||
成功部署 Vault 后,您可能需要:
|
||||
|
||||
1. 配置访问策略
|
||||
2. 启用密钥引擎
|
||||
3. 与 Nomad 集成
|
||||
4. 配置审计日志
|
||||
5. 设置自动解封机制(生产环境)
|
||||
|
||||
请参考 `docs/vault/vault_setup_guide.md` 获取更多信息。
|
||||
169
docs/vault/vault_implementation_proposal.md
Normal file
169
docs/vault/vault_implementation_proposal.md
Normal file
@@ -0,0 +1,169 @@
|
||||
# HashiCorp Vault 实施方案论证
|
||||
|
||||
## 1. 项目现状分析
|
||||
|
||||
### 1.1 现有基础设施
|
||||
- **多云环境**: Oracle Cloud, 华为云, Google Cloud, AWS, DigitalOcean
|
||||
- **基础设施管理**: OpenTofu (Terraform)
|
||||
- **配置管理**: Ansible
|
||||
- **容器编排**: Nomad + Podman
|
||||
- **服务发现**: Consul (部署在warden、ash3c、master三个节点上)
|
||||
- **CI/CD**: Gitea Actions
|
||||
|
||||
### 1.2 当前密钥管理现状
|
||||
- 部分使用Ansible Vault管理敏感信息
|
||||
- 存在明文密钥存储在代码库中(如`security/secrets/key.md`)
|
||||
- 缺乏统一的密钥管理和轮换机制
|
||||
- 没有集中的访问控制和审计机制
|
||||
|
||||
### 1.3 安全风险
|
||||
- 明文密钥存储导致潜在的安全漏洞
|
||||
- 缺乏密钥轮换机制增加了长期凭据泄露的风险
|
||||
- 分散的密钥管理增加了维护难度和安全风险
|
||||
- 缺乏审计机制,难以追踪谁在何时访问了敏感信息
|
||||
|
||||
## 2. HashiCorp Vault 解决方案
|
||||
|
||||
### 2.1 Vault 简介
|
||||
HashiCorp Vault是一个密钥管理和数据保护工具,专为现代云环境设计,提供以下核心功能:
|
||||
- 密钥和敏感数据的安全存储
|
||||
- 动态生成临时凭据
|
||||
- 数据加密服务
|
||||
- 详细的审计日志
|
||||
- 精细的访问控制
|
||||
|
||||
### 2.2 Vault 如何解决当前问题
|
||||
- **集中式密钥管理**: 所有密钥和敏感信息统一存储和管理
|
||||
- **动态密钥生成**: 为数据库、云服务等生成临时凭据,减少长期凭据泄露风险
|
||||
- **自动密钥轮换**: 定期自动轮换密钥,提高安全性
|
||||
- **访问控制**: 基于角色的访问控制,确保只有授权用户能访问特定密钥
|
||||
- **审计日志**: 详细记录所有密钥访问操作,便于安全审计
|
||||
- **与现有基础设施集成**: 与Nomad和Consul无缝集成
|
||||
|
||||
## 3. 部署方案
|
||||
|
||||
### 3.1 部署架构
|
||||
建议在现有的Consul集群节点(warden、ash3c、master)上部署Vault,形成高可用的Vault集群:
|
||||
|
||||
```
|
||||
+-------------------+ +-------------------+ +-------------------+
|
||||
| warden | | ash3c | | master |
|
||||
| | | | | |
|
||||
| +-------------+ | | +-------------+ | | +-------------+ |
|
||||
| | Consul | | | | Consul | | | | Consul | |
|
||||
| +-------------+ | | +-------------+ | | +-------------+ |
|
||||
| | | | | |
|
||||
| +-------------+ | | +-------------+ | | +-------------+ |
|
||||
| | Vault | | | | Vault | | | | Vault | |
|
||||
| +-------------+ | | +-------------+ | | +-------------+ |
|
||||
+-------------------+ +-------------------+ +-------------------+
|
||||
```
|
||||
|
||||
### 3.2 存储后端
|
||||
使用现有的Consul集群作为Vault的存储后端,利用Consul的高可用性和一致性特性:
|
||||
- Vault数据加密存储在Consul中
|
||||
- 利用Consul的分布式特性确保数据的高可用性
|
||||
- Vault服务器本身无状态,便于扩展和维护
|
||||
|
||||
### 3.3 资源需求
|
||||
每个节点上的Vault服务建议配置:
|
||||
- CPU: 2-4核
|
||||
- 内存: 4-8GB
|
||||
- 存储: 20GB (用于日志和临时数据)
|
||||
|
||||
### 3.4 网络配置
|
||||
- Vault API端口: 8200
|
||||
- Vault集群通信端口: 8201
|
||||
- 配置TLS加密所有通信
|
||||
- 设置适当的防火墙规则,限制对Vault API的访问
|
||||
|
||||
## 4. 实施计划
|
||||
|
||||
### 4.1 准备阶段
|
||||
1. **环境准备**
|
||||
- 在目标节点上安装必要的依赖
|
||||
- 生成TLS证书用于Vault通信加密
|
||||
- 配置防火墙规则
|
||||
|
||||
2. **配置文件准备**
|
||||
- 创建Vault配置文件
|
||||
- 配置Consul存储后端
|
||||
- 设置TLS和加密参数
|
||||
|
||||
### 4.2 部署阶段
|
||||
1. **初始部署**
|
||||
- 在三个节点上安装Vault
|
||||
- 配置为使用Consul作为存储后端
|
||||
- 初始化Vault并生成解封密钥
|
||||
|
||||
2. **高可用性配置**
|
||||
- 配置Vault集群
|
||||
- 设置自动解封机制
|
||||
- 配置负载均衡
|
||||
|
||||
### 4.3 集成阶段
|
||||
1. **与现有系统集成**
|
||||
- 配置Nomad使用Vault获取密钥
|
||||
- 更新Ansible脚本,使用Vault API获取敏感信息
|
||||
- 集成到CI/CD流程中
|
||||
|
||||
2. **密钥迁移**
|
||||
- 将现有密钥迁移到Vault
|
||||
- 设置密钥轮换策略
|
||||
- 移除代码库中的明文密钥
|
||||
|
||||
### 4.4 验证和测试
|
||||
1. **功能测试**
|
||||
- 验证Vault的基本功能
|
||||
- 测试密钥访问和管理
|
||||
- 验证高可用性和故障转移
|
||||
|
||||
2. **安全测试**
|
||||
- 进行渗透测试
|
||||
- 验证访问控制策略
|
||||
- 测试审计日志功能
|
||||
|
||||
## 5. 运维和管理
|
||||
|
||||
### 5.1 日常运维
|
||||
- 定期备份Vault数据
|
||||
- 监控Vault服务状态
|
||||
- 审查审计日志
|
||||
|
||||
### 5.2 灾难恢复
|
||||
- 制定详细的灾难恢复计划
|
||||
- 定期进行恢复演练
|
||||
- 确保解封密钥的安全存储
|
||||
|
||||
### 5.3 安全最佳实践
|
||||
- 实施最小权限原则
|
||||
- 定期轮换根密钥
|
||||
- 使用多因素认证
|
||||
- 定期审查访问策略
|
||||
|
||||
## 6. 实施时间表
|
||||
|
||||
| 阶段 | 任务 | 时间估计 |
|
||||
|------|------|----------|
|
||||
| 准备 | 环境准备 | 1天 |
|
||||
| 准备 | 配置文件准备 | 1天 |
|
||||
| 部署 | 初始部署 | 1天 |
|
||||
| 部署 | 高可用性配置 | 1天 |
|
||||
| 集成 | 与现有系统集成 | 3天 |
|
||||
| 集成 | 密钥迁移 | 2天 |
|
||||
| 测试 | 功能和安全测试 | 2天 |
|
||||
| 文档 | 编写运维文档 | 1天 |
|
||||
| **总计** | | **12天** |
|
||||
|
||||
## 7. 结论和建议
|
||||
|
||||
基于对当前基础设施和安全需求的分析,我们强烈建议在现有的Consul集群节点上部署HashiCorp Vault,以提升项目的安全性和密钥管理能力。
|
||||
|
||||
主要优势包括:
|
||||
- 消除明文密钥存储的安全风险
|
||||
- 提供集中式的密钥管理和访问控制
|
||||
- 支持动态密钥生成和自动轮换
|
||||
- 与现有的HashiCorp生态系统(Nomad、Consul)无缝集成
|
||||
- 提供详细的审计日志,满足合规要求
|
||||
|
||||
通过在现有节点上部署Vault,我们可以充分利用现有资源,同时显著提升项目的安全性,为多云环境提供统一的密钥管理解决方案。
|
||||
252
docs/vault/vault_setup_guide.md
Normal file
252
docs/vault/vault_setup_guide.md
Normal file
@@ -0,0 +1,252 @@
|
||||
# Vault 部署和配置指南
|
||||
|
||||
本文档提供了在现有Consul集群节点上部署和配置HashiCorp Vault的详细步骤。
|
||||
|
||||
## 1. 前置准备
|
||||
|
||||
### 1.1 创建数据目录
|
||||
|
||||
在每个节点上创建Vault数据目录:
|
||||
|
||||
```bash
|
||||
sudo mkdir -p /opt/vault/data
|
||||
sudo chown -R nomad:nomad /opt/vault
|
||||
```
|
||||
|
||||
### 1.2 生成TLS证书(生产环境必须)
|
||||
|
||||
```bash
|
||||
# 生成CA证书
|
||||
vault operator generate-root -generate-only -type=tls > ca.cert
|
||||
|
||||
# 生成服务器证书
|
||||
vault operator generate-server-cert > server.cert
|
||||
```
|
||||
|
||||
## 2. 部署Vault集群
|
||||
|
||||
### 2.1 使用Nomad部署
|
||||
|
||||
将`vault-cluster.nomad`文件提交到Nomad:
|
||||
|
||||
```bash
|
||||
nomad job run vault-cluster.nomad
|
||||
```
|
||||
|
||||
### 2.2 验证部署状态
|
||||
|
||||
```bash
|
||||
# 检查Nomad任务状态
|
||||
nomad job status vault-cluster
|
||||
|
||||
# 检查Vault服务状态
|
||||
curl http://localhost:8200/v1/sys/health
|
||||
```
|
||||
|
||||
## 3. 初始化和解封Vault
|
||||
|
||||
### 3.1 初始化Vault
|
||||
|
||||
在任一节点上执行:
|
||||
|
||||
```bash
|
||||
# 初始化Vault,生成解封密钥和根令牌
|
||||
vault operator init -key-shares=5 -key-threshold=3
|
||||
```
|
||||
|
||||
**重要提示:** 安全保存生成的解封密钥和根令牌!
|
||||
|
||||
### 3.2 解封Vault
|
||||
|
||||
在每个节点上执行解封操作(需要至少3个解封密钥):
|
||||
|
||||
```bash
|
||||
# 解封Vault
|
||||
vault operator unseal <解封密钥1>
|
||||
vault operator unseal <解封密钥2>
|
||||
vault operator unseal <解封密钥3>
|
||||
```
|
||||
|
||||
## 4. 配置Vault
|
||||
|
||||
### 4.1 登录Vault
|
||||
|
||||
```bash
|
||||
# 设置Vault地址
|
||||
export VAULT_ADDR='http://127.0.0.1:8200'
|
||||
|
||||
# 使用根令牌登录
|
||||
vault login <根令牌>
|
||||
```
|
||||
|
||||
### 4.2 启用密钥引擎
|
||||
|
||||
```bash
|
||||
# 启用KV v2密钥引擎
|
||||
vault secrets enable -version=2 kv
|
||||
|
||||
# 启用AWS密钥引擎(如需要)
|
||||
vault secrets enable aws
|
||||
|
||||
# 启用数据库密钥引擎(如需要)
|
||||
vault secrets enable database
|
||||
```
|
||||
|
||||
### 4.3 配置访问策略
|
||||
|
||||
```bash
|
||||
# 创建策略文件
|
||||
cat > nomad-server-policy.hcl <<EOF
|
||||
path "kv/data/nomad/*" {
|
||||
capabilities = ["read"]
|
||||
}
|
||||
EOF
|
||||
|
||||
# 创建策略
|
||||
vault policy write nomad-server nomad-server-policy.hcl
|
||||
|
||||
# 创建令牌
|
||||
vault token create -policy=nomad-server
|
||||
```
|
||||
|
||||
## 5. 与Nomad集成
|
||||
|
||||
### 5.1 配置Nomad使用Vault
|
||||
|
||||
编辑Nomad配置文件(`/etc/nomad.d/nomad.hcl`),添加Vault配置:
|
||||
|
||||
```hcl
|
||||
vault {
|
||||
enabled = true
|
||||
address = "http://127.0.0.1:8200"
|
||||
token = "<Nomad服务器的Vault令牌>"
|
||||
}
|
||||
```
|
||||
|
||||
### 5.2 重启Nomad服务
|
||||
|
||||
```bash
|
||||
sudo systemctl restart nomad
|
||||
```
|
||||
|
||||
## 6. 迁移现有密钥到Vault
|
||||
|
||||
### 6.1 存储API密钥
|
||||
|
||||
```bash
|
||||
# 存储OCI API密钥
|
||||
vault kv put kv/oci/api-key key="$(cat /root/mgmt/security/secrets/key.md)"
|
||||
|
||||
# 存储其他云服务商密钥
|
||||
vault kv put kv/aws/credentials aws_access_key_id="<访问密钥ID>" aws_secret_access_key="<秘密访问密钥>"
|
||||
```
|
||||
|
||||
### 6.2 配置密钥轮换策略
|
||||
|
||||
```bash
|
||||
# 为数据库凭据配置自动轮换
|
||||
vault write database/config/mysql \
|
||||
plugin_name=mysql-database-plugin \
|
||||
connection_url="{{username}}:{{password}}@tcp(database.example.com:3306)/" \
|
||||
allowed_roles="app-role" \
|
||||
username="root" \
|
||||
password="<数据库根密码>"
|
||||
|
||||
# 配置角色
|
||||
vault write database/roles/app-role \
|
||||
db_name=mysql \
|
||||
creation_statements="CREATE USER '{{name}}'@'%' IDENTIFIED BY '{{password}}';GRANT SELECT ON *.* TO '{{name}}'@'%';" \
|
||||
default_ttl="1h" \
|
||||
max_ttl="24h"
|
||||
```
|
||||
|
||||
## 7. 安全最佳实践
|
||||
|
||||
### 7.1 启用审计日志
|
||||
|
||||
```bash
|
||||
# 启用文件审计设备
|
||||
vault audit enable file file_path=/var/log/vault/audit.log
|
||||
```
|
||||
|
||||
### 7.2 配置自动解封(生产环境)
|
||||
|
||||
对于生产环境,建议配置自动解封机制,可以使用云KMS服务:
|
||||
|
||||
```hcl
|
||||
# AWS KMS自动解封配置示例
|
||||
seal "awskms" {
|
||||
region = "us-west-2"
|
||||
kms_key_id = "<AWS KMS密钥ID>"
|
||||
}
|
||||
```
|
||||
|
||||
### 7.3 定期轮换根密钥
|
||||
|
||||
```bash
|
||||
# 轮换根密钥
|
||||
vault operator rotate
|
||||
```
|
||||
|
||||
## 8. 故障排除
|
||||
|
||||
### 8.1 检查Vault状态
|
||||
|
||||
```bash
|
||||
# 检查Vault状态
|
||||
vault status
|
||||
|
||||
# 检查密封状态
|
||||
vault status -format=json | jq '.sealed'
|
||||
```
|
||||
|
||||
### 8.2 检查Consul存储
|
||||
|
||||
```bash
|
||||
# 检查Consul中的Vault数据
|
||||
consul kv get -recurse vault/
|
||||
```
|
||||
|
||||
### 8.3 常见问题解决
|
||||
|
||||
- **Vault启动失败**:检查配置文件语法和权限
|
||||
- **解封失败**:确保使用正确的解封密钥
|
||||
- **API不可访问**:检查防火墙规则和监听地址配置
|
||||
|
||||
## 9. 备份和恢复
|
||||
|
||||
### 9.1 备份Vault数据
|
||||
|
||||
```bash
|
||||
# 备份Consul中的Vault数据
|
||||
consul snapshot save vault-backup.snap
|
||||
```
|
||||
|
||||
### 9.2 恢复Vault数据
|
||||
|
||||
```bash
|
||||
# 恢复Consul快照
|
||||
consul snapshot restore vault-backup.snap
|
||||
```
|
||||
|
||||
## 10. 日常维护
|
||||
|
||||
### 10.1 监控Vault状态
|
||||
|
||||
```bash
|
||||
# 设置Prometheus监控
|
||||
vault write sys/metrics/collector prometheus
|
||||
```
|
||||
|
||||
### 10.2 查看审计日志
|
||||
|
||||
```bash
|
||||
# 分析审计日志
|
||||
cat /var/log/vault/audit.log | jq
|
||||
```
|
||||
|
||||
### 10.3 定期更新Vault版本
|
||||
|
||||
```bash
|
||||
# 更新Vault版本(通过更新Nomad作业)
|
||||
nomad job run -detach vault-cluster.nomad
|
||||
245
docs/waypoint/waypoint_implementation_proposal.md
Normal file
245
docs/waypoint/waypoint_implementation_proposal.md
Normal file
@@ -0,0 +1,245 @@
|
||||
# HashiCorp Waypoint 实施方案论证
|
||||
|
||||
## 1. 项目现状分析
|
||||
|
||||
### 1.1 现有部署流程
|
||||
- **基础设施管理**: OpenTofu (Terraform)
|
||||
- **配置管理**: Ansible
|
||||
- **容器编排**: Nomad + Podman
|
||||
- **CI/CD**: Gitea Actions
|
||||
- **多云环境**: Oracle Cloud, 华为云, Google Cloud, AWS, DigitalOcean
|
||||
|
||||
### 1.2 当前部署流程挑战
|
||||
- 跨多个云平台的部署流程不一致
|
||||
- 不同环境(开发、测试、生产)的配置差异管理复杂
|
||||
- 应用生命周期管理分散在多个工具中
|
||||
- 缺乏统一的应用部署和发布界面
|
||||
- 开发团队需要了解多种工具和平台特性
|
||||
|
||||
### 1.3 现有GitOps工作流
|
||||
项目已实施GitOps工作流,包括:
|
||||
- 声明式配置存储在Git中
|
||||
- 通过CI/CD流水线自动应用变更
|
||||
- 状态收敛和监控
|
||||
|
||||
## 2. HashiCorp Waypoint 解决方案
|
||||
|
||||
### 2.1 Waypoint 简介
|
||||
HashiCorp Waypoint是一个应用部署工具,提供一致的工作流来构建、部署和发布应用,无论底层平台如何。主要特性包括:
|
||||
|
||||
- 统一的工作流接口
|
||||
- 多平台支持
|
||||
- 应用版本管理
|
||||
- 自动化发布控制
|
||||
- 可扩展的插件系统
|
||||
|
||||
### 2.2 Waypoint 如何补充现有工具链
|
||||
|
||||
| 现有工具 | 主要职责 | Waypoint 补充 |
|
||||
|---------|---------|--------------|
|
||||
| OpenTofu | 基础设施管理 | 不替代,而是与之集成,使用已创建的基础设施 |
|
||||
| Ansible | 配置管理 | 可以作为构建或部署步骤的一部分调用Ansible |
|
||||
| Nomad | 容器编排 | 直接集成,简化Nomad作业的部署和管理 |
|
||||
| Gitea Actions | CI/CD流水线 | 可以在流水线中调用Waypoint,或由Waypoint触发流水线 |
|
||||
|
||||
### 2.3 Waypoint 与现有工具的协同工作
|
||||
```
|
||||
+----------------+ +----------------+ +----------------+
|
||||
| OpenTofu | | Waypoint | | Nomad |
|
||||
| |---->| |---->| |
|
||||
| (基础设施管理) | | (应用部署流程) | | (容器编排) |
|
||||
+----------------+ +----------------+ +----------------+
|
||||
|
|
||||
v
|
||||
+----------------+
|
||||
| Ansible |
|
||||
| |
|
||||
| (配置管理) |
|
||||
+----------------+
|
||||
```
|
||||
|
||||
## 3. Waypoint 实施价值分析
|
||||
|
||||
### 3.1 潜在优势
|
||||
|
||||
#### 3.1.1 开发体验提升
|
||||
- **简化接口**: 开发人员通过统一接口部署应用,无需了解底层平台细节
|
||||
- **本地开发一致性**: 开发环境与生产环境使用相同的部署流程
|
||||
- **快速反馈**: 部署结果和日志集中可见
|
||||
|
||||
#### 3.1.2 运维效率提升
|
||||
- **标准化部署流程**: 跨团队和项目的一致部署方法
|
||||
- **减少平台特定脚本**: 减少为不同平台维护的自定义脚本
|
||||
- **集中式部署管理**: 通过UI或CLI集中管理所有应用部署
|
||||
|
||||
#### 3.1.3 多云策略支持
|
||||
- **平台无关的部署**: 相同的Waypoint配置可用于不同云平台
|
||||
- **简化云迁移**: 更容易在不同云提供商之间迁移应用
|
||||
- **混合云支持**: 统一管理跨多个云平台的部署
|
||||
|
||||
#### 3.1.4 与现有HashiCorp生态系统集成
|
||||
- **Nomad集成**: 原生支持Nomad作为部署平台
|
||||
- **Consul集成**: 服务发现和配置管理
|
||||
- **Vault集成**: 安全获取部署所需的密钥和证书
|
||||
|
||||
### 3.2 潜在挑战
|
||||
|
||||
#### 3.2.1 实施成本
|
||||
- **学习曲线**: 团队需要学习新工具
|
||||
- **迁移工作**: 现有部署流程需要适配到Waypoint
|
||||
- **维护开销**: 额外的基础设施组件需要维护
|
||||
|
||||
#### 3.2.2 与现有流程的重叠
|
||||
- **与Gitea Actions重叠**: 部分功能与现有CI/CD流程重叠
|
||||
- **工具链复杂性**: 添加新工具可能增加整体复杂性
|
||||
|
||||
#### 3.2.3 成熟度考量
|
||||
- **相对较新的项目**: 与其他HashiCorp产品相比,Waypoint相对较新
|
||||
- **社区规模**: 社区和生态系统仍在发展中
|
||||
- **插件生态**: 某些特定平台的插件可能不够成熟
|
||||
|
||||
## 4. 实施方案
|
||||
|
||||
### 4.1 部署架构
|
||||
建议将Waypoint服务器部署在与Nomad和Consul相同的环境中:
|
||||
|
||||
```
|
||||
+-------------------+ +-------------------+ +-------------------+
|
||||
| warden | | ash3c | | master |
|
||||
| | | | | |
|
||||
| +-------------+ | | +-------------+ | | +-------------+ |
|
||||
| | Consul | | | | Consul | | | | Consul | |
|
||||
| +-------------+ | | +-------------+ | | +-------------+ |
|
||||
| | | | | |
|
||||
| +-------------+ | | +-------------+ | | +-------------+ |
|
||||
| | Nomad | | | | Nomad | | | | Nomad | |
|
||||
| +-------------+ | | +-------------+ | | +-------------+ |
|
||||
| | | | | |
|
||||
| +-------------+ | | +-------------+ | | +-------------+ |
|
||||
| | Vault | | | | Vault | | | | Vault | |
|
||||
| +-------------+ | | +-------------+ | | +-------------+ |
|
||||
| | | | | |
|
||||
| +-------------+ | | | | |
|
||||
| | Waypoint | | | | | |
|
||||
| +-------------+ | | | | |
|
||||
+-------------------+ +-------------------+ +-------------------+
|
||||
```
|
||||
|
||||
### 4.2 资源需求
|
||||
Waypoint服务器建议配置:
|
||||
- CPU: 2核
|
||||
- 内存: 2GB
|
||||
- 存储: 10GB
|
||||
|
||||
### 4.3 网络配置
|
||||
- Waypoint API端口: 9702
|
||||
- Waypoint UI端口: 9701
|
||||
- 配置TLS加密所有通信
|
||||
|
||||
## 5. 实施计划
|
||||
|
||||
### 5.1 试点阶段
|
||||
1. **环境准备**
|
||||
- 在单个节点上部署Waypoint服务器
|
||||
- 配置与Nomad、Consul和Vault的集成
|
||||
|
||||
2. **选择试点项目**
|
||||
- 选择一个非关键应用作为试点
|
||||
- 创建Waypoint配置文件
|
||||
- 实施构建、部署和发布流程
|
||||
|
||||
3. **评估结果**
|
||||
- 收集开发和运维反馈
|
||||
- 评估部署效率提升
|
||||
- 识别潜在问题和改进点
|
||||
|
||||
### 5.2 扩展阶段
|
||||
1. **扩展到更多应用**
|
||||
- 逐步将更多应用迁移到Waypoint
|
||||
- 创建标准化的Waypoint模板
|
||||
- 建立最佳实践文档
|
||||
|
||||
2. **团队培训**
|
||||
- 为开发和运维团队提供Waypoint培训
|
||||
- 创建内部知识库和示例
|
||||
|
||||
3. **与CI/CD集成**
|
||||
- 将Waypoint集成到现有Gitea Actions流水线
|
||||
- 实现自动触发部署
|
||||
|
||||
### 5.3 完全集成阶段
|
||||
1. **扩展到所有环境**
|
||||
- 在开发、测试和生产环境中统一使用Waypoint
|
||||
- 实现环境特定配置管理
|
||||
|
||||
2. **高级功能实施**
|
||||
- 配置自动回滚策略
|
||||
- 实现蓝绿部署和金丝雀发布
|
||||
- 集成监控和告警
|
||||
|
||||
3. **持续优化**
|
||||
- 定期评估和优化部署流程
|
||||
- 跟踪Waypoint更新和新功能
|
||||
|
||||
## 6. 实施时间表
|
||||
|
||||
| 阶段 | 任务 | 时间估计 |
|
||||
|------|------|----------|
|
||||
| 准备 | 环境准备和Waypoint服务器部署 | 2天 |
|
||||
| 试点 | 试点项目实施 | 5天 |
|
||||
| 试点 | 评估和调整 | 3天 |
|
||||
| 扩展 | 扩展到更多应用 | 10天 |
|
||||
| 扩展 | 团队培训 | 2天 |
|
||||
| 扩展 | CI/CD集成 | 3天 |
|
||||
| 集成 | 扩展到所有环境 | 5天 |
|
||||
| 集成 | 高级功能实施 | 5天 |
|
||||
| **总计** | | **35天** |
|
||||
|
||||
## 7. 成本效益分析
|
||||
|
||||
### 7.1 实施成本
|
||||
- **基础设施成本**: 低(利用现有节点)
|
||||
- **许可成本**: 无(开源版本)
|
||||
- **人力成本**: 中(学习和迁移工作)
|
||||
- **维护成本**: 低(与现有HashiCorp产品集成)
|
||||
|
||||
### 7.2 预期收益
|
||||
- **开发效率提升**: 预计减少20-30%的部署相关工作
|
||||
- **部署一致性**: 减少50%的环境特定问题
|
||||
- **上线时间缩短**: 预计缩短15-25%的应用上线时间
|
||||
- **运维负担减轻**: 减少跨平台部署脚本维护
|
||||
|
||||
### 7.3 投资回报周期
|
||||
- 预计在实施后3-6个月内开始看到明显收益
|
||||
- 完全投资回报预计在9-12个月内实现
|
||||
|
||||
## 8. 结论和建议
|
||||
|
||||
### 8.1 是否实施Waypoint的决策因素
|
||||
|
||||
#### 支持实施的因素
|
||||
- 项目已经使用HashiCorp生态系统(Nomad、Consul)
|
||||
- 多云环境需要统一的部署流程
|
||||
- 需要简化开发人员的部署体验
|
||||
- 应用部署流程需要标准化
|
||||
|
||||
#### 不支持实施的因素
|
||||
- 现有CI/CD流程已经满足需求
|
||||
- 团队资源有限,难以支持额外工具的学习和维护
|
||||
- 应用部署需求相对简单,不需要高级发布策略
|
||||
|
||||
### 8.2 建议实施路径
|
||||
|
||||
基于对项目现状的分析,我们建议采取**渐进式实施**策略:
|
||||
|
||||
1. **先实施Vault**: 优先解决安全问题,实施Vault进行密钥管理
|
||||
2. **小规模试点Waypoint**: 在非关键应用上试点Waypoint,评估实际价值
|
||||
3. **基于试点结果决定**: 根据试点结果决定是否扩大Waypoint的使用范围
|
||||
|
||||
### 8.3 最终建议
|
||||
|
||||
虽然Waypoint提供了统一的应用部署体验和多云支持,但考虑到项目已有相对成熟的GitOps工作流和CI/CD流程,Waypoint的实施优先级应低于Vault。
|
||||
|
||||
建议先完成Vault的实施,解决当前的安全问题,然后在资源允许的情况下,通过小规模试点评估Waypoint的实际价值。这种渐进式方法可以降低风险,同时确保资源投入到最有价值的改进上。
|
||||
|
||||
如果试点结果显示Waypoint能显著提升开发效率和部署一致性,再考虑更广泛的实施。
|
||||
712
docs/waypoint/waypoint_integration_examples.md
Normal file
712
docs/waypoint/waypoint_integration_examples.md
Normal file
@@ -0,0 +1,712 @@
|
||||
# Waypoint 集成示例
|
||||
|
||||
本文档提供了将Waypoint与现有基础设施和工具集成的具体示例。
|
||||
|
||||
## 1. 与Nomad集成
|
||||
|
||||
### 1.1 基本Nomad部署配置
|
||||
|
||||
```hcl
|
||||
app "api-service" {
|
||||
build {
|
||||
use "docker" {
|
||||
dockerfile = "Dockerfile"
|
||||
disable_entrypoint = true
|
||||
}
|
||||
}
|
||||
|
||||
deploy {
|
||||
use "nomad" {
|
||||
// Nomad集群地址
|
||||
address = "http://nomad-server:4646"
|
||||
|
||||
// 部署配置
|
||||
datacenter = "dc1"
|
||||
namespace = "default"
|
||||
|
||||
// 资源配置
|
||||
resources {
|
||||
cpu = 500
|
||||
memory = 256
|
||||
}
|
||||
|
||||
// 服务配置
|
||||
service_provider = "consul" {
|
||||
service_name = "api-service"
|
||||
tags = ["api", "v1"]
|
||||
|
||||
check {
|
||||
type = "http"
|
||||
path = "/health"
|
||||
interval = "10s"
|
||||
timeout = "2s"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 1.2 高级Nomad配置
|
||||
|
||||
```hcl
|
||||
app "web-app" {
|
||||
deploy {
|
||||
use "nomad" {
|
||||
// 基本配置...
|
||||
|
||||
// 存储卷配置
|
||||
volume_mount {
|
||||
volume = "app-data"
|
||||
destination = "/data"
|
||||
read_only = false
|
||||
}
|
||||
|
||||
// 网络配置
|
||||
network {
|
||||
mode = "bridge"
|
||||
port "http" {
|
||||
static = 8080
|
||||
to = 80
|
||||
}
|
||||
}
|
||||
|
||||
// 环境变量
|
||||
env {
|
||||
NODE_ENV = "production"
|
||||
}
|
||||
|
||||
// 健康检查
|
||||
health_check {
|
||||
timeout = "5m"
|
||||
check {
|
||||
name = "http-check"
|
||||
route = "/health"
|
||||
method = "GET"
|
||||
code = 200
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 2. 与Vault集成
|
||||
|
||||
### 2.1 从Vault获取静态密钥
|
||||
|
||||
```hcl
|
||||
app "database-service" {
|
||||
deploy {
|
||||
use "nomad" {
|
||||
// 基本配置...
|
||||
|
||||
env {
|
||||
// 从Vault获取数据库凭据
|
||||
DB_USERNAME = dynamic("vault", {
|
||||
path = "kv/data/database/creds"
|
||||
key = "username"
|
||||
})
|
||||
|
||||
DB_PASSWORD = dynamic("vault", {
|
||||
path = "kv/data/database/creds"
|
||||
key = "password"
|
||||
})
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2.2 使用Vault动态密钥
|
||||
|
||||
```hcl
|
||||
app "api-service" {
|
||||
deploy {
|
||||
use "nomad" {
|
||||
// 基本配置...
|
||||
|
||||
template {
|
||||
destination = "secrets/db-creds.txt"
|
||||
data = <<EOF
|
||||
{{- with secret "database/creds/api-role" -}}
|
||||
DB_USERNAME={{ .Data.username }}
|
||||
DB_PASSWORD={{ .Data.password }}
|
||||
{{- end -}}
|
||||
EOF
|
||||
}
|
||||
|
||||
env_from_file = ["secrets/db-creds.txt"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 3. 与Consul集成
|
||||
|
||||
### 3.1 服务发现配置
|
||||
|
||||
```hcl
|
||||
app "frontend" {
|
||||
deploy {
|
||||
use "nomad" {
|
||||
// 基本配置...
|
||||
|
||||
service_provider = "consul" {
|
||||
service_name = "frontend"
|
||||
|
||||
meta {
|
||||
version = "v1.2.3"
|
||||
team = "frontend"
|
||||
}
|
||||
|
||||
tags = ["web", "frontend"]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3.2 使用Consul KV存储配置
|
||||
|
||||
```hcl
|
||||
app "config-service" {
|
||||
deploy {
|
||||
use "nomad" {
|
||||
// 基本配置...
|
||||
|
||||
template {
|
||||
destination = "config/app-config.json"
|
||||
data = <<EOF
|
||||
{
|
||||
"settings": {{ key "config/app-settings" | toJSON }},
|
||||
"features": {{ key "config/features" | toJSON }}
|
||||
}
|
||||
EOF
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 4. 与Gitea Actions集成
|
||||
|
||||
### 4.1 基本CI/CD流水线
|
||||
|
||||
```yaml
|
||||
name: Build and Deploy
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [ main ]
|
||||
|
||||
jobs:
|
||||
deploy:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v2
|
||||
|
||||
- name: Install Waypoint
|
||||
run: |
|
||||
curl -fsSL https://releases.hashicorp.com/waypoint/0.11.0/waypoint_0.11.0_linux_amd64.zip -o waypoint.zip
|
||||
unzip waypoint.zip
|
||||
sudo mv waypoint /usr/local/bin/
|
||||
|
||||
- name: Configure Waypoint
|
||||
run: |
|
||||
waypoint context create \
|
||||
-server-addr=${{ secrets.WAYPOINT_SERVER_ADDR }} \
|
||||
-server-auth-token=${{ secrets.WAYPOINT_AUTH_TOKEN }} \
|
||||
-set-default ci-context
|
||||
|
||||
- name: Build and Deploy
|
||||
run: waypoint up
|
||||
```
|
||||
|
||||
### 4.2 多环境部署流水线
|
||||
|
||||
```yaml
|
||||
name: Multi-Environment Deploy
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [ main, staging, production ]
|
||||
|
||||
jobs:
|
||||
deploy:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v2
|
||||
|
||||
- name: Install Waypoint
|
||||
run: |
|
||||
curl -fsSL https://releases.hashicorp.com/waypoint/0.11.0/waypoint_0.11.0_linux_amd64.zip -o waypoint.zip
|
||||
unzip waypoint.zip
|
||||
sudo mv waypoint /usr/local/bin/
|
||||
|
||||
- name: Configure Waypoint
|
||||
run: |
|
||||
waypoint context create \
|
||||
-server-addr=${{ secrets.WAYPOINT_SERVER_ADDR }} \
|
||||
-server-auth-token=${{ secrets.WAYPOINT_AUTH_TOKEN }} \
|
||||
-set-default ci-context
|
||||
|
||||
- name: Determine Environment
|
||||
id: env
|
||||
run: |
|
||||
if [[ ${{ github.ref }} == 'refs/heads/main' ]]; then
|
||||
echo "::set-output name=environment::development"
|
||||
elif [[ ${{ github.ref }} == 'refs/heads/staging' ]]; then
|
||||
echo "::set-output name=environment::staging"
|
||||
elif [[ ${{ github.ref }} == 'refs/heads/production' ]]; then
|
||||
echo "::set-output name=environment::production"
|
||||
fi
|
||||
|
||||
- name: Build and Deploy
|
||||
run: |
|
||||
waypoint up -workspace=${{ steps.env.outputs.environment }}
|
||||
```
|
||||
|
||||
## 5. 多云部署示例
|
||||
|
||||
### 5.1 AWS ECS部署
|
||||
|
||||
```hcl
|
||||
app "microservice" {
|
||||
build {
|
||||
use "docker" {}
|
||||
}
|
||||
|
||||
deploy {
|
||||
use "aws-ecs" {
|
||||
region = "us-west-2"
|
||||
cluster = "production"
|
||||
|
||||
service {
|
||||
name = "microservice"
|
||||
desired_count = 3
|
||||
|
||||
load_balancer {
|
||||
target_group_arn = "arn:aws:elasticloadbalancing:us-west-2:..."
|
||||
container_name = "microservice"
|
||||
container_port = 8080
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 5.2 Google Cloud Run部署
|
||||
|
||||
```hcl
|
||||
app "api" {
|
||||
build {
|
||||
use "docker" {}
|
||||
}
|
||||
|
||||
deploy {
|
||||
use "google-cloud-run" {
|
||||
project = "my-gcp-project"
|
||||
location = "us-central1"
|
||||
|
||||
port = 8080
|
||||
|
||||
capacity {
|
||||
memory = 512
|
||||
cpu_count = 1
|
||||
max_requests_per_container = 10
|
||||
request_timeout = 300
|
||||
}
|
||||
|
||||
auto_scaling {
|
||||
max_instances = 10
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 5.3 多云部署策略
|
||||
|
||||
```hcl
|
||||
// 使用变量决定部署目标
|
||||
variable "deploy_target" {
|
||||
type = string
|
||||
default = "nomad"
|
||||
}
|
||||
|
||||
app "multi-cloud-app" {
|
||||
build {
|
||||
use "docker" {}
|
||||
}
|
||||
|
||||
deploy {
|
||||
// 根据变量选择部署平台
|
||||
use dynamic {
|
||||
value = var.deploy_target
|
||||
|
||||
// Nomad部署配置
|
||||
nomad {
|
||||
datacenter = "dc1"
|
||||
// 其他Nomad配置...
|
||||
}
|
||||
|
||||
// AWS ECS部署配置
|
||||
aws-ecs {
|
||||
region = "us-west-2"
|
||||
cluster = "production"
|
||||
// 其他ECS配置...
|
||||
}
|
||||
|
||||
// Google Cloud Run部署配置
|
||||
google-cloud-run {
|
||||
project = "my-gcp-project"
|
||||
location = "us-central1"
|
||||
// 其他Cloud Run配置...
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 6. 高级发布策略
|
||||
|
||||
### 6.1 蓝绿部署
|
||||
|
||||
```hcl
|
||||
app "web-app" {
|
||||
build {
|
||||
use "docker" {}
|
||||
}
|
||||
|
||||
deploy {
|
||||
use "nomad" {
|
||||
// 基本部署配置...
|
||||
}
|
||||
}
|
||||
|
||||
release {
|
||||
use "nomad-bluegreen" {
|
||||
service = "web-app"
|
||||
datacenter = "dc1"
|
||||
namespace = "default"
|
||||
|
||||
// 流量转移配置
|
||||
traffic_step = 25 // 每次转移25%的流量
|
||||
confirm_step = true // 每步需要确认
|
||||
|
||||
// 健康检查
|
||||
health_check {
|
||||
timeout = "2m"
|
||||
check {
|
||||
route = "/health"
|
||||
method = "GET"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 6.2 金丝雀发布
|
||||
|
||||
```hcl
|
||||
app "api-service" {
|
||||
build {
|
||||
use "docker" {}
|
||||
}
|
||||
|
||||
deploy {
|
||||
use "nomad" {
|
||||
// 基本部署配置...
|
||||
}
|
||||
}
|
||||
|
||||
release {
|
||||
use "nomad-canary" {
|
||||
service = "api-service"
|
||||
datacenter = "dc1"
|
||||
|
||||
// 金丝雀配置
|
||||
canary {
|
||||
percentage = 10 // 先发布到10%的实例
|
||||
duration = "15m" // 观察15分钟
|
||||
}
|
||||
|
||||
// 自动回滚配置
|
||||
auto_rollback = true
|
||||
|
||||
// 指标监控
|
||||
metrics {
|
||||
provider = "prometheus"
|
||||
address = "http://prometheus:9090"
|
||||
query = "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) > 0.01"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 7. 自定义插件示例
|
||||
|
||||
### 7.1 自定义构建器插件
|
||||
|
||||
```go
|
||||
// custom_builder.go
|
||||
package main
|
||||
|
||||
import (
|
||||
"context"
|
||||
sdk "github.com/hashicorp/waypoint-plugin-sdk"
|
||||
)
|
||||
|
||||
// CustomBuilder 实现自定义构建逻辑
|
||||
type CustomBuilder struct {
|
||||
config BuildConfig
|
||||
}
|
||||
|
||||
type BuildConfig struct {
|
||||
Command string `hcl:"command"`
|
||||
}
|
||||
|
||||
// ConfigSet 设置配置
|
||||
func (b *CustomBuilder) ConfigSet(config interface{}) error {
|
||||
c, ok := config.(*BuildConfig)
|
||||
if !ok {
|
||||
return fmt.Errorf("invalid configuration")
|
||||
}
|
||||
b.config = *c
|
||||
return nil
|
||||
}
|
||||
|
||||
// BuildFunc 执行构建
|
||||
func (b *CustomBuilder) BuildFunc() interface{} {
|
||||
return b.build
|
||||
}
|
||||
|
||||
func (b *CustomBuilder) build(ctx context.Context, ui terminal.UI) (*Binary, error) {
|
||||
// 执行自定义构建命令
|
||||
cmd := exec.CommandContext(ctx, "sh", "-c", b.config.Command)
|
||||
cmd.Stdout = ui.Output()
|
||||
cmd.Stderr = ui.Error()
|
||||
|
||||
if err := cmd.Run(); err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
return &Binary{
|
||||
Source: "custom",
|
||||
}, nil
|
||||
}
|
||||
|
||||
// 注册插件
|
||||
func main() {
|
||||
sdk.Main(sdk.WithComponents(&CustomBuilder{}))
|
||||
}
|
||||
```
|
||||
|
||||
### 7.2 使用自定义插件
|
||||
|
||||
```hcl
|
||||
app "custom-app" {
|
||||
build {
|
||||
use "custom" {
|
||||
command = "make build"
|
||||
}
|
||||
}
|
||||
|
||||
deploy {
|
||||
use "nomad" {
|
||||
// 部署配置...
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 8. 监控和可观测性集成
|
||||
|
||||
### 8.1 Prometheus集成
|
||||
|
||||
```hcl
|
||||
app "monitored-app" {
|
||||
deploy {
|
||||
use "nomad" {
|
||||
// 基本配置...
|
||||
|
||||
// Prometheus注解
|
||||
service_provider = "consul" {
|
||||
service_name = "monitored-app"
|
||||
|
||||
meta {
|
||||
"prometheus.io/scrape" = "true"
|
||||
"prometheus.io/path" = "/metrics"
|
||||
"prometheus.io/port" = "8080"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 8.2 与ELK堆栈集成
|
||||
|
||||
```hcl
|
||||
app "logging-app" {
|
||||
deploy {
|
||||
use "nomad" {
|
||||
// 基本配置...
|
||||
|
||||
// 日志配置
|
||||
logging {
|
||||
type = "fluentd"
|
||||
config {
|
||||
fluentd_address = "fluentd.service.consul:24224"
|
||||
tag = "app.${nomad.namespace}.${app.name}"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 9. 本地开发工作流
|
||||
|
||||
### 9.1 本地开发配置
|
||||
|
||||
```hcl
|
||||
app "dev-app" {
|
||||
build {
|
||||
use "docker" {}
|
||||
}
|
||||
|
||||
deploy {
|
||||
use "docker" {
|
||||
service_port = 3000
|
||||
|
||||
// 开发环境特定配置
|
||||
env {
|
||||
NODE_ENV = "development"
|
||||
DEBUG = "true"
|
||||
}
|
||||
|
||||
// 挂载源代码目录
|
||||
binds {
|
||||
source = abspath("./src")
|
||||
destination = "/app/src"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 9.2 本地与远程环境切换
|
||||
|
||||
```hcl
|
||||
variable "environment" {
|
||||
type = string
|
||||
default = "local"
|
||||
}
|
||||
|
||||
app "fullstack-app" {
|
||||
build {
|
||||
use "docker" {}
|
||||
}
|
||||
|
||||
deploy {
|
||||
// 根据环境变量选择部署方式
|
||||
use dynamic {
|
||||
value = var.environment
|
||||
|
||||
// 本地开发
|
||||
local {
|
||||
use "docker" {
|
||||
// 本地Docker配置...
|
||||
}
|
||||
}
|
||||
|
||||
// 开发环境
|
||||
dev {
|
||||
use "nomad" {
|
||||
// 开发环境Nomad配置...
|
||||
}
|
||||
}
|
||||
|
||||
// 生产环境
|
||||
prod {
|
||||
use "nomad" {
|
||||
// 生产环境Nomad配置...
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 10. 多应用协调
|
||||
|
||||
### 10.1 依赖管理
|
||||
|
||||
```hcl
|
||||
project = "microservices"
|
||||
|
||||
app "database" {
|
||||
// 数据库服务配置...
|
||||
}
|
||||
|
||||
app "backend" {
|
||||
// 后端API配置...
|
||||
|
||||
// 声明依赖关系
|
||||
depends_on = ["database"]
|
||||
}
|
||||
|
||||
app "frontend" {
|
||||
// 前端配置...
|
||||
|
||||
// 声明依赖关系
|
||||
depends_on = ["backend"]
|
||||
}
|
||||
```
|
||||
|
||||
### 10.2 共享配置
|
||||
|
||||
```hcl
|
||||
// 定义共享变量
|
||||
variable "version" {
|
||||
type = string
|
||||
default = "1.0.0"
|
||||
}
|
||||
|
||||
variable "environment" {
|
||||
type = string
|
||||
default = "development"
|
||||
}
|
||||
|
||||
// 共享函数
|
||||
function "service_name" {
|
||||
params = [name]
|
||||
result = "${var.environment}-${name}"
|
||||
}
|
||||
|
||||
// 应用配置
|
||||
app "api" {
|
||||
build {
|
||||
use "docker" {
|
||||
tag = "${var.version}"
|
||||
}
|
||||
}
|
||||
|
||||
deploy {
|
||||
use "nomad" {
|
||||
service_provider = "consul" {
|
||||
service_name = service_name("api")
|
||||
}
|
||||
|
||||
env {
|
||||
APP_VERSION = var.version
|
||||
ENVIRONMENT = var.environment
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
331
docs/waypoint/waypoint_setup_guide.md
Normal file
331
docs/waypoint/waypoint_setup_guide.md
Normal file
@@ -0,0 +1,331 @@
|
||||
# Waypoint 部署和配置指南
|
||||
|
||||
本文档提供了在现有基础设施上部署和配置HashiCorp Waypoint的详细步骤。
|
||||
|
||||
## 1. 前置准备
|
||||
|
||||
### 1.1 创建数据目录
|
||||
|
||||
在Waypoint服务器节点上创建数据目录:
|
||||
|
||||
```bash
|
||||
sudo mkdir -p /opt/waypoint/data
|
||||
sudo chown -R nomad:nomad /opt/waypoint
|
||||
```
|
||||
|
||||
### 1.2 安装Waypoint CLI
|
||||
|
||||
在开发机器和CI/CD服务器上安装Waypoint CLI:
|
||||
|
||||
```bash
|
||||
curl -fsSL https://releases.hashicorp.com/waypoint/0.11.0/waypoint_0.11.0_linux_amd64.zip -o waypoint.zip
|
||||
unzip waypoint.zip
|
||||
sudo mv waypoint /usr/local/bin/
|
||||
```
|
||||
|
||||
## 2. 部署Waypoint服务器
|
||||
|
||||
### 2.1 使用Nomad部署
|
||||
|
||||
将`waypoint-server.nomad`文件提交到Nomad:
|
||||
|
||||
```bash
|
||||
nomad job run waypoint-server.nomad
|
||||
```
|
||||
|
||||
### 2.2 验证部署状态
|
||||
|
||||
```bash
|
||||
# 检查Nomad任务状态
|
||||
nomad job status waypoint-server
|
||||
|
||||
# 检查Waypoint UI是否可访问
|
||||
curl -I http://warden:9701
|
||||
```
|
||||
|
||||
## 3. 初始化Waypoint
|
||||
|
||||
### 3.1 连接到Waypoint服务器
|
||||
|
||||
```bash
|
||||
# 连接CLI到服务器
|
||||
waypoint context create \
|
||||
-server-addr=warden:9703 \
|
||||
-server-tls-skip-verify \
|
||||
-set-default my-waypoint-server
|
||||
```
|
||||
|
||||
### 3.2 验证连接
|
||||
|
||||
```bash
|
||||
waypoint context verify
|
||||
waypoint server info
|
||||
```
|
||||
|
||||
## 4. 配置Waypoint
|
||||
|
||||
### 4.1 配置Nomad作为运行时平台
|
||||
|
||||
```bash
|
||||
# 确认Nomad连接
|
||||
waypoint config source-set -type=nomad nomad-platform \
|
||||
addr=http://localhost:4646
|
||||
```
|
||||
|
||||
### 4.2 配置与Vault的集成
|
||||
|
||||
```bash
|
||||
# 配置Vault集成
|
||||
waypoint config source-set -type=vault vault-secrets \
|
||||
addr=http://localhost:8200 \
|
||||
token=<vault-token>
|
||||
```
|
||||
|
||||
## 5. 创建第一个Waypoint项目
|
||||
|
||||
### 5.1 创建项目配置文件
|
||||
|
||||
在应用代码目录中创建`waypoint.hcl`文件:
|
||||
|
||||
```hcl
|
||||
project = "example-app"
|
||||
|
||||
app "web" {
|
||||
build {
|
||||
use "docker" {
|
||||
dockerfile = "Dockerfile"
|
||||
}
|
||||
}
|
||||
|
||||
deploy {
|
||||
use "nomad" {
|
||||
datacenter = "dc1"
|
||||
namespace = "default"
|
||||
|
||||
service_provider = "consul" {
|
||||
service_name = "web"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 5.2 初始化和部署项目
|
||||
|
||||
```bash
|
||||
# 初始化项目
|
||||
cd /path/to/app
|
||||
waypoint init
|
||||
|
||||
# 部署应用
|
||||
waypoint up
|
||||
```
|
||||
|
||||
## 6. 与现有工具集成
|
||||
|
||||
### 6.1 与Gitea Actions集成
|
||||
|
||||
创建一个Gitea Actions工作流文件`.gitea/workflows/waypoint.yml`:
|
||||
|
||||
```yaml
|
||||
name: Waypoint Deploy
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [ main ]
|
||||
|
||||
jobs:
|
||||
deploy:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v2
|
||||
|
||||
- name: Install Waypoint
|
||||
run: |
|
||||
curl -fsSL https://releases.hashicorp.com/waypoint/0.11.0/waypoint_0.11.0_linux_amd64.zip -o waypoint.zip
|
||||
unzip waypoint.zip
|
||||
sudo mv waypoint /usr/local/bin/
|
||||
|
||||
- name: Configure Waypoint
|
||||
run: |
|
||||
waypoint context create \
|
||||
-server-addr=${{ secrets.WAYPOINT_SERVER_ADDR }} \
|
||||
-server-auth-token=${{ secrets.WAYPOINT_AUTH_TOKEN }} \
|
||||
-set-default ci-context
|
||||
|
||||
- name: Deploy Application
|
||||
run: waypoint up -app=web
|
||||
```
|
||||
|
||||
### 6.2 与Vault集成
|
||||
|
||||
在`waypoint.hcl`中使用Vault获取敏感配置:
|
||||
|
||||
```hcl
|
||||
app "web" {
|
||||
deploy {
|
||||
use "nomad" {
|
||||
# 其他配置...
|
||||
|
||||
env {
|
||||
DB_PASSWORD = dynamic("vault", {
|
||||
path = "kv/data/app/db"
|
||||
key = "password"
|
||||
})
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 7. 高级配置
|
||||
|
||||
### 7.1 配置蓝绿部署
|
||||
|
||||
```hcl
|
||||
app "web" {
|
||||
deploy {
|
||||
use "nomad" {
|
||||
# 基本配置...
|
||||
}
|
||||
}
|
||||
|
||||
release {
|
||||
use "nomad-bluegreen" {
|
||||
service = "web"
|
||||
datacenter = "dc1"
|
||||
namespace = "default"
|
||||
traffic_step = 25
|
||||
confirm_step = true
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 7.2 配置金丝雀发布
|
||||
|
||||
```hcl
|
||||
app "web" {
|
||||
deploy {
|
||||
use "nomad" {
|
||||
# 基本配置...
|
||||
}
|
||||
}
|
||||
|
||||
release {
|
||||
use "nomad-canary" {
|
||||
service = "web"
|
||||
datacenter = "dc1"
|
||||
namespace = "default"
|
||||
|
||||
canary {
|
||||
percentage = 10
|
||||
duration = "5m"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 7.3 配置自动回滚
|
||||
|
||||
```hcl
|
||||
app "web" {
|
||||
deploy {
|
||||
use "nomad" {
|
||||
# 基本配置...
|
||||
|
||||
health_check {
|
||||
timeout = "5m"
|
||||
check {
|
||||
name = "http-check"
|
||||
route = "/health"
|
||||
method = "GET"
|
||||
code = 200
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 8. 监控和日志
|
||||
|
||||
### 8.1 查看部署状态
|
||||
|
||||
```bash
|
||||
# 查看所有应用
|
||||
waypoint list projects
|
||||
|
||||
# 查看特定应用的部署
|
||||
waypoint list deployments -app=web
|
||||
|
||||
# 查看部署详情
|
||||
waypoint deployment inspect <deployment-id>
|
||||
```
|
||||
|
||||
### 8.2 查看应用日志
|
||||
|
||||
```bash
|
||||
# 查看应用日志
|
||||
waypoint logs -app=web
|
||||
```
|
||||
|
||||
## 9. 备份和恢复
|
||||
|
||||
### 9.1 备份Waypoint数据
|
||||
|
||||
```bash
|
||||
# 备份数据目录
|
||||
tar -czf waypoint-backup.tar.gz /opt/waypoint/data
|
||||
```
|
||||
|
||||
### 9.2 恢复Waypoint数据
|
||||
|
||||
```bash
|
||||
# 停止Waypoint服务
|
||||
nomad job stop waypoint-server
|
||||
|
||||
# 恢复数据
|
||||
rm -rf /opt/waypoint/data/*
|
||||
tar -xzf waypoint-backup.tar.gz -C /
|
||||
|
||||
# 重启服务
|
||||
nomad job run waypoint-server.nomad
|
||||
```
|
||||
|
||||
## 10. 故障排除
|
||||
|
||||
### 10.1 常见问题
|
||||
|
||||
1. **连接问题**:
|
||||
- 检查Waypoint服务器是否正常运行
|
||||
- 验证网络连接和防火墙规则
|
||||
|
||||
2. **部署失败**:
|
||||
- 检查Nomad集群状态
|
||||
- 查看详细的部署日志: `waypoint logs -app=<app> -deploy=<deployment-id>`
|
||||
|
||||
3. **权限问题**:
|
||||
- 确保Waypoint有足够的权限访问Nomad和Vault
|
||||
|
||||
### 10.2 调试命令
|
||||
|
||||
```bash
|
||||
# 检查Waypoint服务器状态
|
||||
waypoint server info
|
||||
|
||||
# 验证Nomad连接
|
||||
waypoint config source-get nomad-platform
|
||||
|
||||
# 启用调试日志
|
||||
WAYPOINT_LOG=debug waypoint up
|
||||
```
|
||||
|
||||
## 11. 最佳实践
|
||||
|
||||
1. **模块化配置**: 将通用配置抽取到可重用的Waypoint插件中
|
||||
2. **环境变量**: 使用环境变量区分不同环境的配置
|
||||
3. **版本控制**: 将`waypoint.hcl`文件纳入版本控制
|
||||
4. **自动化测试**: 在部署前添加自动化测试步骤
|
||||
5. **监控集成**: 将部署状态与监控系统集成
|
||||
Reference in New Issue
Block a user