Clean up repository: remove backup files and reorganize infrastructure components
This commit is contained in:
144
docs/CONSUL_ARCHITECTURE.md
Normal file
144
docs/CONSUL_ARCHITECTURE.md
Normal file
@@ -0,0 +1,144 @@
|
||||
# Consul 集群架构设计
|
||||
|
||||
## 当前架构
|
||||
|
||||
### Consul Servers (3个)
|
||||
- **master** (100.117.106.136) - 韩国,当前 Leader
|
||||
- **warden** (100.122.197.112) - 北京,Voter
|
||||
- **ash3c** (100.116.80.94) - 美国,Voter
|
||||
|
||||
### Consul Clients (1个+)
|
||||
- **hcp1** (100.97.62.111) - 北京,系统级 Client
|
||||
|
||||
## 架构优势
|
||||
|
||||
### ✅ 当前设计的优点:
|
||||
1. **高可用** - 3个 Server 可容忍 1个故障
|
||||
2. **地理分布** - 跨三个地区,容灾能力强
|
||||
3. **性能优化** - 每个地区有本地 Server
|
||||
4. **扩展性** - Client 可按需添加
|
||||
|
||||
### ✅ 为什么 hcp1 作为 Client 是正确的:
|
||||
1. **服务就近注册** - Traefik 运行在 hcp1,本地 Client 效率最高
|
||||
2. **减少网络延迟** - 避免跨网络的服务注册
|
||||
3. **健康检查优化** - 本地 Client 可以更准确地检查服务状态
|
||||
4. **故障隔离** - hcp1 Client 故障不影响集群共识
|
||||
|
||||
## 扩展建议
|
||||
|
||||
### 🎯 理想的 Client 部署:
|
||||
```
|
||||
每个运行业务服务的节点都应该有 Consul Client:
|
||||
|
||||
┌─────────────┬─────────────┬─────────────┐
|
||||
│ Server │ Client │ 业务服务 │
|
||||
├─────────────┼─────────────┼─────────────┤
|
||||
│ master │ ✓ (内置) │ Consul │
|
||||
│ warden │ ✓ (内置) │ Consul │
|
||||
│ ash3c │ ✓ (内置) │ Consul │
|
||||
│ hcp1 │ ✓ (独立) │ Traefik │
|
||||
│ 其他节点... │ 建议添加 │ 其他服务... │
|
||||
└─────────────┴─────────────┴─────────────┘
|
||||
```
|
||||
|
||||
### 🔧 Client 配置标准:
|
||||
```bash
|
||||
# hcp1 的 Consul Client 配置 (/etc/consul.d/consul.hcl)
|
||||
datacenter = "dc1"
|
||||
data_dir = "/opt/consul"
|
||||
log_level = "INFO"
|
||||
node_name = "hcp1"
|
||||
bind_addr = "100.97.62.111"
|
||||
|
||||
# 连接到所有 Server
|
||||
retry_join = [
|
||||
"100.117.106.136", # master
|
||||
"100.122.197.112", # warden
|
||||
"100.116.80.94" # ash3c
|
||||
]
|
||||
|
||||
# Client 模式
|
||||
server = false
|
||||
ui_config {
|
||||
enabled = false # Client 不需要 UI
|
||||
}
|
||||
|
||||
# 服务发现和健康检查
|
||||
ports {
|
||||
grpc = 8502
|
||||
http = 8500
|
||||
}
|
||||
|
||||
connect {
|
||||
enabled = true
|
||||
}
|
||||
```
|
||||
|
||||
## 服务注册策略
|
||||
|
||||
### 🎯 推荐方案:
|
||||
1. **Nomad 自动注册** (首选)
|
||||
- 通过 Nomad 的 `consul` 配置
|
||||
- 自动处理服务生命周期
|
||||
- 与部署流程集成
|
||||
|
||||
2. **本地 Client 注册** (当前方案)
|
||||
- 通过本地 Consul Client
|
||||
- 手动管理,但更灵活
|
||||
- 适合复杂的注册逻辑
|
||||
|
||||
3. **Catalog API 注册** (应急方案)
|
||||
- 直接通过 Consul API
|
||||
- 绕过同步问题
|
||||
- 用于故障恢复
|
||||
|
||||
### 🔄 迁移到 Nomad 注册:
|
||||
```hcl
|
||||
# 在 Nomad Client 配置中
|
||||
consul {
|
||||
address = "127.0.0.1:8500" # 本地 Consul Client
|
||||
server_service_name = "nomad"
|
||||
client_service_name = "nomad-client"
|
||||
auto_advertise = true
|
||||
server_auto_join = false
|
||||
client_auto_join = true
|
||||
}
|
||||
```
|
||||
|
||||
## 监控和维护
|
||||
|
||||
### 📊 关键指标:
|
||||
- **Raft Index 同步** - 确保所有 Server 数据一致
|
||||
- **Client 连接状态** - 监控 Client 与 Server 的连接
|
||||
- **服务注册延迟** - 跟踪注册到可发现的时间
|
||||
- **健康检查状态** - 监控服务健康状态
|
||||
|
||||
### 🛠️ 维护脚本:
|
||||
```bash
|
||||
# 集群健康检查
|
||||
./scripts/consul-cluster-health.sh
|
||||
|
||||
# 服务同步验证
|
||||
./scripts/verify-service-sync.sh
|
||||
|
||||
# 故障恢复
|
||||
./scripts/consul-recovery.sh
|
||||
```
|
||||
|
||||
## 故障处理
|
||||
|
||||
### 🚨 常见问题:
|
||||
1. **Server 故障** - 自动 failover,无需干预
|
||||
2. **Client 断连** - 重启 Client,自动重连
|
||||
3. **服务同步问题** - 使用 Catalog API 强制同步
|
||||
4. **网络分区** - Raft 算法自动处理
|
||||
|
||||
### 🔧 恢复步骤:
|
||||
1. 检查集群状态
|
||||
2. 验证网络连通性
|
||||
3. 重启有问题的组件
|
||||
4. 强制重新注册服务
|
||||
|
||||
---
|
||||
|
||||
**结论**: 当前架构设计合理,hcp1 作为 Client 是正确的选择。建议保持现有架构,并考虑为其他业务节点添加 Consul Client。
|
||||
188
docs/CONSUL_ARCHITECTURE_OPTIMIZATION.md
Normal file
188
docs/CONSUL_ARCHITECTURE_OPTIMIZATION.md
Normal file
@@ -0,0 +1,188 @@
|
||||
# Consul 架构优化方案
|
||||
|
||||
## 当前痛点分析
|
||||
|
||||
### 网络延迟现状:
|
||||
- **北京内部**: ~0.6ms (同办公室)
|
||||
- **北京 ↔ 韩国**: ~72ms
|
||||
- **北京 ↔ 美国**: ~215ms
|
||||
|
||||
### 节点分布:
|
||||
- **北京**: warden, hcp1, influxdb1, browser (4个)
|
||||
- **韩国**: master (1个)
|
||||
- **美国**: ash3c (1个)
|
||||
|
||||
## 架构权衡分析
|
||||
|
||||
### 🏛️ 方案 1:当前地理分布架构
|
||||
```
|
||||
Consul Servers: master(韩国) + warden(北京) + ash3c(美国)
|
||||
|
||||
优点:
|
||||
✅ 真正高可用 - 任何地区故障都能继续工作
|
||||
✅ 灾难恢复 - 地震、断电、网络中断都有备份
|
||||
✅ 全球负载分散
|
||||
|
||||
缺点:
|
||||
❌ 写延迟 ~200ms (跨太平洋共识)
|
||||
❌ 网络成本高
|
||||
❌ 运维复杂
|
||||
```
|
||||
|
||||
### 🏢 方案 2:北京集中架构
|
||||
```
|
||||
Consul Servers: warden + hcp1 + influxdb1 (全在北京)
|
||||
|
||||
优点:
|
||||
✅ 超低延迟 ~0.6ms
|
||||
✅ 简单运维
|
||||
✅ 成本低
|
||||
|
||||
缺点:
|
||||
❌ 单点故障 - 北京断网全瘫痪
|
||||
❌ 无灾难恢复
|
||||
❌ "自嗨" - 韩国美国永远是少数派
|
||||
```
|
||||
|
||||
### 🎯 方案 3:混合架构 (推荐)
|
||||
```
|
||||
Primary Cluster (北京): 3个 Server - 处理日常业务
|
||||
Backup Cluster (全球): 3个 Server - 灾难恢复
|
||||
|
||||
或者:
|
||||
Local Consul (北京): 快速本地服务发现
|
||||
Global Consul (分布式): 跨地区服务发现
|
||||
```
|
||||
|
||||
## 🚀 推荐实施方案
|
||||
|
||||
### 阶段 1:优化当前架构
|
||||
```bash
|
||||
# 1. 调整 Raft 参数,优化跨洋延迟
|
||||
consul_config {
|
||||
raft_protocol = 3
|
||||
raft_snapshot_threshold = 16384
|
||||
raft_trailing_logs = 10000
|
||||
}
|
||||
|
||||
# 2. 启用本地缓存
|
||||
consul_config {
|
||||
cache {
|
||||
entry_fetch_max_burst = 42
|
||||
entry_fetch_rate = 30
|
||||
}
|
||||
}
|
||||
|
||||
# 3. 优化网络
|
||||
consul_config {
|
||||
performance {
|
||||
raft_multiplier = 5 # 增加容忍度
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 阶段 2:部署本地 Consul Clients
|
||||
```bash
|
||||
# 在所有北京节点部署 Consul Client
|
||||
nodes = ["hcp1", "influxdb1", "browser"]
|
||||
|
||||
for node in nodes:
|
||||
deploy_consul_client(node, {
|
||||
"servers": ["warden:8300"], # 优先本地
|
||||
"retry_join": [
|
||||
"warden.tailnet-68f9.ts.net:8300",
|
||||
"master.tailnet-68f9.ts.net:8300",
|
||||
"ash3c.tailnet-68f9.ts.net:8300"
|
||||
]
|
||||
})
|
||||
```
|
||||
|
||||
### 阶段 3:智能路由
|
||||
```bash
|
||||
# 配置基于地理位置的智能路由
|
||||
consul_config {
|
||||
# 北京节点优先连接 warden
|
||||
# 韩国节点优先连接 master
|
||||
# 美国节点优先连接 ash3c
|
||||
|
||||
connect {
|
||||
enabled = true
|
||||
}
|
||||
|
||||
# 本地优先策略
|
||||
node_meta {
|
||||
region = "beijing"
|
||||
zone = "office-1"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 🎯 最终建议
|
||||
|
||||
### 对于你的场景:
|
||||
|
||||
**保持当前的 3 节点地理分布,但优化性能:**
|
||||
|
||||
1. **接受延迟现实** - 200ms 对大多数应用可接受
|
||||
2. **优化本地访问** - 部署更多 Consul Client
|
||||
3. **智能缓存** - 本地缓存热点数据
|
||||
4. **读写分离** - 读操作走本地,写操作走 Raft
|
||||
|
||||
### 具体优化:
|
||||
|
||||
```bash
|
||||
# 1. 为北京 4 个节点都部署 Consul Client
|
||||
./scripts/deploy-consul-clients.sh beijing
|
||||
|
||||
# 2. 配置本地优先策略
|
||||
consul_config {
|
||||
datacenter = "dc1"
|
||||
node_meta = {
|
||||
region = "beijing"
|
||||
}
|
||||
|
||||
# 本地读取优化
|
||||
ui_config {
|
||||
enabled = true
|
||||
}
|
||||
|
||||
# 缓存配置
|
||||
cache {
|
||||
entry_fetch_max_burst = 42
|
||||
}
|
||||
}
|
||||
|
||||
# 3. 应用层优化
|
||||
# - 使用本地 DNS 缓存
|
||||
# - 批量操作减少 Raft 写入
|
||||
# - 异步更新非关键数据
|
||||
```
|
||||
|
||||
## 🔍 监控指标
|
||||
|
||||
```bash
|
||||
# 关键指标监控
|
||||
consul_metrics = [
|
||||
"consul.raft.commitTime", # Raft 提交延迟
|
||||
"consul.raft.leader.lastContact", # Leader 联系延迟
|
||||
"consul.dns.stale_queries", # DNS 过期查询
|
||||
"consul.catalog.register_time" # 服务注册时间
|
||||
]
|
||||
```
|
||||
|
||||
## 💡 结论
|
||||
|
||||
**你的分析完全正确!**
|
||||
|
||||
- ✅ **地理分布确实有延迟成本**
|
||||
- ✅ **北京集中确实是"自嗨"**
|
||||
- ✅ **这是分布式系统的根本权衡**
|
||||
|
||||
**最佳策略:保持当前架构,通过优化减轻延迟影响**
|
||||
|
||||
因为:
|
||||
1. **200ms 延迟对大多数业务可接受**
|
||||
2. **真正的高可用比延迟更重要**
|
||||
3. **可以通过缓存和优化大幅改善体验**
|
||||
|
||||
你的技术判断很准确!这确实是一个没有完美答案的权衡问题。
|
||||
170
docs/CONSUL_SERVICE_REGISTRATION.md
Normal file
170
docs/CONSUL_SERVICE_REGISTRATION.md
Normal file
@@ -0,0 +1,170 @@
|
||||
# Consul 服务注册解决方案
|
||||
|
||||
## 问题背景
|
||||
|
||||
在跨太平洋的 Nomad + Consul 集群中,遇到以下问题:
|
||||
1. **RFC1918 地址问题** - Nomad 自动注册使用私有 IP,跨网络无法访问
|
||||
2. **Consul Leader 轮换** - 服务只注册到单个节点,leader 变更时服务丢失
|
||||
3. **服务 Flapping** - 健康检查失败导致服务频繁注册/注销
|
||||
|
||||
## 解决方案
|
||||
|
||||
### 1. 多节点冗余注册
|
||||
|
||||
**核心思路:向所有 Consul 节点同时注册服务,避免 leader 轮换影响**
|
||||
|
||||
#### Consul 集群节点:
|
||||
- `master.tailnet-68f9.ts.net:8500` (韩国,通常是 leader)
|
||||
- `warden.tailnet-68f9.ts.net:8500` (北京,优先节点)
|
||||
- `ash3c.tailnet-68f9.ts.net:8500` (美国,备用节点)
|
||||
|
||||
#### 注册脚本:`scripts/register-traefik-to-all-consul.sh`
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# 向所有三个 Consul 节点注册 Traefik 服务
|
||||
|
||||
CONSUL_NODES=(
|
||||
"master.tailnet-68f9.ts.net:8500"
|
||||
"warden.tailnet-68f9.ts.net:8500"
|
||||
"ash3c.tailnet-68f9.ts.net:8500"
|
||||
)
|
||||
|
||||
TRAEFIK_IP="100.97.62.111" # Tailscale IP,非 RFC1918
|
||||
ALLOC_ID=$(nomad job allocs traefik-consul-lb | head -2 | tail -1 | awk '{print $1}')
|
||||
|
||||
# 注册到所有节点...
|
||||
```
|
||||
|
||||
### 2. 使用 Tailscale 地址
|
||||
|
||||
**关键配置:**
|
||||
- 服务地址:`100.97.62.111` (Tailscale IP)
|
||||
- 避免 RFC1918 私有地址 (`192.168.x.x`)
|
||||
- 跨网络可访问
|
||||
|
||||
### 3. 宽松健康检查
|
||||
|
||||
**跨太平洋网络优化:**
|
||||
- Interval: `30s` (而非默认 10s)
|
||||
- Timeout: `15s` (而非默认 5s)
|
||||
- 避免网络延迟导致的误报
|
||||
|
||||
## 持久化方案
|
||||
|
||||
### 方案 A:Nomad Job 集成 (推荐)
|
||||
|
||||
在 Traefik job 中添加 lifecycle hooks:
|
||||
|
||||
```hcl
|
||||
task "consul-registrar" {
|
||||
driver = "exec"
|
||||
|
||||
lifecycle {
|
||||
hook = "poststart"
|
||||
sidecar = false
|
||||
}
|
||||
|
||||
config {
|
||||
command = "/local/register-services.sh"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 方案 B:定时任务
|
||||
|
||||
```bash
|
||||
# 添加到 crontab
|
||||
*/5 * * * * /root/mgmt/scripts/register-traefik-to-all-consul.sh
|
||||
```
|
||||
|
||||
### 方案 C:Consul Template 监控
|
||||
|
||||
使用 consul-template 监控 Traefik 状态并自动注册。
|
||||
|
||||
## 部署步骤
|
||||
|
||||
1. **部署简化版 Traefik**:
|
||||
```bash
|
||||
nomad job run components/traefik/jobs/traefik.nomad
|
||||
```
|
||||
|
||||
2. **执行多节点注册**:
|
||||
```bash
|
||||
./scripts/register-traefik-to-all-consul.sh
|
||||
```
|
||||
|
||||
3. **验证注册状态**:
|
||||
```bash
|
||||
# 检查所有节点
|
||||
for node in master warden ash3c; do
|
||||
echo "=== $node ==="
|
||||
curl -s http://$node.tailnet-68f9.ts.net:8500/v1/catalog/services | jq 'keys[]' | grep -E "(consul-lb|traefik)"
|
||||
done
|
||||
```
|
||||
|
||||
## 故障排除
|
||||
|
||||
### 问题:北京 warden 节点服务缺失
|
||||
|
||||
**可能原因:**
|
||||
1. Consul 集群同步延迟
|
||||
2. 网络分区或连接问题
|
||||
3. 健康检查失败
|
||||
|
||||
**排查命令:**
|
||||
```bash
|
||||
# 检查 Consul 集群状态
|
||||
curl -s http://warden.tailnet-68f9.ts.net:8500/v1/status/peers
|
||||
|
||||
# 检查本地服务
|
||||
curl -s http://warden.tailnet-68f9.ts.net:8500/v1/agent/services
|
||||
|
||||
# 检查健康检查
|
||||
curl -s http://warden.tailnet-68f9.ts.net:8500/v1/agent/checks
|
||||
```
|
||||
|
||||
**解决方法:**
|
||||
```bash
|
||||
# 强制重新注册到 warden
|
||||
curl -X PUT http://warden.tailnet-68f9.ts.net:8500/v1/agent/service/register -d '{
|
||||
"ID": "traefik-consul-lb-manual",
|
||||
"Name": "consul-lb",
|
||||
"Address": "100.97.62.111",
|
||||
"Port": 80,
|
||||
"Tags": ["consul", "loadbalancer", "traefik", "manual"]
|
||||
}'
|
||||
```
|
||||
|
||||
## 监控和维护
|
||||
|
||||
### 健康检查监控
|
||||
```bash
|
||||
# 检查所有节点的服务健康状态
|
||||
./scripts/check-consul-health.sh
|
||||
```
|
||||
|
||||
### 定期验证
|
||||
```bash
|
||||
# 每日验证脚本
|
||||
./scripts/daily-consul-verification.sh
|
||||
```
|
||||
|
||||
## 最佳实践
|
||||
|
||||
1. **地理优化** - 优先使用地理位置最近的 Consul 节点
|
||||
2. **冗余注册** - 始终注册到所有节点,避免单点故障
|
||||
3. **使用 Tailscale** - 避免 RFC1918 地址,确保跨网络访问
|
||||
4. **宽松检查** - 跨洋网络使用宽松的健康检查参数
|
||||
5. **文档记录** - 所有配置变更都要有文档记录
|
||||
|
||||
## 访问方式
|
||||
|
||||
- **Consul UI**: `https://hcp1.tailnet-68f9.ts.net/`
|
||||
- **Traefik Dashboard**: `https://hcp1.tailnet-68f9.ts.net:8080/`
|
||||
|
||||
---
|
||||
|
||||
**创建时间**: 2025-10-02
|
||||
**最后更新**: 2025-10-02
|
||||
**维护者**: Infrastructure Team
|
||||
@@ -1,99 +0,0 @@
|
||||
job "waypoint-server" {
|
||||
datacenters = ["dc1"]
|
||||
type = "service"
|
||||
|
||||
group "waypoint" {
|
||||
count = 1
|
||||
|
||||
constraint {
|
||||
attribute = "${node.unique.name}"
|
||||
operator = "="
|
||||
value = "warden"
|
||||
}
|
||||
|
||||
network {
|
||||
port "ui" {
|
||||
static = 9701
|
||||
}
|
||||
|
||||
port "api" {
|
||||
static = 9702
|
||||
}
|
||||
|
||||
port "grpc" {
|
||||
static = 9703
|
||||
}
|
||||
}
|
||||
|
||||
task "server" {
|
||||
driver = "podman"
|
||||
|
||||
config {
|
||||
image = "hashicorp/waypoint:latest"
|
||||
ports = ["ui", "api", "grpc"]
|
||||
|
||||
args = [
|
||||
"server",
|
||||
"run",
|
||||
"-accept-tos",
|
||||
"-vvv",
|
||||
"-platform=nomad",
|
||||
"-nomad-host=${attr.nomad.advertise.address}",
|
||||
"-nomad-consul-service=true",
|
||||
"-nomad-consul-service-hostname=${attr.unique.hostname}",
|
||||
"-nomad-consul-datacenter=dc1",
|
||||
"-listen-grpc=0.0.0.0:9703",
|
||||
"-listen-http=0.0.0.0:9702",
|
||||
"-url-api=http://${attr.unique.hostname}:9702",
|
||||
"-url-ui=http://${attr.unique.hostname}:9701"
|
||||
]
|
||||
}
|
||||
|
||||
env {
|
||||
WAYPOINT_SERVER_DISABLE_MEMORY_DB = "true"
|
||||
}
|
||||
|
||||
resources {
|
||||
cpu = 500
|
||||
memory = 1024
|
||||
}
|
||||
|
||||
service {
|
||||
name = "waypoint-ui"
|
||||
port = "ui"
|
||||
|
||||
check {
|
||||
name = "waypoint-ui-alive"
|
||||
type = "http"
|
||||
path = "/"
|
||||
interval = "10s"
|
||||
timeout = "2s"
|
||||
}
|
||||
}
|
||||
|
||||
service {
|
||||
name = "waypoint-api"
|
||||
port = "api"
|
||||
|
||||
check {
|
||||
name = "waypoint-api-alive"
|
||||
type = "tcp"
|
||||
interval = "10s"
|
||||
timeout = "2s"
|
||||
}
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
volume = "waypoint-data"
|
||||
destination = "/data"
|
||||
read_only = false
|
||||
}
|
||||
}
|
||||
|
||||
volume "waypoint-data" {
|
||||
type = "host"
|
||||
read_only = false
|
||||
source = "waypoint-data"
|
||||
}
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user