171 lines
4.0 KiB
Markdown
171 lines
4.0 KiB
Markdown
# Consul 服务注册解决方案
|
||
|
||
## 问题背景
|
||
|
||
在跨太平洋的 Nomad + Consul 集群中,遇到以下问题:
|
||
1. **RFC1918 地址问题** - Nomad 自动注册使用私有 IP,跨网络无法访问
|
||
2. **Consul Leader 轮换** - 服务只注册到单个节点,leader 变更时服务丢失
|
||
3. **服务 Flapping** - 健康检查失败导致服务频繁注册/注销
|
||
|
||
## 解决方案
|
||
|
||
### 1. 多节点冗余注册
|
||
|
||
**核心思路:向所有 Consul 节点同时注册服务,避免 leader 轮换影响**
|
||
|
||
#### Consul 集群节点:
|
||
- `master.tailnet-68f9.ts.net:8500` (韩国,通常是 leader)
|
||
- `warden.tailnet-68f9.ts.net:8500` (北京,优先节点)
|
||
- `ash3c.tailnet-68f9.ts.net:8500` (美国,备用节点)
|
||
|
||
#### 注册脚本:`scripts/register-traefik-to-all-consul.sh`
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# 向所有三个 Consul 节点注册 Traefik 服务
|
||
|
||
CONSUL_NODES=(
|
||
"master.tailnet-68f9.ts.net:8500"
|
||
"warden.tailnet-68f9.ts.net:8500"
|
||
"ash3c.tailnet-68f9.ts.net:8500"
|
||
)
|
||
|
||
TRAEFIK_IP="100.97.62.111" # Tailscale IP,非 RFC1918
|
||
ALLOC_ID=$(nomad job allocs traefik-consul-lb | head -2 | tail -1 | awk '{print $1}')
|
||
|
||
# 注册到所有节点...
|
||
```
|
||
|
||
### 2. 使用 Tailscale 地址
|
||
|
||
**关键配置:**
|
||
- 服务地址:`100.97.62.111` (Tailscale IP)
|
||
- 避免 RFC1918 私有地址 (`192.168.x.x`)
|
||
- 跨网络可访问
|
||
|
||
### 3. 宽松健康检查
|
||
|
||
**跨太平洋网络优化:**
|
||
- Interval: `30s` (而非默认 10s)
|
||
- Timeout: `15s` (而非默认 5s)
|
||
- 避免网络延迟导致的误报
|
||
|
||
## 持久化方案
|
||
|
||
### 方案 A:Nomad Job 集成 (推荐)
|
||
|
||
在 Traefik job 中添加 lifecycle hooks:
|
||
|
||
```hcl
|
||
task "consul-registrar" {
|
||
driver = "exec"
|
||
|
||
lifecycle {
|
||
hook = "poststart"
|
||
sidecar = false
|
||
}
|
||
|
||
config {
|
||
command = "/local/register-services.sh"
|
||
}
|
||
}
|
||
```
|
||
|
||
### 方案 B:定时任务
|
||
|
||
```bash
|
||
# 添加到 crontab
|
||
*/5 * * * * /root/mgmt/scripts/register-traefik-to-all-consul.sh
|
||
```
|
||
|
||
### 方案 C:Consul Template 监控
|
||
|
||
使用 consul-template 监控 Traefik 状态并自动注册。
|
||
|
||
## 部署步骤
|
||
|
||
1. **部署简化版 Traefik**:
|
||
```bash
|
||
nomad job run components/traefik/jobs/traefik.nomad
|
||
```
|
||
|
||
2. **执行多节点注册**:
|
||
```bash
|
||
./scripts/register-traefik-to-all-consul.sh
|
||
```
|
||
|
||
3. **验证注册状态**:
|
||
```bash
|
||
# 检查所有节点
|
||
for node in master warden ash3c; do
|
||
echo "=== $node ==="
|
||
curl -s http://$node.tailnet-68f9.ts.net:8500/v1/catalog/services | jq 'keys[]' | grep -E "(consul-lb|traefik)"
|
||
done
|
||
```
|
||
|
||
## 故障排除
|
||
|
||
### 问题:北京 warden 节点服务缺失
|
||
|
||
**可能原因:**
|
||
1. Consul 集群同步延迟
|
||
2. 网络分区或连接问题
|
||
3. 健康检查失败
|
||
|
||
**排查命令:**
|
||
```bash
|
||
# 检查 Consul 集群状态
|
||
curl -s http://warden.tailnet-68f9.ts.net:8500/v1/status/peers
|
||
|
||
# 检查本地服务
|
||
curl -s http://warden.tailnet-68f9.ts.net:8500/v1/agent/services
|
||
|
||
# 检查健康检查
|
||
curl -s http://warden.tailnet-68f9.ts.net:8500/v1/agent/checks
|
||
```
|
||
|
||
**解决方法:**
|
||
```bash
|
||
# 强制重新注册到 warden
|
||
curl -X PUT http://warden.tailnet-68f9.ts.net:8500/v1/agent/service/register -d '{
|
||
"ID": "traefik-consul-lb-manual",
|
||
"Name": "consul-lb",
|
||
"Address": "100.97.62.111",
|
||
"Port": 80,
|
||
"Tags": ["consul", "loadbalancer", "traefik", "manual"]
|
||
}'
|
||
```
|
||
|
||
## 监控和维护
|
||
|
||
### 健康检查监控
|
||
```bash
|
||
# 检查所有节点的服务健康状态
|
||
./scripts/check-consul-health.sh
|
||
```
|
||
|
||
### 定期验证
|
||
```bash
|
||
# 每日验证脚本
|
||
./scripts/daily-consul-verification.sh
|
||
```
|
||
|
||
## 最佳实践
|
||
|
||
1. **地理优化** - 优先使用地理位置最近的 Consul 节点
|
||
2. **冗余注册** - 始终注册到所有节点,避免单点故障
|
||
3. **使用 Tailscale** - 避免 RFC1918 地址,确保跨网络访问
|
||
4. **宽松检查** - 跨洋网络使用宽松的健康检查参数
|
||
5. **文档记录** - 所有配置变更都要有文档记录
|
||
|
||
## 访问方式
|
||
|
||
- **Consul UI**: `https://hcp1.tailnet-68f9.ts.net/`
|
||
- **Traefik Dashboard**: `https://hcp1.tailnet-68f9.ts.net:8080/`
|
||
|
||
---
|
||
|
||
**创建时间**: 2025-10-02
|
||
**最后更新**: 2025-10-02
|
||
**维护者**: Infrastructure Team
|