Clean up repository: remove backup files and reorganize infrastructure components

2025-10-02 17:04:51 +00:00
parent e5aa00d6f9
commit 1c994f9f60
133 changed files with 1835 additions and 11296 deletions
--- a/docs/CONSUL_ARCHITECTURE.md
+++ b/docs/CONSUL_ARCHITECTURE.md
@@ -0,0 +1,144 @@
+# Consul 集群架构设计
+
+## 当前架构
+
+### Consul Servers (3个)
+- **master** (100.117.106.136) - 韩国，当前 Leader
+- **warden** (100.122.197.112) - 北京，Voter
+- **ash3c** (100.116.80.94) - 美国，Voter
+
+### Consul Clients (1个+)
+- **hcp1** (100.97.62.111) - 北京，系统级 Client
+
+## 架构优势
+
+### ✅ 当前设计的优点：
+1. **高可用** - 3个 Server 可容忍 1个故障
+2. **地理分布** - 跨三个地区，容灾能力强
+3. **性能优化** - 每个地区有本地 Server
+4. **扩展性** - Client 可按需添加
+
+### ✅ 为什么 hcp1 作为 Client 是正确的：
+1. **服务就近注册** - Traefik 运行在 hcp1，本地 Client 效率最高
+2. **减少网络延迟** - 避免跨网络的服务注册
+3. **健康检查优化** - 本地 Client 可以更准确地检查服务状态
+4. **故障隔离** - hcp1 Client 故障不影响集群共识
+
+## 扩展建议
+
+### 🎯 理想的 Client 部署：
+```
+每个运行业务服务的节点都应该有 Consul Client：
+
+┌─────────────┬─────────────┬─────────────┐
+│   Server    │   Client    │   业务服务   │
+├─────────────┼─────────────┼─────────────┤
+│ master      │ ✓ (内置)    │ Consul      │
+│ warden      │ ✓ (内置)    │ Consul      │
+│ ash3c       │ ✓ (内置)    │ Consul      │
+│ hcp1        │ ✓ (独立)    │ Traefik     │
+│ 其他节点... │ 建议添加     │ 其他服务... │
+└─────────────┴─────────────┴─────────────┘
+```
+
+### 🔧 Client 配置标准：
+```bash
+# hcp1 的 Consul Client 配置 (/etc/consul.d/consul.hcl)
+datacenter = "dc1"
+data_dir = "/opt/consul"
+log_level = "INFO"
+node_name = "hcp1"
+bind_addr = "100.97.62.111"
+
+# 连接到所有 Server
+retry_join = [
+  "100.117.106.136",  # master
+  "100.122.197.112",  # warden  
+  "100.116.80.94"     # ash3c
+]
+
+# Client 模式
+server = false
+ui_config {
+  enabled = false  # Client 不需要 UI
+}
+
+# 服务发现和健康检查
+ports {
+  grpc = 8502
+  http = 8500
+}
+
+connect {
+  enabled = true
+}
+```
+
+## 服务注册策略
+
+### 🎯 推荐方案：
+1. **Nomad 自动注册** (首选)
+   - 通过 Nomad 的 `consul` 配置
+   - 自动处理服务生命周期
+   - 与部署流程集成
+
+2. **本地 Client 注册** (当前方案)
+   - 通过本地 Consul Client
+   - 手动管理，但更灵活
+   - 适合复杂的注册逻辑
+
+3. **Catalog API 注册** (应急方案)
+   - 直接通过 Consul API
+   - 绕过同步问题
+   - 用于故障恢复
+
+### 🔄 迁移到 Nomad 注册：
+```hcl
+# 在 Nomad Client 配置中
+consul {
+  address = "127.0.0.1:8500"  # 本地 Consul Client
+  server_service_name = "nomad"
+  client_service_name = "nomad-client"
+  auto_advertise = true
+  server_auto_join = false
+  client_auto_join = true
+}
+```
+
+## 监控和维护
+
+### 📊 关键指标：
+- **Raft Index 同步** - 确保所有 Server 数据一致
+- **Client 连接状态** - 监控 Client 与 Server 的连接
+- **服务注册延迟** - 跟踪注册到可发现的时间
+- **健康检查状态** - 监控服务健康状态
+
+### 🛠️ 维护脚本：
+```bash
+# 集群健康检查
+./scripts/consul-cluster-health.sh
+
+# 服务同步验证  
+./scripts/verify-service-sync.sh
+
+# 故障恢复
+./scripts/consul-recovery.sh
+```
+
+## 故障处理
+
+### 🚨 常见问题：
+1. **Server 故障** - 自动 failover，无需干预
+2. **Client 断连** - 重启 Client，自动重连
+3. **服务同步问题** - 使用 Catalog API 强制同步
+4. **网络分区** - Raft 算法自动处理
+
+### 🔧 恢复步骤：
+1. 检查集群状态
+2. 验证网络连通性
+3. 重启有问题的组件
+4. 强制重新注册服务
+
+---
+
+**结论**: 当前架构设计合理，hcp1 作为 Client 是正确的选择。建议保持现有架构，并考虑为其他业务节点添加 Consul Client。
--- a/docs/CONSUL_ARCHITECTURE_OPTIMIZATION.md
+++ b/docs/CONSUL_ARCHITECTURE_OPTIMIZATION.md
@@ -0,0 +1,188 @@
+# Consul 架构优化方案
+
+## 当前痛点分析
+
+### 网络延迟现状：
+- **北京内部**: ~0.6ms (同办公室)
+- **北京 ↔ 韩国**: ~72ms  
+- **北京 ↔ 美国**: ~215ms
+
+### 节点分布：
+- **北京**: warden, hcp1, influxdb1, browser (4个)
+- **韩国**: master (1个)
+- **美国**: ash3c (1个)
+
+## 架构权衡分析
+
+### 🏛️ 方案 1：当前地理分布架构
+```
+Consul Servers: master(韩国) + warden(北京) + ash3c(美国)
+
+优点：
+✅ 真正高可用 - 任何地区故障都能继续工作
+✅ 灾难恢复 - 地震、断电、网络中断都有备份
+✅ 全球负载分散
+
+缺点：
+❌ 写延迟 ~200ms (跨太平洋共识)
+❌ 网络成本高
+❌ 运维复杂
+```
+
+### 🏢 方案 2：北京集中架构
+```
+Consul Servers: warden + hcp1 + influxdb1 (全在北京)
+
+优点：
+✅ 超低延迟 ~0.6ms
+✅ 简单运维
+✅ 成本低
+
+缺点：
+❌ 单点故障 - 北京断网全瘫痪
+❌ 无灾难恢复
+❌ "自嗨" - 韩国美国永远是少数派
+```
+
+### 🎯 方案 3：混合架构 (推荐)
+```
+Primary Cluster (北京): 3个 Server - 处理日常业务
+Backup Cluster (全球): 3个 Server - 灾难恢复
+
+或者：
+Local Consul (北京): 快速本地服务发现
+Global Consul (分布式): 跨地区服务发现
+```
+
+## 🚀 推荐实施方案
+
+### 阶段 1：优化当前架构
+```bash
+# 1. 调整 Raft 参数，优化跨洋延迟
+consul_config {
+  raft_protocol = 3
+  raft_snapshot_threshold = 16384
+  raft_trailing_logs = 10000
+}
+
+# 2. 启用本地缓存
+consul_config {
+  cache {
+    entry_fetch_max_burst = 42
+    entry_fetch_rate = 30
+  }
+}
+
+# 3. 优化网络
+consul_config {
+  performance {
+    raft_multiplier = 5  # 增加容忍度
+  }
+}
+```
+
+### 阶段 2：部署本地 Consul Clients
+```bash
+# 在所有北京节点部署 Consul Client
+nodes = ["hcp1", "influxdb1", "browser"]
+
+for node in nodes:
+  deploy_consul_client(node, {
+    "servers": ["warden:8300"],  # 优先本地
+    "retry_join": [
+      "warden.tailnet-68f9.ts.net:8300",
+      "master.tailnet-68f9.ts.net:8300", 
+      "ash3c.tailnet-68f9.ts.net:8300"
+    ]
+  })
+```
+
+### 阶段 3：智能路由
+```bash
+# 配置基于地理位置的智能路由
+consul_config {
+  # 北京节点优先连接 warden
+  # 韩国节点优先连接 master  
+  # 美国节点优先连接 ash3c
+  
+  connect {
+    enabled = true
+  }
+  
+  # 本地优先策略
+  node_meta {
+    region = "beijing"
+    zone = "office-1"
+  }
+}
+```
+
+## 🎯 最终建议
+
+### 对于你的场景：
+
+**保持当前的 3 节点地理分布，但优化性能：**
+
+1. **接受延迟现实** - 200ms 对大多数应用可接受
+2. **优化本地访问** - 部署更多 Consul Client
+3. **智能缓存** - 本地缓存热点数据
+4. **读写分离** - 读操作走本地，写操作走 Raft
+
+### 具体优化：
+
+```bash
+# 1. 为北京 4 个节点都部署 Consul Client
+./scripts/deploy-consul-clients.sh beijing
+
+# 2. 配置本地优先策略
+consul_config {
+  datacenter = "dc1"
+  node_meta = {
+    region = "beijing"
+  }
+  
+  # 本地读取优化
+  ui_config {
+    enabled = true
+  }
+  
+  # 缓存配置
+  cache {
+    entry_fetch_max_burst = 42
+  }
+}
+
+# 3. 应用层优化
+# - 使用本地 DNS 缓存
+# - 批量操作减少 Raft 写入
+# - 异步更新非关键数据
+```
+
+## 🔍 监控指标
+
+```bash
+# 关键指标监控
+consul_metrics = [
+  "consul.raft.commitTime",      # Raft 提交延迟
+  "consul.raft.leader.lastContact", # Leader 联系延迟
+  "consul.dns.stale_queries",    # DNS 过期查询
+  "consul.catalog.register_time" # 服务注册时间
+]
+```
+
+## 💡 结论
+
+**你的分析完全正确！**
+
+- ✅ **地理分布确实有延迟成本**
+- ✅ **北京集中确实是"自嗨"**  
+- ✅ **这是分布式系统的根本权衡**
+
+**最佳策略：保持当前架构，通过优化减轻延迟影响**
+
+因为：
+1. **200ms 延迟对大多数业务可接受**
+2. **真正的高可用比延迟更重要**
+3. **可以通过缓存和优化大幅改善体验**
+
+你的技术判断很准确！这确实是一个没有完美答案的权衡问题。
--- a/docs/CONSUL_SERVICE_REGISTRATION.md
+++ b/docs/CONSUL_SERVICE_REGISTRATION.md
@@ -0,0 +1,170 @@
+# Consul 服务注册解决方案
+
+## 问题背景
+
+在跨太平洋的 Nomad + Consul 集群中，遇到以下问题：
+1. **RFC1918 地址问题** - Nomad 自动注册使用私有 IP，跨网络无法访问
+2. **Consul Leader 轮换** - 服务只注册到单个节点，leader 变更时服务丢失
+3. **服务 Flapping** - 健康检查失败导致服务频繁注册/注销
+
+## 解决方案
+
+### 1. 多节点冗余注册
+
+**核心思路：向所有 Consul 节点同时注册服务，避免 leader 轮换影响**
+
+#### Consul 集群节点：
+- `master.tailnet-68f9.ts.net:8500` (韩国，通常是 leader)
+- `warden.tailnet-68f9.ts.net:8500` (北京，优先节点)  
+- `ash3c.tailnet-68f9.ts.net:8500` (美国，备用节点)
+
+#### 注册脚本：`scripts/register-traefik-to-all-consul.sh`
+
+```bash
+#!/bin/bash
+# 向所有三个 Consul 节点注册 Traefik 服务
+
+CONSUL_NODES=(
+  "master.tailnet-68f9.ts.net:8500"
+  "warden.tailnet-68f9.ts.net:8500"
+  "ash3c.tailnet-68f9.ts.net:8500"
+)
+
+TRAEFIK_IP="100.97.62.111"  # Tailscale IP，非 RFC1918
+ALLOC_ID=$(nomad job allocs traefik-consul-lb | head -2 | tail -1 | awk '{print $1}')
+
+# 注册到所有节点...
+```
+
+### 2. 使用 Tailscale 地址
+
+**关键配置：**
+- 服务地址：`100.97.62.111` (Tailscale IP)
+- 避免 RFC1918 私有地址 (`192.168.x.x`)
+- 跨网络可访问
+
+### 3. 宽松健康检查
+
+**跨太平洋网络优化：**
+- Interval: `30s` (而非默认 10s)
+- Timeout: `15s` (而非默认 5s)
+- 避免网络延迟导致的误报
+
+## 持久化方案
+
+### 方案 A：Nomad Job 集成 (推荐)
+
+在 Traefik job 中添加 lifecycle hooks：
+
+```hcl
+task "consul-registrar" {
+  driver = "exec"
+  
+  lifecycle {
+    hook = "poststart"
+    sidecar = false
+  }
+
+  config {
+    command = "/local/register-services.sh"
+  }
+}
+```
+
+### 方案 B：定时任务
+
+```bash
+# 添加到 crontab
+*/5 * * * * /root/mgmt/scripts/register-traefik-to-all-consul.sh
+```
+
+### 方案 C：Consul Template 监控
+
+使用 consul-template 监控 Traefik 状态并自动注册。
+
+## 部署步骤
+
+1. **部署简化版 Traefik**：
+   ```bash
+   nomad job run components/traefik/jobs/traefik.nomad
+   ```
+
+2. **执行多节点注册**：
+   ```bash
+   ./scripts/register-traefik-to-all-consul.sh
+   ```
+
+3. **验证注册状态**：
+   ```bash
+   # 检查所有节点
+   for node in master warden ash3c; do
+     echo "=== $node ==="
+     curl -s http://$node.tailnet-68f9.ts.net:8500/v1/catalog/services | jq 'keys[]' | grep -E "(consul-lb|traefik)"
+   done
+   ```
+
+## 故障排除
+
+### 问题：北京 warden 节点服务缺失
+
+**可能原因：**
+1. Consul 集群同步延迟
+2. 网络分区或连接问题
+3. 健康检查失败
+
+**排查命令：**
+```bash
+# 检查 Consul 集群状态
+curl -s http://warden.tailnet-68f9.ts.net:8500/v1/status/peers
+
+# 检查本地服务
+curl -s http://warden.tailnet-68f9.ts.net:8500/v1/agent/services
+
+# 检查健康检查
+curl -s http://warden.tailnet-68f9.ts.net:8500/v1/agent/checks
+```
+
+**解决方法：**
+```bash
+# 强制重新注册到 warden
+curl -X PUT http://warden.tailnet-68f9.ts.net:8500/v1/agent/service/register -d '{
+  "ID": "traefik-consul-lb-manual",
+  "Name": "consul-lb",
+  "Address": "100.97.62.111",
+  "Port": 80,
+  "Tags": ["consul", "loadbalancer", "traefik", "manual"]
+}'
+```
+
+## 监控和维护
+
+### 健康检查监控
+```bash
+# 检查所有节点的服务健康状态
+./scripts/check-consul-health.sh
+```
+
+### 定期验证
+```bash
+# 每日验证脚本
+./scripts/daily-consul-verification.sh
+```
+
+## 最佳实践
+
+1. **地理优化** - 优先使用地理位置最近的 Consul 节点
+2. **冗余注册** - 始终注册到所有节点，避免单点故障
+3. **使用 Tailscale** - 避免 RFC1918 地址，确保跨网络访问
+4. **宽松检查** - 跨洋网络使用宽松的健康检查参数
+5. **文档记录** - 所有配置变更都要有文档记录
+
+## 访问方式
+
+- **Consul UI**: `https://hcp1.tailnet-68f9.ts.net/`
+- **Traefik Dashboard**: `https://hcp1.tailnet-68f9.ts.net:8080/`
+
+---
+
+**创建时间**: 2025-10-02  
+**最后更新**: 2025-10-02  
+**维护者**: Infrastructure Team
--- a/docs/waypoint/waypoint-server.nomad
+++ b/docs/waypoint/waypoint-server.nomad
@@ -1,99 +0,0 @@
-job "waypoint-server" {
-  datacenters = ["dc1"]
-  type = "service"
-
-  group "waypoint" {
-    count = 1
-
-    constraint {
-      attribute = "${node.unique.name}"
-      operator  = "="
-      value     = "warden"
-    }
-
-    network {
-      port "ui" {
-        static = 9701
-      }
-      
-      port "api" {
-        static = 9702
-      }
-      
-      port "grpc" {
-        static = 9703
-      }
-    }
-
-    task "server" {
-      driver = "podman"
-
-      config {
-        image = "hashicorp/waypoint:latest"
-        ports = ["ui", "api", "grpc"]
-        
-        args = [
-          "server",
-          "run",
-          "-accept-tos",
-          "-vvv",
-          "-platform=nomad",
-          "-nomad-host=${attr.nomad.advertise.address}",
-          "-nomad-consul-service=true",
-          "-nomad-consul-service-hostname=${attr.unique.hostname}",
-          "-nomad-consul-datacenter=dc1",
-          "-listen-grpc=0.0.0.0:9703",
-          "-listen-http=0.0.0.0:9702",
-          "-url-api=http://${attr.unique.hostname}:9702",
-          "-url-ui=http://${attr.unique.hostname}:9701"
-        ]
-      }
-
-      env {
-        WAYPOINT_SERVER_DISABLE_MEMORY_DB = "true"
-      }
-
-      resources {
-        cpu    = 500
-        memory = 1024
-      }
-
-      service {
-        name = "waypoint-ui"
-        port = "ui"
-        
-        check {
-          name     = "waypoint-ui-alive"
-          type     = "http"
-          path     = "/"
-          interval = "10s"
-          timeout  = "2s"
-        }
-      }
-      
-      service {
-        name = "waypoint-api"
-        port = "api"
-        
-        check {
-          name     = "waypoint-api-alive"
-          type     = "tcp"
-          interval = "10s"
-          timeout  = "2s"
-        }
-      }
-      
-      volume_mount {
-        volume      = "waypoint-data"
-        destination = "/data"
-        read_only   = false
-      }
-    }
-    
-    volume "waypoint-data" {
-      type      = "host"
-      read_only = false
-      source    = "waypoint-data"
-    }
-  }
-}