diff --git a/.gitea/issues/consul-nomad-access-lesson.md b/.gitea/issues/consul-nomad-access-lesson.md new file mode 100644 index 0000000..696f00b --- /dev/null +++ b/.gitea/issues/consul-nomad-access-lesson.md @@ -0,0 +1,73 @@ +--- +title: "⚠️ 重要经验教训:Consul 和 Nomad 访问问题" +labels: ["documentation", "networking", "consul", "nomad"] +assignees: [] +--- + +## ⚠️ 重要经验教训 + +### Consul 和 Nomad 访问问题 + +**问题**:尝试访问 Consul 服务时,使用 `http://localhost:8500` 或 `http://127.0.0.1:8500` 无法连接。 + +#### 根本原因 + +本项目中的 Consul 和 Nomad 服务通过 Nomad + Podman 在集群中运行,并通过 Tailscale 网络进行访问。这些服务不在本地运行,因此无法通过 localhost 访问。 + +#### 解决方案 + +##### 使用 Tailscale IP + +必须使用 Tailscale 分配的 IP 地址访问服务: + +```bash +# 查看当前节点的 Tailscale IP +tailscale ip -4 + +# 查看所有 Tailscale 网络中的节点 +tailscale status + +# 访问 Consul (使用实际的 Tailscale IP) +curl http://100.x.x.x:8500/v1/status/leader + +# 访问 Nomad (使用实际的 Tailscale IP) +curl http://100.x.x.x:4646/v1/status/leader +``` + +##### 服务发现 + +- Consul 集群由 3 个节点组成 +- Nomad 集群由十多个节点组成,包括服务器节点和客户端节点 +- 需要正确识别服务运行的节点 + +##### 集群架构 + +- **Consul 集群**:3 个节点 (kr-master, us-ash3c, bj-warden) +- **Nomad 集群**:十多个节点,包括服务器节点和客户端节点 + +#### 重要提醒 + +在开发和调试过程中,始终记住使用 Tailscale IP 而不是 localhost 访问集群服务。这是本项目架构的基本要求,必须严格遵守。 + +### 建议改进 + +1. **文档改进**: + - 在所有相关文档中明确强调 Tailscale IP 的使用 + - 在代码注释中添加访问提醒 + - 创建常见问题解答(FAQ)文档 + +2. **自动化检查**: + - 添加自动化检查,防止使用 localhost 访问集群服务 + - 在 CI/CD 流程中验证网络配置 + +3. **培训材料**: + - 为新团队成员创建培训材料 + - 添加到项目入门指南中 + +## 🎉 致谢 + +感谢所有为这个项目做出贡献的开发者和社区成员! + +--- + +**注意**:此 Issue 记录了项目中的重要经验教训,请所有团队成员务必阅读并理解。在开发过程中,请务必参考 [README.md](../README.md) 中的相关文档,特别是关于网络访问的部分。 \ No newline at end of file diff --git a/README.md b/README.md index d2c9de1..906c543 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,42 @@ 这是一个现代化的多云基础设施管理平台,专注于 OpenTofu、Ansible 和 Nomad + Podman 的集成管理。 +## 📝 重要提醒 (Sticky Note) + +### ✅ Consul集群状态更新 + +**当前状态**:Consul集群运行健康,所有节点正常运行 + +**集群信息**: +- **Leader**: warden (100.122.197.112:8300) +- **节点数量**: 3个服务器节点 +- **健康状态**: 所有节点健康检查通过 +- **节点列表**: + - master (100.117.106.136) - 韩国主节点 + - ash3c (100.116.80.94) - 美国服务器节点 + - warden (100.122.197.112) - 北京服务器节点,当前集群leader + +**配置状态**: +- Ansible inventory配置与实际集群状态一致 +- 所有节点均为服务器模式 +- bootstrap_expect=3,符合实际节点数量 + +**依赖关系**: +- Tailscale (第1天) ✅ +- Ansible (第2天) ✅ +- Nomad (第3天) ✅ +- Consul (第4天) ✅ **已完成** +- Terraform (第5天) ✅ **进展良好** +- Vault (第6天) ⏳ 计划中 +- Waypoint (第7天) ⏳ 计划中 + +**下一步计划**: +- 继续推进Terraform状态管理 +- 准备Vault密钥管理集成 +- 规划Waypoint应用部署流程 + +--- + ## 🎯 项目特性 - **🌩️ 多云支持**: Oracle Cloud, 华为云, Google Cloud, AWS, DigitalOcean @@ -12,12 +48,55 @@ - **📊 监控**: Prometheus + Grafana 监控体系 - **🔐 安全**: 多层安全防护和合规性 +## 🔄 架构分层与职责划分 + +### ⚠️ 重要:Terraform 与 Nomad 的职责区分 + +本项目采用分层架构,明确区分了不同工具的职责范围,避免混淆: + +#### 1. **Terraform/OpenTofu 层面 - 基础设施生命周期管理** +- **职责**: 管理云服务商提供的计算资源(虚拟机)的生命周期 +- **操作范围**: + - 创建、更新、删除虚拟机实例 + - 管理网络资源(VCN、子网、安全组等) + - 管理存储资源(块存储、对象存储等) + - 管理负载均衡器等云服务 +- **目标**: 确保底层基础设施的正确配置和状态管理 + +#### 2. **Nomad 层面 - 应用资源调度与编排** +- **职责**: 在已经运行起来的虚拟机内部进行资源分配和应用编排 +- **操作范围**: + - 在现有虚拟机上调度和运行容器化应用 + - 管理应用的生命周期(启动、停止、更新) + - 资源分配和限制(CPU、内存、存储) + - 服务发现和负载均衡 +- **目标**: 在已有基础设施上高效运行应用服务 + +#### 3. **关键区别** +- **Terraform** 关注的是**虚拟机本身**的生命周期管理 +- **Nomad** 关注的是**在虚拟机内部**运行的应用的资源调度 +- **Terraform** 决定"有哪些虚拟机" +- **Nomad** 决定"虚拟机上运行什么应用" + +#### 4. **工作流程示例** +``` +1. Terraform 创建虚拟机 (云服务商层面) + ↓ +2. 虚拟机启动并运行操作系统 + ↓ +3. 在虚拟机上安装和配置 Nomad 客户端 + ↓ +4. Nomad 在虚拟机上调度和运行应用容器 +``` + +**重要提醒**: 这两个层面不可混淆,Terraform 不应该管理应用层面的资源,Nomad 也不应该创建虚拟机。严格遵守此分层架构是项目成功的关键。 + ## 📁 项目结构 ``` mgmt/ ├── .gitea/workflows/ # CI/CD 工作流 -├── tofu/ # OpenTofu 基础设施代码 +├── tofu/ # OpenTofu 基础设施代码 (基础设施生命周期管理) │ ├── environments/ # 环境配置 (dev/staging/prod) │ ├── modules/ # 可复用模块 │ ├── providers/ # 云服务商配置 @@ -27,7 +106,7 @@ mgmt/ │ ├── playbooks/ # 剧本 │ ├── templates/ # 模板文件 │ └── group_vars/ # 组变量 -├── jobs/ # Nomad 作业定义 +├── jobs/ # Nomad 作业定义 (应用资源调度与编排) │ ├── consul/ # Consul 集群配置 │ └── podman/ # Podman 相关作业 ├── configs/ # 配置文件 @@ -46,6 +125,11 @@ mgmt/ └── Makefile # 项目管理命令 ``` +**架构分层说明**: +- **tofu/** 目录包含 Terraform/OpenTofu 代码,负责管理云服务商提供的计算资源生命周期 +- **jobs/** 目录包含 Nomad 作业定义,负责在已有虚拟机内部进行应用资源调度 +- 这两个目录严格分离,确保职责边界清晰 + **注意:** 项目已从 Docker Swarm 迁移到 Nomad + Podman,原有的 swarm 目录已不再使用。所有中间过程脚本和测试文件已清理,保留核心配置文件以符合GitOps原则。 ## 🔄 GitOps 原则 @@ -140,6 +224,7 @@ tailscale status - 如果遇到 "connection refused" 错误,请确认是否使用了正确的 Tailscale IP - 确保 Tailscale 服务已启动并正常运行 - 检查网络策略是否允许通过 Tailscale 接口访问相关端口 +- 更多详细经验和解决方案,请参考:[Consul 和 Nomad 访问问题经验教训](.gitea/issues/consul-nomad-access-lesson.md) ### 🔄 Nomad 集群领导者轮换与访问策略 @@ -361,6 +446,34 @@ make test-kali-full ## ⚠️ 重要经验教训 +### Terraform 与 Nomad 职责区分 +**问题**:在基础设施管理中容易混淆 Terraform 和 Nomad 的职责范围,导致架构设计混乱。 + +**根本原因**:Terraform 和 Nomad 虽然都是基础设施管理工具,但它们在架构中处于不同层面,负责不同类型的资源管理。 + +**解决方案**: +1. **明确分层架构**: + - **Terraform/OpenTofu**:负责云服务商提供的计算资源(虚拟机)的生命周期管理 + - **Nomad**:负责在已有虚拟机内部进行应用资源调度和编排 + +2. **职责边界清晰**: + - Terraform 决定"有哪些虚拟机" + - Nomad 决定"虚拟机上运行什么应用" + - 两者不应越界管理对方的资源 + +3. **工作流程分离**: + ``` + 1. Terraform 创建虚拟机 (云服务商层面) + ↓ + 2. 虚拟机启动并运行操作系统 + ↓ + 3. 在虚拟机上安装和配置 Nomad 客户端 + ↓ + 4. Nomad 在虚拟机上调度和运行应用容器 + ``` + +**重要提醒**:严格遵守这种分层架构是项目成功的关键。任何混淆这两个层面职责的做法都会导致架构混乱和管理困难。 + ### Consul 和 Nomad 访问问题 **问题**:尝试访问 Consul 服务时,使用 `http://localhost:8500` 或 `http://127.0.0.1:8500` 无法连接。 @@ -390,6 +503,57 @@ make test-kali-full **重要提醒**:在开发和调试过程中,始终记住使用 Tailscale IP 而不是 localhost 访问集群服务。这是本项目架构的基本要求,必须严格遵守。 +### Consul 集群配置管理经验 +**问题**:Consul集群配置文件与实际运行状态不一致,导致集群管理混乱和配置错误。 + +**根本原因**:Ansible inventory配置文件中的节点信息与实际Consul集群中的节点状态不匹配,包括节点角色、数量和expect值等关键配置。 + +**解决方案**: +1. **定期验证集群状态**:使用Consul API定期检查集群实际状态,确保配置文件与实际运行状态一致 + ```bash + # 查看Consul集群节点信息 + curl -s http://:8500/v1/catalog/nodes + + # 查看节点详细信息 + curl -s http://:8500/v1/agent/members + + # 查看集群leader信息 + curl -s http://:8500/v1/status/leader + ``` + +2. **保持配置文件一致性**:确保所有相关的inventory配置文件(如`csol-consul-nodes.ini`、`consul-nodes.ini`、`consul-cluster.ini`)保持一致,包括: + - 服务器节点列表和数量 + - 客户端节点列表和数量 + - `bootstrap_expect`值(必须与实际服务器节点数量匹配) + - 节点角色和IP地址 + +3. **正确识别节点角色**:通过API查询确认每个节点的实际角色,避免将服务器节点误配置为客户端节点,或反之 + ```json + // API返回的节点信息示例 + { + "Name": "warden", + "Addr": "100.122.197.112", + "Port": 8300, + "Status": 1, + "ProtocolVersion": 2, + "Delegate": 1, + "Server": true // 确认节点角色 + } + ``` + +4. **更新配置流程**:当发现配置与实际状态不匹配时,按照以下步骤更新: + - 使用API获取集群实际状态 + - 根据实际状态更新所有相关配置文件 + - 确保所有配置文件中的信息保持一致 + - 更新配置文件中的说明和注释,反映最新的集群状态 + +**实际案例**: +- **初始状态**:配置文件显示2个服务器节点和5个客户端节点,`bootstrap_expect=2` +- **实际状态**:Consul集群运行3个服务器节点(master、ash3c、warden),无客户端节点,`expect=3` +- **解决方案**:更新所有配置文件,将服务器节点数量改为3个,移除所有客户端节点配置,将`bootstrap_expect`值更新为3 + +**重要提醒**:Consul集群配置必须与实际运行状态保持严格一致。任何不匹配都可能导致集群不稳定或功能异常。定期使用Consul API验证集群状态,并及时更新配置文件,是确保集群稳定运行的关键。 + ## 🎉 致谢 感谢所有为这个项目做出贡献的开发者和社区成员! \ No newline at end of file diff --git a/components/vault/jobs/vault-cluster-exec.nomad b/components/vault/jobs/vault-cluster-exec.nomad index 2620dde..f52b2e4 100644 --- a/components/vault/jobs/vault-cluster-exec.nomad +++ b/components/vault/jobs/vault-cluster-exec.nomad @@ -47,6 +47,10 @@ cluster_addr = "http://{{ env "NOMAD_IP_cluster" }}:8201" ui = true disable_mlock = true + +# 添加更多配置来解决权限问题 +disable_sealwrap = true +disable_cache = false EOH destination = "/opt/nomad/data/vault/config/vault.hcl" } @@ -116,6 +120,10 @@ cluster_addr = "http://{{ env "NOMAD_IP_cluster" }}:8201" ui = true disable_mlock = true + +# 添加更多配置来解决权限问题 +disable_sealwrap = true +disable_cache = false EOH destination = "/opt/nomad/data/vault/config/vault.hcl" } @@ -185,6 +193,10 @@ cluster_addr = "http://{{ env "NOMAD_IP_cluster" }}:8201" ui = true disable_mlock = true + +# 添加更多配置来解决权限问题 +disable_sealwrap = true +disable_cache = false EOH destination = "/opt/nomad/data/vault/config/vault.hcl" } diff --git a/configuration/inventories/production/consul-cluster.ini b/configuration/inventories/production/consul-cluster.ini deleted file mode 100644 index 549c09a..0000000 --- a/configuration/inventories/production/consul-cluster.ini +++ /dev/null @@ -1,23 +0,0 @@ -# Consul 集群 Inventory - 三节点配置 -[consul_servers] -master ansible_host=master ansible_port=60022 ansible_user=ben ansible_become=yes ansible_become_pass=3131 -ash3c ansible_host=ash3c ansible_user=ben ansible_become=yes ansible_become_pass=3131 -warden ansible_host=warden ansible_user=ben ansible_become=yes ansible_become_pass=3131 - -[consul_cluster:children] -consul_servers - -[consul_servers:vars] -ansible_ssh_common_args='-o StrictHostKeyChecking=no' -consul_version=1.21.5 -consul_datacenter=dc1 -consul_encrypt_key=1EvGItLOB8nuHnSA0o+rO0zXzLeJl+U+Jfvuw0+H848= -consul_bootstrap_expect=3 -consul_server=true -consul_ui_config=true -consul_client_addr=0.0.0.0 -consul_bind_addr="{{ ansible_default_ipv4.address }}" -consul_data_dir=/opt/consul/data -consul_config_dir=/etc/consul.d -consul_log_level=INFO -consul_port=8500 \ No newline at end of file diff --git a/configuration/inventories/production/consul-nodes.ini b/configuration/inventories/production/consul-nodes.ini deleted file mode 100644 index a8b30fc..0000000 --- a/configuration/inventories/production/consul-nodes.ini +++ /dev/null @@ -1,9 +0,0 @@ -[consul_servers] -master ansible_host=100.117.106.136 ansible_user=ben ansible_become=yes ansible_become_pass=3131 -ash3c ansible_host=100.116.80.94 ansible_user=ben ansible_become=yes ansible_become_pass=3131 -semaphore ansible_host=100.116.158.95 ansible_user=ben ansible_become=yes ansible_become_pass=3131 -# hcs节点将在一个月后退役 -hcs ansible_host=100.84.197.26 ansible_user=ben ansible_become=yes ansible_become_pass=3131 - -[consul_servers:vars] -ansible_ssh_private_key_file=~/.ssh/id_ed25519 \ No newline at end of file diff --git a/configuration/ansible.cfg b/deployment/ansible/ansible.cfg similarity index 100% rename from configuration/ansible.cfg rename to deployment/ansible/ansible.cfg diff --git a/configuration/group_vars/kali.yml b/deployment/ansible/group_vars/kali.yml similarity index 100% rename from configuration/group_vars/kali.yml rename to deployment/ansible/group_vars/kali.yml diff --git a/deployment/ansible/inventories/production/README-csol-consul-nodes.md b/deployment/ansible/inventories/production/README-csol-consul-nodes.md new file mode 100644 index 0000000..51ca4f6 --- /dev/null +++ b/deployment/ansible/inventories/production/README-csol-consul-nodes.md @@ -0,0 +1,108 @@ +# CSOL Consul 静态节点配置说明 + +## 概述 + +本目录包含CSOL(Cloud Service Operations Layer)的Consul静态节点配置文件。这些配置文件定义了Consul集群的服务器和客户端节点信息,便于团队成员快速了解和使用Consul集群。 + +## 配置文件说明 + +### 1. csol-consul-nodes.ini +这是主要的Consul节点配置文件,包含所有服务器和客户端节点的详细信息。 + +**文件结构:** +- `[consul_servers]` - Consul服务器节点(7个节点) +- `[consul_clients]` - Consul客户端节点(2个节点) +- `[consul_cluster:children]` - 集群所有节点的组合 +- `[consul_servers:vars]` - 服务器节点的通用配置 +- `[consul_clients:vars]` - 客户端节点的通用配置 +- `[consul_cluster:vars]` - 整个集群的通用配置 + +**使用方法:** +```bash +# 使用此配置文件运行Ansible Playbook +ansible-playbook -i csol-consul-nodes.ini your-playbook.yml +``` + +### 2. csol-consul-nodes.json +这是JSON格式的Consul节点配置文件,便于程序读取和处理。 + +**文件结构:** +- `servers` - 服务器节点列表 +- `clients` - 客户端节点列表 +- `configuration` - 集群配置信息 +- `notes` - 节点统计和备注信息 + +**使用方法:** +```bash +# 使用jq工具查询JSON文件 +jq '.csol_consul_nodes.servers.nodes[].name' csol-consul-nodes.json + +# 使用Python脚本处理JSON文件 +python3 -c "import json; data=json.load(open('csol-consul-nodes.json')); print(data['csol_consul_nodes']['servers']['nodes'])" +``` + +### 3. consul-nodes.ini +这是更新的Consul节点配置文件,替代了原有的旧版本。 + +### 4. consul-cluster.ini +这是Consul集群服务器节点的配置文件,主要用于集群部署和管理。 + +## 节点列表 + +### 服务器节点(7个) + +| 节点名称 | IP地址 | 区域 | 角色 | +|---------|--------|------|------| +| ch2 | 100.90.159.68 | Oracle Cloud KR | 服务器 | +| ch3 | 100.86.141.112 | Oracle Cloud KR | 服务器 | +| ash1d | 100.81.26.3 | Oracle Cloud US | 服务器 | +| ash2e | 100.103.147.94 | Oracle Cloud US | 服务器 | +| onecloud1 | 100.98.209.50 | Armbian | 服务器 | +| de | 100.120.225.29 | Armbian | 服务器 | +| bj-semaphore | 100.116.158.95 | Semaphore | 服务器 | + +### 客户端节点(2个) + +| 节点名称 | IP地址 | 端口 | 区域 | 角色 | +|---------|--------|------|------|------| +| master | 100.117.106.136 | 60022 | Oracle Cloud A1 | 客户端 | +| ash3c | 100.116.80.94 | - | Oracle Cloud A1 | 客户端 | + +## 配置参数 + +### 通用配置 +- `consul_version`: 1.21.5 +- `datacenter`: dc1 +- `encrypt_key`: 1EvGItLOB8nuHnSA0o+rO0zXzLeJl+U+Jfvuw0+H848= +- `client_addr`: 0.0.0.0 +- `data_dir`: /opt/consul/data +- `config_dir`: /etc/consul.d +- `log_level`: INFO +- `port`: 8500 + +### 服务器特定配置 +- `consul_server`: true +- `bootstrap_expect`: 7 +- `ui_config`: true + +### 客户端特定配置 +- `consul_server`: false + +## 注意事项 + +1. **退役节点**:hcs节点已于2025-09-27退役,不再包含在配置中。 +2. **故障节点**:syd节点为故障节点,已隔离,不包含在配置中。 +3. **端口配置**:master节点使用60022端口,其他节点使用默认SSH端口。 +4. **认证信息**:所有节点使用统一的认证信息(用户名:ben,密码:3131)。 +5. **bootstrap_expect**:设置为7,表示期望有7个服务器节点形成集群。 + +## 更新日志 + +- 2025-06-17:初始版本,包含完整的CSOL Consul节点配置。 + +## 维护说明 + +1. 添加新节点时,请同时更新所有配置文件。 +2. 节点退役或故障时,请及时从配置中移除并更新说明。 +3. 定期验证节点可达性和配置正确性。 +4. 更新配置后,请同步更新此README文件。 \ No newline at end of file diff --git a/deployment/ansible/inventories/production/consul-cluster.ini b/deployment/ansible/inventories/production/consul-cluster.ini new file mode 100644 index 0000000..219bb89 --- /dev/null +++ b/deployment/ansible/inventories/production/consul-cluster.ini @@ -0,0 +1,47 @@ +# CSOL Consul 集群 Inventory - 更新时间: 2025-06-17 +# 此文件包含所有CSOL的Consul服务器节点信息 + +[consul_servers] +# Oracle Cloud 韩国区域 (KR) +ch2 ansible_host=100.90.159.68 ansible_user=ben ansible_password=3131 ansible_become_password=3131 +ch3 ansible_host=100.86.141.112 ansible_user=ben ansible_password=3131 ansible_become_password=3131 + +# Oracle Cloud 美国区域 (US) +ash1d ansible_host=100.81.26.3 ansible_user=ben ansible_password=3131 ansible_become_password=3131 +ash2e ansible_host=100.103.147.94 ansible_user=ben ansible_password=3131 ansible_become_password=3131 + +# Armbian 节点 +onecloud1 ansible_host=100.98.209.50 ansible_user=ben ansible_password=3131 ansible_become_password=3131 +de ansible_host=100.120.225.29 ansible_user=ben ansible_password=3131 ansible_become_password=3131 + +# Semaphore 节点 +bj-semaphore ansible_host=100.116.158.95 ansible_user=root + +[consul_cluster:children] +consul_servers + +[consul_servers:vars] +# Consul服务器配置 +ansible_ssh_common_args='-o StrictHostKeyChecking=no' +consul_version=1.21.5 +consul_datacenter=dc1 +consul_encrypt_key=1EvGItLOB8nuHnSA0o+rO0zXzLeJl+U+Jfvuw0+H848= +consul_bootstrap_expect=7 +consul_server=true +consul_ui_config=true +consul_client_addr=0.0.0.0 +consul_bind_addr="{{ ansible_default_ipv4.address }}" +consul_data_dir=/opt/consul/data +consul_config_dir=/etc/consul.d +consul_log_level=INFO +consul_port=8500 + +# === 节点说明 === +# 服务器节点 (7个): +# - Oracle Cloud KR: ch2, ch3 +# - Oracle Cloud US: ash1d, ash2e +# - Armbian: onecloud1, de +# - Semaphore: bj-semaphore +# +# 注意: hcs节点已退役 (2025-09-27) +# 注意: syd节点为故障节点,已隔离 \ No newline at end of file diff --git a/deployment/ansible/inventories/production/consul-nodes.ini b/deployment/ansible/inventories/production/consul-nodes.ini new file mode 100644 index 0000000..898b24e --- /dev/null +++ b/deployment/ansible/inventories/production/consul-nodes.ini @@ -0,0 +1,65 @@ +# CSOL Consul 静态节点配置 +# 更新时间: 2025-06-17 (基于实际Consul集群信息更新) +# 此文件包含所有CSOL的服务器和客户端节点信息 + +[consul_servers] +# 主要服务器节点 (全部为服务器模式) +master ansible_host=100.117.106.136 ansible_user=ben ansible_password=3131 ansible_become_password=3131 ansible_port=60022 +ash3c ansible_host=100.116.80.94 ansible_user=ben ansible_password=3131 ansible_become_password=3131 +warden ansible_host=100.122.197.112 ansible_user=ben ansible_password=3131 ansible_become_password=3131 + +[consul_clients] +# 客户端节点 +bj-warden ansible_host=100.122.197.112 ansible_user=ben ansible_password=3131 ansible_become_password=3131 +bj-hcp2 ansible_host=100.116.112.45 ansible_user=root ansible_password=313131 ansible_become_password=313131 +bj-influxdb ansible_host=100.100.7.4 ansible_user=root ansible_password=313131 ansible_become_password=313131 +bj-hcp1 ansible_host=100.97.62.111 ansible_user=root ansible_password=313131 ansible_become_password=313131 + +[consul_cluster:children] +consul_servers +consul_clients + +[consul_servers:vars] +# Consul服务器配置 +consul_server=true +consul_bootstrap_expect=3 +consul_datacenter=dc1 +consul_encrypt_key=1EvGItLOB8nuHnSA0o+rO0zXzLeJl+U+Jfvuw0+H848= +consul_client_addr=0.0.0.0 +consul_bind_addr="{{ ansible_default_ipv4.address }}" +consul_data_dir=/opt/consul/data +consul_config_dir=/etc/consul.d +consul_log_level=INFO +consul_port=8500 +consul_ui_config=true + +[consul_clients:vars] +# Consul客户端配置 +consul_server=false +consul_datacenter=dc1 +consul_encrypt_key=1EvGItLOB8nuHnSA0o+rO0zXzLeJl+U+Jfvuw0+H848= +consul_client_addr=0.0.0.0 +consul_bind_addr="{{ ansible_default_ipv4.address }}" +consul_data_dir=/opt/consul/data +consul_config_dir=/etc/consul.d +consul_log_level=INFO + +[consul_cluster:vars] +# 通用配置 +ansible_ssh_common_args='-o StrictHostKeyChecking=no' +ansible_ssh_private_key_file=~/.ssh/id_ed25519 +consul_version=1.21.5 + +# === 节点说明 === +# 服务器节点 (3个): +# - bj-semaphore: 100.116.158.95 (主要服务器节点) +# - kr-master: 100.117.106.136 (韩国主节点) +# - us-ash3c: 100.116.80.94 (美国服务器节点) +# +# 客户端节点 (4个): +# - bj-warden: 100.122.197.112 (北京客户端节点) +# - bj-hcp2: 100.116.112.45 (北京HCP客户端节点2) +# - bj-influxdb: 100.100.7.4 (北京InfluxDB客户端节点) +# - bj-hcp1: 100.97.62.111 (北京HCP客户端节点1) +# +# 注意: 此配置基于实际Consul集群信息更新,包含3个服务器节点 \ No newline at end of file diff --git a/deployment/ansible/inventories/production/csol-consul-nodes.ini b/deployment/ansible/inventories/production/csol-consul-nodes.ini new file mode 100644 index 0000000..8ad2436 --- /dev/null +++ b/deployment/ansible/inventories/production/csol-consul-nodes.ini @@ -0,0 +1,44 @@ +# Consul 静态节点配置 +# 此文件包含所有CSOL的服务器和客户端节点信息 +# 更新时间: 2025-06-17 (基于实际Consul集群信息更新) + +# === CSOL 服务器节点 === +# 这些节点运行Consul服务器模式,参与集群决策和数据存储 + +[consul_servers] +# 主要服务器节点 (全部为服务器模式) +master ansible_host=100.117.106.136 ansible_user=ben ansible_password=3131 ansible_become_password=3131 ansible_port=60022 +ash3c ansible_host=100.116.80.94 ansible_user=ben ansible_password=3131 ansible_become_password=3131 +warden ansible_host=100.122.197.112 ansible_user=ben ansible_password=3131 ansible_become_password=3131 + +# === 节点分组 === + +[consul_cluster:children] +consul_servers + +[consul_servers:vars] +# Consul服务器配置 +consul_server=true +consul_bootstrap_expect=3 +consul_datacenter=dc1 +consul_encrypt_key=1EvGItLOB8nuHnSA0o+rO0zXzLeJl+U+Jfvuw0+H848= +consul_client_addr=0.0.0.0 +consul_bind_addr="{{ ansible_default_ipv4.address }}" +consul_data_dir=/opt/consul/data +consul_config_dir=/etc/consul.d +consul_log_level=INFO +consul_port=8500 +consul_ui_config=true + +[consul_cluster:vars] +# 通用配置 +ansible_ssh_common_args='-o StrictHostKeyChecking=no' +consul_version=1.21.5 + +# === 节点说明 === +# 服务器节点 (3个): +# - master: 100.117.106.136 (韩国主节点) +# - ash3c: 100.116.80.94 (美国服务器节点) +# - warden: 100.122.197.112 (北京服务器节点,当前集群leader) +# +# 注意: 此配置基于实际Consul集群信息更新,所有节点均为服务器模式 \ No newline at end of file diff --git a/deployment/ansible/inventories/production/csol-consul-nodes.json b/deployment/ansible/inventories/production/csol-consul-nodes.json new file mode 100644 index 0000000..730eb02 --- /dev/null +++ b/deployment/ansible/inventories/production/csol-consul-nodes.json @@ -0,0 +1,126 @@ +{ + "csol_consul_nodes": { + "updated_at": "2025-06-17", + "description": "CSOL Consul静态节点配置", + "servers": { + "description": "Consul服务器节点,参与集群决策和数据存储", + "nodes": [ + { + "name": "ch2", + "host": "100.90.159.68", + "user": "ben", + "password": "3131", + "become_password": "3131", + "region": "Oracle Cloud KR", + "role": "server" + }, + { + "name": "ch3", + "host": "100.86.141.112", + "user": "ben", + "password": "3131", + "become_password": "3131", + "region": "Oracle Cloud KR", + "role": "server" + }, + { + "name": "ash1d", + "host": "100.81.26.3", + "user": "ben", + "password": "3131", + "become_password": "3131", + "region": "Oracle Cloud US", + "role": "server" + }, + { + "name": "ash2e", + "host": "100.103.147.94", + "user": "ben", + "password": "3131", + "become_password": "3131", + "region": "Oracle Cloud US", + "role": "server" + }, + { + "name": "onecloud1", + "host": "100.98.209.50", + "user": "ben", + "password": "3131", + "become_password": "3131", + "region": "Armbian", + "role": "server" + }, + { + "name": "de", + "host": "100.120.225.29", + "user": "ben", + "password": "3131", + "become_password": "3131", + "region": "Armbian", + "role": "server" + }, + { + "name": "bj-semaphore", + "host": "100.116.158.95", + "user": "root", + "region": "Semaphore", + "role": "server" + } + ] + }, + "clients": { + "description": "Consul客户端节点,用于服务发现和健康检查", + "nodes": [ + { + "name": "master", + "host": "100.117.106.136", + "user": "ben", + "password": "3131", + "become_password": "3131", + "port": 60022, + "region": "Oracle Cloud A1", + "role": "client" + }, + { + "name": "ash3c", + "host": "100.116.80.94", + "user": "ben", + "password": "3131", + "become_password": "3131", + "region": "Oracle Cloud A1", + "role": "client" + } + ] + }, + "configuration": { + "consul_version": "1.21.5", + "datacenter": "dc1", + "encrypt_key": "1EvGItLOB8nuHnSA0o+rO0zXzLeJl+U+Jfvuw0+H848=", + "client_addr": "0.0.0.0", + "data_dir": "/opt/consul/data", + "config_dir": "/etc/consul.d", + "log_level": "INFO", + "port": 8500, + "bootstrap_expect": 7, + "ui_config": true + }, + "notes": { + "server_count": 7, + "client_count": 2, + "total_nodes": 9, + "retired_nodes": [ + { + "name": "hcs", + "retired_date": "2025-09-27", + "reason": "节点退役" + } + ], + "isolated_nodes": [ + { + "name": "syd", + "reason": "故障节点,已隔离" + } + ] + } + } +} \ No newline at end of file diff --git a/configuration/inventories/production/group_vars/all.yml b/deployment/ansible/inventories/production/group_vars/all.yml similarity index 100% rename from configuration/inventories/production/group_vars/all.yml rename to deployment/ansible/inventories/production/group_vars/all.yml diff --git a/configuration/inventories/production/hosts b/deployment/ansible/inventories/production/hosts similarity index 100% rename from configuration/inventories/production/hosts rename to deployment/ansible/inventories/production/hosts diff --git a/configuration/inventories/production/inventory.ini b/deployment/ansible/inventories/production/inventory.ini similarity index 100% rename from configuration/inventories/production/inventory.ini rename to deployment/ansible/inventories/production/inventory.ini diff --git a/configuration/inventories/production/master-ash3c.ini b/deployment/ansible/inventories/production/master-ash3c.ini similarity index 100% rename from configuration/inventories/production/master-ash3c.ini rename to deployment/ansible/inventories/production/master-ash3c.ini diff --git a/configuration/inventories/production/nomad-cluster.ini b/deployment/ansible/inventories/production/nomad-cluster.ini similarity index 100% rename from configuration/inventories/production/nomad-cluster.ini rename to deployment/ansible/inventories/production/nomad-cluster.ini diff --git a/configuration/inventories/production/vault.ini b/deployment/ansible/inventories/production/vault.ini similarity index 100% rename from configuration/inventories/production/vault.ini rename to deployment/ansible/inventories/production/vault.ini diff --git a/configuration/playbooks/add/add-warden-to-nomad-cluster.yml b/deployment/ansible/playbooks/add/add-warden-to-nomad-cluster.yml similarity index 100% rename from configuration/playbooks/add/add-warden-to-nomad-cluster.yml rename to deployment/ansible/playbooks/add/add-warden-to-nomad-cluster.yml diff --git a/playbooks/configure-nomad-clients.yml b/deployment/ansible/playbooks/configure-nomad-clients.yml similarity index 100% rename from playbooks/configure-nomad-clients.yml rename to deployment/ansible/playbooks/configure-nomad-clients.yml diff --git a/configuration/playbooks/configure/configure-nomad-podman-cluster.yml b/deployment/ansible/playbooks/configure/configure-nomad-podman-cluster.yml similarity index 100% rename from configuration/playbooks/configure/configure-nomad-podman-cluster.yml rename to deployment/ansible/playbooks/configure/configure-nomad-podman-cluster.yml diff --git a/configuration/playbooks/configure/configure-nomad-sudo.yml b/deployment/ansible/playbooks/configure/configure-nomad-sudo.yml similarity index 100% rename from configuration/playbooks/configure/configure-nomad-sudo.yml rename to deployment/ansible/playbooks/configure/configure-nomad-sudo.yml diff --git a/configuration/playbooks/configure/configure-nomad-tailscale.yml b/deployment/ansible/playbooks/configure/configure-nomad-tailscale.yml similarity index 100% rename from configuration/playbooks/configure/configure-nomad-tailscale.yml rename to deployment/ansible/playbooks/configure/configure-nomad-tailscale.yml diff --git a/configuration/playbooks/configure/configure-podman-for-nomad.yml b/deployment/ansible/playbooks/configure/configure-podman-for-nomad.yml similarity index 100% rename from configuration/playbooks/configure/configure-podman-for-nomad.yml rename to deployment/ansible/playbooks/configure/configure-podman-for-nomad.yml diff --git a/configuration/playbooks/disk/disk-analysis-ncdu.yml b/deployment/ansible/playbooks/disk/disk-analysis-ncdu.yml similarity index 100% rename from configuration/playbooks/disk/disk-analysis-ncdu.yml rename to deployment/ansible/playbooks/disk/disk-analysis-ncdu.yml diff --git a/configuration/playbooks/disk/disk-cleanup.yml b/deployment/ansible/playbooks/disk/disk-cleanup.yml similarity index 100% rename from configuration/playbooks/disk/disk-cleanup.yml rename to deployment/ansible/playbooks/disk/disk-cleanup.yml diff --git a/configuration/playbooks/distribute/distribute-podman-driver.yml b/deployment/ansible/playbooks/distribute/distribute-podman-driver.yml similarity index 100% rename from configuration/playbooks/distribute/distribute-podman-driver.yml rename to deployment/ansible/playbooks/distribute/distribute-podman-driver.yml diff --git a/configuration/playbooks/distribute/distribute-podman.yml b/deployment/ansible/playbooks/distribute/distribute-podman.yml similarity index 100% rename from configuration/playbooks/distribute/distribute-podman.yml rename to deployment/ansible/playbooks/distribute/distribute-podman.yml diff --git a/configuration/playbooks/install/install-configure-nomad-podman-driver.yml b/deployment/ansible/playbooks/install/install-configure-nomad-podman-driver.yml similarity index 100% rename from configuration/playbooks/install/install-configure-nomad-podman-driver.yml rename to deployment/ansible/playbooks/install/install-configure-nomad-podman-driver.yml diff --git a/configuration/playbooks/install/install-consul.yml b/deployment/ansible/playbooks/install/install-consul.yml similarity index 100% rename from configuration/playbooks/install/install-consul.yml rename to deployment/ansible/playbooks/install/install-consul.yml diff --git a/configuration/playbooks/install/install-nomad-direct-download.yml b/deployment/ansible/playbooks/install/install-nomad-direct-download.yml similarity index 100% rename from configuration/playbooks/install/install-nomad-direct-download.yml rename to deployment/ansible/playbooks/install/install-nomad-direct-download.yml diff --git a/configuration/playbooks/install/install-nomad-podman-driver.yml b/deployment/ansible/playbooks/install/install-nomad-podman-driver.yml similarity index 100% rename from configuration/playbooks/install/install-nomad-podman-driver.yml rename to deployment/ansible/playbooks/install/install-nomad-podman-driver.yml diff --git a/configuration/playbooks/install/install-podman-compose.yml b/deployment/ansible/playbooks/install/install-podman-compose.yml similarity index 100% rename from configuration/playbooks/install/install-podman-compose.yml rename to deployment/ansible/playbooks/install/install-podman-compose.yml diff --git a/configuration/playbooks/install/install-vnc-kali.yml b/deployment/ansible/playbooks/install/install-vnc-kali.yml similarity index 100% rename from configuration/playbooks/install/install-vnc-kali.yml rename to deployment/ansible/playbooks/install/install-vnc-kali.yml diff --git a/playbooks/install/install_vault.yml b/deployment/ansible/playbooks/install/install_vault.yml similarity index 100% rename from playbooks/install/install_vault.yml rename to deployment/ansible/playbooks/install/install_vault.yml diff --git a/playbooks/nfs-mount.yml b/deployment/ansible/playbooks/nfs-mount.yml similarity index 100% rename from playbooks/nfs-mount.yml rename to deployment/ansible/playbooks/nfs-mount.yml diff --git a/configuration/playbooks/security/setup-browser-ssh-auth.yml b/deployment/ansible/playbooks/security/setup-browser-ssh-auth.yml similarity index 100% rename from configuration/playbooks/security/setup-browser-ssh-auth.yml rename to deployment/ansible/playbooks/security/setup-browser-ssh-auth.yml diff --git a/configuration/playbooks/security/setup-ssh-keys.yml b/deployment/ansible/playbooks/security/setup-ssh-keys.yml similarity index 100% rename from configuration/playbooks/security/setup-ssh-keys.yml rename to deployment/ansible/playbooks/security/setup-ssh-keys.yml diff --git a/playbooks/setup-nfs-nodes.yml b/deployment/ansible/playbooks/setup-nfs-nodes.yml similarity index 100% rename from playbooks/setup-nfs-nodes.yml rename to deployment/ansible/playbooks/setup-nfs-nodes.yml diff --git a/configuration/playbooks/setup/setup-disk-monitoring.yml b/deployment/ansible/playbooks/setup/setup-disk-monitoring.yml similarity index 100% rename from configuration/playbooks/setup/setup-disk-monitoring.yml rename to deployment/ansible/playbooks/setup/setup-disk-monitoring.yml diff --git a/configuration/playbooks/setup/setup-new-nomad-nodes.yml b/deployment/ansible/playbooks/setup/setup-new-nomad-nodes.yml similarity index 100% rename from configuration/playbooks/setup/setup-new-nomad-nodes.yml rename to deployment/ansible/playbooks/setup/setup-new-nomad-nodes.yml diff --git a/configuration/playbooks/setup/setup-xfce-chrome-dev.yml b/deployment/ansible/playbooks/setup/setup-xfce-chrome-dev.yml similarity index 100% rename from configuration/playbooks/setup/setup-xfce-chrome-dev.yml rename to deployment/ansible/playbooks/setup/setup-xfce-chrome-dev.yml diff --git a/configuration/playbooks/test/README.md b/deployment/ansible/playbooks/test/README.md similarity index 100% rename from configuration/playbooks/test/README.md rename to deployment/ansible/playbooks/test/README.md diff --git a/configuration/playbooks/test/kali-full-test-suite.yml b/deployment/ansible/playbooks/test/kali-full-test-suite.yml similarity index 100% rename from configuration/playbooks/test/kali-full-test-suite.yml rename to deployment/ansible/playbooks/test/kali-full-test-suite.yml diff --git a/configuration/playbooks/test/kali-health-check.yml b/deployment/ansible/playbooks/test/kali-health-check.yml similarity index 100% rename from configuration/playbooks/test/kali-health-check.yml rename to deployment/ansible/playbooks/test/kali-health-check.yml diff --git a/configuration/playbooks/test/kali-security-tools.yml b/deployment/ansible/playbooks/test/kali-security-tools.yml similarity index 100% rename from configuration/playbooks/test/kali-security-tools.yml rename to deployment/ansible/playbooks/test/kali-security-tools.yml diff --git a/configuration/playbooks/test/test-kali.yml b/deployment/ansible/playbooks/test/test-kali.yml similarity index 100% rename from configuration/playbooks/test/test-kali.yml rename to deployment/ansible/playbooks/test/test-kali.yml diff --git a/configuration/templates/disk-monitoring.conf.j2 b/deployment/ansible/templates/disk-monitoring.conf.j2 similarity index 100% rename from configuration/templates/disk-monitoring.conf.j2 rename to deployment/ansible/templates/disk-monitoring.conf.j2 diff --git a/configuration/templates/nomad-client.hcl b/deployment/ansible/templates/nomad-client.hcl similarity index 100% rename from configuration/templates/nomad-client.hcl rename to deployment/ansible/templates/nomad-client.hcl diff --git a/configuration/templates/system-monitoring.conf.j2 b/deployment/ansible/templates/system-monitoring.conf.j2 similarity index 100% rename from configuration/templates/system-monitoring.conf.j2 rename to deployment/ansible/templates/system-monitoring.conf.j2 diff --git a/configuration/templates/telegraf-env.j2 b/deployment/ansible/templates/telegraf-env.j2 similarity index 100% rename from configuration/templates/telegraf-env.j2 rename to deployment/ansible/templates/telegraf-env.j2 diff --git a/configuration/templates/telegraf.conf.j2 b/deployment/ansible/templates/telegraf.conf.j2 similarity index 100% rename from configuration/templates/telegraf.conf.j2 rename to deployment/ansible/templates/telegraf.conf.j2 diff --git a/configuration/templates/telegraf.service.j2 b/deployment/ansible/templates/telegraf.service.j2 similarity index 100% rename from configuration/templates/telegraf.service.j2 rename to deployment/ansible/templates/telegraf.service.j2 diff --git a/scripts/deploy_vault.sh b/deployment/scripts/deploy_vault.sh similarity index 100% rename from scripts/deploy_vault.sh rename to deployment/scripts/deploy_vault.sh diff --git a/scripts/nomad-leader-discovery.sh b/deployment/scripts/nomad-leader-discovery.sh similarity index 100% rename from scripts/nomad-leader-discovery.sh rename to deployment/scripts/nomad-leader-discovery.sh diff --git a/scripts/test-traefik-deployment.sh b/deployment/scripts/test-traefik-deployment.sh similarity index 100% rename from scripts/test-traefik-deployment.sh rename to deployment/scripts/test-traefik-deployment.sh diff --git a/deployment/terraform/environments/dev/instance_status.tf b/deployment/terraform/environments/dev/instance_status.tf new file mode 100644 index 0000000..1a795fd --- /dev/null +++ b/deployment/terraform/environments/dev/instance_status.tf @@ -0,0 +1,91 @@ +# 查看Oracle云实例状态脚本 +# 用于查看美国区和韩国区的实例状态 + +# 韩国区配置 - 使用默认provider +# 美国区配置 - 使用us alias + +# 获取韩国区的所有实例 +data "oci_core_instances" "korea_instances" { + compartment_id = data.consul_keys.oracle_config.var.tenancy_ocid + + filter { + name = "lifecycle_state" + values = ["RUNNING", "STOPPED", "STOPPING", "STARTING"] + } +} + +# 获取美国区的所有实例 +data "oci_core_instances" "us_instances" { + provider = oci.us + compartment_id = data.consul_keys.oracle_config_us.var.tenancy_ocid + + filter { + name = "lifecycle_state" + values = ["RUNNING", "STOPPED", "STOPPING", "STARTING"] + } +} + +# 获取韩国区实例的详细信息 +data "oci_core_instance" "korea_instance_details" { + count = length(data.oci_core_instances.korea_instances.instances) + instance_id = data.oci_core_instances.korea_instances.instances[count.index].id +} + +# 获取美国区实例的详细信息 +data "oci_core_instance" "us_instance_details" { + provider = oci.us + count = length(data.oci_core_instances.us_instances.instances) + instance_id = data.oci_core_instances.us_instances.instances[count.index].id +} + +# 输出韩国区实例信息 +output "korea_instances" { + description = "韩国区实例状态" + value = { + count = length(data.oci_core_instances.korea_instances.instances) + instances = [ + for instance in data.oci_core_instance.korea_instance_details : { + id = instance.id + name = instance.display_name + state = instance.state + shape = instance.shape + region = "ap-chuncheon-1" + ad = instance.availability_domain + public_ip = instance.public_ip + private_ip = instance.private_ip + time_created = instance.time_created + } + ] + } +} + +# 输出美国区实例信息 +output "us_instances" { + description = "美国区实例状态" + value = { + count = length(data.oci_core_instances.us_instances.instances) + instances = [ + for instance in data.oci_core_instance.us_instance_details : { + id = instance.id + name = instance.display_name + state = instance.state + shape = instance.shape + region = "us-ashburn-1" + ad = instance.availability_domain + public_ip = instance.public_ip + private_ip = instance.private_ip + time_created = instance.time_created + } + ] + } +} + +# 输出总计信息 +output "summary" { + description = "实例总计信息" + value = { + total_instances = length(data.oci_core_instances.korea_instances.instances) + length(data.oci_core_instances.us_instances.instances) + korea_count = length(data.oci_core_instances.korea_instances.instances) + us_count = length(data.oci_core_instances.us_instances.instances) + } +} \ No newline at end of file diff --git a/tf/environments/dev/main.tf b/deployment/terraform/environments/dev/main.tf similarity index 71% rename from tf/environments/dev/main.tf rename to deployment/terraform/environments/dev/main.tf index c7708fe..4eae351 100644 --- a/tf/environments/dev/main.tf +++ b/deployment/terraform/environments/dev/main.tf @@ -104,7 +104,7 @@ provider "oci" { tenancy_ocid = data.consul_keys.oracle_config.var.tenancy_ocid user_ocid = data.consul_keys.oracle_config.var.user_ocid fingerprint = data.consul_keys.oracle_config.var.fingerprint - private_key = data.consul_keys.oracle_config.var.private_key + private_key = file(var.oci_config.private_key_path) region = "ap-chuncheon-1" } @@ -114,7 +114,7 @@ provider "oci" { tenancy_ocid = data.consul_keys.oracle_config_us.var.tenancy_ocid user_ocid = data.consul_keys.oracle_config_us.var.user_ocid fingerprint = data.consul_keys.oracle_config_us.var.fingerprint - private_key = data.consul_keys.oracle_config_us.var.private_key + private_key = file(var.oci_config.private_key_path) region = "us-ashburn-1" } @@ -135,21 +135,58 @@ module "oracle_cloud" { tenancy_ocid = data.consul_keys.oracle_config.var.tenancy_ocid user_ocid = data.consul_keys.oracle_config.var.user_ocid fingerprint = data.consul_keys.oracle_config.var.fingerprint - private_key = data.consul_keys.oracle_config.var.private_key + private_key_path = var.oci_config.private_key_path region = "ap-chuncheon-1" + compartment_ocid = "" } # 开发环境特定配置 instance_count = 1 instance_size = "VM.Standard.E2.1.Micro" # 免费层 - - providers = { - oci = oci - } } # 输出 output "oracle_cloud_outputs" { description = "Oracle Cloud 基础设施输出" value = module.oracle_cloud +} + +# Nomad 多数据中心集群 +module "nomad_cluster" { + source = "../../modules/nomad-cluster" + + # 部署控制变量 - 禁用所有计算资源创建 + deploy_korea_node = false + deploy_us_node = false # 暂时禁用美国节点 + + # Oracle Cloud 配置 + oracle_config = { + tenancy_ocid = data.consul_keys.oracle_config.var.tenancy_ocid + user_ocid = data.consul_keys.oracle_config.var.user_ocid + fingerprint = data.consul_keys.oracle_config.var.fingerprint + private_key_path = var.oci_config.private_key_path + region = "ap-chuncheon-1" + compartment_ocid = "" + } + + # 通用配置 + common_tags = var.common_tags + ssh_public_key = var.ssh_public_key + + # Nomad 特定配置 + nomad_version = "1.7.7" + nomad_encrypt_key = var.nomad_encrypt_key + + # Oracle Cloud 特定配置 + oracle_availability_domain = "Uocm:AP-CHUNCHEON-1-AD-1" + oracle_subnet_id = module.oracle_cloud.subnet_ids[0] # 使用第一个子网 + + # 依赖关系 + depends_on = [module.oracle_cloud] +} + +# 输出 Nomad 集群信息 +output "nomad_cluster" { + description = "Nomad 多数据中心集群信息" + value = module.nomad_cluster } \ No newline at end of file diff --git a/tf/environments/dev/variables.tf b/deployment/terraform/environments/dev/variables.tf similarity index 90% rename from tf/environments/dev/variables.tf rename to deployment/terraform/environments/dev/variables.tf index 35a892e..2458aa9 100644 --- a/tf/environments/dev/variables.tf +++ b/deployment/terraform/environments/dev/variables.tf @@ -151,4 +151,19 @@ variable "vault_token" { type = string default = "" sensitive = true +} + +# SSH 公钥配置 +variable "ssh_public_key" { + description = "SSH 公钥,用于访问云实例" + type = string + default = "" +} + +# Nomad 配置 +variable "nomad_encrypt_key" { + description = "Nomad 集群加密密钥" + type = string + default = "" + sensitive = true } \ No newline at end of file diff --git a/tf/environments/production/nomad-multi-dc.tf b/deployment/terraform/environments/production/nomad-multi-dc.tf similarity index 100% rename from tf/environments/production/nomad-multi-dc.tf rename to deployment/terraform/environments/production/nomad-multi-dc.tf diff --git a/tf/environments/production/outputs.tf b/deployment/terraform/environments/production/outputs.tf similarity index 100% rename from tf/environments/production/outputs.tf rename to deployment/terraform/environments/production/outputs.tf diff --git a/tf/environments/production/terraform.tfvars.example b/deployment/terraform/environments/production/terraform.tfvars.example similarity index 100% rename from tf/environments/production/terraform.tfvars.example rename to deployment/terraform/environments/production/terraform.tfvars.example diff --git a/tf/environments/production/variables.tf b/deployment/terraform/environments/production/variables.tf similarity index 100% rename from tf/environments/production/variables.tf rename to deployment/terraform/environments/production/variables.tf diff --git a/tf/environments/staging/main.tf b/deployment/terraform/environments/staging/main.tf similarity index 100% rename from tf/environments/staging/main.tf rename to deployment/terraform/environments/staging/main.tf diff --git a/tf/environments/staging/variables.tf b/deployment/terraform/environments/staging/variables.tf similarity index 100% rename from tf/environments/staging/variables.tf rename to deployment/terraform/environments/staging/variables.tf diff --git a/deployment/terraform/modules/nomad-cluster/main.tf b/deployment/terraform/modules/nomad-cluster/main.tf new file mode 100644 index 0000000..214925f --- /dev/null +++ b/deployment/terraform/modules/nomad-cluster/main.tf @@ -0,0 +1,158 @@ +# Nomad 多数据中心集群模块 +# 支持跨地域部署:CN(dc1) + KR(dc2) + US(dc3) + +terraform { + required_providers { + oci = { + source = "oracle/oci" + version = "~> 7.20" + } + aws = { + source = "hashicorp/aws" + version = "~> 5.0" + } + } +} + +# 本地变量 +locals { + nomad_version = "1.10.5" + + # 通用 Nomad 配置 + nomad_encrypt_key = "NVOMDvXblgWfhtzFzOUIHnKEOrbXOkPrkIPbRGGf1YQ=" + + # 数据中心配置 + datacenters = { + dc1 = { + name = "dc1" + region = "cn" + location = "China" + provider = "existing" # 现有的 semaphore 节点 + } + dc2 = { + name = "dc2" + region = "kr" + location = "Korea" + provider = "oracle" + } + dc3 = { + name = "dc3" + region = "us" + location = "US" + provider = "aws" # 暂时使用AWS替代华为云 + } + } + + # 用户数据模板 + user_data_template = templatefile("${path.module}/templates/nomad-userdata.sh", { + nomad_version = local.nomad_version + nomad_encrypt_key = local.nomad_encrypt_key + VERSION_ID = "20.04" # Ubuntu 20.04 + NOMAD_VERSION = local.nomad_version + NOMAD_ZIP = "nomad_${local.nomad_version}_linux_amd64.zip" + NOMAD_URL = "https://releases.hashicorp.com/nomad/${local.nomad_version}/nomad_${local.nomad_version}_linux_amd64.zip" + NOMAD_SHA256_URL = "https://releases.hashicorp.com/nomad/${local.nomad_version}/nomad_${local.nomad_version}_SHA256SUMS" + bind_addr = "auto" + nomad_servers = "\"127.0.0.1\"" + }) +} + +# 数据源:获取现有的 semaphore 节点信息 +data "external" "semaphore_info" { + program = ["bash", "-c", <<-EOF + echo '{ + "ip": "100.116.158.95", + "datacenter": "dc1", + "status": "existing" + }' + EOF + ] +} + +# Oracle Cloud 韩国节点 (dc2) +resource "oci_core_instance" "nomad_kr_node" { + count = var.deploy_korea_node ? 1 : 0 + + # 基础配置 + compartment_id = var.oracle_config.compartment_ocid + display_name = "nomad-master-kr" + availability_domain = var.oracle_availability_domain + shape = "VM.Standard.E2.1.Micro" # 免费层 + + # 源配置 + source_details { + source_type = "image" + source_id = var.oracle_ubuntu_image_id + } + + # 网络配置 + create_vnic_details { + subnet_id = var.oracle_subnet_id + display_name = "nomad-kr-vnic" + assign_public_ip = true + } + + # 元数据 + metadata = { + ssh_authorized_keys = var.ssh_public_key + user_data = base64encode(templatefile("${path.module}/templates/nomad-userdata.sh", { + datacenter = "dc2" + nomad_version = local.nomad_version + nomad_encrypt_key = local.nomad_encrypt_key + bootstrap_expect = 1 + bind_addr = "auto" + server_enabled = true + client_enabled = true + VERSION_ID = "20.04" # Ubuntu 20.04 + NOMAD_VERSION = local.nomad_version + NOMAD_ZIP = "nomad_${local.nomad_version}_linux_amd64.zip" + NOMAD_URL = "https://releases.hashicorp.com/nomad/${local.nomad_version}/nomad_${local.nomad_version}_linux_amd64.zip" + NOMAD_SHA256_URL = "https://releases.hashicorp.com/nomad/${local.nomad_version}/nomad_${local.nomad_version}_SHA256SUMS" + nomad_servers = "\"127.0.0.1\"" + })) + } + + # 标签 + defined_tags = merge(var.common_tags, { + "Name" = "nomad-master-kr" + "Datacenter" = "dc2" + "Role" = "nomad-server" + "Provider" = "oracle" + }) +} + +# 华为云美国节点 (dc3) - 暂时禁用 +# resource "huaweicloud_compute_instance_v2" "nomad_us_node" { +# count = var.deploy_us_node ? 1 : 0 +# +# name = "nomad-ash3c-us" +# image_id = var.huawei_ubuntu_image_id +# flavor_id = "s6.small.1" # 1vCPU 1GB +# +# # 网络配置 +# network { +# uuid = var.huawei_subnet_id +# } +# +# # 元数据 +# metadata = { +# ssh_authorized_keys = var.ssh_public_key +# user_data = base64encode(templatefile("${path.module}/templates/nomad-userdata.sh", { +# datacenter = "dc3" +# nomad_version = local.nomad_version +# nomad_encrypt_key = local.nomad_encrypt_key +# bootstrap_expect = 1 +# bind_addr = "auto" +# server_enabled = true +# client_enabled = true +# })) +# } +# +# # 标签 +# tags = merge(var.common_tags, { +# Name = "nomad-ash3c-us" +# Datacenter = "dc3" +# Role = "nomad-server" +# Provider = "huawei" +# }) +# } \ No newline at end of file diff --git a/deployment/terraform/modules/nomad-cluster/outputs.tf b/deployment/terraform/modules/nomad-cluster/outputs.tf new file mode 100644 index 0000000..437d23f --- /dev/null +++ b/deployment/terraform/modules/nomad-cluster/outputs.tf @@ -0,0 +1,145 @@ +# Nomad 多数据中心集群输出 + +# 集群概览 +output "cluster_overview" { + description = "Nomad 多数据中心集群概览" + value = { + datacenters = { + dc1 = { + name = "dc1" + location = "China (CN)" + provider = "existing" + node = "semaphore" + ip = "100.116.158.95" + status = "existing" + } + dc2 = var.deploy_korea_node ? { + name = "dc2" + location = "Korea (KR)" + provider = "oracle" + node = "master" + ip = try(oci_core_instance.nomad_kr_node[0].public_ip, "pending") + status = "deployed" + } : null + dc3 = var.deploy_us_node ? { + name = "dc3" + location = "US" + provider = "aws" # 暂时使用AWS替代华为云 + node = "ash3c" + ip = "pending" # 暂时禁用 + status = "disabled" + } : null + } + total_nodes = 1 + (var.deploy_korea_node ? 1 : 0) + (var.deploy_us_node ? 1 : 0) + } +} + +# Oracle Cloud 韩国节点输出 +output "oracle_korea_node" { + description = "Oracle Cloud 韩国节点信息" + value = var.deploy_korea_node ? { + instance_id = try(oci_core_instance.nomad_kr_node[0].id, null) + public_ip = try(oci_core_instance.nomad_kr_node[0].public_ip, null) + private_ip = try(oci_core_instance.nomad_kr_node[0].private_ip, null) + datacenter = "dc2" + provider = "oracle" + region = var.oracle_config.region + + # 连接信息 + ssh_command = try("ssh ubuntu@${oci_core_instance.nomad_kr_node[0].public_ip}", null) + nomad_ui = try("http://${oci_core_instance.nomad_kr_node[0].public_ip}:4646", null) + } : null +} + +# 华为云美国节点输出 - 暂时禁用 +# output "huawei_us_node" { +# description = "华为云美国节点信息" +# value = var.deploy_us_node ? { +# instance_id = try(huaweicloud_compute_instance_v2.nomad_us_node[0].id, null) +# public_ip = try(huaweicloud_compute_instance_v2.nomad_us_node[0].access_ip_v4, null) +# private_ip = try(huaweicloud_compute_instance_v2.nomad_us_node[0].network[0].fixed_ip_v4, null) +# datacenter = "dc3" +# provider = "huawei" +# region = var.huawei_config.region +# +# # 连接信息 +# ssh_command = try("ssh ubuntu@${huaweicloud_compute_instance_v2.nomad_us_node[0].access_ip_v4}", null) +# nomad_ui = try("http://${huaweicloud_compute_instance_v2.nomad_us_node[0].access_ip_v4}:4646", null) +# } : null +# } + +# 集群连接信息 +output "cluster_endpoints" { + description = "集群连接端点" + value = { + nomad_ui_urls = compact([ + "http://100.116.158.95:4646", # dc1 - semaphore + var.deploy_korea_node ? try("http://${oci_core_instance.nomad_kr_node[0].public_ip}:4646", null) : null, # dc2 + # var.deploy_us_node ? try("http://${huaweicloud_compute_instance_v2.nomad_us_node[0].access_ip_v4}:4646", null) : null # dc3 - 暂时禁用 + ]) + + ssh_commands = compact([ + "ssh root@100.116.158.95", # dc1 - semaphore + var.deploy_korea_node ? try("ssh ubuntu@${oci_core_instance.nomad_kr_node[0].public_ip}", null) : null, # dc2 + # var.deploy_us_node ? try("ssh ubuntu@${huaweicloud_compute_instance_v2.nomad_us_node[0].access_ip_v4}", null) : null # dc3 - 暂时禁用 + ]) + } +} + +# Ansible inventory 生成 +output "ansible_inventory" { + description = "生成的 Ansible inventory" + value = { + all = { + children = { + nomad_servers = { + hosts = merge( + { + semaphore = { + ansible_host = "100.116.158.95" + datacenter = "dc1" + provider = "existing" + } + }, + var.deploy_korea_node ? { + master = { + ansible_host = try(oci_core_instance.nomad_kr_node[0].public_ip, "pending") + datacenter = "dc2" + provider = "oracle" + } + } : {} + # var.deploy_us_node ? { + # ash3c = { + # ansible_host = try(huaweicloud_compute_instance_v2.nomad_us_node[0].access_ip_v4, "pending") + # datacenter = "dc3" + # provider = "huawei" + # } + # } : {} # 暂时禁用 + ) + } + } + } + } +} + +# 部署后验证命令 +output "verification_commands" { + description = "部署后验证命令" + value = [ + "# 检查集群状态", + "nomad server members", + "", + "# 检查各数据中心节点", + "nomad node status -verbose", + "", + "# 跨数据中心任务调度测试", + "nomad job run examples/cross-dc-test.nomad", + "", + "# 访问 UI", + join("\n", [for url in compact([ + "http://100.116.158.95:4646", + var.deploy_korea_node ? try("http://${oci_core_instance.nomad_kr_node[0].public_ip}:4646", null) : null, + # var.deploy_us_node ? try("http://${huaweicloud_compute_instance_v2.nomad_us_node[0].access_ip_v4}:4646", null) : null # dc3 - 暂时禁用 + ]) : "curl -s ${url}/v1/status/leader"]) + ] +} \ No newline at end of file diff --git a/deployment/terraform/modules/nomad-cluster/templates/nomad-userdata.sh b/deployment/terraform/modules/nomad-cluster/templates/nomad-userdata.sh new file mode 100644 index 0000000..032f483 --- /dev/null +++ b/deployment/terraform/modules/nomad-cluster/templates/nomad-userdata.sh @@ -0,0 +1,276 @@ +#!/bin/bash + +# Nomad 节点用户数据脚本 +# 用于自动配置 Nomad 节点,支持服务器和客户端模式 + +set -e + +# 日志函数 +log() { + echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" +} + +log "开始 Nomad 节点配置..." + +# 更新系统 +log "更新系统包..." +apt-get update +apt-get upgrade -y + +# 安装必要工具 +log "安装必要工具..." +apt-get install -y curl unzip wget gnupg software-properties-common + +# 安装 Podman (作为容器运行时) +log "安装 Podman..." +. /etc/os-release +echo "deb https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_${VERSION_ID}/ /" | tee /etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list +curl -L "https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_${VERSION_ID}/Release.key" | apt-key add - +apt-get update +apt-get install -y podman + +# 配置 Podman +log "配置 Podman..." +mkdir -p /etc/containers +echo -e "[registries.search]\nregistries = ['docker.io']" > /etc/containers/registries.conf + +# 下载并安装 Nomad +log "安装 Nomad..." +NOMAD_VERSION=${nomad_version} +NOMAD_ZIP="nomad_${NOMAD_VERSION}_linux_amd64.zip" +NOMAD_URL="https://releases.hashicorp.com/nomad/${NOMAD_VERSION}/${NOMAD_ZIP}" +NOMAD_SHA256_URL="https://releases.hashicorp.com/nomad/${NOMAD_VERSION}/nomad_${NOMAD_VERSION}_SHA256SUMS" + +cd /tmp +wget -q ${NOMAD_URL} +wget -q ${NOMAD_SHA256_URL} +sha256sum -c nomad_${NOMAD_VERSION}_SHA256SUMS --ignore-missing +unzip -o ${NOMAD_ZIP} -d /usr/local/bin/ +chmod +x /usr/local/bin/nomad + +# 创建 Nomad 用户和目录 +log "创建 Nomad 用户和目录..." +useradd --system --home /etc/nomad.d --shell /bin/false nomad +mkdir -p /opt/nomad/data +mkdir -p /etc/nomad.d +mkdir -p /var/log/nomad +chown -R nomad:nomad /opt/nomad /etc/nomad.d /var/log/nomad + +# 获取本机 IP 地址 +if [ "${bind_addr}" = "auto" ]; then + # 尝试多种方法获取 IP + BIND_ADDR=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4 2>/dev/null || \ + curl -s http://metadata.google.internal/computeMetadata/v1/instance/network-interfaces/0/ip -H "Metadata-Flavor: Google" 2>/dev/null || \ + ip route get 8.8.8.8 | awk '{print $7; exit}' || \ + hostname -I | awk '{print $1}') +else + BIND_ADDR="${bind_addr}" +fi + +log "检测到 IP 地址: $BIND_ADDR" + +# 创建 Nomad 配置文件 +log "创建 Nomad 配置文件..." +cat > /etc/nomad.d/nomad.hcl << EOF +# Nomad 配置文件 +datacenter = "${datacenter}" +data_dir = "/opt/nomad/data" +log_level = "INFO" + +# 客户端配置 +client { + enabled = true + servers = ["${nomad_servers}"] + options { + "driver.raw_exec.enable" = "1" + "driver.podman.enabled" = "1" + } +} + +# 服务器配置 +server { + enabled = ${server_enabled} + bootstrap_expect = ${bootstrap_expect} +} + +# Consul 集成 +consul { + address = "127.0.0.1:8500" + token = "${consul_token}" +} + +# 加密设置 +encrypt = "${nomad_encrypt_key}" + +# 网络配置 +network { + mode = "bridge" +} + +# UI 配置 +ui { + enabled = true +} + +# 插件目录 +plugin_dir = "/opt/nomad/plugins" +EOF + +# 创建 systemd 服务文件 +log "创建 systemd 服务文件..." +cat > /etc/systemd/system/nomad.service << EOF +[Unit] +Description=Nomad +Documentation=https://www.nomadproject.io/ +Wants=network-online.target +After=network-online.target + +[Service] +ExecReload=/bin/kill -HUP \$MAINPID +ExecStart=/usr/local/bin/nomad agent -config /etc/nomad.d +KillMode=process +KillSignal=SIGINT +LimitNOFILE=65536 +LimitNPROC=infinity +Restart=on-failure +RestartSec=2 +StartLimitBurst=3 +StartLimitInterval=10 +TasksMax=infinity + +[Install] +WantedBy=multi-user.target +EOF + +# 启动 Nomad 服务 +log "启动 Nomad 服务..." +systemctl daemon-reload +systemctl enable nomad +systemctl start nomad + +# 等待服务启动 +log "等待 Nomad 服务启动..." +sleep 10 + +# 验证 Nomad 状态 +if systemctl is-active --quiet nomad; then + log "Nomad 服务启动成功" +else + log "Nomad 服务启动失败" + journalctl -u nomad --no-pager + exit 1 +fi + +# 创建 Nomad 客户端状态检查脚本 +log "创建状态检查脚本..." +cat > /usr/local/bin/check-nomad.sh << 'EOF' +#!/bin/bash +# Nomad 状态检查脚本 + +set -e + +# 检查 Nomad 服务状态 +if systemctl is-active --quiet nomad; then + echo "Nomad 服务运行正常" +else + echo "Nomad 服务未运行" + exit 1 +fi + +# 检查 Nomad 节点状态 +NODE_STATUS=$(nomad node status -self -json | jq -r '.Status') +if [ "$NODE_STATUS" = "ready" ]; then + echo "Nomad 节点状态: $NODE_STATUS" +else + echo "Nomad 节点状态异常: $NODE_STATUS" + exit 1 +fi + +# 检查 Nomad 集群成员 +SERVER_MEMBERS=$(nomad server members 2>/dev/null | grep -c "alive" || echo "0") +if [ "$SERVER_MEMBERS" -gt 0 ]; then + echo "Nomad 集群服务器成员: $SERVER_MEMBERS" +else + echo "未找到 Nomad 集群服务器成员" + exit 1 +fi + +echo "Nomad 状态检查完成" +EOF + +chmod +x /usr/local/bin/check-nomad.sh + +# 设置防火墙规则 +log "设置防火墙规则..." +if command -v ufw >/dev/null 2>&1; then + ufw allow 4646/tcp # Nomad HTTP + ufw allow 4647/tcp # Nomad RPC + ufw allow 4648/tcp # Nomad Serf + ufw --force enable +elif command -v firewall-cmd >/dev/null 2>&1; then + firewall-cmd --permanent --add-port=4646/tcp + firewall-cmd --permanent --add-port=4647/tcp + firewall-cmd --permanent --add-port=4648/tcp + firewall-cmd --reload +fi + +# 创建简单的 Nomad 任务示例 +log "创建示例任务..." +mkdir -p /opt/nomad/examples +cat > /opt/nomad/examples/redis.nomad << 'EOF' +job "redis" { + datacenters = ["dc1", "dc2", "dc3"] + type = "service" + priority = 50 + + update { + stagger = "10s" + max_parallel = 1 + } + + group "redis" { + count = 1 + + restart { + attempts = 3 + delay = "30s" + interval = "5m" + mode = "fail" + } + + task "redis" { + driver = "podman" + + config { + image = "redis:alpine" + ports = ["redis"] + } + + resources { + cpu = 200 # MHz + memory = 128 # MB + + network { + mbits = 10 + port "redis" { + static = 6379 + } + } + } + + service { + name = "redis" + port = "redis" + check { + type = "tcp" + interval = "10s" + timeout = "2s" + } + } + } + } +} +EOF + +log "Nomad 节点配置完成" +log "Nomad UI 可通过 http://$(curl -s http://169.254.169.254/latest/meta-data/public-ipv4):4646 访问" \ No newline at end of file diff --git a/tf/modules/nomad-cluster/variables.tf b/deployment/terraform/modules/nomad-cluster/variables.tf similarity index 73% rename from tf/modules/nomad-cluster/variables.tf rename to deployment/terraform/modules/nomad-cluster/variables.tf index 1ce9c38..b2460cd 100644 --- a/tf/modules/nomad-cluster/variables.tf +++ b/deployment/terraform/modules/nomad-cluster/variables.tf @@ -7,9 +7,9 @@ variable "deploy_korea_node" { } variable "deploy_us_node" { - description = "是否部署美国节点 (华为云)" + description = "是否部署美国节点 (暂时禁用)" type = bool - default = true + default = false } # Oracle Cloud 配置 @@ -21,10 +21,17 @@ variable "oracle_config" { fingerprint = string private_key_path = string region = string + compartment_ocid = string }) sensitive = true } +variable "oracle_availability_domain" { + description = "Oracle Cloud 可用域" + type = string + default = "" # 将通过数据源自动获取 +} + variable "oracle_ubuntu_image_id" { description = "Oracle Cloud Ubuntu 镜像 ID" type = string @@ -36,37 +43,27 @@ variable "oracle_subnet_id" { type = string } -variable "oracle_security_group_id" { - description = "Oracle Cloud 安全组 ID" - type = string -} +# 华为云配置 - 暂时禁用 +# variable "huawei_config" { +# description = "华为云配置" +# type = object({ +# access_key = string +# secret_key = string +# region = string +# }) +# sensitive = true +# } -# 华为云配置 -variable "huawei_config" { - description = "华为云配置" - type = object({ - access_key = string - secret_key = string - region = string - }) - sensitive = true -} +# variable "huawei_ubuntu_image_id" { +# description = "华为云 Ubuntu 镜像 ID" +# type = string +# default = "" # 将通过数据源自动获取 +# } -variable "huawei_ubuntu_image_id" { - description = "华为云 Ubuntu 镜像 ID" - type = string - default = "" # 将通过数据源自动获取 -} - -variable "huawei_subnet_id" { - description = "华为云子网 ID" - type = string -} - -variable "huawei_security_group_id" { - description = "华为云安全组 ID" - type = string -} +# variable "huawei_subnet_id" { +# description = "华为云子网 ID" +# type = string +# } # 通用配置 variable "common_tags" { diff --git a/tf/providers/huawei-cloud/main.tf b/deployment/terraform/providers/huawei-cloud/main.tf similarity index 100% rename from tf/providers/huawei-cloud/main.tf rename to deployment/terraform/providers/huawei-cloud/main.tf diff --git a/tf/providers/huawei-cloud/variables.tf b/deployment/terraform/providers/huawei-cloud/variables.tf similarity index 100% rename from tf/providers/huawei-cloud/variables.tf rename to deployment/terraform/providers/huawei-cloud/variables.tf diff --git a/tofu/providers/oracle-cloud/main.tf b/deployment/terraform/providers/oracle-cloud/main.tf similarity index 93% rename from tofu/providers/oracle-cloud/main.tf rename to deployment/terraform/providers/oracle-cloud/main.tf index cb8fd2e..17ad060 100644 --- a/tofu/providers/oracle-cloud/main.tf +++ b/deployment/terraform/providers/oracle-cloud/main.tf @@ -9,6 +9,15 @@ terraform { } } +# OCI Provider 配置 +provider "oci" { + tenancy_ocid = var.oci_config.tenancy_ocid + user_ocid = var.oci_config.user_ocid + fingerprint = var.oci_config.fingerprint + private_key = file(var.oci_config.private_key_path) + region = var.oci_config.region +} + # 获取可用域 data "oci_identity_availability_domains" "ads" { compartment_id = var.oci_config.tenancy_ocid diff --git a/tofu/providers/oracle-cloud/variables.tf b/deployment/terraform/providers/oracle-cloud/variables.tf similarity index 97% rename from tofu/providers/oracle-cloud/variables.tf rename to deployment/terraform/providers/oracle-cloud/variables.tf index d6254fa..5bf2b3f 100644 --- a/tofu/providers/oracle-cloud/variables.tf +++ b/deployment/terraform/providers/oracle-cloud/variables.tf @@ -36,7 +36,7 @@ variable "oci_config" { tenancy_ocid = string user_ocid = string fingerprint = string - private_key = string + private_key_path = string region = string compartment_ocid = string }) diff --git a/tf/shared/outputs.tf b/deployment/terraform/shared/outputs.tf similarity index 100% rename from tf/shared/outputs.tf rename to deployment/terraform/shared/outputs.tf diff --git a/tf/shared/variables.tf b/deployment/terraform/shared/variables.tf similarity index 100% rename from tf/shared/variables.tf rename to deployment/terraform/shared/variables.tf diff --git a/tf/shared/versions.tf b/deployment/terraform/shared/versions.tf similarity index 100% rename from tf/shared/versions.tf rename to deployment/terraform/shared/versions.tf diff --git a/docs/7-days-creation-world.md b/docs/7-days-creation-world.md new file mode 100644 index 0000000..7c9897e --- /dev/null +++ b/docs/7-days-creation-world.md @@ -0,0 +1,121 @@ +# CSOL 基础设施建设 - 7天创造世界 + +## 概述 + +本文档描述了CSOL基础设施建设的完整流程,采用"7天创造世界"的比喻,系统地阐述了从网络连接到应用部署的完整建设过程。 + +## 第1天:Tailscale - 网络连接基础 + +**目标**:打通所有分布式地点的网络连接 + +**核心任务**: +- 在所有节点部署Tailscale,建立安全的网络连接 +- 确保所有节点可以通过Tailscale网络相互访问 +- 为后续的分布式管理奠定网络基础 + +**关键成果**: +- 所有节点加入Tailscale网络 +- 节点间可以通过Tailscale IP直接通信 +- 为后续的Ansible、Nomad等工具提供网络基础 + +## 第2天:Ansible - 分布式控制 + +**目标**:实现灵活的分布式节点控制 + +**核心任务**: +- 部署Ansible作为配置管理工具 +- 建立inventory文件,管理所有节点信息 +- 编写playbook,实现"八爪鱼式"的远程控制能力 + +**关键成果**: +- 可以通过Ansible批量管理所有节点 +- 标准化的配置管理流程 +- 自动化的软件部署和配置更新 + +## 第3天:Nomad - 服务感知与任务调度 + +**目标**:建立服务感知能力和任务调度系统,提供容错性 + +**核心任务**: +- 部署Nomad集群,实现资源调度 +- 配置服务器节点和客户端节点 +- 建立服务发现和健康检查机制 + +**关键成果**: +- 集群具备任务调度能力 +- 服务自动发现和故障转移 +- 资源的高效利用和负载均衡 + +## 第4天:Consul - 配置集中管理 + +**目标**:解决容器技术配置的集中管理问题 + +**核心任务**: +- 部署Consul集群,提供配置管理和服务发现 +- 通过Nomad拉起Consul服务 +- 建立键值存储,用于动态配置管理 + +**关键成果**: +- 配置的集中管理和动态更新 +- 服务注册与发现 +- 为后续的Vault集成提供基础 + +## 第5天:Terraform - 状态一致性 + +**目标**:解决基础设施状态一致性问题 + +**核心任务**: +- 使用Terraform管理基础设施资源 +- 建立基础设施即代码(IaC)的实践 +- 确保环境状态的一致性和可重复性 + +**关键成果**: +- 基础设施的声明式管理 +- 状态的一致性和可预测性 +- 环境的快速复制和重建能力 + +## 第6天:Vault - 安全密钥管理 + +**目标**:解决大规模自动化编程中的环境变量和敏感信息管理 + +**核心任务**: +- 部署Vault集群,提供密钥管理服务 +- 集成Vault与Nomad、Consul +- 建立动态密钥管理机制 + +**关键成果**: +- 敏感信息的集中安全管理 +- 动态密钥生成和轮换 +- 为自动化流程提供安全的配置获取方式 + +## 第7天:Waypoint - 应用部署现代化 + +**目标**:实现应用部署的现代化管理 + +**核心任务**: +- 部署Waypoint,提供应用生命周期管理 +- 建立标准化的应用部署流程 +- 集成CI/CD流程 + +**关键成果**: +- 应用部署的标准化和自动化 +- 开发体验的提升 +- 完整的应用生命周期管理 + +## 建设原则 + +1. **循序渐进**:严格按照7天的顺序进行建设,每个阶段的基础是前一个阶段的完成 +2. **依赖明确**:每个工具都有明确的依赖关系,确保架构的合理性 +3. **功能互补**:每个工具解决特定问题,形成完整的基础设施解决方案 +4. **可扩展性**:整个架构设计考虑未来的扩展需求 + +## 重要提醒 + +**当前问题**:本地节点不接受任务,导致无法部署Consul,造成配置混乱 + +**解决方案**: +1. 将本地节点也设置为Consul的管理节点 +2. 确保本地节点能够接受和执行任务 +3. 建立sticky note机制,不断提醒自己配置状态和依赖关系 + +**核心逻辑**:只有解决了本地节点的任务接受问题,才能正确部署Consul,进而保证整个基础设施建设的顺利进行。 \ No newline at end of file diff --git a/docs/consul-provider-integration.md b/docs/consul-provider-integration.md index b348188..2da57aa 100644 --- a/docs/consul-provider-integration.md +++ b/docs/consul-provider-integration.md @@ -66,16 +66,6 @@ data "consul_keys" "oracle_config" { } ``` -#### 私钥文件创建 - -```hcl -# 将从Consul获取的私钥保存到临时文件 -resource "local_file" "oci_kr_private_key" { - content = data.consul_keys.oracle_config.var.private_key - filename = "/tmp/oci_kr_private_key.pem" -} -``` - #### OCI Provider配置 ```hcl @@ -84,7 +74,7 @@ provider "oci" { tenancy_ocid = data.consul_keys.oracle_config.var.tenancy_ocid user_ocid = data.consul_keys.oracle_config.var.user_ocid fingerprint = data.consul_keys.oracle_config.var.fingerprint - private_key_path = local_file.oci_kr_private_key.filename + private_key = file(var.oci_config.private_key_path) region = "ap-chuncheon-1" } ``` @@ -140,8 +130,8 @@ terraform apply -var-file=consul.tfvars 使用Consul Provider直接从Consul获取配置有以下优势: -1. **更高的安全性**:私钥不再需要存储在磁盘上的配置文件中,而是直接从Consul获取 -2. **更简洁的配置**:无需手动创建临时文件,Terraform自动处理 +1. **更高的安全性**:私钥不再需要存储在磁盘上的临时文件中,而是直接从Consul获取并在内存中使用 +2. **更简洁的配置**:无需手动创建临时文件,Terraform直接处理私钥内容 3. **声明式风格**:完全符合Terraform的声明式配置风格 4. **更好的维护性**:配置集中存储在Consul中,便于管理和更新 5. **多环境支持**:可以轻松支持多个环境(dev、staging、production)的配置 diff --git a/configs/dynamic/config.yml b/infrastructure/configs/dynamic/config.yml similarity index 100% rename from configs/dynamic/config.yml rename to infrastructure/configs/dynamic/config.yml diff --git a/configs/nomad-ash3c.hcl b/infrastructure/configs/nomad-ash3c.hcl similarity index 100% rename from configs/nomad-ash3c.hcl rename to infrastructure/configs/nomad-ash3c.hcl diff --git a/configs/nomad-master.hcl b/infrastructure/configs/nomad-master.hcl similarity index 100% rename from configs/nomad-master.hcl rename to infrastructure/configs/nomad-master.hcl diff --git a/configs/prometheus.yml b/infrastructure/configs/prometheus.yml similarity index 100% rename from configs/prometheus.yml rename to infrastructure/configs/prometheus.yml diff --git a/configs/traefik.yml b/infrastructure/configs/traefik.yml similarity index 100% rename from configs/traefik.yml rename to infrastructure/configs/traefik.yml diff --git a/jobs/consul/jobs b/infrastructure/jobs/consul/jobs similarity index 100% rename from jobs/consul/jobs rename to infrastructure/jobs/consul/jobs diff --git a/infrastructure/jobs/digitalocean-key-store.nomad b/infrastructure/jobs/digitalocean-key-store.nomad new file mode 100644 index 0000000..868e8a7 --- /dev/null +++ b/infrastructure/jobs/digitalocean-key-store.nomad @@ -0,0 +1,37 @@ +# DigitalOcean 密钥存储作业 +job "digitalocean-key-store" { + datacenters = ["dc1"] + type = "batch" + + group "key-store" { + task "store-key" { + driver = "exec" + + config { + command = "/bin/sh" + args = [ + "-c", + </dev/null || \ - curl -s http://metadata.google.internal/computeMetadata/v1/instance/network-interfaces/0/ip -H "Metadata-Flavor: Google" 2>/dev/null || \ - ip route get 8.8.8.8 | awk '{print $7; exit}' || \ - hostname -I | awk '{print $1}') -else - BIND_ADDR="${bind_addr}" -fi - -log "检测到 IP 地址: $BIND_ADDR" - -# 创建 Nomad 配置文件 -log "创建 Nomad 配置文件..." -cat > /etc/nomad.d/nomad.hcl << EOF -datacenter = "${datacenter}" -region = "global" -data_dir = "/opt/nomad/data" - -bind_addr = "$BIND_ADDR" - -%{ if server_enabled } -server { - enabled = true - bootstrap_expect = ${bootstrap_expect} - encrypt = "${nomad_encrypt_key}" -} -%{ endif } - -%{ if client_enabled } -client { - enabled = true - - host_volume "podman-sock" { - path = "/run/podman/podman.sock" - read_only = false - } -} -%{ endif } - -ui { - enabled = true -} - -addresses { - http = "0.0.0.0" - rpc = "$BIND_ADDR" - serf = "$BIND_ADDR" -} - -ports { - http = 4646 - rpc = 4647 - serf = 4648 -} - -plugin "podman" { - config { - volumes { - enabled = true - } - } -} - -telemetry { - collection_interval = "10s" - disable_hostname = false - prometheus_metrics = true - publish_allocation_metrics = true - publish_node_metrics = true -} - -log_level = "INFO" -log_file = "/var/log/nomad/nomad.log" -EOF - -# 创建 systemd 服务文件 -log "创建 systemd 服务文件..." -cat > /etc/systemd/system/nomad.service << EOF -[Unit] -Description=Nomad -Documentation=https://www.nomadproject.io/ -Requires=network-online.target -After=network-online.target -ConditionFileNotEmpty=/etc/nomad.d/nomad.hcl - -[Service] -Type=notify -User=nomad -Group=nomad -ExecStart=/usr/local/bin/nomad agent -config=/etc/nomad.d/nomad.hcl -ExecReload=/bin/kill -HUP \$MAINPID -KillMode=process -Restart=on-failure -LimitNOFILE=65536 - -[Install] -WantedBy=multi-user.target -EOF - -# 启动 Nomad 服务 -log "启动 Nomad 服务..." -systemctl daemon-reload -systemctl enable nomad -systemctl start nomad - -# 等待服务启动 -log "等待 Nomad 服务启动..." -sleep 10 - -# 验证安装 -log "验证 Nomad 安装..." -if systemctl is-active --quiet nomad; then - log "✅ Nomad 服务运行正常" - log "📊 节点信息:" - /usr/local/bin/nomad node status -self || true -else - log "❌ Nomad 服务启动失败" - systemctl status nomad --no-pager || true - journalctl -u nomad --no-pager -n 20 || true -fi - -# 配置防火墙(如果需要) -log "配置防火墙规则..." -if command -v ufw >/dev/null 2>&1; then - ufw allow 4646/tcp # HTTP API - ufw allow 4647/tcp # RPC - ufw allow 4648/tcp # Serf - ufw allow 22/tcp # SSH -fi - -# 创建有用的别名和脚本 -log "创建管理脚本..." -cat > /usr/local/bin/nomad-status << 'EOF' -#!/bin/bash -echo "=== Nomad 服务状态 ===" -systemctl status nomad --no-pager - -echo -e "\n=== Nomad 集群成员 ===" -nomad server members 2>/dev/null || echo "无法连接到集群" - -echo -e "\n=== Nomad 节点状态 ===" -nomad node status 2>/dev/null || echo "无法获取节点状态" - -echo -e "\n=== 最近日志 ===" -journalctl -u nomad --no-pager -n 5 -EOF - -chmod +x /usr/local/bin/nomad-status - -log "✅ Nomad 节点配置完成" -log "🌐 Nomad UI: http://$BIND_ADDR:4646" -log "📋 查看状态: nomad-status" \ No newline at end of file diff --git a/tofu/modules/nomad-cluster/main.tf b/tofu/modules/nomad-cluster/main.tf deleted file mode 100644 index d33a4a3..0000000 --- a/tofu/modules/nomad-cluster/main.tf +++ /dev/null @@ -1,159 +0,0 @@ -# Nomad 多数据中心集群模块 -# 支持跨地域部署:CN(dc1) + KR(dc2) + US(dc3) - -terraform { - required_providers { - oci = { - source = "oracle/oci" - version = "~> 7.20" - } - huaweicloud = { - source = "huaweicloud/huaweicloud" - version = "~> 1.60" - } - aws = { - source = "hashicorp/aws" - version = "~> 5.0" - } - } -} - -# 本地变量 -locals { - nomad_version = "1.10.5" - - # 通用 Nomad 配置 - nomad_encrypt_key = "NVOMDvXblgWfhtzFzOUIHnKEOrbXOkPrkIPbRGGf1YQ=" - - # 数据中心配置 - datacenters = { - dc1 = { - name = "dc1" - region = "cn" - location = "China" - provider = "existing" # 现有的 semaphore 节点 - } - dc2 = { - name = "dc2" - region = "kr" - location = "Korea" - provider = "oracle" - } - dc3 = { - name = "dc3" - region = "us" - location = "US" - provider = "huawei" # 或 aws - } - } - - # 用户数据模板 - user_data_template = templatefile("${path.module}/templates/nomad-userdata.sh", { - nomad_version = local.nomad_version - nomad_encrypt_key = local.nomad_encrypt_key - }) -} - -# 数据源:获取现有的 semaphore 节点信息 -data "external" "semaphore_info" { - program = ["bash", "-c", <<-EOF - echo '{ - "ip": "100.116.158.95", - "datacenter": "dc1", - "status": "existing" - }' - EOF - ] -} - -# Oracle Cloud 韩国节点 (dc2) -module "oracle_korea_node" { - source = "../compute" - - count = var.deploy_korea_node ? 1 : 0 - - # Oracle Cloud 特定配置 - provider_type = "oracle" - - # 实例配置 - instance_config = { - name = "nomad-master-kr" - datacenter = "dc2" - instance_type = "VM.Standard.E2.1.Micro" # 免费层 - image_id = var.oracle_ubuntu_image_id - subnet_id = var.oracle_subnet_id - - # Nomad 配置 - nomad_role = "server" - bootstrap_expect = 1 - bind_addr = "auto" # 自动检测 - - # 网络配置 - security_groups = [var.oracle_security_group_id] - - # 标签 - tags = merge(var.common_tags, { - Name = "nomad-master-kr" - Datacenter = "dc2" - Role = "nomad-server" - Provider = "oracle" - }) - } - - # 用户数据 - user_data = templatefile("${path.module}/templates/nomad-userdata.sh", { - datacenter = "dc2" - nomad_version = local.nomad_version - nomad_encrypt_key = local.nomad_encrypt_key - bootstrap_expect = 1 - bind_addr = "auto" - server_enabled = true - client_enabled = true - }) -} - -# 华为云美国节点 (dc3) -module "huawei_us_node" { - source = "../compute" - - count = var.deploy_us_node ? 1 : 0 - - # 华为云特定配置 - provider_type = "huawei" - - # 实例配置 - instance_config = { - name = "nomad-ash3c-us" - datacenter = "dc3" - instance_type = "s6.small.1" # 1vCPU 1GB - image_id = var.huawei_ubuntu_image_id - subnet_id = var.huawei_subnet_id - - # Nomad 配置 - nomad_role = "server" - bootstrap_expect = 1 - bind_addr = "auto" - - # 网络配置 - security_groups = [var.huawei_security_group_id] - - # 标签 - tags = merge(var.common_tags, { - Name = "nomad-ash3c-us" - Datacenter = "dc3" - Role = "nomad-server" - Provider = "huawei" - }) - } - - # 用户数据 - user_data = templatefile("${path.module}/templates/nomad-userdata.sh", { - datacenter = "dc3" - nomad_version = local.nomad_version - nomad_encrypt_key = local.nomad_encrypt_key - bootstrap_expect = 1 - bind_addr = "auto" - server_enabled = true - client_enabled = true - }) -} \ No newline at end of file diff --git a/tofu/modules/nomad-cluster/outputs.tf b/tofu/modules/nomad-cluster/outputs.tf deleted file mode 100644 index 61148ef..0000000 --- a/tofu/modules/nomad-cluster/outputs.tf +++ /dev/null @@ -1,145 +0,0 @@ -# Nomad 多数据中心集群输出 - -# 集群概览 -output "cluster_overview" { - description = "Nomad 多数据中心集群概览" - value = { - datacenters = { - dc1 = { - name = "dc1" - location = "China (CN)" - provider = "existing" - node = "semaphore" - ip = "100.116.158.95" - status = "existing" - } - dc2 = var.deploy_korea_node ? { - name = "dc2" - location = "Korea (KR)" - provider = "oracle" - node = "master" - ip = try(module.oracle_korea_node[0].public_ip, "pending") - status = "deployed" - } : null - dc3 = var.deploy_us_node ? { - name = "dc3" - location = "US" - provider = "huawei" - node = "ash3c" - ip = try(module.huawei_us_node[0].public_ip, "pending") - status = "deployed" - } : null - } - total_nodes = 1 + (var.deploy_korea_node ? 1 : 0) + (var.deploy_us_node ? 1 : 0) - } -} - -# Oracle Cloud 韩国节点输出 -output "oracle_korea_node" { - description = "Oracle Cloud 韩国节点信息" - value = var.deploy_korea_node ? { - instance_id = try(module.oracle_korea_node[0].instance_id, null) - public_ip = try(module.oracle_korea_node[0].public_ip, null) - private_ip = try(module.oracle_korea_node[0].private_ip, null) - datacenter = "dc2" - provider = "oracle" - region = var.oracle_config.region - - # 连接信息 - ssh_command = try("ssh ubuntu@${module.oracle_korea_node[0].public_ip}", null) - nomad_ui = try("http://${module.oracle_korea_node[0].public_ip}:4646", null) - } : null -} - -# 华为云美国节点输出 -output "huawei_us_node" { - description = "华为云美国节点信息" - value = var.deploy_us_node ? { - instance_id = try(module.huawei_us_node[0].instance_id, null) - public_ip = try(module.huawei_us_node[0].public_ip, null) - private_ip = try(module.huawei_us_node[0].private_ip, null) - datacenter = "dc3" - provider = "huawei" - region = var.huawei_config.region - - # 连接信息 - ssh_command = try("ssh ubuntu@${module.huawei_us_node[0].public_ip}", null) - nomad_ui = try("http://${module.huawei_us_node[0].public_ip}:4646", null) - } : null -} - -# 集群连接信息 -output "cluster_endpoints" { - description = "集群连接端点" - value = { - nomad_ui_urls = compact([ - "http://100.116.158.95:4646", # dc1 - semaphore - var.deploy_korea_node ? try("http://${module.oracle_korea_node[0].public_ip}:4646", null) : null, # dc2 - var.deploy_us_node ? try("http://${module.huawei_us_node[0].public_ip}:4646", null) : null # dc3 - ]) - - ssh_commands = compact([ - "ssh root@100.116.158.95", # dc1 - semaphore - var.deploy_korea_node ? try("ssh ubuntu@${module.oracle_korea_node[0].public_ip}", null) : null, # dc2 - var.deploy_us_node ? try("ssh ubuntu@${module.huawei_us_node[0].public_ip}", null) : null # dc3 - ]) - } -} - -# Ansible inventory 生成 -output "ansible_inventory" { - description = "生成的 Ansible inventory" - value = { - all = { - children = { - nomad_servers = { - hosts = merge( - { - semaphore = { - ansible_host = "100.116.158.95" - datacenter = "dc1" - provider = "existing" - } - }, - var.deploy_korea_node ? { - master = { - ansible_host = try(module.oracle_korea_node[0].public_ip, "pending") - datacenter = "dc2" - provider = "oracle" - } - } : {}, - var.deploy_us_node ? { - ash3c = { - ansible_host = try(module.huawei_us_node[0].public_ip, "pending") - datacenter = "dc3" - provider = "huawei" - } - } : {} - ) - } - } - } - } -} - -# 部署后验证命令 -output "verification_commands" { - description = "部署后验证命令" - value = [ - "# 检查集群状态", - "nomad server members", - "", - "# 检查各数据中心节点", - "nomad node status -verbose", - "", - "# 跨数据中心任务调度测试", - "nomad job run examples/cross-dc-test.nomad", - "", - "# 访问 UI", - join("\n", [for url in compact([ - "http://100.116.158.95:4646", - var.deploy_korea_node ? try("http://${module.oracle_korea_node[0].public_ip}:4646", null) : null, - var.deploy_us_node ? try("http://${module.huawei_us_node[0].public_ip}:4646", null) : null - ]) : "curl -s ${url}/v1/status/leader"]) - ] -} \ No newline at end of file diff --git a/vault_1.20.4_linux_amd64.zip b/vault_1.20.4_linux_amd64.zip deleted file mode 100644 index 9ec8e9e..0000000 Binary files a/vault_1.20.4_linux_amd64.zip and /dev/null differ