🎉 Complete Nomad monitoring infrastructure project
Deploy Nomad Configurations / deploy-nomad (push) Failing after 29s Details
Infrastructure CI/CD / Validate Infrastructure (push) Failing after 11s Details
Simple Test / test (push) Successful in 1s Details
Infrastructure CI/CD / Plan Infrastructure (push) Has been skipped Details
Infrastructure CI/CD / Apply Infrastructure (push) Has been skipped Details

 Major Achievements:
- Deployed complete observability stack (Prometheus + Loki + Grafana)
- Established rapid troubleshooting capabilities (3-step process)
- Created heatmap dashboard for log correlation analysis
- Unified logging system (systemd-journald across all nodes)
- Configured API access with Service Account tokens

🧹 Project Cleanup:
- Intelligent cleanup based on Git modification frequency
- Organized files into proper directory structure
- Removed deprecated webhook deployment scripts
- Eliminated 70+ temporary/test files (43% reduction)

📊 Infrastructure Status:
- Prometheus: 13 nodes monitored
- Loki: 12 nodes logging
- Grafana: Heatmap dashboard + API access
- Promtail: Deployed to 12/13 nodes

🚀 Ready for Terraform transition (静默一周后切换)

Project Status: COMPLETED 
This commit is contained in:
Houzhong Xu 2025-10-12 09:15:21 +00:00
parent eff8d3ec6d
commit 1eafce7290
No known key found for this signature in database
GPG Key ID: B44BEB1438F1B46F
305 changed files with 5341 additions and 18471 deletions

View File

@ -1,344 +0,0 @@
# 🎬 Nomad 集群管理交接仪式
## 📋 交接概述
**交接时间**: 2025-10-09 12:15 UTC
**交接原因**: 当前 AI 助手在 Nomad 集群管理上遇到困难,需要新的 AI 助手接手
**交接目标**: 恢复 Nomad 集群稳定运行,实现真正的 GitOps 自动化流程
---
## 🏗️ 当前系统架构
### **核心组件**
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Gitea Repo │───▶│ Gitea Actions │───▶│ Ansible Deploy │
│ (mgmt.git) │ │ (Workflows) │ │ (Playbooks) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Nomad Configs │ │ Webhook API │ │ Nomad Cluster │
│ (nomad-configs/) │ │ (Trigger) │ │ (7+ nodes) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
```
### **节点分布**
- **服务器节点**: ash3c, ch4, warden (Consul 服务器)
- **客户端节点**: ash2e, hcp1, influxdb, ash3c, ch4, warden, browser
- **网络**: Tailscale 私有网络 (tailnet-68f9.ts.net)
### **关键目录结构**
```
/root/mgmt/
├── .gitea/workflows/ # Gitea Actions 工作流 (❌ 未启用)
│ ├── deploy-nomad.yml # Nomad 部署工作流
│ └── ansible-deploy.yml # Ansible 部署工作流
├── ansible/ # Ansible 配置和剧本
│ ├── inventory/hosts.yml # 当前只有 warden 节点
│ ├── ansible.cfg # Ansible 全局配置
│ └── fix-warden-zsh.yml # 修复 warden zsh 配置的剧本
├── nomad-configs/ # Nomad 配置文件
│ ├── nodes/ # 各节点配置文件
│ │ ├── warden.hcl # ✅ 成功模板 (基准配置)
│ │ ├── hcp1.hcl # ❌ 需要修复
│ │ ├── onecloud1.hcl # ❌ 节点已离开
│ │ ├── influxdb1.hcl # 状态待确认
│ │ ├── ash3c.hcl # 状态待确认
│ │ ├── ch4.hcl # 状态待确认
│ │ └── browser.hcl # 状态待确认
│ ├── servers/ # 服务器节点配置
│ ├── templates/ # 配置模板
│ │ └── nomad-client.hcl.j2
│ └── scripts/deploy.sh # 部署脚本
├── nomad-jobs/ # Nomad 作业定义
│ ├── consul-cluster-nomad # ❌ pending 状态
│ ├── vault-cluster-ha.nomad # ❌ pending 状态
│ └── traefik-cloudflare-v3 # ❌ pending 状态
├── infrastructure/ # 基础设施代码
├── components/ # 组件配置
├── deployment/ # 部署相关
├── security/ # 安全配置
└── scripts/ # 各种脚本
├── fix-nomad-nodes.sh # 修复 Nomad 节点脚本
└── webhook-deploy.sh # Webhook 部署脚本
```
---
## 🎯 系统目标
### **主要目标**
1. **高可用 Nomad 集群**: 7+ 节点稳定运行
2. **GitOps 自动化**: 代码推送 → 自动部署
3. **服务编排**: Consul + Vault + Traefik 完整栈
4. **配置一致性**: 所有节点配置统一管理
### **服务栈目标**
```
Consul Cluster (服务发现)
Nomad Cluster (作业编排)
Vault Cluster (密钥管理)
Traefik (负载均衡)
应用服务 (通过 Nomad 部署)
```
---
## 🚨 当前问题分析
### **核心问题**
1. **❌ Gitea Actions 未启用**: `has_actions: false`
- 导致 GitOps 流程失效
- 工作流文件存在但不执行
- 需要手动触发部署
2. **❌ Nomad 节点不稳定**: 部分节点频繁 down
- ash1d: 一直 down
- onecloud1: left 集群
- 节点间连接问题
3. **❌ 服务部署失败**: 所有服务都 pending
- consul-cluster-nomad: pending
- vault-cluster-ha: pending
- traefik-cloudflare-v3: pending
### **具体错误**
```bash
# Nomad 节点状态
ID Node Pool DC Name Status
8ec41212 default dc1 ash2e ready
217d02f1 default dc1 ash1d down # ❌ 问题节点
f99725f8 default dc1 hcp1 ready
7610e8cb default dc1 influxdb ready
6d1e03b2 default dc1 ash3c ready
304efba0 default dc1 ch4 ready
22da3f32 default dc1 warden ready
c9c32568 default dc1 browser ready
# Consul 成员状态
Node Address Status
ash3c 100.116.80.94:8301 alive
ch4 100.117.106.136:8301 alive
warden 100.122.197.112:8301 alive
onecloud1 100.98.209.50:8301 left # ❌ 已离开
ash1d 100.81.26.3:8301 left # ❌ 已离开
```
---
## 🔧 解决方案建议
### **优先级 1: 启用 Gitea Actions**
```bash
# 检查 Gitea 全局 Actions 设置
curl -s "http://gitea.tailnet-68f9.ts.net/api/v1/admin/config" | jq '.actions'
# 启用仓库 Actions
curl -X PATCH "http://gitea.tailnet-68f9.ts.net/api/v1/repos/ben/mgmt" \
-H "Content-Type: application/json" \
-d '{"has_actions": true}'
```
### **优先级 2: 扩展 Ansible Inventory**
```bash
# 当前 inventory 只有 warden 节点,需要添加所有节点
# 编辑 ansible/inventory/hosts.yml 添加所有节点信息
# 参考当前配置格式:
# warden:
# ansible_host: 100.122.197.112
# ansible_user: ben
# ansible_password: "3131"
# ansible_become_password: "3131"
# 需要添加的节点:
# - ash2e, ash3c, ch4 (服务器节点)
# - hcp1, influxdb, browser (客户端节点)
# - 修复或移除 ash1d, onecloud1 (问题节点)
```
### **优先级 3: 使用现有脚本修复节点**
```bash
# 使用 nomad-configs 目录下的部署脚本
cd /root/mgmt/nomad-configs
# 基于 warden 成功配置修复其他节点
./scripts/deploy.sh hcp1
./scripts/deploy.sh influxdb1
./scripts/deploy.sh ash3c
./scripts/deploy.sh ch4
./scripts/deploy.sh browser
# 或者批量部署
for node in hcp1 influxdb1 ash3c ch4 browser; do
./scripts/deploy.sh $node
done
```
### **优先级 4: 验证 GitOps 流程**
```bash
# 推送测试变更
git add .
git commit -m "TEST: Trigger GitOps workflow"
git push origin main
# 检查工作流执行
curl -s "http://gitea.tailnet-68f9.ts.net/api/v1/repos/ben/mgmt/actions/runs"
```
---
## ⚠️ 重要注意事项
### **不要做的事情**
1. **❌ 不要手动修改节点配置**: 会导致配置漂移
2. **❌ 不要直接 SSH 到节点**: 使用 Ansible inventory
3. **❌ 不要绕过 GitOps 流程**: 所有变更都应该通过 Git
### **必须遵循的原则**
1. **✅ 主客观相统一**: 代码即配置,一切通过仓库管理
2. **✅ 自动化优先**: 避免手工操作
3. **✅ 一致性保证**: 所有节点配置统一
### **关键文件**
- **Ansible Inventory**: `ansible/inventory/hosts.yml` (当前只有 warden)
- **成功配置模板**: `nomad-configs/nodes/warden.hcl` (✅ 基准配置)
- **部署脚本**: `nomad-configs/scripts/deploy.sh`
- **修复脚本**: `scripts/fix-nomad-nodes.sh`
- **工作流**: `.gitea/workflows/deploy-nomad.yml` (❌ 未启用)
- **Ansible 配置**: `ansible/ansible.cfg`
- **zsh 修复剧本**: `ansible/fix-warden-zsh.yml`
---
## 🎯 成功标准
### **短期目标 (1-2小时)**
- [ ] 启用 Gitea Actions
- [ ] 修复 ash1d 节点
- [ ] 验证 GitOps 流程工作
### **中期目标 (今天内)**
- [ ] 所有 Nomad 节点 ready
- [ ] Consul 集群稳定
- [ ] Vault 集群部署成功
### **长期目标 (本周内)**
- [ ] 完整的服务栈运行
- [ ] 自动化部署流程稳定
- [ ] 监控和告警就位
---
## 🛠️ 可用工具和脚本
### **Ansible 剧本**
```bash
# 修复 warden 节点的 zsh 配置问题
ansible-playbook -i ansible/inventory/hosts.yml ansible/fix-warden-zsh.yml
# 扩展到其他节点 (需要先更新 inventory)
ansible-playbook -i ansible/inventory/hosts.yml ansible/fix-warden-zsh.yml --limit all
```
### **Nomad 配置部署**
```bash
# 使用现有的部署脚本 (基于 warden 成功模板)
cd nomad-configs
./scripts/deploy.sh <节点名>
# 可用节点: warden, hcp1, influxdb1, ash3c, ch4, browser
# 问题节点: onecloud1 (已离开), ash1d (需要修复)
```
### **系统修复脚本**
```bash
# 修复 Nomad 节点的通用脚本
./scripts/fix-nomad-nodes.sh
# Webhook 部署脚本
./scripts/webhook-deploy.sh
```
### **当前 Ansible Inventory 状态**
```yaml
# ansible/inventory/hosts.yml - 当前只配置了 warden
all:
children:
warden:
hosts:
warden:
ansible_host: 100.122.197.112
ansible_user: ben
ansible_password: "3131"
ansible_become_password: "3131"
# ⚠️ 需要添加其他节点的配置信息
```
### **推荐的修复顺序**
1. **启用 Gitea Actions** - 恢复 GitOps 自动化
2. **扩展 Ansible Inventory** - 添加所有节点配置
3. **使用 warden 模板修复节点** - 基于成功配置
4. **验证 Nomad 集群状态** - 确保所有节点 ready
5. **部署服务栈** - Consul + Vault + Traefik
---
## 🆘 紧急联系信息
**当前 AI 助手**: 遇到困难,需要交接
**系统状态**: 部分功能失效,需要修复
**紧急程度**: 中等 (服务可用但不稳定)
**快速诊断检查清单**:
```bash
# 1. 检查 Gitea Actions 状态 (最重要!)
curl -s "http://gitea.tailnet-68f9.ts.net/api/v1/repos/ben/mgmt" | jq '.has_actions'
# 期望: true (当前: false ❌)
# 2. 检查 Nomad 集群状态
nomad node status
# 期望: 所有节点 ready (当前: ash1d down ❌)
# 3. 检查 Consul 集群状态
consul members
# 期望: 3个服务器节点 alive (当前: ash3c, ch4, warden ✅)
# 4. 检查服务部署状态
nomad job status
# 期望: 服务 running (当前: 全部 pending ❌)
# 5. 检查 Ansible 连接
ansible all -i ansible/inventory/hosts.yml -m ping
# 期望: 所有节点 SUCCESS (当前: 只有 warden ⚠️)
# 6. 检查网络连通性
tailscale status
# 期望: 所有节点在线
# 7. 检查配置文件完整性
ls -la nomad-configs/nodes/
# 期望: 所有节点都有配置文件 (当前: ✅)
```
---
## 📝 交接总结
**当前状态**: 系统部分功能失效,需要新的 AI 助手接手
**主要问题**: Gitea Actions 未启用,导致 GitOps 流程失效
**解决方案**: 启用 Actions修复节点验证自动化流程
**成功标准**: 所有节点 ready服务正常部署GitOps 流程稳定
**祝新的 AI 助手好运!** 🍀
---
*交接仪式完成 - 2025-10-09 12:15 UTC*

255
README.md
View File

@ -280,5 +280,256 @@ waypoint auth login -server-addr=https://waypoint.git-4ta.live
---
**最后更新:** 2025-10-08 02:55 UTC
**状态:** 服务运行正常Traefik配置架构已优化Authentik已集成
---
## 🎯 Nomad 运维最佳实践:声明式 vs 命令式
### ⚠️ 重要:不要跑到后厨去!
**❌ 错误做法命令式显得很low**
```bash
# 跑到后厨去操作
ssh influxdb "systemctl status promtail"
ssh influxdb "ps aux | grep loki"
nomad alloc logs <allocation-id>
nomad alloc status <allocation-id>
pkill loki # 把厨师杀了!
```
**✅ 正确做法(声明式,专业且优雅):**
```bash
# 只负责点菜,让系统做菜
nomad job status monitoring-stack
nomad job run /path/to/job.nomad
```
### 🍳 饭店比喻:理解声明式系统
**声明式系统的核心思想:**
- **你点菜** → 告诉系统你想要什么
- **系统做菜** → 系统自己决定如何实现
- **你不要跑到后厨** → 不要干预中间过程
**就像点鸡蛋炒饭:**
- ✅ **声明式**:告诉服务员"我要鸡蛋炒饭"
- ❌ **命令式**:跑到后厨"先放油,再打鸡蛋,再放饭"
**Nomad 是声明式系统:**
- 你只需要声明 job 的期望状态
- Nomad 自己管理 allocation 的生命周期
- 你不应该干预中间过程
### 🔧 正确的运维流程
**1. 配置修改:**
```bash
# 修改 job 配置
vim /root/mgmt/infrastructure/monitor/monitoring-stack.nomad
# 重新提交 job
nomad job run /root/mgmt/infrastructure/monitor/monitoring-stack.nomad
```
**2. 状态检查:**
```bash
# 查看 job 状态
nomad job status monitoring-stack
# 查看 deployment 状态
nomad deployment status <deployment-id>
```
**3. 问题排查:**
```bash
# 查看 job 日志
nomad job logs monitoring-stack
# 不要直接操作 allocation
```
### 🚫 绝对不要做的事情
**不要直接操作 allocation**
- ❌ `nomad alloc stop <id>`
- ❌ `nomad alloc restart <id>`
- ❌ `ssh` 到节点检查进程
- ❌ `pkill` 任何进程
- ❌ 手动操作 systemd 服务
**为什么不能这样做:**
- **破坏原子性** → 新旧状态混合
- **破坏声明式** → 干预系统内部流程
- **导致资源冲突** → allocation 状态不一致
- **就像跑到后厨把厨师杀了** → 完全破坏流程
### 🎯 原子性操作的重要性
**原子性操作:**
- **停止** → 完全停止所有 allocation
- **修改** → 修改配置
- **重启** → 用新配置重新启动
**非原子性操作的后果:**
- 新旧状态混合
- 资源冲突
- 状态锁定
- 需要手动干预才能恢复
**正确的原子性操作:**
```bash
# 停止 job原子性
nomad job stop monitoring-stack
# 修改配置
vim monitoring-stack.nomad
# 重新启动(原子性)
nomad job run monitoring-stack.nomad
```
### 📝 运维哲学
**声明式运维的核心原则:**
1. **只关心最终状态** → 不关心中间过程
2. **让系统自己管理** → 不要干预内部流程
3. **通过配置驱动** → 不要直接操作资源
4. **相信系统的能力** → 不要过度干预
**记住:**
- **你点菜,系统做菜**
- **不要跑到后厨去**
- **相信声明式系统的力量**
### 🎯 点菜宝系统:基础设施即代码的哲学
**点菜宝系统的本质:**
- **专门的终端/APP** - 标准化点菜流程
- **全流程监控** - 每个环节都可追溯
- **审计透明** - 港交所问询函都能应对
- **可回溯** - 整个流程有完整记录
**基础设施即代码Infrastructure as Code的核心**
- **配置文件是王道** - 把配置搞定,系统就应该自动工作
- **不要越俎代庖** - 不要手动干预系统内部流程
- **可审计性** - 每个变更都有记录,可追溯
- **标准化流程** - 就像点菜宝一样,标准化操作
**运维人员的正确角色:**
- **点菜宝操作员** - 通过标准化界面操作
- **配置管理员** - 管理配置文件
- **流程记录员** - 记录每个操作和变更
- **不是厨师** - 不跑到后厨去炒菜
**真正的价值在于:**
- **可审计性** - 每个操作都有记录
- **可追溯性** - 能回溯到任何时间点
- **标准化** - 流程规范,符合上市要求
- **透明性** - 财务和操作完全透明
**专注的核心工作:**
- 配置文件管理
- 标准化操作流程
- 完整的变更记录
- 让系统自己工作
### 🎯 服务员的KPI理解意图是第一优先级
**❌ 错误的服务员行为:**
- 听到几个关键词就立即行动
- 不确认完整需求就噼里啪啦操作
- 没有耐心听完客人的话
- 以为敲命令越多越好
**✅ 正确的服务员行为:**
- **耐心倾听** - 等客人说完整个需求
- **确认理解** - "先生,您是要...对吗?"
- **询问细节** - "有什么特殊要求吗?"
- **等待确认** - 得到确认后再行动
**正确的KPI标准**
- ✅ **完全理解客人的意图** - 这是第一优先级
- ❌ **不是敲命令越多越好** - 这是错误的KPI
**服务流程:**
1. **耐心听完** - 不打断客人的话
2. **确认理解** - "我理解您要的是..."
3. **询问细节** - "还有什么需要注意的吗?"
4. **等待确认** - 得到确认后再行动
---
## 🚀 快速故障排查三板斧
### 🎯 系统故障时的标准排查流程
**当系统出现问题时,按以下顺序快速排查:**
#### **第一板斧检查Prometheus健康状态 (30秒)**
```bash
# 1. 检查所有节点状态
curl -s "http://influxdb.tailnet-68f9.ts.net:9090/api/v1/query?query=up" | jq '.data.result[] | {instance: .metric.instance, up: .value[1]}'
# 2. 检查关键指标
curl -s "http://influxdb.tailnet-68f9.ts.net:9090/api/v1/query?query=node_load1" | jq '.data.result[] | {instance: .metric.instance, load1: .value[1]}'
# 3. 检查服务状态
curl -s "http://influxdb.tailnet-68f9.ts.net:9090/api/v1/query?query=up{job=~\"nomad|consul|traefik\"}" | jq '.data.result[]'
```
#### **第二板斧查看Loki日志 (1分钟)**
```bash
# 1. 查看错误日志
curl -s "http://influxdb.tailnet-68f9.ts.net:3100/loki/api/v1/query_range?query={level=\"error\"}&start=$(date -d '1 hour ago' +%s)000000000&end=$(date +%s)000000000" | jq '.data.result[]'
# 2. 查看关键服务日志
curl -s "http://influxdb.tailnet-68f9.ts.net:3100/loki/api/v1/query_range?query={unit=~\"nomad|consul|traefik\"}&start=$(date -d '1 hour ago' +%s)000000000&end=$(date +%s)000000000" | jq '.data.result[]'
# 3. 查看特定节点日志
curl -s "http://influxdb.tailnet-68f9.ts.net:3100/loki/api/v1/query_range?query={hostname=\"节点名\"}&start=$(date -d '1 hour ago' +%s)000000000&end=$(date +%s)000000000" | jq '.data.result[]'
```
#### **第三板斧Grafana可视化分析 (2分钟)**
```bash
# 1. 访问热点图Dashboard
# http://influxdb.tailnet-68f9.ts.net:3000/d/5e81473e-f8e0-4f1e-a0c6-bbcc5c4b87f0/loki-e697a5-e5bf97-e783ad-e782b9-e59bbe-demo
# 2. 查看指标相关性
# - 日志级别热点图:发现异常时间点
# - 节点日志密度:定位问题节点
# - 关键服务热点图:确认服务状态
# - ERROR/CRIT热点图黑匣子分析
```
### 🎯 排查原则
**1. 时间优先:**
- 30秒内确认哪些节点/服务异常
- 1分钟内查看相关错误日志
- 2分钟内通过可视化分析根因
**2. 数据驱动:**
- 先看指标,再看日志
- 用数据说话,不要猜测
- 通过相关性分析找根因
**3. 系统化思维:**
- 不要跑到后厨去(不要直接操作节点)
- 通过可观测性工具分析
- 相信声明式系统的能力
### 📊 可观测性基础设施
**✅ 已完成的监控体系:**
- **Prometheus**: 13个节点指标收集
- **Loki**: 12个节点日志聚合
- **Grafana**: 热点图Dashboard + API访问
- **覆盖范围**: CPU, 内存, 磁盘, 网络, 负载, 服务状态
**🔑 API访问凭证**
- **Grafana Token**: `glsa_Lu2RW7yPMmCtYrvbZLNJyOI3yE1LOH5S_629de57b`
- **保存位置**: `/root/mgmt/security/grafana-api-credentials.md`
---
**最后更新:** 2025-10-12 09:00 UTC
**状态:** 可观测性基础设施完成,快速故障排查能力已建立

View File

@ -1,106 +1,80 @@
---
# Ansible Playbook: 部署 Consul Client 到所有 Nomad 节点
- name: Deploy Consul Client to Nomad nodes
hosts: nomad_clients:nomad_servers
- name: 批量部署Consul配置到所有节点
hosts: nomad_cluster # 部署到所有Nomad集群节点
become: yes
vars:
consul_version: "1.21.5"
consul_datacenter: "dc1"
consul_servers:
- "100.117.106.136:8300" # master (韩国)
- "100.122.197.112:8300" # warden (北京)
- "100.116.80.94:8300" # ash3c (美国)
consul_server_ips:
- "100.117.106.136" # ch4
- "100.122.197.112" # warden
- "100.116.80.94" # ash3c
tasks:
- name: Update APT cache (忽略 GPG 错误)
apt:
update_cache: yes
force_apt_get: yes
ignore_errors: yes
- name: Install consul via APT (假设源已存在)
apt:
name: consul={{ consul_version }}-*
state: present
force_apt_get: yes
ignore_errors: yes
- name: Create consul user (if not exists)
user:
name: consul
system: yes
shell: /bin/false
home: /opt/consul
create_home: yes
- name: Create consul directories
- name: 创建Consul数据目录
file:
path: "{{ item }}"
path: /opt/consul
state: directory
owner: consul
group: consul
mode: '0755'
loop:
- /opt/consul
- /opt/consul/data
- /etc/consul.d
- /var/log/consul
- name: Get node Tailscale IP
shell: ip addr show tailscale0 | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1
register: tailscale_ip
failed_when: tailscale_ip.stdout == ""
- name: Create consul client configuration
template:
src: templates/consul-client.hcl.j2
dest: /etc/consul.d/consul.hcl
- name: 创建Consul数据子目录
file:
path: /opt/consul/data
state: directory
owner: consul
group: consul
mode: '0644'
notify: restart consul
mode: '0755'
- name: Create consul systemd service
- name: 创建Consul配置目录
file:
path: /etc/consul.d
state: directory
owner: consul
group: consul
mode: '0755'
- name: 检查节点类型
set_fact:
node_type: "{{ 'server' if inventory_hostname in ['ch4', 'ash3c', 'warden'] else 'client' }}"
ui_enabled: "{{ true if inventory_hostname in ['ch4', 'ash3c', 'warden'] else false }}"
bind_addr: "{{ hostvars[inventory_hostname]['tailscale_ip'] }}" # 使用inventory中指定的Tailscale IP
- name: 生成Consul配置文件
template:
src: templates/consul.service.j2
dest: /etc/systemd/system/consul.service
src: ../infrastructure/consul/templates/consul.j2
dest: /etc/consul.d/consul.hcl
owner: root
group: root
mode: '0644'
notify: reload systemd
vars:
node_name: "{{ inventory_hostname }}"
bind_addr: "{{ hostvars[inventory_hostname]['tailscale_ip'] }}"
node_zone: "{{ node_type }}"
ui_enabled: "{{ ui_enabled }}"
consul_servers: "{{ consul_server_ips }}"
- name: Enable and start consul service
- name: 验证Consul配置文件
command: consul validate /etc/consul.d/consul.hcl
register: consul_validate_result
failed_when: consul_validate_result.rc != 0
- name: 重启Consul服务
systemd:
name: consul
state: restarted
enabled: yes
state: started
notify: restart consul
- name: Wait for consul to be ready
uri:
url: "http://{{ tailscale_ip.stdout }}:8500/v1/status/leader"
status_code: 200
timeout: 5
register: consul_leader_status
until: consul_leader_status.status == 200
retries: 30
delay: 5
- name: Verify consul cluster membership
shell: consul members -status=alive -format=json | jq -r '.[].Name'
register: consul_members
changed_when: false
- name: Display cluster status
debug:
msg: "Node {{ inventory_hostname.split('.')[0] }} joined cluster with {{ consul_members.stdout_lines | length }} members"
handlers:
- name: reload systemd
systemd:
daemon_reload: yes
- name: restart consul
- name: 等待Consul服务启动
wait_for:
port: 8500
host: "{{ hostvars[inventory_hostname]['tailscale_ip'] }}"
timeout: 60
- name: 显示Consul服务状态
systemd:
name: consul
state: restarted
register: consul_status
- name: 显示服务状态
debug:
msg: "{{ inventory_hostname }} ({{ node_type }}) Consul服务状态: {{ consul_status.status.ActiveState }}"

View File

@ -0,0 +1,63 @@
---
- name: 部署监控代理配置文件
hosts: nomad_cluster
become: yes
vars:
ansible_python_interpreter: /usr/bin/python3
tasks:
- name: 创建promtail配置目录
file:
path: /etc/promtail
state: directory
mode: '0755'
tags:
- promtail-config
- name: 创建node-exporter配置目录
file:
path: /etc/prometheus
state: directory
mode: '0755'
tags:
- node-exporter-config
- name: 部署promtail配置
copy:
src: /root/mgmt/infrastructure/monitor/configs/promtail/promtail-config.yaml
dest: /etc/promtail/config.yaml
owner: root
group: root
mode: '0644'
backup: yes
tags:
- promtail-config
- name: 部署node-exporter配置
copy:
src: /root/mgmt/infrastructure/monitor/configs/node-exporter/node-exporter-config.yml
dest: /etc/prometheus/node-exporter-config.yml
owner: prometheus
group: prometheus
mode: '0644'
backup: yes
tags:
- node-exporter-config
- name: 重启promtail服务
systemd:
name: promtail
state: restarted
enabled: yes
when: ansible_facts['systemd']['promtail']['status'] is defined
tags:
- promtail-restart
- name: 重启node-exporter服务
systemd:
name: prometheus-node-exporter
state: restarted
enabled: yes
when: ansible_facts['systemd']['prometheus-node-exporter']['status'] is defined
tags:
- node-exporter-restart

View File

@ -0,0 +1,45 @@
---
- name: 部署完整监控栈
hosts: localhost
become: no
vars:
ansible_python_interpreter: /usr/bin/python3
tasks:
- name: 停止并purge现有的monitoring-stack job
command: nomad job stop -purge monitoring-stack
register: stop_result
failed_when: false
changed_when: stop_result.rc == 0
- name: 等待job完全停止
pause:
seconds: 5
- name: 部署完整的monitoring-stack job (Grafana + Prometheus + Loki)
command: nomad job run /root/mgmt/infrastructure/monitor/monitoring-stack.nomad
register: deploy_result
- name: 显示部署结果
debug:
msg: "{{ deploy_result.stdout_lines }}"
- name: 等待服务启动
pause:
seconds: 30
- name: 检查monitoring-stack job状态
command: nomad job status monitoring-stack
register: status_result
- name: 显示job状态
debug:
msg: "{{ status_result.stdout_lines }}"
- name: 检查Consul中的监控服务
command: consul catalog services
register: consul_services
- name: 显示Consul服务
debug:
msg: "{{ consul_services.stdout_lines }}"

View File

@ -0,0 +1,35 @@
---
- name: 部署Prometheus配置
hosts: influxdb
become: yes
vars:
ansible_python_interpreter: /usr/bin/python3
tasks:
- name: 备份原Prometheus配置
copy:
src: /etc/prometheus/prometheus.yml
dest: /etc/prometheus/prometheus.yml.backup
remote_src: yes
backup: yes
tags:
- backup-config
- name: 部署新Prometheus配置
copy:
src: /root/mgmt/infrastructure/monitor/configs/prometheus/prometheus.yml
dest: /etc/prometheus/prometheus.yml
owner: prometheus
group: prometheus
mode: '0644'
backup: yes
tags:
- deploy-config
- name: 重启Prometheus服务
systemd:
name: prometheus
state: restarted
enabled: yes
tags:
- restart-service

View File

@ -0,0 +1,80 @@
---
# 修复美国 Ashburn 服务器节点的安全配置
- name: 修复 Ashburn 服务器节点不安全配置
hosts: ash1d,ash2e
become: yes
serial: 1 # 一个一个来,确保安全
tasks:
- name: 显示当前处理的服务器节点
debug:
msg: "⚠️ 正在处理关键服务器节点: {{ inventory_hostname }}"
- name: 检查集群状态 - 确保有足够的服务器在线
uri:
url: "http://semaphore.tailnet-68f9.ts.net:4646/v1/status/leader"
method: GET
register: leader_check
delegate_to: localhost
- name: 确认集群有 leader
fail:
msg: "集群没有 leader停止操作"
when: leader_check.status != 200
- name: 备份当前配置
copy:
src: /etc/nomad.d/nomad.hcl
dest: /etc/nomad.d/nomad.hcl.backup.{{ ansible_date_time.epoch }}
backup: yes
- name: 创建安全的服务器配置
template:
src: ../nomad-configs-tofu/server-template-secure.hcl
dest: /etc/nomad.d/nomad.hcl
backup: yes
notify: restart nomad
- name: 验证配置文件语法
command: nomad config validate /etc/nomad.d/nomad.hcl
register: config_validation
- name: 显示验证结果
debug:
msg: "{{ inventory_hostname }} 配置验证: {{ config_validation.stdout }}"
- name: 重启 Nomad 服务
systemd:
name: nomad
state: restarted
daemon_reload: yes
- name: 等待服务启动
wait_for:
port: 4646
host: "{{ inventory_hostname }}.tailnet-68f9.ts.net"
delay: 10
timeout: 60
delegate_to: localhost
handlers:
- name: restart nomad
systemd:
name: nomad
state: restarted
daemon_reload: yes
post_tasks:
- name: 等待节点重新加入集群
pause:
seconds: 20
- name: 验证服务器重新加入集群
uri:
url: "http://semaphore.tailnet-68f9.ts.net:4646/v1/status/peers"
method: GET
register: cluster_peers
delegate_to: localhost
- name: 显示集群状态
debug:
msg: "集群 peers: {{ cluster_peers.json }}"

View File

@ -0,0 +1,69 @@
---
- name: 批量安装监控代理软件
hosts: nomad_cluster
become: yes
vars:
ansible_python_interpreter: /usr/bin/python3
tasks:
- name: 添加Grafana APT源
apt_repository:
repo: "deb [trusted=yes] https://packages.grafana.com/oss/deb stable main"
state: present
filename: grafana
when: ansible_distribution == "Debian" or ansible_distribution == "Ubuntu"
tags:
- grafana-repo
- name: 更新APT缓存
apt:
update_cache: yes
tags:
- update-cache
- name: 检查node-exporter是否已安装
command: which prometheus-node-exporter
register: node_exporter_check
failed_when: false
changed_when: false
- name: 安装prometheus-node-exporter
apt:
name: prometheus-node-exporter
state: present
update_cache: yes
when: node_exporter_check.rc != 0
register: node_exporter_install
- name: 显示node-exporter安装结果
debug:
msg: "{{ inventory_hostname }}: {{ '已安装' if node_exporter_check.rc == 0 else '安装完成' if node_exporter_install.changed else '安装失败' }}"
- name: 检查promtail是否已安装
command: which promtail
register: promtail_check
failed_when: false
changed_when: false
- name: 安装promtail
apt:
name: promtail
state: present
update_cache: yes
when: promtail_check.rc != 0
register: promtail_install
- name: 显示promtail安装结果
debug:
msg: "{{ inventory_hostname }}: {{ '已安装' if promtail_check.rc == 0 else '安装完成' if promtail_install.changed else '安装失败' }}"
- name: 创建promtail数据目录
file:
path: /opt/promtail/data
state: directory
owner: promtail
group: nogroup
mode: '0755'
when: promtail_check.rc != 0 or promtail_install.changed
tags:
- promtail-dirs

View File

@ -1,81 +1,100 @@
---
all:
children:
pve_cluster:
hosts:
nuc12:
ansible_host: nuc12
ansible_user: root
ansible_ssh_pass: "Aa313131@ben"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no'
xgp:
ansible_host: xgp
ansible_user: root
ansible_ssh_pass: "Aa313131@ben"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no'
pve:
ansible_host: pve
ansible_user: root
ansible_ssh_pass: "Aa313131@ben"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no'
vars:
ansible_python_interpreter: /usr/bin/python3
nomad_cluster:
hosts:
ch4:
ansible_host: ch4.tailnet-68f9.ts.net
# 服务器节点 (7个)
ch2:
ansible_host: ch2.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
hcp1:
ansible_host: hcp1.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
ash3c:
ansible_host: ash3c.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
warden:
ansible_host: warden.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
onecloud1:
ansible_host: onecloud1.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
influxdb1:
ansible_host: influxdb1.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
browser:
ansible_host: browser.tailnet-68f9.ts.net
tailscale_ip: "100.90.159.68"
ch3:
ansible_host: ch3.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
tailscale_ip: "100.86.141.112"
ash1d:
ansible_host: ash1d.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
tailscale_ip: "100.81.26.3"
ash2e:
ansible_host: ash2e.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
tailscale_ip: "100.125.147.1"
de:
ansible_host: de.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
tailscale_ip: "100.120.225.29"
onecloud1:
ansible_host: onecloud1.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
tailscale_ip: "100.98.209.50"
semaphore:
ansible_host: semaphore.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
tailscale_ip: "100.116.158.95"
# 客户端节点 (6个)
ch4:
ansible_host: ch4.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
tailscale_ip: "100.117.106.136"
ash3c:
ansible_host: ash3c.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
tailscale_ip: "100.116.80.94"
warden:
ansible_host: warden.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
tailscale_ip: "100.122.197.112"
hcp1:
ansible_host: hcp1.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
tailscale_ip: "100.97.62.111"
influxdb:
ansible_host: influxdb.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
tailscale_ip: "100.100.7.4"
browser:
ansible_host: browser.tailnet-68f9.ts.net
ansible_user: ben
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"
ansible_ssh_common_args: '-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
tailscale_ip: "100.116.112.45"
vars:
ansible_python_interpreter: /usr/bin/python3

View File

@ -1,23 +1,23 @@
# Nomad 服务器节点安全配置模板
# Nomad 服务器安全配置 - OneCloud1 节点
datacenter = "dc1"
data_dir = "/opt/nomad/data"
plugin_dir = "/opt/nomad/plugins"
log_level = "INFO"
name = "ash2e"
name = "onecloud1"
# 安全绑定 - 只绑定到 Tailscale 接口
bind_addr = "ash2e.tailnet-68f9.ts.net"
bind_addr = "onecloud1.tailnet-68f9.ts.net"
addresses {
http = "ash2e.tailnet-68f9.ts.net"
rpc = "ash2e.tailnet-68f9.ts.net"
serf = "ash2e.tailnet-68f9.ts.net"
http = "onecloud1.tailnet-68f9.ts.net"
rpc = "onecloud1.tailnet-68f9.ts.net"
serf = "onecloud1.tailnet-68f9.ts.net"
}
advertise {
http = "ash2e.tailnet-68f9.ts.net:4646"
rpc = "ash2e.tailnet-68f9.ts.net:4647"
serf = "ash2e.tailnet-68f9.ts.net:4648"
http = "onecloud1.tailnet-68f9.ts.net:4646"
rpc = "onecloud1.tailnet-68f9.ts.net:4647"
serf = "onecloud1.tailnet-68f9.ts.net:4648"
}
ports {
@ -28,8 +28,9 @@ ports {
server {
enabled = true
bootstrap_expect = 7
# 七仙女服务器发现配置
# 服务器发现配置
server_join {
retry_join = [
"semaphore.tailnet-68f9.ts.net:4647",
@ -40,10 +41,12 @@ server {
"onecloud1.tailnet-68f9.ts.net:4647",
"de.tailnet-68f9.ts.net:4647"
]
retry_interval = "15s"
retry_max = 3
}
}
# 安全的 Consul 配置 - 指向本地客户端
# 安全的 Consul 配置
consul {
address = "127.0.0.1:8500"
server_service_name = "nomad"
@ -53,9 +56,9 @@ consul {
client_auto_join = true
}
# 安全的 Vault 配置 - 指向本地代理
# Vault 配置(暂时禁用)
vault {
enabled = false # 暂时禁用,等 Vault 集群部署完成
enabled = false
}
# 遥测配置

View File

@ -1,30 +0,0 @@
# ash2e
data "oci_core_boot_volumes" "ash2e_boot_volumes" {
provider = oci.us
compartment_id = data.consul_keys.oracle_config_us.var.tenancy_ocid
availability_domain = "TZXJ:US-ASHBURN-AD-1"
filter {
name = "display_name"
values = ["ash2e"]
}
}
# ash2e
data "oci_core_instances" "us_instances" {
provider = oci.us
compartment_id = data.consul_keys.oracle_config_us.var.tenancy_ocid
availability_domain = "TZXJ:US-ASHBURN-AD-1"
filter {
name = "display_name"
values = ["ash2e"]
}
}
output "ash2e_disk_status" {
value = {
boot_volumes = data.oci_core_boot_volumes.ash2e_boot_volumes.boot_volumes
instances = data.oci_core_instances.us_instances.instances
}
}

View File

@ -1,29 +0,0 @@
# Debian
data "oci_core_images" "us_debian_images" {
provider = oci.us
compartment_id = data.consul_keys.oracle_config_us.var.tenancy_ocid
# Debian
filter {
name = "operating_system"
values = ["Debian"]
}
#
sort_by = "TIMECREATED"
sort_order = "DESC"
}
output "debian_images" {
value = {
debian_images = [
for img in data.oci_core_images.us_debian_images.images : {
display_name = img.display_name
operating_system = img.operating_system
operating_system_version = img.operating_system_version
id = img.id
time_created = img.time_created
}
]
}
}

View File

@ -1,55 +0,0 @@
#
data "oci_core_instance" "ash1d" {
provider = oci.us
instance_id = "ocid1.instance.oc1.iad.anuwcljtkbqyulqcr3ekof6jr5mnmja2gl7vfmwf6s4nnsch6t5osfhwhhfq"
}
data "oci_core_instance" "ash3c" {
provider = oci.us
instance_id = "ocid1.instance.oc1.iad.anuwcljtkbqyulqczicblxqyu3nxtqv2dqfpaitqgffbrmb7ztu3xiuefhxq"
}
# VNIC
data "oci_core_vnic_attachments" "ash1d_vnics" {
provider = oci.us
compartment_id = data.consul_keys.oracle_config_us.var.tenancy_ocid
instance_id = data.oci_core_instance.ash1d.id
}
data "oci_core_vnic_attachments" "ash3c_vnics" {
provider = oci.us
compartment_id = data.consul_keys.oracle_config_us.var.tenancy_ocid
instance_id = data.oci_core_instance.ash3c.id
}
# VNIC
data "oci_core_vnic" "ash1d_vnic" {
provider = oci.us
vnic_id = data.oci_core_vnic_attachments.ash1d_vnics.vnic_attachments[0].vnic_id
}
data "oci_core_vnic" "ash3c_vnic" {
provider = oci.us
vnic_id = data.oci_core_vnic_attachments.ash3c_vnics.vnic_attachments[0].vnic_id
}
output "existing_instances_info" {
value = {
ash1d = {
id = data.oci_core_instance.ash1d.id
display_name = data.oci_core_instance.ash1d.display_name
public_ip = data.oci_core_instance.ash1d.public_ip
private_ip = data.oci_core_instance.ash1d.private_ip
subnet_id = data.oci_core_instance.ash1d.subnet_id
ipv6addresses = data.oci_core_vnic.ash1d_vnic.ipv6addresses
}
ash3c = {
id = data.oci_core_instance.ash3c.id
display_name = data.oci_core_instance.ash3c.display_name
public_ip = data.oci_core_instance.ash3c.public_ip
private_ip = data.oci_core_instance.ash3c.private_ip
subnet_id = data.oci_core_instance.ash3c.subnet_id
ipv6addresses = data.oci_core_vnic.ash3c_vnic.ipv6addresses
}
}
}

View File

@ -1,38 +0,0 @@
#
data "oci_core_images" "us_images" {
provider = oci.us
compartment_id = data.consul_keys.oracle_config_us.var.tenancy_ocid
#
filter {
name = "operating_system"
values = ["Canonical Ubuntu", "Oracle Linux"]
}
#
sort_by = "TIMECREATED"
sort_order = "DESC"
}
output "available_os_images" {
value = {
ubuntu_images = [
for img in data.oci_core_images.us_images.images : {
display_name = img.display_name
operating_system = img.operating_system
operating_system_version = img.operating_system_version
id = img.id
time_created = img.time_created
} if img.operating_system == "Canonical Ubuntu"
]
oracle_linux_images = [
for img in data.oci_core_images.us_images.images : {
display_name = img.display_name
operating_system = img.operating_system
operating_system_version = img.operating_system_version
id = img.id
time_created = img.time_created
} if img.operating_system == "Oracle Linux"
]
}
}

View File

@ -1,20 +0,0 @@
#
data "oci_core_instances" "us_all_instances" {
provider = oci.us
compartment_id = data.consul_keys.oracle_config_us.var.tenancy_ocid
}
output "us_all_instances_summary" {
value = {
total_count = length(data.oci_core_instances.us_all_instances.instances)
instances = [
for instance in data.oci_core_instances.us_all_instances.instances : {
name = instance.display_name
state = instance.state
shape = instance.shape
id = instance.id
}
]
}
}

View File

@ -1,19 +0,0 @@
# Consul 配置
## 部署
```bash
nomad job run components/consul/jobs/consul-cluster.nomad
```
## Job 信息
- **Job 名称**: `consul-cluster-nomad`
- **类型**: service
- **节点**: master, ash3c, warden
## 访问方式
- Master: `http://master.tailnet-68f9.ts.net:8500`
- Ash3c: `http://ash3c.tailnet-68f9.ts.net:8500`
- Warden: `http://warden.tailnet-68f9.ts.net:8500`

View File

@ -1,88 +0,0 @@
# Consul配置文件
# 此文件包含Consul的完整配置包括变量和存储相关设置
# 基础配置
data_dir = "/opt/consul/data"
raft_dir = "/opt/consul/raft"
# 启用UI
ui_config {
enabled = true
}
# 数据中心配置
datacenter = "dc1"
# 服务器配置
server = true
bootstrap_expect = 3
# 网络配置
client_addr = "0.0.0.0"
bind_addr = "{{ GetInterfaceIP `eth0` }}"
advertise_addr = "{{ GetInterfaceIP `eth0` }}"
# 端口配置
ports {
dns = 8600
http = 8500
https = -1
grpc = 8502
grpc_tls = 8503
serf_lan = 8301
serf_wan = 8302
server = 8300
}
# 集群连接
retry_join = ["100.117.106.136", "100.116.80.94", "100.122.197.112"]
# 服务发现
enable_service_script = true
enable_script_checks = true
enable_local_script_checks = true
# 性能调优
performance {
raft_multiplier = 1
}
# 日志配置
log_level = "INFO"
enable_syslog = false
log_file = "/var/log/consul/consul.log"
# 安全配置
encrypt = "YourEncryptionKeyHere"
# 连接配置
reconnect_timeout = "30s"
reconnect_timeout_wan = "30s"
session_ttl_min = "10s"
# Autopilot配置
autopilot {
cleanup_dead_servers = true
last_contact_threshold = "200ms"
max_trailing_logs = 250
server_stabilization_time = "10s"
redundancy_zone_tag = ""
disable_upgrade_migration = false
upgrade_version_tag = ""
}
# 快照配置
snapshot {
enabled = true
interval = "24h"
retain = 30
name = "consul-snapshot-{{.Timestamp}}"
}
# 备份配置
backup {
enabled = true
interval = "6h"
retain = 7
name = "consul-backup-{{.Timestamp}}"
}

View File

@ -1,93 +0,0 @@
# Consul配置模板文件
# 此文件使用Consul模板语法从KV存储中动态获取配置
# 遵循 config/{environment}/{provider}/{region_or_service}/{key} 格式
# 基础配置
data_dir = "{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/cluster/data_dir` `/opt/consul/data` }}"
raft_dir = "{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/cluster/raft_dir` `/opt/consul/raft` }}"
# 启用UI
ui_config {
enabled = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/ui/enabled` `true` }}
}
# 数据中心配置
datacenter = "{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/cluster/datacenter` `dc1` }}"
# 服务器配置
server = true
bootstrap_expect = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/cluster/bootstrap_expect` `3` }}
# 网络配置
client_addr = "{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/network/client_addr` `0.0.0.0` }}"
bind_addr = "{{ GetInterfaceIP (keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/network/bind_interface` `ens160`) }}"
advertise_addr = "{{ GetInterfaceIP (keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/network/advertise_interface` `ens160`) }}"
# 端口配置
ports {
dns = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/ports/dns` `8600` }}
http = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/ports/http` `8500` }}
https = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/ports/https` `-1` }}
grpc = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/ports/grpc` `8502` }}
grpc_tls = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/ports/grpc_tls` `8503` }}
serf_lan = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/ports/serf_lan` `8301` }}
serf_wan = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/ports/serf_wan` `8302` }}
server = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/ports/server` `8300` }}
}
# 集群连接 - 动态获取节点IP
retry_join = [
"{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/nodes/master/ip` `100.117.106.136` }}",
"{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/nodes/ash3c/ip` `100.116.80.94` }}",
"{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/nodes/warden/ip` `100.122.197.112` }}"
]
# 服务发现
enable_service_script = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/service/enable_service_script` `true` }}
enable_script_checks = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/service/enable_script_checks` `true` }}
enable_local_script_checks = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/service/enable_local_script_checks` `true` }}
# 性能调优
performance {
raft_multiplier = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/performance/raft_multiplier` `1` }}
}
# 日志配置
log_level = "{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/cluster/log_level` `INFO` }}"
enable_syslog = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/log/enable_syslog` `false` }}
log_file = "{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/log/log_file` `/var/log/consul/consul.log` }}"
# 安全配置
encrypt = "{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/cluster/encrypt_key` `YourEncryptionKeyHere` }}"
# 连接配置
reconnect_timeout = "{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/connection/reconnect_timeout` `30s` }}"
reconnect_timeout_wan = "{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/connection/reconnect_timeout_wan` `30s` }}"
session_ttl_min = "{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/connection/session_ttl_min` `10s` }}"
# Autopilot配置
autopilot {
cleanup_dead_servers = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/autopilot/cleanup_dead_servers` `true` }}
last_contact_threshold = "{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/autopilot/last_contact_threshold` `200ms` }}"
max_trailing_logs = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/autopilot/max_trailing_logs` `250` }}
server_stabilization_time = "{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/autopilot/server_stabilization_time` `10s` }}"
redundancy_zone_tag = ""
disable_upgrade_migration = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/autopilot/disable_upgrade_migration` `false` }}
upgrade_version_tag = ""
}
# 快照配置
snapshot {
enabled = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/snapshot/enabled` `true` }}
interval = "{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/snapshot/interval` `24h` }}"
retain = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/snapshot/retain` `30` }}
name = "{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/snapshot/name` `consul-snapshot-{{.Timestamp}}` }}"
}
# 备份配置
backup {
enabled = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/backup/enabled` `true` }}
interval = "{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/backup/interval` `6h` }}"
retain = {{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/backup/retain` `7` }}
name = "{{ keyOrDefault `config/` + env "ENVIRONMENT" + `/consul/backup/name` `consul-backup-{{.Timestamp}}` }}"
}

View File

@ -1,158 +0,0 @@
job "consul-cluster-nomad" {
datacenters = ["dc1"]
type = "service"
group "consul-ch4" {
constraint {
attribute = "${node.unique.name}"
value = "ch4"
}
network {
port "http" {
static = 8500
}
port "server" {
static = 8300
}
port "serf-lan" {
static = 8301
}
port "serf-wan" {
static = 8302
}
}
task "consul" {
driver = "exec"
config {
command = "consul"
args = [
"agent",
"-server",
"-bootstrap-expect=3",
"-data-dir=/opt/nomad/data/consul",
"-client=0.0.0.0",
"-bind={{ env \"NOMAD_IP_http\" }}",
"-advertise={{ env \"NOMAD_IP_http\" }}",
"-retry-join=ash3c.tailnet-68f9.ts.net:8301",
"-retry-join=warden.tailnet-68f9.ts.net:8301",
"-ui",
"-http-port=8500",
"-server-port=8300",
"-serf-lan-port=8301",
"-serf-wan-port=8302"
]
}
resources {
cpu = 300
memory = 512
}
}
}
group "consul-ash3c" {
constraint {
attribute = "${node.unique.name}"
value = "ash3c"
}
network {
port "http" {
static = 8500
}
port "server" {
static = 8300
}
port "serf-lan" {
static = 8301
}
port "serf-wan" {
static = 8302
}
}
task "consul" {
driver = "exec"
config {
command = "consul"
args = [
"agent",
"-server",
"-data-dir=/opt/nomad/data/consul",
"-client=0.0.0.0",
"-bind={{ env \"NOMAD_IP_http\" }}",
"-advertise={{ env \"NOMAD_IP_http\" }}",
"-retry-join=ch4.tailnet-68f9.ts.net:8301",
"-retry-join=warden.tailnet-68f9.ts.net:8301",
"-ui",
"-http-port=8500",
"-server-port=8300",
"-serf-lan-port=8301",
"-serf-wan-port=8302"
]
}
resources {
cpu = 300
memory = 512
}
}
}
group "consul-warden" {
constraint {
attribute = "${node.unique.name}"
value = "warden"
}
network {
port "http" {
static = 8500
}
port "server" {
static = 8300
}
port "serf-lan" {
static = 8301
}
port "serf-wan" {
static = 8302
}
}
task "consul" {
driver = "exec"
config {
command = "consul"
args = [
"agent",
"-server",
"-data-dir=/opt/nomad/data/consul",
"-client=0.0.0.0",
"-bind={{ env \"NOMAD_IP_http\" }}",
"-advertise={{ env \"NOMAD_IP_http\" }}",
"-retry-join=ch4.tailnet-68f9.ts.net:8301",
"-retry-join=ash3c.tailnet-68f9.ts.net:8301",
"-ui",
"-http-port=8500",
"-server-port=8300",
"-serf-lan-port=8301",
"-serf-wan-port=8302"
]
}
resources {
cpu = 300
memory = 512
}
}
}
}

View File

@ -1,8 +0,0 @@
# Nomad 配置
## Jobs
- `install-podman-driver.nomad` - 安装 Podman 驱动
- `nomad-consul-config.nomad` - Nomad-Consul 配置
- `nomad-consul-setup.nomad` - Nomad-Consul 设置
- `nomad-nfs-volume.nomad` - NFS 卷配置

View File

@ -1,43 +0,0 @@
job "juicefs-controller" {
datacenters = ["dc1"]
type = "system"
group "controller" {
task "plugin" {
driver = "podman"
config {
image = "juicedata/juicefs-csi-driver:v0.14.1"
args = [
"--endpoint=unix://csi/csi.sock",
"--logtostderr",
"--nodeid=${node.unique.id}",
"--v=5",
"--by-process=true"
]
privileged = true
}
csi_plugin {
id = "juicefs-nfs"
type = "controller"
mount_dir = "/csi"
}
resources {
cpu = 100
memory = 512
}
env {
POD_NAME = "csi-controller"
}
}
}
}

View File

@ -1,38 +0,0 @@
job "juicefs-csi-controller" {
datacenters = ["dc1"]
type = "system"
group "controller" {
task "juicefs-csi-driver" {
driver = "podman"
config {
image = "juicedata/juicefs-csi-driver:v0.14.1"
args = [
"--endpoint=unix://csi/csi.sock",
"--logtostderr",
"--nodeid=${node.unique.id}",
"--v=5"
]
privileged = true
}
env {
POD_NAME = "juicefs-csi-controller"
POD_NAMESPACE = "default"
NODE_NAME = "${node.unique.id}"
}
csi_plugin {
id = "juicefs0"
type = "controller"
mount_dir = "/csi"
}
resources {
cpu = 100
memory = 512
}
}
}
}

View File

@ -1,43 +0,0 @@
# NFS CSI Volume Definition for Nomad
# 这个文件定义了CSI volume让NFS存储能在Nomad UI中显示
volume "nfs-shared-csi" {
type = "csi"
# CSI plugin名称
source = "csi-nfs"
# 容量设置
capacity_min = "1GiB"
capacity_max = "10TiB"
# 访问模式 - 支持多节点读写
access_mode = "multi-node-multi-writer"
# 挂载选项
mount_options {
fs_type = "nfs4"
mount_flags = "rw,relatime,vers=4.2"
}
# 拓扑约束 - 确保在有NFS挂载的节点上运行
topology_request {
required {
topology {
"node" = "{{ range $node := nomadNodes }}{{ if eq $node.Status "ready" }}{{ $node.Name }}{{ end }}{{ end }}"
}
}
}
# 卷参数
parameters {
server = "snail"
share = "/fs/1000/nfs/Fnsync"
}
}

View File

@ -1,22 +0,0 @@
# Dynamic Host Volume Definition for NFS
# 这个文件定义了动态host volume让NFS存储能在Nomad UI中显示
volume "nfs-shared-dynamic" {
type = "host"
# 使用动态host volume
source = "fnsync"
# 只读设置
read_only = false
# 容量信息用于显示
capacity_min = "1GiB"
capacity_max = "10TiB"
}

View File

@ -1,22 +0,0 @@
# NFS Host Volume Definition for Nomad UI
# 这个文件定义了host volume让NFS存储能在Nomad UI中显示
volume "nfs-shared-host" {
type = "host"
# 使用host volume
source = "fnsync"
# 只读设置
read_only = false
# 容量信息用于显示
capacity_min = "1GiB"
capacity_max = "10TiB"
}

View File

@ -1,28 +0,0 @@
# Traefik 配置
## 部署
```bash
nomad job run components/traefik/jobs/traefik.nomad
```
## 配置特点
- 明确绑定 Tailscale IP (100.97.62.111)
- 地理位置优化的 Consul 集群顺序(北京 → 韩国 → 美国)
- 适合跨太平洋网络的宽松健康检查
- 无服务健康检查,避免 flapping
## 访问方式
- Dashboard: `http://hcp1.tailnet-68f9.ts.net:8080/dashboard/`
- 直接 IP: `http://100.97.62.111:8080/dashboard/`
- Consul LB: `http://hcp1.tailnet-68f9.ts.net:80`
## 故障排除
如果遇到服务 flapping 问题:
1. 检查是否使用了 RFC1918 私有地址
2. 确认 Tailscale 网络连通性
3. 调整健康检查间隔时间
4. 考虑地理位置对网络延迟的影响

View File

@ -1,105 +0,0 @@
http:
serversTransports:
authentik-insecure:
insecureSkipVerify: true
middlewares:
consul-stripprefix:
stripPrefix:
prefixes:
- "/consul"
services:
consul-cluster:
loadBalancer:
servers:
- url: "http://ch4.tailnet-68f9.ts.net:8500" # 韩国Leader
- url: "http://warden.tailnet-68f9.ts.net:8500" # 北京Follower
- url: "http://ash3c.tailnet-68f9.ts.net:8500" # 美国Follower
healthCheck:
path: "/v1/status/leader"
interval: "30s"
timeout: "15s"
nomad-cluster:
loadBalancer:
servers:
- url: "http://ch2.tailnet-68f9.ts.net:4646" # 韩国Leader
- url: "http://warden.tailnet-68f9.ts.net:4646" # 北京Follower
- url: "http://ash3c.tailnet-68f9.ts.net:4646" # 美国Follower
healthCheck:
path: "/v1/status/leader"
interval: "30s"
timeout: "15s"
vault-cluster:
loadBalancer:
servers:
- url: "http://warden.tailnet-68f9.ts.net:8200" # 北京,单节点
healthCheck:
path: "/ui/"
interval: "30s"
timeout: "15s"
authentik-cluster:
loadBalancer:
servers:
- url: "https://authentik.tailnet-68f9.ts.net:9443" # Authentik容器HTTPS端口
serversTransport: authentik-insecure
healthCheck:
path: "/flows/-/default/authentication/"
interval: "30s"
timeout: "15s"
routers:
consul-api:
rule: "Host(`consul.git4ta.tech`)"
service: consul-cluster
entryPoints:
- websecure
tls:
certResolver: cloudflare
middlewares:
- consul-stripprefix
consul-ui:
rule: "Host(`consul.git-4ta.live`) && PathPrefix(`/ui`)"
service: consul-cluster
entryPoints:
- websecure
tls:
certResolver: cloudflare
nomad-api:
rule: "Host(`nomad.git-4ta.live`)"
service: nomad-cluster
entryPoints:
- websecure
tls:
certResolver: cloudflare
nomad-ui:
rule: "Host(`nomad.git-4ta.live`) && PathPrefix(`/ui`)"
service: nomad-cluster
entryPoints:
- websecure
tls:
certResolver: cloudflare
vault-ui:
rule: "Host(`vault.git-4ta.live`)"
service: vault-cluster
entryPoints:
- websecure
tls:
certResolver: cloudflare
authentik-ui:
rule: "Host(`authentik1.git-4ta.live`)"
service: authentik-cluster
entryPoints:
- websecure
tls:
certResolver: cloudflare

View File

@ -1,254 +0,0 @@
job "traefik-cloudflare-v2" {
datacenters = ["dc1"]
type = "service"
group "traefik" {
count = 1
constraint {
attribute = "${node.unique.name}"
operator = "="
value = "hcp1"
}
volume "traefik-certs" {
type = "host"
read_only = false
source = "traefik-certs"
}
network {
mode = "host"
port "http" {
static = 80
}
port "https" {
static = 443
}
port "traefik" {
static = 8080
}
}
task "traefik" {
driver = "exec"
config {
command = "/usr/local/bin/traefik"
args = [
"--configfile=/local/traefik.yml"
]
}
env {
CLOUDFLARE_EMAIL = "houzhongxu.houzhongxu@gmail.com"
CLOUDFLARE_DNS_API_TOKEN = "HYT-cfZTP_jq6Xd9g3tpFMwxopOyIrf8LZpmGAI3"
CLOUDFLARE_ZONE_API_TOKEN = "HYT-cfZTP_jq6Xd9g3tpFMwxopOyIrf8LZpmGAI3"
}
volume_mount {
volume = "traefik-certs"
destination = "/opt/traefik/certs"
read_only = false
}
template {
data = <<EOF
api:
dashboard: true
insecure: true
debug: true
entryPoints:
web:
address: "0.0.0.0:80"
http:
redirections:
entrypoint:
to: websecure
scheme: https
permanent: true
websecure:
address: "0.0.0.0:443"
traefik:
address: "0.0.0.0:8080"
providers:
consulCatalog:
endpoint:
address: "warden.tailnet-68f9.ts.net:8500"
scheme: "http"
watch: true
exposedByDefault: false
prefix: "traefik"
defaultRule: "Host(`{{ .Name }}.git-4ta.live`)"
file:
filename: /local/dynamic.yml
watch: true
certificatesResolvers:
cloudflare:
acme:
email: {{ env "CLOUDFLARE_EMAIL" }}
storage: /opt/traefik/certs/acme.json
dnsChallenge:
provider: cloudflare
delayBeforeCheck: 30s
resolvers:
- "1.1.1.1:53"
- "1.0.0.1:53"
log:
level: DEBUG
EOF
destination = "local/traefik.yml"
}
template {
data = <<EOF
http:
serversTransports:
waypoint-insecure:
insecureSkipVerify: true
authentik-insecure:
insecureSkipVerify: true
middlewares:
consul-stripprefix:
stripPrefix:
prefixes:
- "/consul"
waypoint-auth:
replacePathRegex:
regex: "^/auth/token(.*)$"
replacement: "/auth/token$1"
services:
consul-cluster:
loadBalancer:
servers:
- url: "http://ch4.tailnet-68f9.ts.net:8500" # 韩国Leader
- url: "http://warden.tailnet-68f9.ts.net:8500" # 北京Follower
- url: "http://ash3c.tailnet-68f9.ts.net:8500" # 美国Follower
healthCheck:
path: "/v1/status/leader"
interval: "30s"
timeout: "15s"
nomad-cluster:
loadBalancer:
servers:
- url: "http://ch2.tailnet-68f9.ts.net:4646" # 韩国Leader
- url: "http://warden.tailnet-68f9.ts.net:4646" # 北京Follower
- url: "http://ash3c.tailnet-68f9.ts.net:4646" # 美国Follower
healthCheck:
path: "/v1/status/leader"
interval: "30s"
timeout: "15s"
waypoint-cluster:
loadBalancer:
servers:
- url: "https://hcp1.tailnet-68f9.ts.net:9701" # hcp1 节点 HTTPS API
serversTransport: waypoint-insecure
vault-cluster:
loadBalancer:
servers:
- url: "http://warden.tailnet-68f9.ts.net:8200" # 北京,单节点
healthCheck:
path: "/ui/"
interval: "30s"
timeout: "15s"
authentik-cluster:
loadBalancer:
servers:
- url: "https://authentik.tailnet-68f9.ts.net:9443" # Authentik容器HTTPS端口
serversTransport: authentik-insecure
healthCheck:
path: "/flows/-/default/authentication/"
interval: "30s"
timeout: "15s"
routers:
consul-api:
rule: "Host(`consul.git-4ta.live`)"
service: consul-cluster
middlewares:
- consul-stripprefix
entryPoints:
- websecure
tls:
certResolver: cloudflare
traefik-dashboard:
rule: "Host(`traefik.git-4ta.live`)"
service: dashboard@internal
middlewares:
- dashboard_redirect@internal
- dashboard_stripprefix@internal
entryPoints:
- websecure
tls:
certResolver: cloudflare
nomad-ui:
rule: "Host(`nomad.git-4ta.live`)"
service: nomad-cluster
entryPoints:
- websecure
tls:
certResolver: cloudflare
waypoint-ui:
rule: "Host(`waypoint.git-4ta.live`)"
service: waypoint-cluster
entryPoints:
- websecure
tls:
certResolver: cloudflare
vault-ui:
rule: "Host(`vault.git-4ta.live`)"
service: vault-cluster
entryPoints:
- websecure
tls:
certResolver: cloudflare
authentik-ui:
rule: "Host(`authentik.git-4ta.live`)"
service: authentik-cluster
entryPoints:
- websecure
tls:
certResolver: cloudflare
EOF
destination = "local/dynamic.yml"
}
template {
data = <<EOF
CLOUDFLARE_EMAIL={{ env "CLOUDFLARE_EMAIL" }}
CLOUDFLARE_DNS_API_TOKEN={{ env "CLOUDFLARE_DNS_API_TOKEN" }}
CLOUDFLARE_ZONE_API_TOKEN={{ env "CLOUDFLARE_ZONE_API_TOKEN" }}
EOF
destination = "local/cloudflare.env"
env = true
}
# 测试证书权限控制
template {
data = "-----BEGIN CERTIFICATE-----\nTEST CERTIFICATE FOR PERMISSION CONTROL\n-----END CERTIFICATE-----"
destination = "/opt/traefik/certs/test-cert.pem"
perms = 600
}
resources {
cpu = 500
memory = 512
}
}
}
}

View File

@ -1,239 +0,0 @@
job "traefik-cloudflare-v2" {
datacenters = ["dc1"]
type = "service"
group "traefik" {
count = 1
constraint {
attribute = "${node.unique.name}"
value = "hcp1"
}
volume "traefik-certs" {
type = "host"
read_only = false
source = "traefik-certs"
}
network {
mode = "host"
port "http" {
static = 80
}
port "https" {
static = 443
}
port "traefik" {
static = 8080
}
}
task "traefik" {
driver = "exec"
config {
command = "/usr/local/bin/traefik"
args = [
"--configfile=/local/traefik.yml"
]
}
volume_mount {
volume = "traefik-certs"
destination = "/opt/traefik/certs"
read_only = false
}
template {
data = <<EOF
api:
dashboard: true
insecure: true
entryPoints:
web:
address: "0.0.0.0:80"
http:
redirections:
entrypoint:
to: websecure
scheme: https
permanent: true
websecure:
address: "0.0.0.0:443"
traefik:
address: "0.0.0.0:8080"
providers:
consulCatalog:
endpoint:
address: "warden.tailnet-68f9.ts.net:8500"
scheme: "http"
watch: true
exposedByDefault: false
prefix: "traefik"
defaultRule: "Host(`{{ .Name }}.git-4ta.live`)"
file:
filename: /local/dynamic.yml
watch: true
certificatesResolvers:
cloudflare:
acme:
email: houzhongxu.houzhongxu@gmail.com
storage: /opt/traefik/certs/acme.json
dnsChallenge:
provider: cloudflare
delayBeforeCheck: 30s
resolvers:
- "1.1.1.1:53"
- "1.0.0.1:53"
log:
level: DEBUG
EOF
destination = "local/traefik.yml"
}
template {
data = <<EOF
http:
serversTransports:
waypoint-insecure:
insecureSkipVerify: true
authentik-insecure:
insecureSkipVerify: true
middlewares:
consul-stripprefix:
stripPrefix:
prefixes:
- "/consul"
waypoint-auth:
replacePathRegex:
regex: "^/auth/token(.*)$"
replacement: "/auth/token$1"
services:
consul-cluster:
loadBalancer:
servers:
- url: "http://ch4.tailnet-68f9.ts.net:8500" # 韩国Leader
- url: "http://warden.tailnet-68f9.ts.net:8500" # 北京Follower
- url: "http://ash3c.tailnet-68f9.ts.net:8500" # 美国Follower
healthCheck:
path: "/v1/status/leader"
interval: "30s"
timeout: "15s"
nomad-cluster:
loadBalancer:
servers:
- url: "http://ch2.tailnet-68f9.ts.net:4646" # 韩国Leader
- url: "http://warden.tailnet-68f9.ts.net:4646" # 北京Follower
- url: "http://ash3c.tailnet-68f9.ts.net:4646" # 美国Follower
healthCheck:
path: "/v1/status/leader"
interval: "30s"
timeout: "15s"
waypoint-cluster:
loadBalancer:
servers:
- url: "https://hcp1.tailnet-68f9.ts.net:9701" # hcp1 节点 HTTPS API
serversTransport: waypoint-insecure
vault-cluster:
loadBalancer:
servers:
- url: "http://warden.tailnet-68f9.ts.net:8200" # 北京,单节点
healthCheck:
path: "/ui/"
interval: "30s"
timeout: "15s"
authentik-cluster:
loadBalancer:
servers:
- url: "https://authentik.tailnet-68f9.ts.net:9443" # Authentik容器HTTPS端口
serversTransport: authentik-insecure
healthCheck:
path: "/flows/-/default/authentication/"
interval: "30s"
timeout: "15s"
routers:
consul-api:
rule: "Host(`consul.git-4ta.live`)"
service: consul-cluster
middlewares:
- consul-stripprefix
entryPoints:
- websecure
tls:
certResolver: cloudflare
traefik-dashboard:
rule: "Host(`traefik.git-4ta.live`)"
service: dashboard@internal
middlewares:
- dashboard_redirect@internal
- dashboard_stripprefix@internal
entryPoints:
- websecure
tls:
certResolver: cloudflare
nomad-ui:
rule: "Host(`nomad.git-4ta.live`)"
service: nomad-cluster
entryPoints:
- websecure
tls:
certResolver: cloudflare
waypoint-ui:
rule: "Host(`waypoint.git-4ta.live`)"
service: waypoint-cluster
entryPoints:
- websecure
tls:
certResolver: cloudflare
vault-ui:
rule: "Host(`vault.git-4ta.live`)"
service: vault-cluster
entryPoints:
- websecure
tls:
certResolver: cloudflare
authentik-ui:
rule: "Host(`authentik.git4ta.tech`)"
service: authentik-cluster
entryPoints:
- websecure
tls:
certResolver: cloudflare
EOF
destination = "local/dynamic.yml"
}
template {
data = <<EOF
CLOUDFLARE_EMAIL=houzhongxu.houzhongxu@gmail.com
CLOUDFLARE_DNS_API_TOKEN=0aPWoLaQ59l0nyL1jIVzZaEx2e41Gjgcfhn3ztJr
CLOUDFLARE_ZONE_API_TOKEN=0aPWoLaQ59l0nyL1jIVzZaEx2e41Gjgcfhn3ztJr
EOF
destination = "local/cloudflare.env"
env = true
}
resources {
cpu = 500
memory = 512
}
}
}
}

View File

@ -1,7 +0,0 @@
# Vault 配置
## Jobs
- `vault-cluster-exec.nomad` - Vault 集群 (exec 驱动)
- `vault-cluster-podman.nomad` - Vault 集群 (podman 驱动)
- `vault-dev-warden.nomad` - Vault 开发环境

View File

@ -1,22 +0,0 @@
job "consul-kv-simple-test" {
datacenters = ["dc1"]
type = "batch"
group "test" {
count = 1
task "consul-test" {
driver = "exec"
config {
command = "/bin/sh"
args = ["-c", "curl -s http://ch4.tailnet-68f9.ts.net:8500/v1/kv/config/dev/cloudflare/token | jq -r '.[0].Value' | base64 -d"]
}
resources {
cpu = 100
memory = 128
}
}
}
}

View File

@ -1,105 +0,0 @@
# ash2e
resource "oci_core_instance" "ash2e" {
provider = oci.us
#
compartment_id = data.consul_keys.oracle_config_us.var.tenancy_ocid
availability_domain = "TZXJ:US-ASHBURN-AD-1"
shape = "VM.Standard.E2.1.Micro"
display_name = "ash2e"
# 使 Ubuntu 24.04 LTS
source_details {
source_type = "image"
source_id = "ocid1.image.oc1.iad.aaaaaaaahmozwney6aptbe6dgdh3iledjxr2v6q74fjpatgnwiekedftmm2q" # Ubuntu 24.04 LTS
boot_volume_size_in_gbs = 50
boot_volume_vpus_per_gb = 10
}
# - IPv6
create_vnic_details {
assign_public_ip = true
assign_ipv6ip = true # IPv6 Oracle
hostname_label = "ash2e"
subnet_id = "ocid1.subnet.oc1.iad.aaaaaaaapkx25eckkl3dps67o35iprz2gkqjd5bo3rc4rxf4si5hyj2ocara" # 使 ash1d
}
# SSH - 使
metadata = {
ssh_authorized_keys = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMSUUfma8FKEFvH8Nq65XM2PZ9kitfgv1q727cKV9y5Z houzhongxu@seekkey.tech"
user_data = base64encode(<<-EOF
#!/bin/bash
# ben
useradd -m -s /bin/bash ben
usermod -aG sudo ben
# ben SSH
mkdir -p /home/ben/.ssh
echo "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMSUUfma8FKEFvH8Nq65XM2PZ9kitfgv1q727cKV9y5Z houzhongxu@seekkey.tech" >> /home/ben/.ssh/authorized_keys
chown -R ben:ben /home/ben/.ssh
chmod 700 /home/ben/.ssh
chmod 600 /home/ben/.ssh/authorized_keys
#
apt update && apt upgrade -y
#
apt install -y curl wget git vim htop
#
hostnamectl set-hostname ash2e
# IPv6
systemctl restart networking
EOF
)
}
# 便
lifecycle {
prevent_destroy = false
ignore_changes = [
source_details,
metadata,
create_vnic_details,
time_created
]
}
}
#
data "oci_core_subnets" "us_subnets" {
provider = oci.us
compartment_id = data.consul_keys.oracle_config_us.var.tenancy_ocid
vcn_id = data.oci_core_vcns.us_vcns.virtual_networks[0].id
}
# VCN
data "oci_core_vcns" "us_vcns" {
provider = oci.us
compartment_id = data.consul_keys.oracle_config_us.var.tenancy_ocid
}
output "ash2e_instance_info" {
value = {
id = oci_core_instance.ash2e.id
public_ip = oci_core_instance.ash2e.public_ip
private_ip = oci_core_instance.ash2e.private_ip
state = oci_core_instance.ash2e.state
display_name = oci_core_instance.ash2e.display_name
}
}
output "us_subnets_info" {
value = {
subnets = [
for subnet in data.oci_core_subnets.us_subnets.subnets : {
id = subnet.id
display_name = subnet.display_name
cidr_block = subnet.cidr_block
availability_domain = subnet.availability_domain
}
]
}
}

View File

@ -1,104 +0,0 @@
# 项目管理 Makefile
.PHONY: help setup init plan apply destroy clean test lint docs
# 默认目标
help: ## 显示帮助信息
@echo "可用的命令:"
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-20s\033[0m %s\n", $$1, $$2}'
# 环境设置
setup: ## 设置开发环境
@echo "🚀 设置开发环境..."
@bash scripts/setup/environment/setup-environment.sh
# OpenTofu 操作
init: ## 初始化 OpenTofu
@echo "🏗️ 初始化 OpenTofu..."
@cd infrastructure/environments/dev && tofu init
plan: ## 生成执行计划
@echo "📋 生成执行计划..."
@cd infrastructure/environments/dev && tofu plan -var-file="terraform.tfvars"
apply: ## 应用基础设施变更
@echo "🚀 应用基础设施变更..."
@cd infrastructure/environments/dev && tofu apply -var-file="terraform.tfvars"
destroy: ## 销毁基础设施
@echo "💥 销毁基础设施..."
@cd infrastructure/environments/dev && tofu destroy -var-file="terraform.tfvars"
# Ansible 操作
ansible-check: ## 检查 Ansible 配置
@echo "🔍 检查 Ansible 配置..."
@cd configuration && ansible-playbook --syntax-check playbooks/bootstrap/main.yml
ansible-deploy: ## 部署应用
@echo "📦 部署应用..."
@cd configuration && ansible-playbook -i inventories/production/inventory.ini playbooks/bootstrap/main.yml
# Podman 操作
podman-build: ## 构建 Podman 镜像
@echo "📦 构建 Podman 镜像..."
@podman-compose -f containers/compose/development/docker-compose.yml build
podman-up: ## 启动开发环境
@echo "🚀 启动开发环境..."
@podman-compose -f containers/compose/development/docker-compose.yml up -d
podman-down: ## 停止开发环境
@echo "🛑 停止开发环境..."
@podman-compose -f containers/compose/development/docker-compose.yml down
# 测试
test: ## 运行测试
@echo "🧪 运行测试..."
@bash scripts/testing/test-runner.sh
test-mcp: ## 运行MCP服务器测试
@echo "🧪 运行MCP服务器测试..."
@bash scripts/testing/mcp/test_local_mcp_servers.sh
test-kali: ## 运行Kali Linux快速健康检查
@echo "🧪 运行Kali Linux快速健康检查..."
@cd configuration && ansible-playbook -i inventories/production/inventory.ini playbooks/test/kali-health-check.yml
test-kali-security: ## 运行Kali Linux安全工具测试
@echo "🧪 运行Kali Linux安全工具测试..."
@cd configuration && ansible-playbook -i inventories/production/inventory.ini playbooks/test/kali-security-tools.yml
test-kali-full: ## 运行Kali Linux完整测试套件
@echo "🧪 运行Kali Linux完整测试套件..."
@cd configuration && ansible-playbook playbooks/test/kali-full-test-suite.yml
lint: ## 代码检查
@echo "🔍 代码检查..."
@bash scripts/ci-cd/quality/lint.sh
# 文档
docs: ## 生成文档
@echo "📚 生成文档..."
@bash scripts/ci-cd/build/generate-docs.sh
# 清理
clean: ## 清理临时文件
@echo "🧹 清理临时文件..."
@find . -name "*.tfstate*" -delete
@find . -name ".terraform" -type d -exec rm -rf {} + 2>/dev/null || true
@podman system prune -f
# 备份
backup: ## 创建备份
@echo "💾 创建备份..."
@bash scripts/utilities/backup/backup-all.sh
# 监控
monitor: ## 启动监控
@echo "📊 启动监控..."
@podman-compose -f containers/compose/production/monitoring.yml up -d
# 安全扫描
security-scan: ## 安全扫描
@echo "🔒 安全扫描..."
@bash scripts/ci-cd/quality/security-scan.sh

View File

@ -1,20 +0,0 @@
[defaults]
inventory = inventory.ini
host_key_checking = False
forks = 8
timeout = 30
gathering = smart
fact_caching = memory
# 支持新的 playbooks 目录结构
roles_path = playbooks/
collections_path = playbooks/
# 启用SSH密钥认证
ansible_ssh_common_args = '-o PreferredAuthentications=publickey -o PubkeyAuthentication=yes'
[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o PreferredAuthentications=publickey -o PubkeyAuthentication=yes
pipelining = True
[inventory]
# 启用插件以支持动态 inventory
enable_plugins = host_list, script, auto, yaml, ini, toml

View File

@ -1,57 +0,0 @@
---
- name: Clean up Consul configuration from dedicated clients
hosts: hcp1,influxdb1,browser
become: yes
tasks:
- name: Stop Consul service
systemd:
name: consul
state: stopped
enabled: no
- name: Disable Consul service
systemd:
name: consul
enabled: no
- name: Kill any remaining Consul processes
shell: |
pkill -f consul || true
sleep 2
pkill -9 -f consul || true
ignore_errors: yes
- name: Remove Consul systemd service file
file:
path: /etc/systemd/system/consul.service
state: absent
- name: Remove Consul configuration directory
file:
path: /etc/consul.d
state: absent
- name: Remove Consul data directory
file:
path: /opt/consul
state: absent
- name: Reload systemd daemon
systemd:
daemon_reload: yes
- name: Verify Consul is stopped
shell: |
if pgrep -f consul; then
echo "Consul still running"
exit 1
else
echo "Consul stopped successfully"
fi
register: consul_status
failed_when: consul_status.rc != 0
- name: Display cleanup status
debug:
msg: "Consul cleanup completed on {{ inventory_hostname }}"

View File

@ -1,55 +0,0 @@
---
- name: Configure Consul Auto-Discovery
hosts: all
become: yes
vars:
consul_servers:
- "warden.tailnet-68f9.ts.net:8301"
- "ch4.tailnet-68f9.ts.net:8301"
- "ash3c.tailnet-68f9.ts.net:8301"
tasks:
- name: Backup current nomad.hcl
copy:
src: /etc/nomad.d/nomad.hcl
dest: /etc/nomad.d/nomad.hcl.backup.{{ ansible_date_time.epoch }}
remote_src: yes
backup: yes
- name: Update Consul configuration for auto-discovery
blockinfile:
path: /etc/nomad.d/nomad.hcl
marker: "# {mark} ANSIBLE MANAGED CONSUL CONFIG"
block: |
consul {
retry_join = [
"warden.tailnet-68f9.ts.net:8301",
"ch4.tailnet-68f9.ts.net:8301",
"ash3c.tailnet-68f9.ts.net:8301"
]
server_service_name = "nomad"
client_service_name = "nomad-client"
}
insertbefore: '^consul \{'
replace: '^consul \{.*?\}'
- name: Restart Nomad service
systemd:
name: nomad
state: restarted
enabled: yes
- name: Wait for Nomad to be ready
wait_for:
port: 4646
host: "{{ ansible_default_ipv4.address }}"
delay: 5
timeout: 30
- name: Verify Consul connection
shell: |
NOMAD_ADDR=http://localhost:4646 nomad node status | grep -q "ready"
register: nomad_ready
failed_when: nomad_ready.rc != 0
retries: 3
delay: 10

View File

@ -1,75 +0,0 @@
---
- name: Remove Consul configuration from Nomad servers
hosts: semaphore,ash1d,ash2e,ch2,ch3,onecloud1,de
become: yes
tasks:
- name: Remove entire Consul configuration block
blockinfile:
path: /etc/nomad.d/nomad.hcl
marker: "# {mark} ANSIBLE MANAGED CONSUL CONFIG"
state: absent
- name: Remove Consul configuration lines
lineinfile:
path: /etc/nomad.d/nomad.hcl
regexp: '^consul \{'
state: absent
- name: Remove Consul configuration content
lineinfile:
path: /etc/nomad.d/nomad.hcl
regexp: '^ address ='
state: absent
- name: Remove Consul service names
lineinfile:
path: /etc/nomad.d/nomad.hcl
regexp: '^ server_service_name ='
state: absent
- name: Remove Consul client service name
lineinfile:
path: /etc/nomad.d/nomad.hcl
regexp: '^ client_service_name ='
state: absent
- name: Remove Consul auto-advertise
lineinfile:
path: /etc/nomad.d/nomad.hcl
regexp: '^ auto_advertise ='
state: absent
- name: Remove Consul server auto-join
lineinfile:
path: /etc/nomad.d/nomad.hcl
regexp: '^ server_auto_join ='
state: absent
- name: Remove Consul client auto-join
lineinfile:
path: /etc/nomad.d/nomad.hcl
regexp: '^ client_auto_join ='
state: absent
- name: Remove Consul closing brace
lineinfile:
path: /etc/nomad.d/nomad.hcl
regexp: '^}'
state: absent
- name: Restart Nomad service
systemd:
name: nomad
state: restarted
- name: Wait for Nomad to be ready
wait_for:
port: 4646
host: "{{ ansible_default_ipv4.address }}"
delay: 5
timeout: 30
- name: Display completion message
debug:
msg: "Removed Consul configuration from {{ inventory_hostname }}"

View File

@ -1,32 +0,0 @@
---
- name: Enable Nomad Client Mode on Servers
hosts: ch2,ch3,de
become: yes
tasks:
- name: Enable Nomad client mode
lineinfile:
path: /etc/nomad.d/nomad.hcl
regexp: '^client \{'
line: 'client {'
state: present
- name: Enable client mode
lineinfile:
path: /etc/nomad.d/nomad.hcl
regexp: '^ enabled = false'
line: ' enabled = true'
state: present
- name: Restart Nomad service
systemd:
name: nomad
state: restarted
- name: Wait for Nomad to be ready
wait_for:
port: 4646
host: "{{ ansible_default_ipv4.address }}"
delay: 5
timeout: 30

View File

@ -1,38 +0,0 @@
client {
enabled = true
# 配置七姐妹服务器地址
servers = [
"100.116.158.95:4647", # bj-semaphore
"100.81.26.3:4647", # ash1d
"100.103.147.94:4647", # ash2e
"100.90.159.68:4647", # ch2
"100.86.141.112:4647", # ch3
"100.98.209.50:4647", # bj-onecloud1
"100.120.225.29:4647" # de
]
host_volume "fnsync" {
path = "/mnt/fnsync"
read_only = false
}
# 禁用Docker驱动只使用Podman
options {
"driver.raw_exec.enable" = "1"
"driver.exec.enable" = "1"
}
plugin_dir = "/opt/nomad/plugins"
}
# 配置Podman驱动
plugin "podman" {
config {
volumes {
enabled = true
}
logging {
type = "journald"
}
gc {
container = true
}
}
}

View File

@ -1,62 +0,0 @@
---
- name: Fix all master references to ch4
hosts: localhost
gather_facts: no
vars:
files_to_fix:
- "scripts/diagnose-consul-sync.sh"
- "scripts/register-traefik-to-all-consul.sh"
- "deployment/ansible/playbooks/update-nomad-consul-config.yml"
- "deployment/ansible/templates/nomad-server.hcl.j2"
- "deployment/ansible/templates/nomad-client.hcl"
- "deployment/ansible/playbooks/fix-nomad-consul-roles.yml"
- "deployment/ansible/onecloud1_nomad.hcl"
- "ansible/templates/consul-client.hcl.j2"
- "ansible/consul-client-deployment.yml"
- "ansible/consul-client-simple.yml"
tasks:
- name: Replace master.tailnet-68f9.ts.net with ch4.tailnet-68f9.ts.net
replace:
path: "{{ item }}"
regexp: 'master\.tailnet-68f9\.ts\.net'
replace: 'ch4.tailnet-68f9.ts.net'
loop: "{{ files_to_fix }}"
when: item is file
- name: Replace master hostname references
replace:
path: "{{ item }}"
regexp: '\bmaster\b'
replace: 'ch4'
loop: "{{ files_to_fix }}"
when: item is file
- name: Replace master IP references in comments
replace:
path: "{{ item }}"
regexp: '# master'
replace: '# ch4'
loop: "{{ files_to_fix }}"
when: item is file
- name: Fix inventory files
replace:
path: "{{ item }}"
regexp: 'master ansible_host=master'
replace: 'ch4 ansible_host=ch4'
loop:
- "deployment/ansible/inventories/production/inventory.ini"
- "deployment/ansible/inventories/production/csol-consul-nodes.ini"
- "deployment/ansible/inventories/production/nomad-clients.ini"
- "deployment/ansible/inventories/production/master-ash3c.ini"
- "deployment/ansible/inventories/production/consul-nodes.ini"
- "deployment/ansible/inventories/production/vault.ini"
- name: Fix IP address references (100.117.106.136 comments)
replace:
path: "{{ item }}"
regexp: '100\.117\.106\.136.*# master'
replace: '100.117.106.136 # ch4'
loop: "{{ files_to_fix }}"
when: item is file

View File

@ -1,2 +0,0 @@
ansible_ssh_pass: "3131"
ansible_become_pass: "3131"

View File

@ -1,108 +0,0 @@
# CSOL Consul 静态节点配置说明
## 概述
本目录包含CSOLCloud Service Operations Layer的Consul静态节点配置文件。这些配置文件定义了Consul集群的服务器和客户端节点信息便于团队成员快速了解和使用Consul集群。
## 配置文件说明
### 1. csol-consul-nodes.ini
这是主要的Consul节点配置文件包含所有服务器和客户端节点的详细信息。
**文件结构:**
- `[consul_servers]` - Consul服务器节点7个节点
- `[consul_clients]` - Consul客户端节点2个节点
- `[consul_cluster:children]` - 集群所有节点的组合
- `[consul_servers:vars]` - 服务器节点的通用配置
- `[consul_clients:vars]` - 客户端节点的通用配置
- `[consul_cluster:vars]` - 整个集群的通用配置
**使用方法:**
```bash
# 使用此配置文件运行Ansible Playbook
ansible-playbook -i csol-consul-nodes.ini your-playbook.yml
```
### 2. csol-consul-nodes.json
这是JSON格式的Consul节点配置文件便于程序读取和处理。
**文件结构:**
- `servers` - 服务器节点列表
- `clients` - 客户端节点列表
- `configuration` - 集群配置信息
- `notes` - 节点统计和备注信息
**使用方法:**
```bash
# 使用jq工具查询JSON文件
jq '.csol_consul_nodes.servers.nodes[].name' csol-consul-nodes.json
# 使用Python脚本处理JSON文件
python3 -c "import json; data=json.load(open('csol-consul-nodes.json')); print(data['csol_consul_nodes']['servers']['nodes'])"
```
### 3. consul-nodes.ini
这是更新的Consul节点配置文件替代了原有的旧版本。
### 4. consul-cluster.ini
这是Consul集群服务器节点的配置文件主要用于集群部署和管理。
## 节点列表
### 服务器节点7个
| 节点名称 | IP地址 | 区域 | 角色 |
|---------|--------|------|------|
| ch2 | 100.90.159.68 | Oracle Cloud KR | 服务器 |
| ch3 | 100.86.141.112 | Oracle Cloud KR | 服务器 |
| ash1d | 100.81.26.3 | Oracle Cloud US | 服务器 |
| ash2e | 100.103.147.94 | Oracle Cloud US | 服务器 |
| onecloud1 | 100.98.209.50 | Armbian | 服务器 |
| de | 100.120.225.29 | Armbian | 服务器 |
| bj-semaphore | 100.116.158.95 | Semaphore | 服务器 |
### 客户端节点2个
| 节点名称 | IP地址 | 端口 | 区域 | 角色 |
|---------|--------|------|------|------|
| master | 100.117.106.136 | 60022 | Oracle Cloud A1 | 客户端 |
| ash3c | 100.116.80.94 | - | Oracle Cloud A1 | 客户端 |
## 配置参数
### 通用配置
- `consul_version`: 1.21.5
- `datacenter`: dc1
- `encrypt_key`: 1EvGItLOB8nuHnSA0o+rO0zXzLeJl+U+Jfvuw0+H848=
- `client_addr`: 0.0.0.0
- `data_dir`: /opt/consul/data
- `config_dir`: /etc/consul.d
- `log_level`: INFO
- `port`: 8500
### 服务器特定配置
- `consul_server`: true
- `bootstrap_expect`: 7
- `ui_config`: true
### 客户端特定配置
- `consul_server`: false
## 注意事项
1. **退役节点**hcs节点已于2025-09-27退役不再包含在配置中。
2. **故障节点**syd节点为故障节点已隔离不包含在配置中。
3. **端口配置**master节点使用60022端口其他节点使用默认SSH端口。
4. **认证信息**所有节点使用统一的认证信息用户名ben密码3131
5. **bootstrap_expect**设置为7表示期望有7个服务器节点形成集群。
## 更新日志
- 2025-06-17初始版本包含完整的CSOL Consul节点配置。
## 维护说明
1. 添加新节点时,请同时更新所有配置文件。
2. 节点退役或故障时,请及时从配置中移除并更新说明。
3. 定期验证节点可达性和配置正确性。
4. 更新配置后请同步更新此README文件。

View File

@ -1,47 +0,0 @@
# CSOL Consul 集群 Inventory - 更新时间: 2025-06-17
# 此文件包含所有CSOL的Consul服务器节点信息
[consul_servers]
# Oracle Cloud 韩国区域 (KR)
ch2 ansible_host=100.90.159.68 ansible_user=ben ansible_password=3131 ansible_become_password=3131
ch3 ansible_host=100.86.141.112 ansible_user=ben ansible_password=3131 ansible_become_password=3131
# Oracle Cloud 美国区域 (US)
ash1d ansible_host=100.81.26.3 ansible_user=ben ansible_password=3131 ansible_become_password=3131
ash2e ansible_host=100.103.147.94 ansible_user=ben ansible_password=3131 ansible_become_password=3131
# Armbian 节点
onecloud1 ansible_host=100.98.209.50 ansible_user=ben ansible_password=3131 ansible_become_password=3131
de ansible_host=100.120.225.29 ansible_user=ben ansible_password=3131 ansible_become_password=3131
# Semaphore 节点
bj-semaphore ansible_host=100.116.158.95 ansible_user=root
[consul_cluster:children]
consul_servers
[consul_servers:vars]
# Consul服务器配置
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
consul_version=1.21.5
consul_datacenter=dc1
consul_encrypt_key=1EvGItLOB8nuHnSA0o+rO0zXzLeJl+U+Jfvuw0+H848=
consul_bootstrap_expect=7
consul_server=true
consul_ui_config=true
consul_client_addr=0.0.0.0
consul_bind_addr="{{ ansible_default_ipv4.address }}"
consul_data_dir=/opt/consul/data
consul_config_dir=/etc/consul.d
consul_log_level=INFO
consul_port=8500
# === 节点说明 ===
# 服务器节点 (7个):
# - Oracle Cloud KR: ch2, ch3
# - Oracle Cloud US: ash1d, ash2e
# - Armbian: onecloud1, de
# - Semaphore: bj-semaphore
#
# 注意: hcs节点已退役 (2025-09-27)
# 注意: syd节点为故障节点已隔离

View File

@ -1,65 +0,0 @@
# CSOL Consul 静态节点配置
# 更新时间: 2025-06-17 (基于实际Consul集群信息更新)
# 此文件包含所有CSOL的服务器和客户端节点信息
[consul_servers]
# 主要服务器节点 (全部为服务器模式)
master ansible_host=100.117.106.136 ansible_user=ben ansible_password=3131 ansible_become_password=3131 ansible_port=60022
ash3c ansible_host=100.116.80.94 ansible_user=ben ansible_password=3131 ansible_become_password=3131
warden ansible_host=100.122.197.112 ansible_user=ben ansible_password=3131 ansible_become_password=3131
[consul_clients]
# 客户端节点
bj-warden ansible_host=100.122.197.112 ansible_user=ben ansible_password=3131 ansible_become_password=3131
bj-hcp2 ansible_host=100.116.112.45 ansible_user=root ansible_password=313131 ansible_become_password=313131
bj-influxdb ansible_host=100.100.7.4 ansible_user=root ansible_password=313131 ansible_become_password=313131
bj-hcp1 ansible_host=100.97.62.111 ansible_user=root ansible_password=313131 ansible_become_password=313131
[consul_cluster:children]
consul_servers
consul_clients
[consul_servers:vars]
# Consul服务器配置
consul_server=true
consul_bootstrap_expect=3
consul_datacenter=dc1
consul_encrypt_key=1EvGItLOB8nuHnSA0o+rO0zXzLeJl+U+Jfvuw0+H848=
consul_client_addr=0.0.0.0
consul_bind_addr="{{ ansible_default_ipv4.address }}"
consul_data_dir=/opt/consul/data
consul_config_dir=/etc/consul.d
consul_log_level=INFO
consul_port=8500
consul_ui_config=true
[consul_clients:vars]
# Consul客户端配置
consul_server=false
consul_datacenter=dc1
consul_encrypt_key=1EvGItLOB8nuHnSA0o+rO0zXzLeJl+U+Jfvuw0+H848=
consul_client_addr=0.0.0.0
consul_bind_addr="{{ ansible_default_ipv4.address }}"
consul_data_dir=/opt/consul/data
consul_config_dir=/etc/consul.d
consul_log_level=INFO
[consul_cluster:vars]
# 通用配置
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
ansible_ssh_private_key_file=~/.ssh/id_ed25519
consul_version=1.21.5
# === 节点说明 ===
# 服务器节点 (3个):
# - bj-semaphore: 100.116.158.95 (主要服务器节点)
# - kr-master: 100.117.106.136 (韩国主节点)
# - us-ash3c: 100.116.80.94 (美国服务器节点)
#
# 客户端节点 (4个):
# - bj-warden: 100.122.197.112 (北京客户端节点)
# - bj-hcp2: 100.116.112.45 (北京HCP客户端节点2)
# - bj-influxdb: 100.100.7.4 (北京InfluxDB客户端节点)
# - bj-hcp1: 100.97.62.111 (北京HCP客户端节点1)
#
# 注意: 此配置基于实际Consul集群信息更新包含3个服务器节点

View File

@ -1,44 +0,0 @@
# Consul 静态节点配置
# 此文件包含所有CSOL的服务器和客户端节点信息
# 更新时间: 2025-06-17 (基于实际Consul集群信息更新)
# === CSOL 服务器节点 ===
# 这些节点运行Consul服务器模式参与集群决策和数据存储
[consul_servers]
# 主要服务器节点 (全部为服务器模式)
master ansible_host=100.117.106.136 ansible_user=ben ansible_password=3131 ansible_become_password=3131 ansible_port=60022
ash3c ansible_host=100.116.80.94 ansible_user=ben ansible_password=3131 ansible_become_password=3131
warden ansible_host=100.122.197.112 ansible_user=ben ansible_password=3131 ansible_become_password=3131
# === 节点分组 ===
[consul_cluster:children]
consul_servers
[consul_servers:vars]
# Consul服务器配置
consul_server=true
consul_bootstrap_expect=3
consul_datacenter=dc1
consul_encrypt_key=1EvGItLOB8nuHnSA0o+rO0zXzLeJl+U+Jfvuw0+H848=
consul_client_addr=0.0.0.0
consul_bind_addr="{{ ansible_default_ipv4.address }}"
consul_data_dir=/opt/consul/data
consul_config_dir=/etc/consul.d
consul_log_level=INFO
consul_port=8500
consul_ui_config=true
[consul_cluster:vars]
# 通用配置
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
consul_version=1.21.5
# === 节点说明 ===
# 服务器节点 (3个):
# - master: 100.117.106.136 (韩国主节点)
# - ash3c: 100.116.80.94 (美国服务器节点)
# - warden: 100.122.197.112 (北京服务器节点当前集群leader)
#
# 注意: 此配置基于实际Consul集群信息更新所有节点均为服务器模式

View File

@ -1,126 +0,0 @@
{
"csol_consul_nodes": {
"updated_at": "2025-06-17",
"description": "CSOL Consul静态节点配置",
"servers": {
"description": "Consul服务器节点参与集群决策和数据存储",
"nodes": [
{
"name": "ch2",
"host": "100.90.159.68",
"user": "ben",
"password": "3131",
"become_password": "3131",
"region": "Oracle Cloud KR",
"role": "server"
},
{
"name": "ch3",
"host": "100.86.141.112",
"user": "ben",
"password": "3131",
"become_password": "3131",
"region": "Oracle Cloud KR",
"role": "server"
},
{
"name": "ash1d",
"host": "100.81.26.3",
"user": "ben",
"password": "3131",
"become_password": "3131",
"region": "Oracle Cloud US",
"role": "server"
},
{
"name": "ash2e",
"host": "100.103.147.94",
"user": "ben",
"password": "3131",
"become_password": "3131",
"region": "Oracle Cloud US",
"role": "server"
},
{
"name": "onecloud1",
"host": "100.98.209.50",
"user": "ben",
"password": "3131",
"become_password": "3131",
"region": "Armbian",
"role": "server"
},
{
"name": "de",
"host": "100.120.225.29",
"user": "ben",
"password": "3131",
"become_password": "3131",
"region": "Armbian",
"role": "server"
},
{
"name": "bj-semaphore",
"host": "100.116.158.95",
"user": "root",
"region": "Semaphore",
"role": "server"
}
]
},
"clients": {
"description": "Consul客户端节点用于服务发现和健康检查",
"nodes": [
{
"name": "ch4",
"host": "100.117.106.136",
"user": "ben",
"password": "3131",
"become_password": "3131",
"port": 60022,
"region": "Oracle Cloud A1",
"role": "client"
},
{
"name": "ash3c",
"host": "100.116.80.94",
"user": "ben",
"password": "3131",
"become_password": "3131",
"region": "Oracle Cloud A1",
"role": "client"
}
]
},
"configuration": {
"consul_version": "1.21.5",
"datacenter": "dc1",
"encrypt_key": "1EvGItLOB8nuHnSA0o+rO0zXzLeJl+U+Jfvuw0+H848=",
"client_addr": "0.0.0.0",
"data_dir": "/opt/consul/data",
"config_dir": "/etc/consul.d",
"log_level": "INFO",
"port": 8500,
"bootstrap_expect": 7,
"ui_config": true
},
"notes": {
"server_count": 7,
"client_count": 2,
"total_nodes": 9,
"retired_nodes": [
{
"name": "hcs",
"retired_date": "2025-09-27",
"reason": "节点退役"
}
],
"isolated_nodes": [
{
"name": "syd",
"reason": "故障节点,已隔离"
}
]
}
}
}

View File

@ -1,20 +0,0 @@
# Nomad 集群全局配置
# InfluxDB 2.x + Grafana 监控配置
# InfluxDB 2.x 连接配置
influxdb_url: "http://influxdb1.tailnet-68f9.ts.net:8086"
influxdb_token: "VU_dOCVZzqEHb9jSFsDe0bJlEBaVbiG4LqfoczlnmcbfrbmklSt904HJPL4idYGvVi0c2eHkYDi2zCTni7Ay4w=="
influxdb_org: "seekkey" # 组织名称
influxdb_bucket: "VPS" # Bucket 名称
# 远程 Telegraf 配置 URL
telegraf_config_url: "http://influxdb1.tailnet-68f9.ts.net:8086/api/v2/telegrafs/0f8a73496790c000"
# 监控配置
disk_usage_warning: 80 # 硬盘使用率警告阈值
disk_usage_critical: 90 # 硬盘使用率严重告警阈值
collection_interval: 30 # 数据收集间隔(秒)
# Telegraf 优化配置
telegraf_log_level: "ERROR" # 只记录错误日志
telegraf_disable_local_logs: true # 禁用本地日志文件

View File

@ -1,37 +0,0 @@
[nomad_servers]
# 服务器节点 (7个服务器节点)
# ⚠️ 警告:能力越大,责任越大!服务器节点操作需极其谨慎!
# ⚠️ 任何对服务器节点的操作都可能影响整个集群的稳定性!
semaphore ansible_host=127.0.0.1 ansible_user=root ansible_password=3131 ansible_become_password=3131 ansible_ssh_common_args="-o PreferredAuthentications=password -o PubkeyAuthentication=no"
ash1d ansible_host=ash1d.tailnet-68f9.ts.net ansible_user=ben ansible_password=3131 ansible_become_password=3131
ash2e ansible_host=ash2e.tailnet-68f9.ts.net ansible_user=ben
ch2 ansible_host=ch2.tailnet-68f9.ts.net ansible_user=ben ansible_password=3131 ansible_become_password=3131
ch3 ansible_host=ch3.tailnet-68f9.ts.net ansible_user=ben ansible_password=3131 ansible_become_password=3131
onecloud1 ansible_host=onecloud1.tailnet-68f9.ts.net ansible_user=ben ansible_password=3131 ansible_become_password=3131
de ansible_host=de.tailnet-68f9.ts.net ansible_user=ben ansible_password=3131 ansible_become_password=3131
hcp1 ansible_host=hcp1.tailnet-68f9.ts.net ansible_user=root ansible_password=3131 ansible_become_password=3131
[nomad_clients]
# 客户端节点 (5个客户端节点)
ch4 ansible_host=ch4.tailnet-68f9.ts.net ansible_user=ben ansible_password=3131 ansible_become_password=3131
ash3c ansible_host=ash3c.tailnet-68f9.ts.net ansible_user=ben ansible_password=3131 ansible_become_password=3131
browser ansible_host=browser.tailnet-68f9.ts.net ansible_user=ben ansible_password=3131 ansible_become_password=3131
influxdb1 ansible_host=influxdb1.tailnet-68f9.ts.net ansible_user=ben ansible_password=3131 ansible_become_password=3131
warden ansible_host=warden.tailnet-68f9.ts.net ansible_user=ben ansible_password=3131 ansible_become_password=3131
[nomad_nodes:children]
nomad_servers
nomad_clients
[nomad_nodes:vars]
# NFS配置
nfs_server=snail
nfs_share=/fs/1000/nfs/Fnsync
mount_point=/mnt/fnsync
# Ansible配置
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
gitea ansible_host=gitea ansible_user=ben ansible_password=3131 ansible_become_password=3131
[gitea]
gitea ansible_host=gitea ansible_user=ben ansible_password=3131 ansible_become_password=3131

View File

@ -1,98 +0,0 @@
[dev]
dev1 ansible_host=dev1 ansible_user=ben ansible_become=yes ansible_become_pass=3131
dev2 ansible_host=dev2 ansible_user=ben ansible_become=yes ansible_become_pass=3131
[oci_kr]
#ch2 ansible_host=ch2 ansible_user=ben ansible_become=yes ansible_become_pass=3131 # 过期节点,已移除 (2025-09-30)
#ch3 ansible_host=ch3 ansible_user=ben ansible_become=yes ansible_become_pass=3131 # 过期节点,已移除 (2025-09-30)
[oci_us]
ash1d ansible_host=ash1d ansible_user=ben ansible_become=yes ansible_become_pass=3131
ash2e ansible_host=ash2e ansible_user=ben ansible_become=yes ansible_become_pass=3131
[oci_a1]
ch4 ansible_host=ch4 ansible_user=ben ansible_become=yes ansible_become_pass=3131
ash3c ansible_host=ash3c ansible_user=ben ansible_become=yes ansible_become_pass=3131
[huawei]
# hcs 节点已退役 (2025-09-27)
[google]
benwork ansible_host=benwork ansible_user=ben ansible_become=yes ansible_become_pass=3131
[ditigalocean]
# syd ansible_host=syd ansible_user=ben ansible_become=yes ansible_become_pass=3131 # 故障节点,已隔离
[faulty_cloud_servers]
# 故障的云服务器节点,需要通过 OpenTofu 和 Consul 解决
# hcs 节点已退役 (2025-09-27)
syd ansible_host=syd ansible_user=ben ansible_become=yes ansible_become_pass=3131
[aws]
#aws linux dnf
awsirish ansible_host=awsirish ansible_user=ben ansible_become=yes ansible_become_pass=3131
[proxmox]
pve ansible_host=pve ansible_user=root ansible_become=yes ansible_become_pass=Aa313131@ben
xgp ansible_host=xgp ansible_user=root ansible_become=yes ansible_become_pass=Aa313131@ben
nuc12 ansible_host=nuc12 ansible_user=root ansible_become=yes ansible_become_pass=Aa313131@ben
[lxc]
#集中在三台机器不要同时upgrade 会死掉,顺序调度来 (Debian/Ubuntu containers using apt)
gitea ansible_host=gitea.tailnet-68f9.ts.net ansible_user=ben ansible_ssh_private_key_file=/root/.ssh/gitea ansible_become=yes ansible_become_pass=3131
mysql ansible_host=mysql ansible_user=root ansible_become=yes ansible_become_pass=313131
postgresql ansible_host=postgresql ansible_user=root ansible_become=yes ansible_become_pass=313131
[nomadlxc]
influxdb ansible_host=influxdb1 ansible_user=root ansible_become=yes ansible_become_pass=313131
warden ansible_host=warden ansible_user=ben ansible_become=yes ansible_become_pass=3131
[semaphore]
#semaphoressh ansible_host=localhost ansible_user=root ansible_become=yes ansible_become_pass=313131 ansible_ssh_pass=313131 # 过期节点,已移除 (2025-09-30)
[alpine]
#Alpine Linux containers using apk package manager
redis ansible_host=redis ansible_user=root ansible_become=yes ansible_become_pass=313131
authentik ansible_host=authentik ansible_user=root ansible_become=yes ansible_become_pass=313131
calibreweb ansible_host=calibreweb ansible_user=root ansible_become=yes ansible_become_pass=313131
qdrant ansible_host=qdrant ansible_user=root ansible_become=yes
[vm]
kali ansible_host=kali ansible_user=ben ansible_become=yes ansible_become_pass=3131
[hcp]
hcp1 ansible_host=hcp1 ansible_user=root ansible_become=yes ansible_become_pass=313131
# hcp2 ansible_host=hcp2 ansible_user=root ansible_become=yes ansible_become_pass=313131 # 节点不存在,已注释 (2025-10-10)
[feiniu]
snail ansible_host=snail ansible_user=houzhongxu ansible_ssh_pass=Aa313131@ben ansible_become=yes ansible_become_pass=Aa313131@ben
[armbian]
onecloud1 ansible_host=100.98.209.50 ansible_user=ben ansible_password=3131 ansible_become_password=3131
de ansible_host=100.120.225.29 ansible_user=ben ansible_password=3131 ansible_become_password=3131
[beijing:children]
nomadlxc
hcp
[all:vars]
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
[nomad_clients:children]
nomadlxc
hcp
oci_a1
huawei
ditigalocean
[nomad_servers:children]
oci_us
oci_kr
semaphore
armbian
[nomad_cluster:children]
nomad_servers
nomad_clients
[beijing:children]
nomadlxc
hcp

View File

@ -1,7 +0,0 @@
[target_nodes]
master ansible_host=100.117.106.136 ansible_port=60022 ansible_user=ben ansible_become=yes ansible_become_pass=3131
ash3c ansible_host=100.116.80.94 ansible_user=ben ansible_become=yes ansible_become_pass=3131
semaphore ansible_host=100.116.158.95 ansible_user=ben ansible_become=yes ansible_become_pass=3131
[target_nodes:vars]
ansible_ssh_common_args='-o StrictHostKeyChecking=no'

View File

@ -1,14 +0,0 @@
# Nomad 客户端节点配置
# 此文件包含需要配置为Nomad客户端的6个节点
[nomad_clients]
bj-hcp1 ansible_host=bj-hcp1 ansible_user=root ansible_password=313131 ansible_become_password=313131
bj-influxdb ansible_host=bj-influxdb ansible_user=root ansible_password=313131 ansible_become_password=313131
bj-warden ansible_host=bj-warden ansible_user=ben ansible_password=3131 ansible_become_password=3131
bj-hcp2 ansible_host=bj-hcp2 ansible_user=root ansible_password=313131 ansible_become_password=313131
kr-master ansible_host=master ansible_port=60022 ansible_user=ben ansible_password=3131 ansible_become_password=3131
us-ash3c ansible_host=ash3c ansible_user=ben ansible_password=3131 ansible_become_password=3131
[nomad_clients:vars]
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
client_ip="{{ ansible_host }}"

View File

@ -1,12 +0,0 @@
[consul_servers:children]
nomad_servers
[consul_servers:vars]
consul_cert_dir=/etc/consul.d/certs
consul_ca_src=security/certificates/ca.pem
consul_cert_src=security/certificates/consul-server.pem
consul_key_src=security/certificates/consul-server-key.pem
[nomad_cluster:children]
nomad_servers
nomad_clients

View File

@ -1,7 +0,0 @@
[vault_servers]
master ansible_host=100.117.106.136 ansible_user=ben ansible_password=3131 ansible_become_password=3131 ansible_port=60022
ash3c ansible_host=100.116.80.94 ansible_user=ben ansible_password=3131 ansible_become_password=3131
warden ansible_host=warden ansible_user=ben ansible_become=yes ansible_become_pass=3131
[vault_servers:vars]
ansible_ssh_common_args='-o StrictHostKeyChecking=no'

View File

@ -1,50 +0,0 @@
datacenter = "dc1"
data_dir = "/opt/nomad/data"
plugin_dir = "/opt/nomad/plugins"
log_level = "INFO"
name = "onecloud1"
bind_addr = "100.98.209.50"
addresses {
http = "100.98.209.50"
rpc = "100.98.209.50"
serf = "100.98.209.50"
}
ports {
http = 4646
rpc = 4647
serf = 4648
}
server {
enabled = true
bootstrap_expect = 3
retry_join = ["100.81.26.3", "100.103.147.94", "100.90.159.68", "100.86.141.112", "100.98.209.50", "100.120.225.29"]
}
client {
enabled = false
}
plugin "nomad-driver-podman" {
config {
socket_path = "unix:///run/podman/podman.sock"
volumes {
enabled = true
}
}
}
consul {
address = "100.117.106.136:8500,100.116.80.94:8500,100.122.197.112:8500" # master, ash3c, warden
}
vault {
enabled = true
address = "http://100.117.106.136:8200,http://100.116.80.94:8200,http://100.122.197.112:8200" # master, ash3c, warden
token = "hvs.A5Fu4E1oHyezJapVllKPFsWg"
create_from_role = "nomad-cluster"
tls_skip_verify = true
}

View File

@ -1,202 +0,0 @@
---
- name: Add Warden Server as Nomad Client to Cluster
hosts: warden
become: yes
gather_facts: yes
vars:
nomad_plugin_dir: "/opt/nomad/plugins"
nomad_datacenter: "dc1"
nomad_region: "global"
nomad_servers:
- "100.117.106.136:4647"
- "100.116.80.94:4647"
- "100.97.62.111:4647"
- "100.116.112.45:4647"
- "100.84.197.26:4647"
tasks:
- name: 显示当前处理的节点
debug:
msg: "🔧 将 warden 服务器添加为 Nomad 客户端: {{ inventory_hostname }}"
- name: 检查 Nomad 是否已安装
shell: which nomad || echo "not_found"
register: nomad_check
changed_when: false
- name: 下载并安装 Nomad
block:
- name: 下载 Nomad 1.10.5
get_url:
url: "https://releases.hashicorp.com/nomad/1.10.5/nomad_1.10.5_linux_amd64.zip"
dest: "/tmp/nomad.zip"
mode: '0644'
- name: 解压并安装 Nomad
unarchive:
src: "/tmp/nomad.zip"
dest: "/usr/local/bin/"
remote_src: yes
owner: root
group: root
mode: '0755'
- name: 清理临时文件
file:
path: "/tmp/nomad.zip"
state: absent
when: nomad_check.stdout == "not_found"
- name: 验证 Nomad 安装
shell: nomad version
register: nomad_version_output
- name: 创建 Nomad 配置目录
file:
path: /etc/nomad.d
state: directory
owner: root
group: root
mode: '0755'
- name: 创建 Nomad 数据目录
file:
path: /opt/nomad/data
state: directory
owner: nomad
group: nomad
mode: '0755'
ignore_errors: yes
- name: 创建 Nomad 插件目录
file:
path: "{{ nomad_plugin_dir }}"
state: directory
owner: nomad
group: nomad
mode: '0755'
ignore_errors: yes
- name: 获取服务器 IP 地址
shell: |
ip route get 1.1.1.1 | grep -oP 'src \K\S+'
register: server_ip_result
changed_when: false
- name: 设置服务器 IP 变量
set_fact:
server_ip: "{{ server_ip_result.stdout }}"
- name: 停止 Nomad 服务(如果正在运行)
systemd:
name: nomad
state: stopped
ignore_errors: yes
- name: 创建 Nomad 客户端配置文件
copy:
content: |
# Nomad Client Configuration for warden
datacenter = "{{ nomad_datacenter }}"
data_dir = "/opt/nomad/data"
log_level = "INFO"
bind_addr = "{{ server_ip }}"
server {
enabled = false
}
client {
enabled = true
servers = [
{% for server in nomad_servers %}"{{ server }}"{% if not loop.last %}, {% endif %}{% endfor %}
]
}
plugin_dir = "{{ nomad_plugin_dir }}"
plugin "podman" {
config {
socket_path = "unix:///run/podman/podman.sock"
volumes {
enabled = true
}
}
}
consul {
address = "127.0.0.1:8500"
}
dest: /etc/nomad.d/nomad.hcl
owner: root
group: root
mode: '0644'
- name: 验证 Nomad 配置
shell: nomad config validate /etc/nomad.d/nomad.hcl
register: nomad_validate
failed_when: nomad_validate.rc != 0
- name: 创建 Nomad systemd 服务文件
copy:
content: |
[Unit]
Description=Nomad
Documentation=https://www.nomadproject.io/docs/
Wants=network-online.target
After=network-online.target
[Service]
Type=notify
User=root
Group=root
ExecStart=/usr/local/bin/nomad agent -config=/etc/nomad.d
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
KillSignal=SIGINT
TimeoutStopSec=5
LimitNOFILE=65536
LimitNPROC=32768
Restart=on-failure
RestartSec=2
[Install]
WantedBy=multi-user.target
dest: /etc/systemd/system/nomad.service
mode: '0644'
- name: 重新加载 systemd 配置
systemd:
daemon_reload: yes
- name: 启动并启用 Nomad 服务
systemd:
name: nomad
state: started
enabled: yes
- name: 等待 Nomad 服务启动
wait_for:
port: 4646
host: "{{ server_ip }}"
delay: 5
timeout: 60
- name: 检查 Nomad 客户端状态
shell: nomad node status -self
register: nomad_node_status
retries: 5
delay: 5
until: nomad_node_status.rc == 0
ignore_errors: yes
- name: 显示 Nomad 客户端配置结果
debug:
msg: |
✅ warden 服务器已成功配置为 Nomad 客户端
📦 Nomad 版本: {{ nomad_version_output.stdout.split('\n')[0] }}
🌐 服务器 IP: {{ server_ip }}
🏗️ 数据中心: {{ nomad_datacenter }}
📊 客户端状态: {{ 'SUCCESS' if nomad_node_status.rc == 0 else 'PENDING' }}
🚀 warden 现在是 Nomad 集群的一部分

View File

@ -1,22 +0,0 @@
---
- name: Thorough cleanup of Nomad configuration backup files
hosts: nomad_nodes
become: yes
tasks:
- name: Remove all backup files with various patterns
shell: |
find /etc/nomad.d/ -name "nomad.hcl.*" -not -name "nomad.hcl" -delete
find /etc/nomad.d/ -name "*.bak" -delete
find /etc/nomad.d/ -name "*.backup*" -delete
find /etc/nomad.d/ -name "*.~" -delete
find /etc/nomad.d/ -name "*.broken" -delete
ignore_errors: yes
- name: List remaining files in /etc/nomad.d/
command: ls -la /etc/nomad.d/
register: remaining_files
changed_when: false
- name: Display remaining files
debug:
var: remaining_files.stdout_lines

View File

@ -1,25 +0,0 @@
---
- name: Cleanup Nomad configuration backup files
hosts: nomad_nodes
become: yes
tasks:
- name: Remove backup files from /etc/nomad.d/
file:
path: "{{ item }}"
state: absent
loop:
- "/etc/nomad.d/*.bak"
- "/etc/nomad.d/*.backup"
- "/etc/nomad.d/*.~"
- "/etc/nomad.d/*.broken"
- "/etc/nomad.d/nomad.hcl.*"
ignore_errors: yes
- name: List remaining files in /etc/nomad.d/
command: ls -la /etc/nomad.d/
register: remaining_files
changed_when: false
- name: Display remaining files
debug:
var: remaining_files.stdout_lines

View File

@ -1,39 +0,0 @@
---
- name: 配置Nomad客户端节点
hosts: nomad_clients
become: yes
vars:
nomad_config_dir: /etc/nomad.d
tasks:
- name: 创建Nomad配置目录
file:
path: "{{ nomad_config_dir }}"
state: directory
owner: root
group: root
mode: '0755'
- name: 复制Nomad客户端配置模板
template:
src: ../templates/nomad-client.hcl
dest: "{{ nomad_config_dir }}/nomad.hcl"
owner: root
group: root
mode: '0644'
- name: 启动Nomad服务
systemd:
name: nomad
state: restarted
enabled: yes
daemon_reload: yes
- name: 检查Nomad服务状态
command: systemctl status nomad
register: nomad_status
changed_when: false
- name: 显示Nomad服务状态
debug:
var: nomad_status.stdout_lines

View File

@ -1,44 +0,0 @@
---
- name: 统一配置所有Nomad节点
hosts: nomad_cluster
become: yes
tasks:
- name: 备份当前Nomad配置
copy:
src: /etc/nomad.d/nomad.hcl
dest: /etc/nomad.d/nomad.hcl.bak
remote_src: yes
ignore_errors: yes
- name: 生成统一Nomad配置
template:
src: ../templates/nomad-unified.hcl.j2
dest: /etc/nomad.d/nomad.hcl
owner: root
group: root
mode: '0644'
- name: 重启Nomad服务
systemd:
name: nomad
state: restarted
enabled: yes
daemon_reload: yes
- name: 等待Nomad服务就绪
wait_for:
port: 4646
host: "{{ inventory_hostname }}.tailnet-68f9.ts.net"
delay: 10
timeout: 60
ignore_errors: yes
- name: 检查Nomad服务状态
command: systemctl status nomad
register: nomad_status
changed_when: false
- name: 显示Nomad服务状态
debug:
var: nomad_status.stdout_lines

View File

@ -1,62 +0,0 @@
---
- name: Configure Nomad Dynamic Host Volumes for NFS
hosts: nomad_clients
become: yes
vars:
nfs_server: "snail"
nfs_share: "/fs/1000/nfs/Fnsync"
mount_point: "/mnt/fnsync"
tasks:
- name: Stop Nomad service
systemd:
name: nomad
state: stopped
- name: Update Nomad configuration for dynamic host volumes
blockinfile:
path: /etc/nomad.d/nomad.hcl
marker: "# {mark} DYNAMIC HOST VOLUMES CONFIGURATION"
block: |
client {
# 启用动态host volumes
host_volume "fnsync" {
path = "{{ mount_point }}"
read_only = false
}
# 添加NFS相关的节点元数据
meta {
nfs_server = "{{ nfs_server }}"
nfs_share = "{{ nfs_share }}"
nfs_mounted = "true"
}
}
insertafter: 'client {'
- name: Start Nomad service
systemd:
name: nomad
state: started
enabled: yes
- name: Wait for Nomad to start
wait_for:
port: 4646
delay: 10
timeout: 60
- name: Check Nomad status
command: nomad node status
register: nomad_status
ignore_errors: yes
- name: Display Nomad status
debug:
var: nomad_status.stdout_lines

View File

@ -1,57 +0,0 @@
---
- name: Configure Podman driver for all Nomad client nodes
hosts: target_nodes
become: yes
tasks:
- name: Stop Nomad service
systemd:
name: nomad
state: stopped
- name: Install Podman if not present
package:
name: podman
state: present
ignore_errors: yes
- name: Enable Podman socket
systemd:
name: podman.socket
enabled: yes
state: started
ignore_errors: yes
- name: Update Nomad configuration to use Podman
lineinfile:
path: /etc/nomad.d/nomad.hcl
regexp: '^plugin "docker"'
line: 'plugin "podman" {'
state: present
- name: Add Podman plugin configuration
blockinfile:
path: /etc/nomad.d/nomad.hcl
marker: "# {mark} PODMAN PLUGIN CONFIG"
block: |
plugin "podman" {
config {
socket_path = "unix:///run/podman/podman.sock"
volumes {
enabled = true
}
}
}
insertafter: 'client {'
- name: Start Nomad service
systemd:
name: nomad
state: started
- name: Wait for Nomad to be ready
wait_for:
port: 4646
host: localhost
delay: 5
timeout: 30

View File

@ -1,22 +0,0 @@
---
- name: Configure NOPASSWD sudo for nomad user
hosts: nomad_clients
become: yes
tasks:
- name: Ensure sudoers.d directory exists
file:
path: /etc/sudoers.d
state: directory
owner: root
group: root
mode: '0750'
- name: Allow nomad user passwordless sudo for required commands
copy:
dest: /etc/sudoers.d/nomad
content: |
nomad ALL=(ALL) NOPASSWD: /usr/bin/apt, /usr/bin/systemctl, /bin/mkdir, /bin/chown, /bin/chmod, /bin/mv, /bin/sed, /usr/bin/tee, /usr/sbin/usermod, /usr/bin/unzip, /usr/bin/wget
owner: root
group: root
mode: '0440'
validate: 'visudo -cf %s'

View File

@ -1,226 +0,0 @@
---
- name: 配置 Nomad 集群使用 Tailscale 网络通讯
hosts: nomad_cluster
become: yes
gather_facts: no
vars:
nomad_config_dir: "/etc/nomad.d"
nomad_config_file: "{{ nomad_config_dir }}/nomad.hcl"
tasks:
- name: 获取当前节点的 Tailscale IP
shell: tailscale ip | head -1
register: current_tailscale_ip
changed_when: false
ignore_errors: yes
- name: 计算用于 Nomad 的地址(优先 Tailscale回退到 inventory 或 ansible_host
set_fact:
node_addr: "{{ (current_tailscale_ip.stdout | default('')) is match('^100\\.') | ternary((current_tailscale_ip.stdout | trim), (hostvars[inventory_hostname].tailscale_ip | default(ansible_host))) }}"
- name: 确保 Nomad 配置目录存在
file:
path: "{{ nomad_config_dir }}"
state: directory
owner: root
group: root
mode: '0755'
- name: 生成 Nomad 服务器配置(使用 Tailscale
copy:
dest: "{{ nomad_config_file }}"
owner: root
group: root
mode: '0644'
content: |
datacenter = "{{ nomad_datacenter | default('dc1') }}"
data_dir = "/opt/nomad/data"
log_level = "INFO"
bind_addr = "{{ node_addr }}"
addresses {
http = "{{ node_addr }}"
rpc = "{{ node_addr }}"
serf = "{{ node_addr }}"
}
ports {
http = 4646
rpc = 4647
serf = 4648
}
server {
enabled = true
bootstrap_expect = {{ nomad_bootstrap_expect | default(4) }}
retry_join = [
"100.116.158.95", # semaphore
"100.103.147.94", # ash2e
"100.81.26.3", # ash1d
"100.90.159.68" # ch2
]
encrypt = "{{ nomad_encrypt_key }}"
}
client {
enabled = false
}
plugin "podman" {
config {
socket_path = "unix:///run/podman/podman.sock"
volumes {
enabled = true
}
}
}
consul {
address = "{{ node_addr }}:8500"
}
when: nomad_role == "server"
notify: restart nomad
- name: 生成 Nomad 客户端配置(使用 Tailscale
copy:
dest: "{{ nomad_config_file }}"
owner: root
group: root
mode: '0644'
content: |
datacenter = "{{ nomad_datacenter | default('dc1') }}"
data_dir = "/opt/nomad/data"
log_level = "INFO"
bind_addr = "{{ node_addr }}"
addresses {
http = "{{ node_addr }}"
rpc = "{{ node_addr }}"
serf = "{{ node_addr }}"
}
ports {
http = 4646
rpc = 4647
serf = 4648
}
server {
enabled = false
}
client {
enabled = true
network_interface = "tailscale0"
cpu_total_compute = 0
servers = [
"100.116.158.95:4647", # semaphore
"100.103.147.94:4647", # ash2e
"100.81.26.3:4647", # ash1d
"100.90.159.68:4647" # ch2
]
}
plugin "podman" {
config {
socket_path = "unix:///run/podman/podman.sock"
volumes {
enabled = true
}
}
}
consul {
address = "{{ node_addr }}:8500"
}
when: nomad_role == "client"
notify: restart nomad
- name: 检查 Nomad 二进制文件位置
shell: which nomad || find /usr -name nomad 2>/dev/null | head -1
register: nomad_binary_path
failed_when: nomad_binary_path.stdout == ""
- name: 创建/更新 Nomad systemd 服务文件
copy:
dest: "/etc/systemd/system/nomad.service"
owner: root
group: root
mode: '0644'
content: |
[Unit]
Description=Nomad
Documentation=https://www.nomadproject.io/
Requires=network-online.target
After=network-online.target
[Service]
Type=notify
User=root
Group=root
ExecStart={{ nomad_binary_path.stdout }} agent -config=/etc/nomad.d/nomad.hcl
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
notify: restart nomad
- name: 确保 Nomad 数据目录存在
file:
path: "/opt/nomad/data"
state: directory
owner: root
group: root
mode: '0755'
- name: 重新加载 systemd daemon
systemd:
daemon_reload: yes
- name: 启用并启动 Nomad 服务
systemd:
name: nomad
enabled: yes
state: started
- name: 等待 Nomad 服务启动
wait_for:
port: 4646
host: "{{ node_addr }}"
delay: 5
timeout: 30
ignore_errors: yes
- name: 检查 Nomad 服务状态
shell: systemctl status nomad --no-pager -l
register: nomad_status
ignore_errors: yes
- name: 显示配置结果
debug:
msg: |
✅ 节点 {{ inventory_hostname }} 配置完成
🌐 使用地址: {{ node_addr }}
🎯 角色: {{ nomad_role }}
🔧 Nomad 二进制: {{ nomad_binary_path.stdout }}
📊 服务状态: {{ 'active' if nomad_status.rc == 0 else 'failed' }}
{% if nomad_status.rc != 0 %}
❌ 错误信息:
{{ nomad_status.stdout }}
{{ nomad_status.stderr }}
{% endif %}
handlers:
- name: restart nomad
systemd:
name: nomad
state: restarted
daemon_reload: yes

View File

@ -1,115 +0,0 @@
---
- name: Configure Podman for Nomad Integration
hosts: all
become: yes
gather_facts: yes
tasks:
- name: 显示当前处理的节点
debug:
msg: "🔧 正在为 Nomad 配置 Podman: {{ inventory_hostname }}"
- name: 确保 Podman 已安装
package:
name: podman
state: present
- name: 启用并启动 Podman socket 服务
systemd:
name: podman.socket
enabled: yes
state: started
- name: 创建 Podman 系统配置目录
file:
path: /etc/containers
state: directory
mode: '0755'
- name: 配置 Podman 使用系统 socket
copy:
content: |
[engine]
# 使用系统级 socket 而不是用户级 socket
active_service = "system"
[engine.service_destinations]
[engine.service_destinations.system]
uri = "unix:///run/podman/podman.sock"
dest: /etc/containers/containers.conf
mode: '0644'
- name: 检查是否存在 nomad 用户
getent:
database: passwd
key: nomad
register: nomad_user_check
ignore_errors: yes
- name: 为 nomad 用户创建配置目录
file:
path: "/home/nomad/.config/containers"
state: directory
owner: nomad
group: nomad
mode: '0755'
when: nomad_user_check is succeeded
- name: 为 nomad 用户配置 Podman
copy:
content: |
[engine]
active_service = "system"
[engine.service_destinations]
[engine.service_destinations.system]
uri = "unix:///run/podman/podman.sock"
dest: /home/nomad/.config/containers/containers.conf
owner: nomad
group: nomad
mode: '0644'
when: nomad_user_check is succeeded
- name: 将 nomad 用户添加到 podman 组
user:
name: nomad
groups: podman
append: yes
when: nomad_user_check is succeeded
ignore_errors: yes
- name: 创建 podman 组(如果不存在)
group:
name: podman
state: present
ignore_errors: yes
- name: 设置 podman socket 目录权限
file:
path: /run/podman
state: directory
mode: '0755'
group: podman
ignore_errors: yes
- name: 验证 Podman socket 权限
file:
path: /run/podman/podman.sock
mode: '066'
when: nomad_user_check is succeeded
ignore_errors: yes
- name: 验证 Podman 安装
shell: podman --version
register: podman_version
- name: 测试 Podman 功能
shell: podman info
register: podman_info
ignore_errors: yes
- name: 显示配置结果
debug:
msg: |
✅ 节点 {{ inventory_hostname }} Podman 配置完成
📦 Podman 版本: {{ podman_version.stdout }}
🐳 Podman 状态: {{ 'SUCCESS' if podman_info.rc == 0 else 'WARNING' }}
👤 Nomad 用户: {{ 'FOUND' if nomad_user_check is succeeded else 'NOT FOUND' }}

View File

@ -1,105 +0,0 @@
---
- name: 部署韩国节点Nomad配置
hosts: ch2,ch3
become: yes
gather_facts: no
vars:
nomad_config_dir: "/etc/nomad.d"
nomad_config_file: "{{ nomad_config_dir }}/nomad.hcl"
source_config_dir: "/root/mgmt/infrastructure/configs/server"
tasks:
- name: 获取主机名短名称(去掉后缀)
set_fact:
short_hostname: "{{ inventory_hostname | regex_replace('\\$', '') }}"
- name: 确保 Nomad 配置目录存在
file:
path: "{{ nomad_config_dir }}"
state: directory
owner: root
group: root
mode: '0755'
- name: 部署 Nomad 配置文件到韩国节点
copy:
src: "{{ source_config_dir }}/nomad-{{ short_hostname }}.hcl"
dest: "{{ nomad_config_file }}"
owner: root
group: root
mode: '0644'
backup: yes
notify: restart nomad
- name: 检查 Nomad 二进制文件位置
shell: which nomad || find /usr -name nomad 2>/dev/null | head -1
register: nomad_binary_path
failed_when: nomad_binary_path.stdout == ""
- name: 创建/更新 Nomad systemd 服务文件
copy:
dest: "/etc/systemd/system/nomad.service"
owner: root
group: root
mode: '0644'
content: |
[Unit]
Description=Nomad
Documentation=https://www.nomadproject.io/
Requires=network-online.target
After=network-online.target
[Service]
Type=notify
User=root
Group=root
ExecStart={{ nomad_binary_path.stdout }} agent -config=/etc/nomad.d/nomad.hcl
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
notify: restart nomad
- name: 确保 Nomad 数据目录存在
file:
path: "/opt/nomad/data"
state: directory
owner: root
group: root
mode: '0755'
- name: 重新加载 systemd daemon
systemd:
daemon_reload: yes
- name: 启用并启动 Nomad 服务
systemd:
name: nomad
enabled: yes
state: started
- name: 等待 Nomad 服务启动
wait_for:
port: 4646
host: "{{ ansible_host }}"
delay: 5
timeout: 30
ignore_errors: yes
- name: 显示 Nomad 服务状态
command: systemctl status nomad
register: nomad_status
changed_when: false
- name: 显示 Nomad 服务状态信息
debug:
var: nomad_status.stdout_lines
handlers:
- name: restart nomad
systemd:
name: nomad
state: restarted

View File

@ -1,64 +0,0 @@
---
- name: 部署Nomad配置到所有节点
hosts: nomad_cluster
become: yes
tasks:
- name: 检查节点类型
set_fact:
node_type: "{{ 'server' if inventory_hostname in groups['nomad_servers'] else 'client' }}"
- name: 部署Nomad服务器配置文件
template:
src: nomad-server.hcl.j2
dest: /etc/nomad.d/nomad.hcl
backup: yes
owner: root
group: root
mode: '0644'
when: node_type == 'server'
- name: 部署Nomad客户端配置文件
get_url:
url: "https://gitea.tailnet-68f9.ts.net/ben/mgmt/raw/branch/main/nomad-configs/nodes/{{ inventory_hostname }}.hcl"
dest: /etc/nomad.d/nomad.hcl
backup: yes
owner: root
group: root
mode: '0644'
when: node_type == 'client'
- name: 重启Nomad服务
systemd:
name: nomad
state: restarted
enabled: yes
- name: 等待Nomad服务启动
wait_for:
port: 4646
host: "{{ ansible_host }}"
timeout: 30
when: node_type == 'server'
- name: 等待Nomad客户端服务启动
wait_for:
port: 4646
host: "{{ ansible_host }}"
timeout: 30
when: node_type == 'client'
- name: 显示Nomad服务状态
systemd:
name: nomad
register: nomad_status
- name: 显示服务状态
debug:
msg: "{{ inventory_hostname }} ({{ node_type }}) Nomad服务状态: {{ nomad_status.status.ActiveState }}"

View File

@ -1,168 +0,0 @@
---
- name: 磁盘空间分析 - 使用 ncdu 工具
hosts: all
become: yes
vars:
ncdu_scan_paths:
- "/"
- "/var"
- "/opt"
- "/home"
output_dir: "/tmp/disk-analysis"
tasks:
- name: 安装 ncdu 工具
package:
name: ncdu
state: present
register: ncdu_install
- name: 创建输出目录
file:
path: "{{ output_dir }}"
state: directory
mode: '0755'
- name: 检查磁盘空间使用情况
shell: df -h
register: disk_usage
- name: 显示当前磁盘使用情况
debug:
msg: |
=== {{ inventory_hostname }} 磁盘使用情况 ===
{{ disk_usage.stdout }}
- name: 使用 ncdu 扫描根目录并生成报告
shell: |
ncdu -x -o {{ output_dir }}/ncdu-root-{{ inventory_hostname }}.json /
async: 300
poll: 0
register: ncdu_root_scan
- name: 使用 ncdu 扫描 /var 目录
shell: |
ncdu -x -o {{ output_dir }}/ncdu-var-{{ inventory_hostname }}.json /var
async: 180
poll: 0
register: ncdu_var_scan
when: ansible_mounts | selectattr('mount', 'equalto', '/var') | list | length > 0 or '/var' in ansible_mounts | map(attribute='mount') | list
- name: 使用 ncdu 扫描 /opt 目录
shell: |
ncdu -x -o {{ output_dir }}/ncdu-opt-{{ inventory_hostname }}.json /opt
async: 120
poll: 0
register: ncdu_opt_scan
when: ansible_mounts | selectattr('mount', 'equalto', '/opt') | list | length > 0 or '/opt' in ansible_mounts | map(attribute='mount') | list
- name: 等待根目录扫描完成
async_status:
jid: "{{ ncdu_root_scan.ansible_job_id }}"
register: ncdu_root_result
until: ncdu_root_result.finished
retries: 60
delay: 5
- name: 等待 /var 目录扫描完成
async_status:
jid: "{{ ncdu_var_scan.ansible_job_id }}"
register: ncdu_var_result
until: ncdu_var_result.finished
retries: 36
delay: 5
when: ncdu_var_scan is defined and ncdu_var_scan.ansible_job_id is defined
- name: 等待 /opt 目录扫描完成
async_status:
jid: "{{ ncdu_opt_scan.ansible_job_id }}"
register: ncdu_opt_result
until: ncdu_opt_result.finished
retries: 24
delay: 5
when: ncdu_opt_scan is defined and ncdu_opt_scan.ansible_job_id is defined
- name: 生成磁盘使用分析报告
shell: |
echo "=== {{ inventory_hostname }} 磁盘分析报告 ===" > {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
echo "生成时间: $(date)" >> {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
echo "" >> {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
echo "=== 磁盘使用情况 ===" >> {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
df -h >> {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
echo "" >> {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
echo "=== 最大的目录 (前10个) ===" >> {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
du -h --max-depth=2 / 2>/dev/null | sort -hr | head -10 >> {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
echo "" >> {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
echo "=== /var 目录最大文件 ===" >> {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
find /var -type f -size +100M -exec ls -lh {} \; 2>/dev/null | head -10 >> {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
echo "" >> {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
echo "=== /tmp 目录使用情况 ===" >> {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
du -sh /tmp/* 2>/dev/null | sort -hr | head -5 >> {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
echo "" >> {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
echo "=== 日志文件大小 ===" >> {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
find /var/log -name "*.log" -type f -size +50M -exec ls -lh {} \; 2>/dev/null >> {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
- name: 显示分析报告
shell: cat {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
register: disk_report
- name: 输出磁盘分析结果
debug:
msg: "{{ disk_report.stdout }}"
- name: 检查是否有磁盘使用率超过 80%
shell: df -h | awk 'NR>1 {gsub(/%/, "", $5); if($5 > 80) print $0}'
register: high_usage_disks
- name: 警告高磁盘使用率
debug:
msg: |
⚠️ 警告: {{ inventory_hostname }} 发现高磁盘使用率!
{{ high_usage_disks.stdout }}
when: high_usage_disks.stdout != ""
- name: 创建清理建议
shell: |
echo "=== {{ inventory_hostname }} 清理建议 ===" > {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt
echo "" >> {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt
echo "1. 检查日志文件:" >> {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt
find /var/log -name "*.log" -type f -size +100M -exec echo " 大日志文件: {}" \; 2>/dev/null >> {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt
echo "" >> {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt
echo "2. 检查临时文件:" >> {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt
find /tmp -type f -size +50M -exec echo " 大临时文件: {}" \; 2>/dev/null >> {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt
echo "" >> {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt
echo "3. 检查包缓存:" >> {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt
if [ -d /var/cache/apt ]; then
echo " APT 缓存大小: $(du -sh /var/cache/apt 2>/dev/null | cut -f1)" >> {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt
fi
if [ -d /var/cache/yum ]; then
echo " YUM 缓存大小: $(du -sh /var/cache/yum 2>/dev/null | cut -f1)" >> {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt
fi
echo "" >> {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt
echo "4. 检查容器相关:" >> {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt
if command -v podman >/dev/null 2>&1; then
echo " Podman 镜像: $(podman images --format 'table {{.Repository}} {{.Tag}} {{.Size}}' 2>/dev/null | wc -l) 个" >> {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt
echo " Podman 容器: $(podman ps -a --format 'table {{.Names}} {{.Status}}' 2>/dev/null | wc -l) 个" >> {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt
fi
- name: 显示清理建议
shell: cat {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt
register: cleanup_suggestions
- name: 输出清理建议
debug:
msg: "{{ cleanup_suggestions.stdout }}"
- name: 保存 ncdu 文件位置信息
debug:
msg: |
📁 ncdu 扫描文件已保存到:
- 根目录: {{ output_dir }}/ncdu-root-{{ inventory_hostname }}.json
- /var 目录: {{ output_dir }}/ncdu-var-{{ inventory_hostname }}.json (如果存在)
- /opt 目录: {{ output_dir }}/ncdu-opt-{{ inventory_hostname }}.json (如果存在)
💡 使用方法:
ncdu -f {{ output_dir }}/ncdu-root-{{ inventory_hostname }}.json
📊 完整报告: {{ output_dir }}/disk-report-{{ inventory_hostname }}.txt
🧹 清理建议: {{ output_dir }}/cleanup-suggestions-{{ inventory_hostname }}.txt

View File

@ -1,96 +0,0 @@
---
- name: 磁盘清理工具
hosts: all
become: yes
vars:
cleanup_logs: true
cleanup_cache: true
cleanup_temp: true
cleanup_containers: false # 谨慎操作
tasks:
- name: 检查磁盘使用情况 (清理前)
shell: df -h
register: disk_before
- name: 显示清理前磁盘使用情况
debug:
msg: |
=== {{ inventory_hostname }} 清理前磁盘使用情况 ===
{{ disk_before.stdout }}
- name: 清理系统日志 (保留最近7天)
shell: |
journalctl --vacuum-time=7d
find /var/log -name "*.log" -type f -mtime +7 -exec truncate -s 0 {} \;
find /var/log -name "*.log.*" -type f -mtime +7 -delete
when: cleanup_logs | bool
register: log_cleanup
- name: 清理包管理器缓存
block:
- name: 清理 APT 缓存 (Debian/Ubuntu)
shell: |
apt-get clean
apt-get autoclean
apt-get autoremove -y
when: ansible_os_family == "Debian"
- name: 清理 YUM/DNF 缓存 (RedHat/CentOS)
shell: |
if command -v dnf >/dev/null 2>&1; then
dnf clean all
elif command -v yum >/dev/null 2>&1; then
yum clean all
fi
when: ansible_os_family == "RedHat"
when: cleanup_cache | bool
- name: 清理临时文件
shell: |
find /tmp -type f -atime +7 -delete 2>/dev/null || true
find /var/tmp -type f -atime +7 -delete 2>/dev/null || true
rm -rf /tmp/.* 2>/dev/null || true
when: cleanup_temp | bool
- name: 清理 Podman 资源 (谨慎操作)
block:
- name: 停止所有容器
shell: podman stop --all
ignore_errors: yes
- name: 删除未使用的容器
shell: podman container prune -f
ignore_errors: yes
- name: 删除未使用的镜像
shell: podman image prune -f
ignore_errors: yes
- name: 删除未使用的卷
shell: podman volume prune -f
ignore_errors: yes
when: cleanup_containers | bool
- name: 清理核心转储文件
shell: |
find /var/crash -name "core.*" -type f -delete 2>/dev/null || true
find / -name "core" -type f -size +10M -delete 2>/dev/null || true
ignore_errors: yes
- name: 检查磁盘使用情况 (清理后)
shell: df -h
register: disk_after
- name: 显示清理结果
debug:
msg: |
=== {{ inventory_hostname }} 清理完成 ===
清理前:
{{ disk_before.stdout }}
清理后:
{{ disk_after.stdout }}
🧹 清理操作完成!

View File

@ -1,33 +0,0 @@
---
- name: 分发SSH公钥到Nomad客户端节点
hosts: nomad_clients
become: yes
vars:
ssh_public_key: "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMSUUfma8FKEFvH8Nq65XM2PZ9kitfgv1q727cKV9y5Z houzhongxu@seekkey.tech"
tasks:
- name: 确保 .ssh 目录存在
file:
path: "/home/{{ ansible_user }}/.ssh"
state: directory
owner: "{{ ansible_user }}"
group: "{{ ansible_user }}"
mode: '0700'
- name: 添加SSH公钥到 authorized_keys
lineinfile:
path: "/home/{{ ansible_user }}/.ssh/authorized_keys"
line: "{{ ssh_public_key }}"
create: yes
owner: "{{ ansible_user }}"
group: "{{ ansible_user }}"
mode: '0600'
- name: 验证SSH公钥已添加
command: cat "/home/{{ ansible_user }}/.ssh/authorized_keys"
register: ssh_key_check
changed_when: false
- name: 显示SSH公钥内容
debug:
var: ssh_key_check.stdout_lines

View File

@ -1,32 +0,0 @@
---
- name: 分发SSH公钥到新节点
hosts: browser,influxdb1,hcp1,warden
become: yes
vars:
ssh_public_key: "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMSUUfma8FKEFvH8Nq65XM2PZ9kitfgv1q727cKV9y5Z houzhongxu@seekkey.tech"
tasks:
- name: 确保 .ssh 目录存在
file:
path: "/root/.ssh"
state: directory
mode: '0700'
owner: root
group: root
- name: 添加SSH公钥到 authorized_keys
copy:
content: "{{ ssh_public_key }}"
dest: "/root/.ssh/authorized_keys"
mode: '0600'
owner: root
group: root
- name: 验证SSH公钥已添加
command: cat /root/.ssh/authorized_keys
register: ssh_key_check
changed_when: false
- name: 显示SSH公钥内容
debug:
var: ssh_key_check.stdout_lines

View File

@ -1,76 +0,0 @@
---
- name: Distribute Nomad Podman Driver to all nodes
hosts: nomad_cluster
become: yes
vars:
nomad_user: nomad
nomad_data_dir: /opt/nomad/data
nomad_plugins_dir: "{{ nomad_data_dir }}/plugins"
tasks:
- name: Stop Nomad service
systemd:
name: nomad
state: stopped
- name: Create plugins directory
file:
path: "{{ nomad_plugins_dir }}"
state: directory
owner: "{{ nomad_user }}"
group: "{{ nomad_user }}"
mode: '0755'
- name: Copy Nomad Podman driver from local
copy:
src: /tmp/nomad-driver-podman
dest: "{{ nomad_plugins_dir }}/nomad-driver-podman"
owner: "{{ nomad_user }}"
group: "{{ nomad_user }}"
mode: '0755'
- name: Update Nomad configuration for plugin directory
lineinfile:
path: /etc/nomad.d/nomad.hcl
regexp: '^plugin_dir'
line: 'plugin_dir = "{{ nomad_plugins_dir }}"'
insertafter: 'data_dir = "/opt/nomad/data"'
- name: Ensure Podman is installed
package:
name: podman
state: present
- name: Enable Podman socket
systemd:
name: podman.socket
enabled: yes
state: started
ignore_errors: yes
- name: Start Nomad service
systemd:
name: nomad
state: started
enabled: yes
- name: Wait for Nomad to be ready
wait_for:
port: 4646
host: localhost
delay: 10
timeout: 60
- name: Wait for plugins to load
pause:
seconds: 15
- name: Check driver status
shell: |
/usr/local/bin/nomad node status -self | grep -A 10 "Driver Status" || /usr/bin/nomad node status -self | grep -A 10 "Driver Status"
register: driver_status
failed_when: false
- name: Display driver status
debug:
var: driver_status.stdout_lines

View File

@ -1,12 +0,0 @@
- name: Distribute new podman binary to specified nomad_clients
hosts: nomadlxc,hcp,huawei,ditigalocean
gather_facts: false
tasks:
- name: Copy new podman binary to /usr/local/bin
copy:
src: /root/mgmt/configuration/podman-remote-static-linux_amd64
dest: /usr/local/bin/podman
owner: root
group: root
mode: '0755'
become: yes

View File

@ -1,39 +0,0 @@
---
- name: 紧急修复Nomad bootstrap_expect配置
hosts: nomad_servers
become: yes
tasks:
- name: 修复bootstrap_expect为3
lineinfile:
path: /etc/nomad.d/nomad.hcl
regexp: '^ bootstrap_expect = \d+'
line: ' bootstrap_expect = 3'
backup: yes
- name: 重启Nomad服务
systemd:
name: nomad
state: restarted
enabled: yes
- name: 等待Nomad服务启动
wait_for:
port: 4646
host: "{{ ansible_host }}"
timeout: 30
- name: 检查Nomad服务状态
systemd:
name: nomad
register: nomad_status
- name: 显示Nomad服务状态
debug:
msg: "{{ inventory_hostname }} Nomad服务状态: {{ nomad_status.status.ActiveState }}"

View File

@ -1,103 +0,0 @@
---
- name: Fix ch4 Nomad configuration - convert from server to client
hosts: ch4
become: yes
vars:
ansible_host: 100.117.106.136
tasks:
- name: Backup current Nomad config
copy:
src: /etc/nomad.d/nomad.hcl
dest: /etc/nomad.d/nomad.hcl.backup
remote_src: yes
backup: yes
- name: Update Nomad config to client mode
blockinfile:
path: /etc/nomad.d/nomad.hcl
marker: "# {mark} ANSIBLE MANAGED CLIENT CONFIG"
block: |
server {
enabled = false
}
client {
enabled = true
network_interface = "tailscale0"
servers = [
"semaphore.tailnet-68f9.ts.net:4647",
"ash1d.tailnet-68f9.ts.net:4647",
"ash2e.tailnet-68f9.ts.net:4647",
"ch2.tailnet-68f9.ts.net:4647",
"ch3.tailnet-68f9.ts.net:4647",
"onecloud1.tailnet-68f9.ts.net:4647",
"de.tailnet-68f9.ts.net:4647"
]
meta {
consul = "true"
consul_version = "1.21.5"
consul_server = "true"
}
}
insertbefore: '^server \{'
replace: '^server \{.*?\}'
- name: Update client block
blockinfile:
path: /etc/nomad.d/nomad.hcl
marker: "# {mark} ANSIBLE MANAGED CLIENT BLOCK"
block: |
client {
enabled = true
network_interface = "tailscale0"
servers = [
"semaphore.tailnet-68f9.ts.net:4647",
"ash1d.tailnet-68f9.ts.net:4647",
"ash2e.tailnet-68f9.ts.net:4647",
"ch2.tailnet-68f9.ts.net:4647",
"ch3.tailnet-68f9.ts.net:4647",
"onecloud1.tailnet-68f9.ts.net:4647",
"de.tailnet-68f9.ts.net:4647"
]
meta {
consul = "true"
consul_version = "1.21.5"
consul_server = "true"
}
}
insertbefore: '^client \{'
replace: '^client \{.*?\}'
- name: Restart Nomad service
systemd:
name: nomad
state: restarted
enabled: yes
- name: Wait for Nomad to be ready
wait_for:
port: 4646
host: "{{ ansible_default_ipv4.address }}"
delay: 5
timeout: 30
- name: Verify Nomad client status
shell: |
NOMAD_ADDR=http://localhost:4646 nomad node status | grep -q "ready"
register: nomad_ready
failed_when: nomad_ready.rc != 0
retries: 3
delay: 10
- name: Display completion message
debug:
msg: |
✅ Successfully converted ch4 from Nomad server to client
✅ Nomad service restarted
✅ Configuration updated

View File

@ -1,82 +0,0 @@
---
- name: Fix master node - rename to ch4 and restore SSH port 22
hosts: master
become: yes
vars:
new_hostname: ch4
old_hostname: master
tasks:
- name: Backup current hostname
copy:
content: "{{ old_hostname }}"
dest: /etc/hostname.backup
mode: '0644'
when: ansible_hostname == old_hostname
- name: Update hostname to ch4
hostname:
name: "{{ new_hostname }}"
when: ansible_hostname == old_hostname
- name: Update /etc/hostname file
copy:
content: "{{ new_hostname }}"
dest: /etc/hostname
mode: '0644'
when: ansible_hostname == old_hostname
- name: Update /etc/hosts file
lineinfile:
path: /etc/hosts
regexp: '^127\.0\.1\.1.*{{ old_hostname }}'
line: '127.0.1.1 {{ new_hostname }}'
state: present
when: ansible_hostname == old_hostname
- name: Update Tailscale hostname
shell: |
tailscale set --hostname={{ new_hostname }}
when: ansible_hostname == old_hostname
- name: Backup SSH config
copy:
src: /etc/ssh/sshd_config
dest: /etc/ssh/sshd_config.backup
remote_src: yes
backup: yes
- name: Restore SSH port to 22
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^Port '
line: 'Port 22'
state: present
- name: Restart SSH service
systemd:
name: ssh
state: restarted
enabled: yes
- name: Wait for SSH to be ready on port 22
wait_for:
port: 22
host: "{{ ansible_default_ipv4.address }}"
delay: 5
timeout: 30
- name: Test SSH connection on port 22
ping:
delegate_to: "{{ inventory_hostname }}"
vars:
ansible_port: 22
- name: Display completion message
debug:
msg: |
✅ Successfully renamed {{ old_hostname }} to {{ new_hostname }}
✅ SSH port restored to 22
✅ Tailscale hostname updated
🔄 Please update your inventory file to use the new hostname and port

View File

@ -1,73 +0,0 @@
---
- name: 修正Nomad节点的Consul角色配置
hosts: nomad_nodes
become: yes
vars:
consul_addresses: "master.tailnet-68f9.ts.net:8500,ash3c.tailnet-68f9.ts.net:8500,warden.tailnet-68f9.ts.net:8500"
tasks:
- name: 备份原始Nomad配置
copy:
src: /etc/nomad.d/nomad.hcl
dest: /etc/nomad.d/nomad.hcl.bak_{{ ansible_date_time.iso8601 }}
remote_src: yes
- name: 检查节点角色
shell: grep -A 1 "server {" /etc/nomad.d/nomad.hcl | grep "enabled = true" | wc -l
register: is_server
changed_when: false
- name: 检查节点角色
shell: grep -A 1 "client {" /etc/nomad.d/nomad.hcl | grep "enabled = true" | wc -l
register: is_client
changed_when: false
- name: 修正服务器节点的Consul配置
blockinfile:
path: /etc/nomad.d/nomad.hcl
marker: "# {mark} ANSIBLE MANAGED BLOCK - CONSUL CONFIG"
block: |
consul {
address = "{{ consul_addresses }}"
server_service_name = "nomad"
client_service_name = "nomad-client"
auto_advertise = true
server_auto_join = true
client_auto_join = false
}
replace: true
when: is_server.stdout == "1"
- name: 修正客户端节点的Consul配置
blockinfile:
path: /etc/nomad.d/nomad.hcl
marker: "# {mark} ANSIBLE MANAGED BLOCK - CONSUL CONFIG"
block: |
consul {
address = "{{ consul_addresses }}"
server_service_name = "nomad"
client_service_name = "nomad-client"
auto_advertise = true
server_auto_join = false
client_auto_join = true
}
replace: true
when: is_client.stdout == "1"
- name: 重启Nomad服务
systemd:
name: nomad
state: restarted
enabled: yes
daemon_reload: yes
- name: 等待Nomad服务启动
wait_for:
port: 4646
host: "{{ ansible_host }}"
timeout: 30
- name: 显示节点角色和配置
debug:
msg: "节点 {{ inventory_hostname }} 是 {{ '服务器' if is_server.stdout == '1' else '客户端' }} 节点Consul配置已更新"

View File

@ -1,43 +0,0 @@
---
- name: 修复 Nomad 服务器 region 配置
hosts: nomad_servers
become: yes
vars:
nomad_config_dir: /etc/nomad.d
tasks:
- name: 备份当前 Nomad 配置
copy:
src: "{{ nomad_config_dir }}/nomad.hcl"
dest: "{{ nomad_config_dir }}/nomad.hcl.backup.{{ ansible_date_time.epoch }}"
remote_src: yes
ignore_errors: yes
- name: 更新 Nomad 配置文件以添加 region 设置
blockinfile:
path: "{{ nomad_config_dir }}/nomad.hcl"
insertafter: '^datacenter = '
block: |
region = "dc1"
marker: "# {mark} Ansible managed region setting"
notify: restart nomad
- name: 更新节点名称以移除 .global 后缀(如果存在)
replace:
path: "{{ nomad_config_dir }}/nomad.hcl"
regexp: 'name = "(.*)\.global(.*)"'
replace: 'name = "\1\2"'
notify: restart nomad
- name: 确保 retry_join 使用正确的 IP 地址
replace:
path: "{{ nomad_config_dir }}/nomad.hcl"
regexp: 'retry_join = \[(.*)\]'
replace: 'retry_join = ["100.81.26.3", "100.103.147.94", "100.90.159.68", "100.116.158.95", "100.98.209.50", "100.120.225.29"]'
notify: restart nomad
handlers:
- name: restart nomad
systemd:
name: nomad
state: restarted

View File

@ -1,71 +0,0 @@
---
- name: Install and configure Consul clients on all nodes
hosts: all
become: yes
vars:
consul_servers:
- "100.117.106.136" # ch4 (韩国)
- "100.122.197.112" # warden (北京)
- "100.116.80.94" # ash3c (美国)
tasks:
- name: Get Tailscale IP address
shell: ip addr show tailscale0 | grep 'inet ' | awk '{print $2}' | cut -d/ -f1
register: tailscale_ip_result
changed_when: false
- name: Set Tailscale IP fact
set_fact:
tailscale_ip: "{{ tailscale_ip_result.stdout }}"
- name: Install Consul
apt:
name: consul
state: present
update_cache: yes
- name: Create Consul data directory
file:
path: /opt/consul/data
state: directory
owner: consul
group: consul
mode: '0755'
- name: Create Consul log directory
file:
path: /var/log/consul
state: directory
owner: consul
group: consul
mode: '0755'
- name: Create Consul config directory
file:
path: /etc/consul.d
state: directory
owner: consul
group: consul
mode: '0755'
- name: Generate Consul client configuration
template:
src: consul-client.hcl.j2
dest: /etc/consul.d/consul.hcl
owner: consul
group: consul
mode: '0644'
notify: restart consul
- name: Enable and start Consul service
systemd:
name: consul
enabled: yes
state: started
daemon_reload: yes
handlers:
- name: restart consul
systemd:
name: consul
state: restarted

View File

@ -1,87 +0,0 @@
---
- name: Configure Nomad Podman Driver
hosts: target_nodes
become: yes
tasks:
- name: Create backup directory
file:
path: /etc/nomad.d/backup
state: directory
mode: '0755'
- name: Backup current nomad.hcl
copy:
src: /etc/nomad.d/nomad.hcl
dest: "/etc/nomad.d/backup/nomad.hcl.bak.{{ ansible_date_time.iso8601 }}"
remote_src: yes
- name: Create plugin directory
file:
path: /opt/nomad/plugins
state: directory
owner: nomad
group: nomad
mode: '0755'
- name: Create symlink for podman driver
file:
src: /usr/bin/nomad-driver-podman
dest: /opt/nomad/plugins/nomad-driver-podman
state: link
- name: Copy podman driver configuration
copy:
src: ../../files/podman-driver.hcl
dest: /etc/nomad.d/podman-driver.hcl
owner: root
group: root
mode: '0644'
- name: Remove existing plugin_dir configuration
lineinfile:
path: /etc/nomad.d/nomad.hcl
regexp: '^plugin_dir = "/opt/nomad/data/plugins"'
state: absent
- name: Configure Nomad to use Podman driver
blockinfile:
path: /etc/nomad.d/nomad.hcl
marker: "# {mark} ANSIBLE MANAGED BLOCK - PODMAN DRIVER"
block: |
plugin_dir = "/opt/nomad/plugins"
plugin "podman" {
config {
volumes {
enabled = true
}
logging {
type = "journald"
}
gc {
container = true
}
}
}
register: nomad_config_result
- name: Restart nomad service
systemd:
name: nomad
state: restarted
enabled: yes
- name: Wait for nomad to start
wait_for:
port: 4646
delay: 10
timeout: 60
- name: Check nomad status
command: nomad node status
register: nomad_status
changed_when: false
- name: Display nomad status
debug:
var: nomad_status.stdout_lines

View File

@ -1,161 +0,0 @@
---
- name: Install and Configure Nomad Podman Driver on Client Nodes
hosts: nomad_clients
become: yes
vars:
nomad_plugin_dir: "/opt/nomad/plugins"
tasks:
- name: Create backup directory with timestamp
set_fact:
backup_dir: "/root/backup/{{ ansible_date_time.date }}_{{ ansible_date_time.hour }}{{ ansible_date_time.minute }}{{ ansible_date_time.second }}"
- name: Create backup directory
file:
path: "{{ backup_dir }}"
state: directory
mode: '0755'
- name: Backup current Nomad configuration
copy:
src: /etc/nomad.d/nomad.hcl
dest: "{{ backup_dir }}/nomad.hcl.backup"
remote_src: yes
ignore_errors: yes
- name: Backup current apt sources
shell: |
cp -r /etc/apt/sources.list* {{ backup_dir }}/
dpkg --get-selections > {{ backup_dir }}/installed_packages.txt
ignore_errors: yes
- name: Create temporary directory for apt
file:
path: /tmp/apt-temp
state: directory
mode: '1777'
- name: Download HashiCorp GPG key
get_url:
url: https://apt.releases.hashicorp.com/gpg
dest: /tmp/hashicorp.gpg
mode: '0644'
environment:
TMPDIR: /tmp/apt-temp
- name: Install HashiCorp GPG key
shell: |
gpg --dearmor < /tmp/hashicorp.gpg > /usr/share/keyrings/hashicorp-archive-keyring.gpg
environment:
TMPDIR: /tmp/apt-temp
- name: Add HashiCorp repository
lineinfile:
path: /etc/apt/sources.list.d/hashicorp.list
line: "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com {{ ansible_distribution_release }} main"
create: yes
mode: '0644'
- name: Update apt cache
apt:
update_cache: yes
environment:
TMPDIR: /tmp/apt-temp
ignore_errors: yes
- name: Install nomad-driver-podman
apt:
name: nomad-driver-podman
state: present
environment:
TMPDIR: /tmp/apt-temp
- name: Create Nomad plugin directory
file:
path: "{{ nomad_plugin_dir }}"
state: directory
owner: nomad
group: nomad
mode: '0755'
- name: Create symlink for nomad-driver-podman in plugin directory
file:
src: /usr/bin/nomad-driver-podman
dest: "{{ nomad_plugin_dir }}/nomad-driver-podman"
state: link
owner: nomad
group: nomad
- name: Get server IP address
shell: |
ip route get 1.1.1.1 | grep -oP 'src \K\S+'
register: server_ip_result
changed_when: false
- name: Set server IP fact
set_fact:
server_ip: "{{ server_ip_result.stdout }}"
- name: Stop Nomad service
systemd:
name: nomad
state: stopped
- name: Create updated Nomad client configuration
copy:
content: |
datacenter = "{{ nomad_datacenter }}"
data_dir = "/opt/nomad/data"
log_level = "INFO"
bind_addr = "{{ server_ip }}"
server {
enabled = false
}
client {
enabled = true
servers = ["100.117.106.136:4647", "100.116.80.94:4647", "100.97.62.111:4647", "100.116.112.45:4647", "100.84.197.26:4647"]
}
plugin_dir = "{{ nomad_plugin_dir }}"
plugin "nomad-driver-podman" {
config {
volumes {
enabled = true
}
recover_stopped = true
}
}
consul {
address = "127.0.0.1:8500"
}
dest: /etc/nomad.d/nomad.hcl
owner: nomad
group: nomad
mode: '0640'
backup: yes
- name: Validate Nomad configuration
shell: nomad config validate /etc/nomad.d/nomad.hcl
register: nomad_validate
failed_when: nomad_validate.rc != 0
- name: Start Nomad service
systemd:
name: nomad
state: started
enabled: yes
- name: Wait for Nomad to be ready
wait_for:
port: 4646
host: "{{ server_ip }}"
delay: 5
timeout: 60
- name: Display backup location
debug:
msg: "Backup created at: {{ backup_dir }}"

View File

@ -1,68 +0,0 @@
---
- name: 在 master 和 ash3c 节点安装 Consul
hosts: master,ash3c
become: yes
vars:
consul_version: "1.21.5"
consul_arch: "arm64" # 因为这两个节点都是 aarch64
tasks:
- name: 检查节点架构
command: uname -m
register: node_arch
changed_when: false
- name: 显示节点架构
debug:
msg: "节点 {{ inventory_hostname }} 架构: {{ node_arch.stdout }}"
- name: 检查是否已安装 consul
command: which consul
register: consul_check
failed_when: false
changed_when: false
- name: 显示当前 consul 状态
debug:
msg: "Consul 状态: {{ 'already installed' if consul_check.rc == 0 else 'not installed' }}"
- name: 删除错误的 consul 二进制文件(如果存在)
file:
path: /usr/local/bin/consul
state: absent
when: consul_check.rc == 0
- name: 更新 APT 缓存
apt:
update_cache: yes
ignore_errors: yes
- name: 安装 consul 通过 APT
apt:
name: consul={{ consul_version }}-1
state: present
- name: 验证 consul 安装
command: consul version
register: consul_version_check
changed_when: false
- name: 显示安装的 consul 版本
debug:
msg: "安装的 Consul 版本: {{ consul_version_check.stdout_lines[0] }}"
- name: 确保 consul 用户存在
user:
name: consul
system: yes
shell: /bin/false
home: /opt/consul
create_home: no
- name: 创建 consul 数据目录
file:
path: /opt/consul
state: directory
owner: consul
group: consul
mode: '0755'

View File

@ -1,91 +0,0 @@
---
- name: Install NFS CSI Plugin for Nomad
hosts: nomad_nodes
become: yes
vars:
nomad_user: nomad
nomad_plugins_dir: /opt/nomad/plugins
csi_driver_version: "v4.0.0"
csi_driver_url: "https://github.com/kubernetes-csi/csi-driver-nfs/releases/download/{{ csi_driver_version }}/csi-nfs-driver"
tasks:
- name: Stop Nomad service
systemd:
name: nomad
state: stopped
- name: Create plugins directory
file:
path: "{{ nomad_plugins_dir }}"
state: directory
owner: "{{ nomad_user }}"
group: "{{ nomad_user }}"
mode: '0755'
- name: Download NFS CSI driver
get_url:
url: "{{ csi_driver_url }}"
dest: "{{ nomad_plugins_dir }}/csi-nfs-driver"
owner: "{{ nomad_user }}"
group: "{{ nomad_user }}"
mode: '0755'
- name: Install required packages for CSI
package:
name:
- nfs-common
- mount
state: present
- name: Create CSI mount directory
file:
path: /opt/nomad/csi
state: directory
owner: "{{ nomad_user }}"
group: "{{ nomad_user }}"
mode: '0755'
- name: Update Nomad configuration for CSI plugin
blockinfile:
path: /etc/nomad.d/nomad.hcl
marker: "# {mark} CSI PLUGIN CONFIGURATION"
block: |
plugin_dir = "{{ nomad_plugins_dir }}"
plugin "csi-nfs" {
type = "csi"
config {
driver_name = "nfs.csi.k8s.io"
mount_dir = "/opt/nomad/csi"
health_timeout = "30s"
log_level = "INFO"
}
}
insertafter: 'data_dir = "/opt/nomad/data"'
- name: Start Nomad service
systemd:
name: nomad
state: started
enabled: yes
- name: Wait for Nomad to start
wait_for:
port: 4646
delay: 10
timeout: 60
- name: Check Nomad status
command: nomad node status
register: nomad_status
ignore_errors: yes
- name: Display Nomad status
debug:
var: nomad_status.stdout_lines

View File

@ -1,131 +0,0 @@
---
- name: Install Nomad by direct download from HashiCorp
hosts: all
become: yes
vars:
nomad_user: "nomad"
nomad_group: "nomad"
nomad_home: "/opt/nomad"
nomad_data_dir: "/opt/nomad/data"
nomad_config_dir: "/etc/nomad.d"
nomad_datacenter: "dc1"
nomad_region: "global"
nomad_server_addresses:
- "100.116.158.95:4647" # semaphore server address
tasks:
- name: Create nomad user
user:
name: "{{ nomad_user }}"
group: "{{ nomad_group }}"
system: yes
shell: /bin/false
home: "{{ nomad_home }}"
create_home: yes
- name: Create nomad directories
file:
path: "{{ item }}"
state: directory
owner: "{{ nomad_user }}"
group: "{{ nomad_group }}"
mode: '0755'
loop:
- "{{ nomad_home }}"
- "{{ nomad_data_dir }}"
- "{{ nomad_config_dir }}"
- /var/log/nomad
- name: Install unzip package
apt:
name: unzip
state: present
update_cache: yes
- name: Download Nomad binary
get_url:
url: "{{ nomad_url }}"
dest: "/tmp/nomad_{{ nomad_version }}_linux_amd64.zip"
mode: '0644'
timeout: 300
- name: Extract Nomad binary
unarchive:
src: "/tmp/nomad_{{ nomad_version }}_linux_amd64.zip"
dest: /tmp
remote_src: yes
- name: Copy Nomad binary to /usr/local/bin
copy:
src: /tmp/nomad
dest: /usr/local/bin/nomad
mode: '0755'
owner: root
group: root
remote_src: yes
- name: Create Nomad client configuration
template:
src: templates/nomad-client.hcl.j2
dest: "{{ nomad_config_dir }}/nomad.hcl"
owner: "{{ nomad_user }}"
group: "{{ nomad_group }}"
mode: '0640'
- name: Create Nomad systemd service
copy:
content: |
[Unit]
Description=Nomad
Documentation=https://www.nomadproject.io/
Requires=network-online.target
After=network-online.target
ConditionFileNotEmpty={{ nomad_config_dir }}/nomad.hcl
[Service]
Type=notify
User={{ nomad_user }}
Group={{ nomad_group }}
ExecStart=/usr/local/bin/nomad agent -config={{ nomad_config_dir }}
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
dest: /etc/systemd/system/nomad.service
mode: '0644'
- name: Reload systemd daemon
systemd:
daemon_reload: yes
- name: Enable and start Nomad service
systemd:
name: nomad
enabled: yes
state: started
- name: Wait for Nomad to be ready
wait_for:
port: 4646
host: localhost
delay: 5
timeout: 60
- name: Verify Nomad installation
command: /usr/local/bin/nomad version
register: nomad_version_output
- name: Display Nomad version
debug:
msg: "{{ nomad_version_output.stdout }}"
- name: Clean up downloaded files
file:
path: "{{ item }}"
state: absent
loop:
- "/tmp/nomad_{{ nomad_version }}_linux_amd64.zip"
- /tmp/nomad

View File

@ -1,131 +0,0 @@
---
- name: Install Nomad Podman Driver Plugin
hosts: target_nodes
become: yes
vars:
nomad_user: nomad
nomad_data_dir: /opt/nomad/data
nomad_plugins_dir: "{{ nomad_data_dir }}/plugins"
podman_driver_version: "0.6.1"
podman_driver_url: "https://releases.hashicorp.com/nomad-driver-podman/{{ podman_driver_version }}/nomad-driver-podman_{{ podman_driver_version }}_linux_amd64.zip"
tasks:
- name: Stop Nomad service
systemd:
name: nomad
state: stopped
- name: Create plugins directory
file:
path: "{{ nomad_plugins_dir }}"
state: directory
owner: "{{ nomad_user }}"
group: "{{ nomad_user }}"
mode: '0755'
- name: Download Nomad Podman driver
get_url:
url: "{{ podman_driver_url }}"
dest: "/tmp/nomad-driver-podman_{{ podman_driver_version }}_linux_amd64.zip"
mode: '0644'
- name: Extract Nomad Podman driver
unarchive:
src: "/tmp/nomad-driver-podman_{{ podman_driver_version }}_linux_amd64.zip"
dest: "/tmp"
remote_src: yes
- name: Install Nomad Podman driver
copy:
src: "/tmp/nomad-driver-podman"
dest: "{{ nomad_plugins_dir }}/nomad-driver-podman"
owner: "{{ nomad_user }}"
group: "{{ nomad_user }}"
mode: '0755'
remote_src: yes
- name: Update Nomad configuration for plugin directory
blockinfile:
path: /etc/nomad.d/nomad.hcl
marker: "# {mark} PLUGIN DIRECTORY CONFIGURATION"
block: |
plugin_dir = "{{ nomad_plugins_dir }}"
insertafter: 'data_dir = "/opt/nomad/data"'
- name: Fix Podman socket permissions
file:
path: /run/user/1001/podman/podman.sock
mode: '0666'
ignore_errors: yes
- name: Ensure nomad user can access Podman socket
user:
name: "{{ nomad_user }}"
groups: ben
append: yes
- name: Start Nomad service
systemd:
name: nomad
state: started
enabled: yes
- name: Wait for Nomad to be ready
wait_for:
port: 4646
host: localhost
delay: 10
timeout: 60
- name: Verify Nomad is running
systemd:
name: nomad
register: nomad_service_status
- name: Display Nomad service status
debug:
msg: "Nomad service is {{ nomad_service_status.status.ActiveState }}"
- name: Wait for plugins to load
pause:
seconds: 15
- name: Check available drivers
shell: |
sudo -u {{ nomad_user }} /usr/local/bin/nomad node status -self | grep -A 20 "Driver Status"
register: driver_status
failed_when: false
- name: Display driver status
debug:
var: driver_status.stdout_lines
- name: Test Podman driver functionality
shell: |
sudo -u {{ nomad_user }} /usr/local/bin/nomad node status -json | jq -r '.Drivers | keys[]'
register: available_drivers
failed_when: false
- name: Display available drivers
debug:
msg: "Available drivers: {{ available_drivers.stdout_lines | join(', ') }}"
- name: Clean up downloaded files
file:
path: "{{ item }}"
state: absent
loop:
- "/tmp/nomad-driver-podman_{{ podman_driver_version }}_linux_amd64.zip"
- "/tmp/nomad-driver-podman"
- name: Final verification - Check if Podman driver is loaded
shell: |
sudo -u {{ nomad_user }} /usr/local/bin/nomad node status -json | jq -r '.Drivers.podman.Detected'
register: podman_driver_detected
failed_when: false
- name: Display final result
debug:
msg: |
Podman driver installation: {{ 'SUCCESS' if podman_driver_detected.stdout == 'true' else 'NEEDS VERIFICATION' }}
Driver detected: {{ podman_driver_detected.stdout | default('unknown') }}

View File

@ -1,61 +0,0 @@
---
- name: Install Podman Compose on all Nomad cluster nodes
hosts: nomad_cluster
become: yes
tasks:
- name: Display target node
debug:
msg: "正在安装 Podman Compose 到节点: {{ inventory_hostname }}"
- name: Update package cache
apt:
update_cache: yes
ignore_errors: yes
- name: Install Podman and related tools
apt:
name:
- podman
- podman-compose
- buildah
- skopeo
state: present
ignore_errors: yes
- name: Install additional dependencies
apt:
name:
- python3-pip
- python3-setuptools
state: present
ignore_errors: yes
- name: Install podman-compose via pip if package manager failed
pip:
name: podman-compose
state: present
ignore_errors: yes
- name: Verify Podman installation
shell: podman --version
register: podman_version
- name: Verify Podman Compose installation
shell: podman-compose --version
register: podman_compose_version
ignore_errors: yes
- name: Display installation results
debug:
msg: |
✅ 节点 {{ inventory_hostname }} 安装结果:
📦 Podman: {{ podman_version.stdout }}
🐳 Podman Compose: {{ podman_compose_version.stdout if podman_compose_version.rc == 0 else '安装失败或不可用' }}
- name: Ensure Podman socket is enabled
systemd:
name: podman.socket
enabled: yes
state: started
ignore_errors: yes

View File

@ -1,115 +0,0 @@
---
- name: 在Kali Linux上安装和配置VNC服务器
hosts: kali
become: yes
vars:
vnc_password: "3131" # VNC连接密码
vnc_port: "5901" # VNC服务端口
vnc_geometry: "1280x1024" # VNC分辨率
vnc_depth: "24" # 颜色深度
tasks:
- name: 更新APT缓存
apt:
update_cache: yes
- name: 安装VNC服务器和客户端
apt:
name:
- tigervnc-standalone-server
- tigervnc-viewer
- xfce4
- xfce4-goodies
state: present
- name: 创建VNC配置目录
file:
path: /home/ben/.vnc
state: directory
owner: ben
group: ben
mode: '0700'
- name: 设置VNC密码
shell: |
echo "{{ vnc_password }}" | vncpasswd -f > /home/ben/.vnc/passwd
echo "{{ vnc_password }}" | vncpasswd -f > /home/ben/.vnc/passwd2
become_user: ben
- name: 设置VNC密码文件权限
file:
path: /home/ben/.vnc/passwd
owner: ben
group: ben
mode: '0600'
- name: 设置VNC密码文件2权限
file:
path: /home/ben/.vnc/passwd2
owner: ben
group: ben
mode: '0600'
- name: 创建VNC启动脚本
copy:
dest: /home/ben/.vnc/xstartup
content: |
#!/bin/bash
unset SESSION_MANAGER
unset DBUS_SESSION_BUS_ADDRESS
exec startxfce4
owner: ben
group: ben
mode: '0755'
- name: 创建VNC服务文件
copy:
dest: /etc/systemd/system/vncserver@.service
content: |
[Unit]
Description=Start TigerVNC server at startup
After=syslog.target network.target
[Service]
Type=forking
User=ben
Group=ben
WorkingDirectory=/home/ben
PIDFile=/home/ben/.vnc/%H:%i.pid
ExecStartPre=-/usr/bin/vncserver -kill :%i > /dev/null 2>&1
ExecStart=/usr/bin/vncserver -depth {{ vnc_depth }} -geometry {{ vnc_geometry }} :%i
ExecStop=/usr/bin/vncserver -kill :%i
[Install]
WantedBy=multi-user.target
- name: 重新加载systemd配置
systemd:
daemon_reload: yes
- name: 启用并启动VNC服务
systemd:
name: vncserver@1.service
enabled: yes
state: started
- name: 检查VNC服务状态
command: systemctl status vncserver@1.service
register: vnc_status
ignore_errors: yes
- name: 显示VNC服务状态
debug:
msg: "{{ vnc_status.stdout_lines }}"
- name: 显示VNC连接信息
debug:
msg: |
VNC服务器已成功配置
连接信息:
- 地址: {{ ansible_host }}
- 端口: {{ vnc_port }}
- 密码: {{ vnc_password }}
- 连接命令: vnc://{{ ansible_host }}:{{ vnc_port }}
- 使用macOS屏幕共享应用连接到上述地址

View File

@ -1,36 +0,0 @@
---
# install_vault.yml
- name: Install HashiCorp Vault
hosts: vault_servers
become: yes
tasks:
- name: Check if Vault is already installed
command: which vault
register: vault_check
ignore_errors: yes
changed_when: false
- name: Install Vault using apt
apt:
name: vault
state: present
update_cache: yes
when: vault_check.rc != 0
- name: Create Vault data directory
file:
path: "{{ vault_data_dir | default('/opt/nomad/data/vault/config') }}"
state: directory
owner: root
group: root
mode: '0755'
recurse: yes
- name: Verify Vault installation
command: vault --version
register: vault_version
changed_when: false
- name: Display Vault version
debug:
var: vault_version.stdout

View File

@ -1,42 +0,0 @@
---
- name: 配置Nomad节点NFS挂载
hosts: nomad_nodes
become: yes
vars:
nfs_server: "snail"
nfs_share: "/fs/1000/nfs/Fnsync"
mount_point: "/mnt/fnsync"
tasks:
- name: 安装NFS客户端
package:
name: nfs-common
state: present
- name: 创建挂载目录
file:
path: "{{ mount_point }}"
state: directory
mode: '0755'
- name: 临时挂载NFS共享
mount:
path: "{{ mount_point }}"
src: "{{ nfs_server }}:{{ nfs_share }}"
fstype: nfs4
opts: "rw,relatime,vers=4.2"
state: mounted
- name: 配置开机自动挂载
lineinfile:
path: /etc/fstab
line: "{{ nfs_server }}:{{ nfs_share }} {{ mount_point }} nfs4 rw,relatime,vers=4.2 0 0"
state: present
- name: 验证挂载
command: df -h {{ mount_point }}
register: mount_check
- name: 显示挂载信息
debug:
var: mount_check.stdout_lines

View File

@ -1,86 +0,0 @@
---
- name: 恢复客户端节点的/etc/hosts文件
hosts: nomad_clients
become: yes
tasks:
- name: 删除添加的主机名解析条目
lineinfile:
path: /etc/hosts
regexp: "^100\\.116\\.158\\.95\\s"
state: absent
- name: 删除添加的主机名解析条目
lineinfile:
path: /etc/hosts
regexp: "^100\\.81\\.26\\.3\\s"
state: absent
- name: 删除添加的主机名解析条目
lineinfile:
path: /etc/hosts
regexp: "^100\\.103\\.147\\.94\\s"
state: absent
- name: 删除添加的主机名解析条目
lineinfile:
path: /etc/hosts
regexp: "^100\\.90\\.159\\.68\\s"
state: absent
- name: 删除添加的主机名解析条目
lineinfile:
path: /etc/hosts
regexp: "^100\\.86\\.141\\.112\\s"
state: absent
- name: 删除添加的主机名解析条目
lineinfile:
path: /etc/hosts
regexp: "^100\\.98\\.209\\.50\\s"
state: absent
- name: 删除添加的主机名解析条目
lineinfile:
path: /etc/hosts
regexp: "^100\\.120\\.225\\.29\\s"
state: absent
- name: 删除添加的主机名解析条目
lineinfile:
path: /etc/hosts
regexp: "^100\\.117\\.106\\.136\\s"
state: absent
- name: 删除添加的主机名解析条目
lineinfile:
path: /etc/hosts
regexp: "^100\\.116\\.80\\.94\\s"
state: absent
- name: 删除添加的主机名解析条目
lineinfile:
path: /etc/hosts
regexp: "^100\\.116\\.112\\.45\\s"
state: absent
- name: 删除添加的主机名解析条目
lineinfile:
path: /etc/hosts
regexp: "^100\\.97\\.62\\.111\\s"
state: absent
- name: 删除添加的主机名解析条目
lineinfile:
path: /etc/hosts
regexp: "^100\\.122\\.197\\.112\\s"
state: absent
- name: 显示恢复后的/etc/hosts文件内容
command: cat /etc/hosts
register: hosts_content
changed_when: false
- name: 显示/etc/hosts文件内容
debug:
var: hosts_content.stdout_lines

View File

@ -1,81 +0,0 @@
---
- name: Setup complete SSH key authentication for browser host
hosts: browser
become: yes
vars:
target_user: ben
ssh_key_comment: "ansible-generated-key-for-{{ inventory_hostname }}"
tasks:
- name: Copy existing Ed25519 SSH public key to target user
copy:
src: /root/.ssh/id_ed25519.pub
dest: /home/{{ target_user }}/.ssh/id_ed25519.pub
owner: "{{ target_user }}"
group: "{{ target_user }}"
mode: '0644'
- name: Copy existing Ed25519 SSH private key to target user
copy:
src: /root/.ssh/id_ed25519
dest: /home/{{ target_user }}/.ssh/id_ed25519
owner: "{{ target_user }}"
group: "{{ target_user }}"
mode: '0600'
- name: Get SSH public key content
command: cat /home/{{ target_user }}/.ssh/id_ed25519.pub
register: ssh_public_key
become_user: "{{ target_user }}"
changed_when: false
- name: Ensure .ssh directory exists for user
file:
path: /home/{{ target_user }}/.ssh
state: directory
owner: "{{ target_user }}"
group: "{{ target_user }}"
mode: '0700'
- name: Add public key to authorized_keys
authorized_key:
user: "{{ target_user }}"
state: present
key: "{{ ssh_public_key.stdout }}"
become_user: "{{ target_user }}"
- name: Configure SSH to prefer key authentication
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^PasswordAuthentication'
line: 'PasswordAuthentication yes'
backup: yes
notify: restart sshd
when: ansible_connection != 'local'
- name: Configure SSH to allow key authentication
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^PubkeyAuthentication'
line: 'PubkeyAuthentication yes'
backup: yes
notify: restart sshd
when: ansible_connection != 'local'
- name: Configure SSH authorized keys file permissions
file:
path: /home/{{ target_user }}/.ssh/authorized_keys
owner: "{{ target_user }}"
group: "{{ target_user }}"
mode: '0600'
- name: Display success message
debug:
msg: "SSH key authentication has been configured for user {{ target_user }} on {{ inventory_hostname }}"
handlers:
- name: restart sshd
systemd:
name: sshd
state: restarted
when: ansible_connection != 'local'

View File

@ -1,62 +0,0 @@
---
- name: Setup SSH key authentication for browser host
hosts: browser
become: yes
vars:
target_user: ben
ssh_key_comment: "ansible-generated-key"
tasks:
- name: Generate SSH key pair if it doesn't exist
user:
name: "{{ target_user }}"
generate_ssh_key: yes
ssh_key_bits: 4096
ssh_key_comment: "{{ ssh_key_comment }}"
become_user: "{{ target_user }}"
- name: Get SSH public key content
command: cat /home/{{ target_user }}/.ssh/id_rsa.pub
register: ssh_public_key
become_user: "{{ target_user }}"
changed_when: false
- name: Display SSH public key for manual configuration
debug:
msg: |
SSH Public Key for {{ inventory_hostname }}:
{{ ssh_public_key.stdout }}
To complete key-based authentication setup:
1. Copy the above public key to the target system's authorized_keys
2. Or use ssh-copy-id command from this system:
ssh-copy-id -i /home/{{ target_user }}/.ssh/id_rsa.pub {{ target_user }}@{{ inventory_hostname }}
- name: Ensure .ssh directory exists for user
file:
path: /home/{{ target_user }}/.ssh
state: directory
owner: "{{ target_user }}"
group: "{{ target_user }}"
mode: '0700'
- name: Configure SSH to prefer key authentication
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^PasswordAuthentication'
line: 'PasswordAuthentication yes'
backup: yes
notify: restart sshd
- name: Configure SSH to allow key authentication
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^PubkeyAuthentication'
line: 'PubkeyAuthentication yes'
backup: yes
notify: restart sshd
handlers:
- name: restart sshd
systemd:
name: sshd
state: restarted

View File

@ -1,43 +0,0 @@
---
- name: 设置Nomad节点NFS挂载
hosts: nomad_nodes
become: yes
vars:
nfs_server: "snail"
nfs_share: "/fs/1000/nfs/Fnsync"
mount_point: "/mnt/fnsync"
tasks:
- name: 安装NFS客户端
package:
name: nfs-common
state: present
- name: 创建挂载目录
file:
path: "{{ mount_point }}"
state: directory
mode: '0755'
- name: 临时挂载NFS共享
mount:
path: "{{ mount_point }}"
src: "{{ nfs_server }}:{{ nfs_share }}"
fstype: nfs4
opts: "rw,relatime,vers=4.2"
state: mounted
- name: 配置开机自动挂载
lineinfile:
path: /etc/fstab
line: "{{ nfs_server }}:{{ nfs_share }} {{ mount_point }} nfs4 rw,relatime,vers=4.2 0 0"
state: present
- name: 验证挂载
command: df -h {{ mount_point }}
register: mount_check
- name: 显示挂载信息
debug:
var: mount_check.stdout_lines

View File

@ -1,187 +0,0 @@
---
- name: 部署 Telegraf 硬盘监控到 Nomad 集群
hosts: all
become: yes
vars:
# 连接现有的 InfluxDB 2.x + Grafana 监控栈
influxdb_url: "{{ influxdb_url | default('http://influxdb1.tailnet-68f9.ts.net:8086') }}"
influxdb_token: "{{ influxdb_token }}"
influxdb_org: "{{ influxdb_org | default('nomad') }}"
influxdb_bucket: "{{ influxdb_bucket | default('nomad_monitoring') }}"
# 远程 Telegraf 配置模式(优先)
use_remote_config: "{{ use_remote_config | default(true) }}"
telegraf_config_url: "{{ telegraf_config_url | default('') }}"
# 硬盘监控阈值
disk_usage_warning: 80 # 80% 使用率警告
disk_usage_critical: 90 # 90% 使用率严重告警
# 监控间隔(秒)
collection_interval: 30
tasks:
- name: 显示正在处理的节点
debug:
msg: "🔧 正在为节点 {{ inventory_hostname }} 安装硬盘监控"
- name: 添加 InfluxData 仓库密钥
apt_key:
url: https://repos.influxdata.com/influxdata-archive_compat.key
state: present
retries: 3
delay: 5
- name: 添加 InfluxData 仓库
apt_repository:
repo: "deb https://repos.influxdata.com/ubuntu {{ ansible_distribution_release }} stable"
state: present
update_cache: yes
retries: 3
delay: 5
- name: 安装 Telegraf
apt:
name: telegraf
state: present
update_cache: yes
retries: 3
delay: 10
- name: 创建 Telegraf 配置目录
file:
path: /etc/telegraf/telegraf.d
state: directory
owner: telegraf
group: telegraf
mode: '0755'
- name: 清理旧的 Telegraf 日志文件(节省硬盘空间)
file:
path: "{{ item }}"
state: absent
loop:
- /var/log/telegraf
- /var/log/telegraf.log
ignore_errors: yes
- name: 禁用 Telegraf 日志目录创建
file:
path: /var/log/telegraf
state: absent
ignore_errors: yes
- name: 创建 Telegraf 环境变量文件
template:
src: telegraf-env.j2
dest: /etc/default/telegraf
owner: root
group: root
mode: '0600'
backup: yes
notify: restart telegraf
- name: 创建 Telegraf systemd 服务文件(支持远程配置)
template:
src: telegraf.service.j2
dest: /etc/systemd/system/telegraf.service
owner: root
group: root
mode: '0644'
backup: yes
notify:
- reload systemd
- restart telegraf
when: telegraf_config_url is defined and telegraf_config_url != ''
- name: 生成 Telegraf 主配置文件(本地配置模式)
template:
src: telegraf.conf.j2
dest: /etc/telegraf/telegraf.conf
owner: telegraf
group: telegraf
mode: '0644'
backup: yes
notify: restart telegraf
when: telegraf_config_url is not defined or telegraf_config_url == ''
- name: 生成硬盘监控配置
template:
src: disk-monitoring.conf.j2
dest: /etc/telegraf/telegraf.d/disk-monitoring.conf
owner: telegraf
group: telegraf
mode: '0644'
backup: yes
notify: restart telegraf
- name: 生成系统监控配置
template:
src: system-monitoring.conf.j2
dest: /etc/telegraf/telegraf.d/system-monitoring.conf
owner: telegraf
group: telegraf
mode: '0644'
backup: yes
notify: restart telegraf
- name: 启用并启动 Telegraf 服务
systemd:
name: telegraf
state: started
enabled: yes
daemon_reload: yes
- name: 验证 Telegraf 状态
systemd:
name: telegraf
register: telegraf_status
- name: 检查 InfluxDB 连接
uri:
url: "{{ influxdb_url }}/ping"
method: GET
timeout: 5
register: influxdb_ping
ignore_errors: yes
delegate_to: localhost
run_once: true
- name: 显示 InfluxDB 连接状态
debug:
msg: "{{ '✅ InfluxDB 连接正常' if influxdb_ping.status == 204 else '❌ InfluxDB 连接失败,请检查配置' }}"
run_once: true
- name: 显示 Telegraf 状态
debug:
msg: "✅ Telegraf 状态: {{ telegraf_status.status.ActiveState }}"
- name: 检查硬盘使用情况
shell: |
df -h | grep -vE '^Filesystem|tmpfs|cdrom|udev' | awk '{print $5 " " $1 " " $6}' | while read output;
do
usage=$(echo $output | awk '{print $1}' | sed 's/%//g')
partition=$(echo $output | awk '{print $2}')
mount=$(echo $output | awk '{print $3}')
if [ $usage -ge {{ disk_usage_warning }} ]; then
echo "⚠️ 警告: $mount ($partition) 使用率 $usage%"
else
echo "✅ $mount ($partition) 使用率 $usage%"
fi
done
register: disk_check
changed_when: false
- name: 显示硬盘检查结果
debug:
msg: "{{ disk_check.stdout_lines }}"
handlers:
- name: reload systemd
systemd:
daemon_reload: yes
- name: restart telegraf
systemd:
name: telegraf
state: restarted

View File

@ -1,76 +0,0 @@
---
- name: 安装并配置新的 Nomad Server 节点
hosts: influxdb1
become: yes
gather_facts: no
tasks:
- name: 更新包缓存
apt:
update_cache: yes
cache_valid_time: 3600
retries: 3
delay: 10
- name: 安装依赖包
apt:
name:
- wget
- curl
- unzip
- podman
- buildah
- skopeo
state: present
retries: 3
delay: 10
- name: 检查 Nomad 是否已安装
shell: which nomad || echo "not_found"
register: nomad_check
changed_when: false
- name: 下载并安装 Nomad
block:
- name: 下载 Nomad 1.10.5
get_url:
url: "https://releases.hashicorp.com/nomad/1.10.5/nomad_1.10.5_linux_amd64.zip"
dest: "/tmp/nomad.zip"
mode: '0644'
- name: 解压 Nomad
unarchive:
src: "/tmp/nomad.zip"
dest: "/usr/bin/"
remote_src: yes
owner: root
group: root
mode: '0755'
- name: 清理临时文件
file:
path: "/tmp/nomad.zip"
state: absent
when: nomad_check.stdout == "not_found"
- name: 验证 Nomad 安装
shell: nomad version
register: nomad_version_output
- name: 显示安装结果
debug:
msg: |
✅ 节点 {{ inventory_hostname }} 软件安装完成
📦 Podman: {{ ansible_facts.packages.podman[0].version if ansible_facts.packages.podman is defined else 'checking...' }}
🎯 Nomad: {{ nomad_version_output.stdout.split('\n')[0] }}
- name: 启用 Podman socket
systemd:
name: podman.socket
enabled: yes
state: started
ignore_errors: yes
- name: 继续完整配置
debug:
msg: "软件安装完成,现在将运行完整的 Nomad 配置..."

View File

@ -1,114 +0,0 @@
---
- name: Setup Xfce desktop environment and Chrome Dev for browser automation
hosts: browser
become: yes
vars:
target_user: ben
tasks:
- name: Update package lists
apt:
update_cache: yes
cache_valid_time: 3600
- name: Install Xfce desktop environment
apt:
name:
- xfce4
- xfce4-goodies
- lightdm
- xorg
- dbus-x11
state: present
- name: Install additional useful packages for desktop environment
apt:
name:
- firefox-esr
- geany
- thunar-archive-plugin
- xfce4-terminal
- gvfs
- fonts-noto
- fonts-noto-cjk
state: present
- name: Download Google Chrome Dev .deb package
get_url:
url: https://dl.google.com/linux/direct/google-chrome-unstable_current_amd64.deb
dest: /tmp/google-chrome-unstable_current_amd64.deb
mode: '0644'
- name: Install Google Chrome Dev
apt:
deb: /tmp/google-chrome-unstable_current_amd64.deb
- name: Clean up downloaded .deb package
file:
path: /tmp/google-chrome-unstable_current_amd64.deb
state: absent
- name: Install Chrome automation dependencies
apt:
name:
- python3-pip
- python3-venv
- python3-dev
- build-essential
- libssl-dev
- libffi-dev
state: present
- name: Install Python packages for browser automation
pip:
name:
- selenium
- webdriver-manager
- pyvirtualdisplay
executable: pip3
- name: Set up Xfce as default desktop environment
copy:
dest: /etc/lightdm/lightdm.conf
content: |
[Seat:*]
autologin-user={{ target_user }}
autologin-user-timeout=0
autologin-session=xfce
user-session=xfce
- name: Ensure user is in necessary groups
user:
name: "{{ target_user }}"
groups:
- audio
- video
- input
- netdev
append: yes
- name: Create .xprofile for user
copy:
dest: /home/{{ target_user }}/.xprofile
content: |
# Start Xfce on login
startxfce4
owner: "{{ target_user }}"
group: "{{ target_user }}"
mode: '0644'
- name: Enable and start lightdm service
systemd:
name: lightdm
enabled: yes
state: started
- name: Display success message
debug:
msg: "Xfce desktop environment and Chrome Dev have been configured for user {{ target_user }} on {{ inventory_hostname }}"
handlers:
- name: restart lightdm
systemd:
name: lightdm
state: restarted

View File

@ -1,33 +0,0 @@
---
- name: 启动所有Nomad服务器形成集群
hosts: nomad_servers
become: yes
tasks:
- name: 检查Nomad服务状态
systemd:
name: nomad
register: nomad_status
- name: 启动Nomad服务如果未运行
systemd:
name: nomad
state: started
enabled: yes
when: nomad_status.status.ActiveState != "active"
- name: 等待Nomad服务启动
wait_for:
port: 4646
host: "{{ ansible_host }}"
timeout: 30
- name: 显示Nomad服务状态
debug:
msg: "{{ inventory_hostname }} Nomad服务状态: {{ nomad_status.status.ActiveState }}"

View File

@ -1,106 +0,0 @@
datacenter = "dc1"
data_dir = "/opt/nomad/data"
plugin_dir = "/opt/nomad/plugins"
log_level = "INFO"
name = "{{ ansible_hostname }}"
bind_addr = "0.0.0.0"
addresses {
http = "{{ ansible_host }}"
rpc = "{{ ansible_host }}"
serf = "{{ ansible_host }}"
}
advertise {
http = "{{ ansible_host }}:4646"
rpc = "{{ ansible_host }}:4647"
serf = "{{ ansible_host }}:4648"
}
ports {
http = 4646
rpc = 4647
serf = 4648
}
server {
enabled = true
bootstrap_expect = 3
server_join {
retry_join = [
"semaphore.tailnet-68f9.ts.net:4648",
"ash1d.tailnet-68f9.ts.net:4648",
"ash2e.tailnet-68f9.ts.net:4648",
"ch2.tailnet-68f9.ts.net:4648",
"ch3.tailnet-68f9.ts.net:4648",
"onecloud1.tailnet-68f9.ts.net:4648",
"de.tailnet-68f9.ts.net:4648",
"hcp1.tailnet-68f9.ts.net:4648"
]
}
}
{% if ansible_hostname == 'hcp1' %}
client {
enabled = true
network_interface = "tailscale0"
servers = [
"semaphore.tailnet-68f9.ts.net:4647",
"ash1d.tailnet-68f9.ts.net:4647",
"ash2e.tailnet-68f9.ts.net:4647",
"ch2.tailnet-68f9.ts.net:4647",
"ch3.tailnet-68f9.ts.net:4647",
"onecloud1.tailnet-68f9.ts.net:4647",
"de.tailnet-68f9.ts.net:4647",
"hcp1.tailnet-68f9.ts.net:4647"
]
host_volume "traefik-certs" {
path = "/opt/traefik/certs"
read_only = false
}
host_volume "fnsync" {
path = "/mnt/fnsync"
read_only = false
}
meta {
consul = "true"
consul_version = "1.21.5"
consul_client = "true"
}
gc_interval = "5m"
gc_disk_usage_threshold = 80
gc_inode_usage_threshold = 70
}
plugin "nomad-driver-podman" {
config {
socket_path = "unix:///run/podman/podman.sock"
volumes {
enabled = true
}
}
}
{% endif %}
consul {
address = "ch4.tailnet-68f9.ts.net:8500,ash3c.tailnet-68f9.ts.net:8500,warden.tailnet-68f9.ts.net:8500"
server_service_name = "nomad"
client_service_name = "nomad-client"
auto_advertise = true
server_auto_join = false
client_auto_join = true
}
telemetry {
collection_interval = "1s"
disable_hostname = false
prometheus_metrics = true
publish_allocation_metrics = true
publish_node_metrics = true
}

View File

@ -1,110 +0,0 @@
# Kali Linux Ansible 测试套件
本目录包含用于测试Kali Linux系统的Ansible playbook集合。
## 测试Playbook列表
### 1. kali-health-check.yml
**用途**: Kali Linux快速健康检查
**描述**: 执行基本的系统状态检查包括系统信息、更新状态、磁盘空间、关键工具安装状态、网络连接、系统负载和SSH服务状态。
**运行方式**:
```bash
cd /root/mgmt/configuration
ansible-playbook -i inventories/production/inventory.ini playbooks/test/kali-health-check.yml
```
### 2. kali-security-tools.yml
**用途**: Kali Linux安全工具测试
**描述**: 专门测试各种Kali Linux安全工具的安装和基本功能包括
- Nmap
- Metasploit Framework
- Wireshark
- John the Ripper
- Hydra
- SQLMap
- Aircrack-ng
- Burp Suite
- Netcat
- Curl
**运行方式**:
```bash
cd /root/mgmt/configuration
ansible-playbook -i inventories/production/inventory.ini playbooks/test/kali-security-tools.yml
```
### 3. test-kali.yml
**用途**: Kali Linux完整系统测试
**描述**: 执行全面的系统测试,包括:
- 系统基本信息收集
- 网络连接测试
- 包管理器测试
- Kali工具检查
- 系统安全性检查
- 系统性能测试
- 网络工具测试
- 生成详细测试报告
**运行方式**:
```bash
cd /root/mgmt/configuration
ansible-playbook -i inventories/production/inventory.ini playbooks/test/test-kali.yml
```
### 4. kali-full-test-suite.yml
**用途**: Kali Linux完整测试套件
**描述**: 按顺序执行所有上述测试,提供全面的系统测试覆盖。
**运行方式**:
```bash
cd /root/mgmt/configuration
ansible-playbook playbooks/test/kali-full-test-suite.yml
```
## 测试结果
### 健康检查
- 直接在终端显示测试结果
- 无额外文件生成
### 安全工具测试
- 终端显示测试结果摘要
- 在Kali系统上生成 `/tmp/kali_security_tools_report.md` 报告文件
### 完整系统测试
- 终端显示测试进度
- 在Kali系统上生成 `/tmp/kali_test_results/` 目录,包含:
- `system_info.txt`: 系统基本信息
- `tool_check.txt`: Kali工具检查结果
- `security_check.txt`: 系统安全检查
- `performance.txt`: 系统性能信息
- `network_tools.txt`: 网络工具测试
- `kali_test.log`: 完整测试日志
- `README.md`: 测试报告摘要
## 前提条件
1. 确保Kali系统在inventory中正确配置
2. 确保Ansible可以连接到Kali系统
3. 确保有足够的权限在Kali系统上执行测试
## 注意事项
1. 某些测试可能需要网络连接
2. 完整系统测试可能需要较长时间
3. 测试结果文件会保存在Kali系统的临时目录中
4. 建议定期清理测试结果文件以节省磁盘空间
## 故障排除
如果测试失败,请检查:
1. 网络连接是否正常
2. Ansible inventory配置是否正确
3. SSH连接是否正常
4. Kali系统是否正常运行
5. 是否有足够的权限执行测试
## 自定义测试
您可以根据需要修改playbook中的测试内容或添加新的测试任务。所有playbook都使用模块化设计便于扩展和维护。

View File

@ -1,50 +0,0 @@
---
- name: Kali Linux 完整测试套件
hosts: localhost
gather_facts: no
tasks:
- name: 显示测试开始信息
debug:
msg: "开始执行 Kali Linux 完整测试套件"
- name: 执行Kali快速健康检查
command: "ansible-playbook -i ../inventories/production/inventory.ini kali-health-check.yml"
args:
chdir: "/root/mgmt/configuration/playbooks/test"
register: health_check_result
- name: 显示健康检查结果
debug:
msg: "健康检查完成,退出码: {{ health_check_result.rc }}"
- name: 执行Kali安全工具测试
command: "ansible-playbook -i ../inventories/production/inventory.ini kali-security-tools.yml"
args:
chdir: "/root/mgmt/configuration/playbooks/test"
register: security_tools_result
- name: 显示安全工具测试结果
debug:
msg: "安全工具测试完成,退出码: {{ security_tools_result.rc }}"
- name: 执行Kali完整系统测试
command: "ansible-playbook -i ../inventories/production/inventory.ini test-kali.yml"
args:
chdir: "/root/mgmt/configuration/playbooks/test"
register: full_test_result
- name: 显示完整测试结果
debug:
msg: "完整系统测试完成,退出码: {{ full_test_result.rc }}"
- name: 显示测试完成信息
debug:
msg: |
Kali Linux 完整测试套件执行完成!
测试结果摘要:
- 健康检查: {{ '成功' if health_check_result.rc == 0 else '失败' }}
- 安全工具测试: {{ '成功' if security_tools_result.rc == 0 else '失败' }}
- 完整系统测试: {{ '成功' if full_test_result.rc == 0 else '失败' }}
详细测试结果请查看各测试生成的报告文件。

View File

@ -1,86 +0,0 @@
---
- name: Kali Linux 快速健康检查
hosts: kali
become: yes
gather_facts: yes
tasks:
- name: 显示系统基本信息
debug:
msg: |
=== Kali Linux 系统信息 ===
主机名: {{ ansible_hostname }}
操作系统: {{ ansible_distribution }} {{ ansible_distribution_version }}
内核版本: {{ ansible_kernel }}
架构: {{ ansible_architecture }}
CPU核心数: {{ ansible_processor_vcpus }}
内存总量: {{ ansible_memtotal_mb }} MB
- name: 修复损坏的依赖关系
command: apt --fix-broken install -y
when: ansible_os_family == "Debian"
ignore_errors: yes
- name: 检查系统更新状态
apt:
update_cache: yes
upgrade: dist
check_mode: yes
register: update_check
changed_when: false
ignore_errors: yes
- name: 显示系统更新状态
debug:
msg: "{% if update_check.changed %}系统有可用更新{% else %}系统已是最新{% endif %}"
- name: 检查磁盘空间
command: "df -h /"
register: disk_space
- name: 显示根分区磁盘空间
debug:
msg: "根分区使用情况: {{ disk_space.stdout_lines[1] }}"
- name: 检查关键Kali工具
command: "which {{ item }}"
loop:
- nmap
- metasploit-framework
- wireshark
register: tool_check
ignore_errors: yes
changed_when: false
- name: 显示工具检查结果
debug:
msg: "{% for result in tool_check.results %}{{ result.item }}: {% if result.rc == 0 %}已安装{% else %}未安装{% endif %}{% endfor %}"
- name: 检查网络连接
uri:
url: https://httpbin.org/get
method: GET
timeout: 5
register: network_test
ignore_errors: yes
- name: 显示网络连接状态
debug:
msg: "{% if network_test.failed %}网络连接测试失败{% else %}网络连接正常{% endif %}"
- name: 检查系统负载
command: "uptime"
register: uptime
- name: 显示系统负载
debug:
msg: "系统负载: {{ uptime.stdout }}"
- name: 检查SSH服务状态
systemd:
name: ssh
register: ssh_service
- name: 显示SSH服务状态
debug:
msg: "SSH服务状态: {{ ssh_service.status.ActiveState }}"

Some files were not shown because too many files have changed in this diff Show More