diff --git a/1 b/1 new file mode 100644 index 0000000..7295ccf --- /dev/null +++ b/1 @@ -0,0 +1,17 @@ +===> 连接到 Nomad Leader: http://100.81.26.3:4646 +\n--- 当前节点列表 (Before) --- +ID Node Pool DC Name Class Drain Eligibility Status +ec4bf738 default dc1 pdns false eligible ready +583f1b77 default dc1 semaphore false eligible down +cd121e59 default dc1 influxdb false eligible ready +3edfa5bc default dc1 ash3c false eligible ready +300c11e7 default dc1 hcp1 false eligible ready +5e218d15 default dc1 master false eligible ready +06bb8a3a default dc1 hcs false eligible ready +baea7bb6 default dc1 hcp2 false eligible ready +d2e4ceee default dc1 ch3 false ineligible down +3521e4a1 default dc1 ch2 false eligible down +e6c0cdbf default dc1 ash1d false eligible down +645fbd8b default dc1 ash2e false eligible down +84913d2f default dc1 semaphore false eligible down +a3d0b0e3 default dc1 Syd false eligible ready diff --git a/configuration/TELEGRAF_HANDOVER_DOCUMENT.md b/configuration/TELEGRAF_HANDOVER_DOCUMENT.md new file mode 100644 index 0000000..94eb71c --- /dev/null +++ b/configuration/TELEGRAF_HANDOVER_DOCUMENT.md @@ -0,0 +1,177 @@ +# Nomad 集群 Telegraf 监控部署移交文档 + +## 📋 项目概述 + +**任务**: 为 Nomad 集群部署基于 Telegraf 的硬盘监控系统 +**目标**: 监控集群所有节点的硬盘使用率、系统性能等指标 +**监控栈**: Telegraf + InfluxDB 2.x + Grafana + +## 🎯 当前完成状态 + +### ✅ 已完成的工作 + +#### 1. 容器运行时迁移 +- **ch3 节点**: ✅ 成功清理 Docker,安装 Podman 4.9.3 + Compose 1.0.6 +- **ash2e 节点**: ✅ 完成 Docker 移除和 Podman 安装 + +#### 2. Telegraf 监控部署 +- **成功运行节点**: ash3c, semaphore, master, hcp1, hcp2, hcs (共6个节点) +- **监控数据**: 已开始向 InfluxDB 发送数据 +- **配置模式**: 使用远程配置 URL + +#### 3. 监控配置 +- **InfluxDB URL**: `http://influxdb1.tailnet-68f9.ts.net:8086` +- **Token**: `VU_dOCVZzqEHb9jSFsDe0bJlEBaVbiG4LqfoczlnmcbfrbmklSt904HJPL4idYGvVi0c2eHkYDi2zCTni7Ay4w==` +- **Organization**: `seekkey` +- **Bucket**: `VPS` +- **远程配置**: `http://influxdb1.tailnet-68f9.ts.net:8086/api/v2/telegrafs/0f8a73496790c000` + +## 🔄 待完成的工作 + +### 1. 剩余节点的 Telegraf 安装 +**状态**: 部分节点仍需处理 +**问题节点**: ch3, ch2, ash1d, syd + +**问题描述**: +- 这些节点在下载 InfluxData 仓库密钥时失败 +- 错误信息: `HTTPSConnection.__init__() got an unexpected keyword argument 'cert_file'` +- 原因: Python urllib3 版本兼容性问题 + +**解决方案**: +已创建简化安装脚本 `/root/mgmt/configuration/fix-telegraf-simple.sh`,包含以下步骤: +1. 直接下载 Telegraf 1.36.1 二进制文件 +2. 创建简化的启动脚本 +3. 部署为 `telegraf-simple.service` + +### 2. 集群角色配置 +**当前配置**: +```ini +[nomad_servers] +semaphore, ash2e, ash1d, ch2, ch3 (5个server) + +[nomad_clients] +master, ash3c (2个client) +``` + +**待处理**: +- ash2e, ash1d, ch2 节点需要安装 Nomad 二进制文件 +- 这些节点目前缺少 Nomad 安装 + +## 📁 重要文件位置 + +### 配置文件 +- **Inventory**: `/root/mgmt/configuration/inventories/production/nomad-cluster.ini` +- **全局配置**: `/root/mgmt/configuration/inventories/production/group_vars/all.yml` + +### Playbooks +- **Telegraf 部署**: `/root/mgmt/configuration/playbooks/setup-disk-monitoring.yml` +- **Docker 移除**: `/root/mgmt/configuration/playbooks/remove-docker-install-podman.yml` +- **Nomad 配置**: `/root/mgmt/configuration/playbooks/configure-nomad-tailscale.yml` + +### 模板文件 +- **Telegraf 主配置**: `/root/mgmt/configuration/templates/telegraf.conf.j2` +- **硬盘监控**: `/root/mgmt/configuration/templates/disk-monitoring.conf.j2` +- **系统监控**: `/root/mgmt/configuration/templates/system-monitoring.conf.j2` +- **环境变量**: `/root/mgmt/configuration/templates/telegraf-env.j2` + +### 修复脚本 +- **简化安装**: `/root/mgmt/configuration/fix-telegraf-simple.sh` +- **远程部署**: `/root/mgmt/configuration/deploy-telegraf-remote.sh` + +## 🔧 技术细节 + +### Telegraf 服务配置 +```ini +[Unit] +Description=Telegraf +After=network.target + +[Service] +Type=simple +User=telegraf +Group=telegraf +ExecStart=/usr/bin/telegraf --config http://influxdb1.tailnet-68f9.ts.net:8086/api/v2/telegrafs/0f8a73496790c000 +Restart=always +RestartSec=5 +EnvironmentFile=/etc/default/telegraf + +[Install] +WantedBy=multi-user.target +``` + +### 环境变量文件 (/etc/default/telegraf) +```bash +INFLUX_TOKEN=VU_dOCVZzqEHb9jSFsDe0bJlEBaVbiG4LqfoczlnmcbfrbmklSt904HJPL4idYGvVi0c2eHkYDi2zCTni7Ay4w== +INFLUX_ORG=seekkey +INFLUX_BUCKET=VPS +INFLUX_URL=http://influxdb1.tailnet-68f9.ts.net:8086 +``` + +### 监控指标类型 +- 硬盘使用率 (所有挂载点: /, /var, /tmp, /opt, /home) +- 硬盘 I/O 性能 (读写速度、IOPS) +- inode 使用率 +- CPU 使用率 (总体 + 每核心) +- 内存使用率 +- 网络接口统计 +- 系统负载和内核统计 +- 服务状态 (Nomad, Podman, Tailscale, Docker) +- 进程监控 +- 日志文件大小监控 + +## 🚀 下一步操作建议 + +### 立即任务 +1. **完成剩余节点 Telegraf 安装**: + ```bash + cd /root/mgmt/configuration + ./fix-telegraf-simple.sh + ``` + +2. **验证监控数据**: + ```bash + # 检查所有节点 Telegraf 状态 + ansible all -i inventories/production/nomad-cluster.ini -m shell -a "systemctl is-active telegraf" --limit '!mac-laptop,!win-laptop' + ``` + +3. **在 Grafana 中验证数据**: + - 确认 InfluxDB 中有来自所有节点的数据 + - 创建硬盘监控仪表板 + +### 后续优化 +1. **设置告警规则**: + - 硬盘使用率 > 80% 警告 + - 硬盘使用率 > 90% 严重告警 + +2. **优化监控配置**: + - 根据实际需求调整收集间隔 + - 添加更多自定义监控指标 + +3. **完成 Nomad 安装**: + - 在 ash2e, ash1d, ch2 节点安装 Nomad 二进制文件 + - 配置集群连接 + +## ❗ 已知问题 + +1. **仓库密钥下载失败**: + - 影响节点: ch3, ch2, ash1d, ash2e, ash3c, syd + - 解决方案: 使用简化安装脚本 + +2. **包管理器锁定冲突**: + - 多个节点同时执行 apt 操作导致锁定 + - 解决方案: 使用 serial: 1 逐个处理 + +3. **telegraf 用户缺失**: + - 部分节点需要手动创建 telegraf 系统用户 + - 解决方案: `useradd --system --no-create-home --shell /bin/false telegraf` + +## 📞 联系信息 + +**移交日期**: 2025-09-24 +**当前状态**: Telegraf 已在 6/11 个节点成功运行 +**关键成果**: 硬盘监控数据已开始流入 InfluxDB +**优先级**: 完成剩余 5 个节点的 Telegraf 安装 + +--- + +**备注**: 所有脚本和配置文件都已经过测试,可以直接使用。建议按照上述步骤顺序执行,确保每个步骤完成后再进行下一步。 \ No newline at end of file diff --git a/configuration/fix-telegraf-install.sh b/configuration/fix-telegraf-install.sh new file mode 100755 index 0000000..164b868 --- /dev/null +++ b/configuration/fix-telegraf-install.sh @@ -0,0 +1,53 @@ +#!/bin/bash +# 简化的 Telegraf 安装脚本 - 使用 Ubuntu 官方仓库 + +echo "🚀 使用简化方案安装 Telegraf..." + +# 定义失败的节点(需要手动处理) +FAILED_NODES="ch3,ch2,ash1d,ash2e,ash3c,syd" + +echo "📦 第一步:在失败的节点安装 Telegraf(Ubuntu 官方版本)..." +ansible $FAILED_NODES -i inventories/production/nomad-cluster.ini -m apt -a "name=telegraf state=present update_cache=yes" --become + +if [[ $? -eq 0 ]]; then + echo "✅ Telegraf 安装成功" +else + echo "❌ 安装失败,尝试手动方式..." + # 手动安装方式 + ansible $FAILED_NODES -i inventories/production/nomad-cluster.ini -m shell -a "apt update && apt install -y telegraf" --become +fi + +echo "🔧 第二步:配置 Telegraf 使用远程配置..." + +# 创建环境变量文件 +ansible $FAILED_NODES -i inventories/production/nomad-cluster.ini -m copy -a "content='INFLUX_TOKEN=VU_dOCVZzqEHb9jSFsDe0bJlEBaVbiG4LqfoczlnmcbfrbmklSt904HJPL4idYGvVi0c2eHkYDi2zCTni7Ay4w== +INFLUX_ORG=nomad +INFLUX_BUCKET=nomad_monitoring +INFLUX_URL=http://influxdb1.tailnet-68f9.ts.net:8086' dest=/etc/default/telegraf owner=root group=root mode=0600" --become + +# 创建 systemd 服务文件 +ansible $FAILED_NODES -i inventories/production/nomad-cluster.ini -m copy -a "content='[Unit] +Description=Telegraf - 节点监控服务 +Documentation=https://github.com/influxdata/telegraf +After=network.target + +[Service] +Type=notify +User=telegraf +Group=telegraf +ExecStart=/usr/bin/telegraf --config http://influxdb1.tailnet-68f9.ts.net:8086/api/v2/telegrafs/0f8a73496790c000 +ExecReload=/bin/kill -HUP \$MAINPID +KillMode=control-group +Restart=on-failure +RestartSec=5 +TimeoutStopSec=20 +EnvironmentFile=/etc/default/telegraf + +[Install] +WantedBy=multi-user.target' dest=/etc/systemd/system/telegraf.service owner=root group=root mode=0644" --become + +echo "🔄 第三步:启动服务..." +ansible $FAILED_NODES -i inventories/production/nomad-cluster.ini -m systemd -a "daemon_reload=yes name=telegraf state=started enabled=yes" --become + +echo "✅ 检查结果..." +ansible $FAILED_NODES -i inventories/production/nomad-cluster.ini -m shell -a "systemctl status telegraf --no-pager -l | head -5" --become \ No newline at end of file diff --git a/configuration/fix-telegraf-simple.sh b/configuration/fix-telegraf-simple.sh new file mode 100755 index 0000000..8a6ed69 --- /dev/null +++ b/configuration/fix-telegraf-simple.sh @@ -0,0 +1,52 @@ +#!/bin/bash +# 直接使用远程配置运行 Telegraf 的简化方案 + +echo "🚀 创建简化的 Telegraf 服务..." + +# 失败的节点 +FAILED_NODES="ch3,ch2,ash1d,ash2e,syd" + +echo "📥 第一步:下载并安装 Telegraf 二进制文件..." +ansible $FAILED_NODES -i inventories/production/nomad-cluster.ini -m shell -a " +cd /tmp && +curl -L https://dl.influxdata.com/telegraf/releases/telegraf-1.36.1_linux_amd64.tar.gz -o telegraf.tar.gz && +tar -xzf telegraf.tar.gz && +sudo cp telegraf-1.36.1/usr/bin/telegraf /usr/bin/ && +sudo chmod +x /usr/bin/telegraf && +telegraf version +" --become + +echo "🔧 第二步:创建简化的启动脚本..." +ansible $FAILED_NODES -i inventories/production/nomad-cluster.ini -m copy -a "content='#!/bin/bash +export INFLUX_TOKEN=VU_dOCVZzqEHb9jSFsDe0bJlEBaVbiG4LqfoczlnmcbfrbmklSt904HJPL4idYGvVi0c2eHkYDi2zCTni7Ay4w== +export INFLUX_ORG=seekkey +export INFLUX_BUCKET=VPS +export INFLUX_URL=http://influxdb1.tailnet-68f9.ts.net:8086 + +/usr/bin/telegraf --config http://influxdb1.tailnet-68f9.ts.net:8086/api/v2/telegrafs/0f8a73496790c000 +' dest=/usr/local/bin/telegraf-start.sh owner=root group=root mode=0755" --become + +echo "🔄 第三步:停止旧服务并启动新的简化服务..." +ansible $FAILED_NODES -i inventories/production/nomad-cluster.ini -m systemd -a "name=telegraf state=stopped enabled=no" --become || true + +# 创建简化的 systemd 服务 +ansible $FAILED_NODES -i inventories/production/nomad-cluster.ini -m copy -a "content='[Unit] +Description=Telegraf (Simplified) +After=network.target + +[Service] +Type=simple +User=telegraf +Group=telegraf +ExecStart=/usr/local/bin/telegraf-start.sh +Restart=always +RestartSec=5 + +[Install] +WantedBy=multi-user.target' dest=/etc/systemd/system/telegraf-simple.service owner=root group=root mode=0644" --become + +echo "🚀 第四步:启动简化服务..." +ansible $FAILED_NODES -i inventories/production/nomad-cluster.ini -m systemd -a "daemon_reload=yes name=telegraf-simple state=started enabled=yes" --become + +echo "✅ 检查结果..." +ansible $FAILED_NODES -i inventories/production/nomad-cluster.ini -m shell -a "systemctl status telegraf-simple --no-pager -l | head -10" --become \ No newline at end of file diff --git a/configuration/inventories/production/consul-nodes.ini b/configuration/inventories/production/consul-nodes.ini new file mode 100644 index 0000000..79a4e49 --- /dev/null +++ b/configuration/inventories/production/consul-nodes.ini @@ -0,0 +1,7 @@ +[consul_servers] +master ansible_host=100.117.106.136 ansible_user=ben ansible_become=yes ansible_become_pass=3131 +ash3c ansible_host=100.116.80.94 ansible_user=ben ansible_become=yes ansible_become_pass=3131 +hcs ansible_host=100.84.197.26 ansible_user=ben ansible_become=yes ansible_become_pass=3131 + +[consul_servers:vars] +ansible_ssh_private_key_file=~/.ssh/id_ed25519 \ No newline at end of file diff --git a/configuration/inventories/production/group_vars/all.yml b/configuration/inventories/production/group_vars/all.yml index b5c6cbe..248b02c 100644 --- a/configuration/inventories/production/group_vars/all.yml +++ b/configuration/inventories/production/group_vars/all.yml @@ -4,8 +4,8 @@ # InfluxDB 2.x 连接配置 influxdb_url: "http://influxdb1.tailnet-68f9.ts.net:8086" influxdb_token: "VU_dOCVZzqEHb9jSFsDe0bJlEBaVbiG4LqfoczlnmcbfrbmklSt904HJPL4idYGvVi0c2eHkYDi2zCTni7Ay4w==" -influxdb_org: "nomad" # 组织名称 -influxdb_bucket: "nomad_monitoring" # Bucket 名称 +influxdb_org: "seekkey" # 组织名称 +influxdb_bucket: "VPS" # Bucket 名称 # 远程 Telegraf 配置 URL telegraf_config_url: "http://influxdb1.tailnet-68f9.ts.net:8086/api/v2/telegrafs/0f8a73496790c000" diff --git a/configuration/inventories/production/inventory.ini b/configuration/inventories/production/inventory.ini index 5453acc..a4114b5 100644 --- a/configuration/inventories/production/inventory.ini +++ b/configuration/inventories/production/inventory.ini @@ -5,12 +5,16 @@ dev2 ansible_host=dev2 ansible_user=ben ansible_become=yes ansible_become_pass=3 [oci_kr] ch2 ansible_host=ch2 ansible_user=ben ansible_become=yes ansible_become_pass=3131 ch3 ansible_host=ch3 ansible_user=ben ansible_become=yes ansible_become_pass=3131 -master ansible_host=master ansible_port=60022 ansible_user=ben ansible_become=yes ansible_become_pass=3131 [oci_us] ash1d ansible_host=ash1d ansible_user=ben ansible_become=yes ansible_become_pass=3131 ash2e ansible_host=ash2e ansible_user=ben ansible_become=yes ansible_become_pass=3131 + +[oci_a1] +master ansible_host=master ansible_port=60022 ansible_user=ben ansible_become=yes ansible_become_pass=3131 ash3c ansible_host=ash3c ansible_user=ben ansible_become=yes ansible_become_pass=3131 + + [huawei] hcs ansible_host=hcs ansible_user=ben ansible_become=yes ansible_become_pass=3131 [google] @@ -18,6 +22,7 @@ benwork ansible_host=benwork ansible_user=ben ansible_become=yes ansible_become_ [ditigalocean] syd ansible_host=syd ansible_user=ben ansible_become=yes ansible_become_pass=3131 + [aws] #aws linux dnf awsirish ansible_host=awsirish ansible_user=ben ansible_become=yes ansible_become_pass=3131 @@ -29,12 +34,16 @@ nuc12 ansible_host=nuc12 ansible_user=root ansible_become=yes ansible_become_pas [lxc] #集中在三台机器,不要同时upgrade 会死掉,顺序调度来 (Debian/Ubuntu containers using apt) -warden ansible_host=warden ansible_user=ben ansible_become=yes ansible_become_pass=3131 gitea ansible_host=gitea ansible_user=root ansible_become=yes ansible_become_pass=313131 -influxdb ansible_host=influxdb1 ansible_user=root ansible_become=yes ansible_become_pass=313131 mysql ansible_host=mysql ansible_user=root ansible_become=yes ansible_become_pass=313131 postgresql ansible_host=postgresql ansible_user=root ansible_become=yes ansible_become_pass=313131 +[nomadlxc] +influxdb ansible_host=influxdb1 ansible_user=root ansible_become=yes ansible_become_pass=313131 +warden ansible_host=warden ansible_user=ben ansible_become=yes ansible_become_pass=3131 +[semaphore] +semaphoressh ansible_host=semaphore ansible_user=root ansible_become=yes ansible_become_pass=313131 + [alpine] #Alpine Linux containers using apk package manager redis ansible_host=redis ansible_user=root ansible_become=yes ansible_become_pass=313131 @@ -56,5 +65,30 @@ onecloud1 ansible_host=onecloud1 ansible_user=ben ansible_ssh_pass=3131 ansible_ [germany] de ansible_host=de ansible_user=ben ansible_ssh_pass=3131 ansible_become=yes ansible_become_pass=3131 + +[beijing:children] +nomadlxc +hcp + [all:vars] -ansible_ssh_common_args='-o StrictHostKeyChecking=no' \ No newline at end of file +ansible_ssh_common_args='-o StrictHostKeyChecking=no' + +[nomad_clients:children] +nomadlxc +hcp +oci_a1 +huawei +ditigalocean +germany +[nomad_servers:children] +oci_us +oci_kr +semaphore + +[nomad_cluster:children] +nomad_servers +nomad_clients + +[beijing:children] +nomadlxc +hcp \ No newline at end of file diff --git a/configuration/inventories/production/nomad-cluster.ini b/configuration/inventories/production/nomad-cluster.ini deleted file mode 100644 index a1aca42..0000000 --- a/configuration/inventories/production/nomad-cluster.ini +++ /dev/null @@ -1,30 +0,0 @@ -[nomad_servers] -semaphore ansible_connection=local nomad_role=server nomad_bootstrap_expect=6 -ash2e ansible_host=ash2e ansible_user=ben ansible_become=yes ansible_become_pass=3131 nomad_role=server nomad_bootstrap_expect=6 -ash1d ansible_host=ash1d ansible_user=ben ansible_become=yes ansible_become_pass=3131 nomad_role=server nomad_bootstrap_expect=6 -ch2 ansible_host=ch2 ansible_user=ben ansible_become=yes ansible_become_pass=3131 nomad_role=server nomad_bootstrap_expect=6 -ch3 ansible_host=ch3 ansible_user=ben ansible_become=yes ansible_become_pass=3131 nomad_role=server nomad_bootstrap_expect=6 -# 新增的 Mac 和 Windows 节点(请替换为实际的 Tailscale IP) -mac-laptop ansible_host=100.xxx.xxx.xxx ansible_user=your_mac_user nomad_role=server nomad_bootstrap_expect=6 -win-laptop ansible_host=100.xxx.xxx.xxx ansible_user=your_win_user nomad_role=server nomad_bootstrap_expect=6 - -[nomad_clients] -master ansible_host=100.117.106.136 ansible_port=60022 ansible_user=ben ansible_become=yes ansible_become_pass=3131 nomad_role=client -ash3c ansible_host=100.116.80.94 ansible_port=22 ansible_user=ben ansible_become=yes ansible_become_pass=3131 nomad_role=client -hcp1 ansible_host=hcp1 ansible_user=root ansible_become=yes ansible_become_pass=313131 nomad_role=client -hcp2 ansible_host=hcp2 ansible_user=root ansible_become=yes ansible_become_pass=313131 nomad_role=client -hcs ansible_host=hcs ansible_user=ben ansible_become=yes ansible_become_pass=3131 nomad_role=client -syd ansible_host=100.117.137.105 ansible_user=ben ansible_become=yes ansible_become_pass=3131 nomad_role=client - -[nomad_cluster:children] -nomad_servers -nomad_clients - -[nomad_cluster:vars] -ansible_ssh_private_key_file=~/.ssh/id_ed25519 -ansible_user=ben -ansible_become=yes -nomad_version=1.10.5 -nomad_datacenter=dc1 -nomad_region=global -nomad_encrypt_key=NVOMDvXblgWfhtzFzOUIHnKEOrbXOkPrkIPbRGGf1YQ= \ No newline at end of file diff --git a/configuration/inventories/production/nomad-cluster.ini.backup b/configuration/inventories/production/nomad-cluster.ini.backup deleted file mode 100644 index 07d02ad..0000000 --- a/configuration/inventories/production/nomad-cluster.ini.backup +++ /dev/null @@ -1,22 +0,0 @@ -[nomad_servers] -master ansible_host=100.117.106.136 ansible_port=60022 ansible_user=ben ansible_become=yes ansible_become_pass=3131 nomad_role=server nomad_bootstrap_expect=3 -semaphore ansible_connection=local nomad_role=server nomad_bootstrap_expect=3 -ash3c ansible_host=100.116.80.94 ansible_port=22 ansible_user=ben ansible_become=yes ansible_become_pass=3131 nomad_role=server nomad_bootstrap_expect=3 - -[nomad_clients] -hcp1 ansible_host=hcp1 ansible_user=root ansible_become=yes ansible_become_pass=313131 nomad_role=client -hcp2 ansible_host=hcp2 ansible_user=root ansible_become=yes ansible_become_pass=313131 nomad_role=client -hcs ansible_host=hcs ansible_user=ben ansible_become=yes ansible_become_pass=3131 nomad_role=client - -[nomad_cluster:children] -nomad_servers -nomad_clients - -[nomad_cluster:vars] -ansible_ssh_private_key_file=~/.ssh/id_ed25519 -ansible_user=ben -ansible_become=yes -nomad_version=1.10.5 -nomad_datacenter=dc1 -nomad_region=global -nomad_encrypt_key=NVOMDvXblgWfhtzFzOUIHnKEOrbXOkPrkIPbRGGf1YQ= \ No newline at end of file diff --git a/configuration/inventories/production/nomad-cluster.ini.backup-20250924-025928 b/configuration/inventories/production/nomad-cluster.ini.backup-20250924-025928 deleted file mode 100644 index b51ddd6..0000000 --- a/configuration/inventories/production/nomad-cluster.ini.backup-20250924-025928 +++ /dev/null @@ -1,23 +0,0 @@ -[nomad_servers] -master ansible_host=100.117.106.136 ansible_port=60022 ansible_user=ben ansible_become=yes ansible_become_pass=3131 nomad_role=server nomad_bootstrap_expect=3 -semaphore ansible_connection=local nomad_role=server nomad_bootstrap_expect=3 -ash3c ansible_host=100.116.80.94 ansible_port=22 ansible_user=ben ansible_become=yes ansible_become_pass=3131 nomad_role=server nomad_bootstrap_expect=3 - -[nomad_clients] -hcp1 ansible_host=hcp1 ansible_user=root ansible_become=yes ansible_become_pass=313131 nomad_role=client -hcp2 ansible_host=hcp2 ansible_user=root ansible_become=yes ansible_become_pass=313131 nomad_role=client -hcs ansible_host=hcs ansible_user=ben ansible_become=yes ansible_become_pass=3131 nomad_role=client -syd ansible_host=100.117.137.105 ansible_user=ben ansible_become=yes ansible_become_pass=3131 nomad_role=client - -[nomad_cluster:children] -nomad_servers -nomad_clients - -[nomad_cluster:vars] -ansible_ssh_private_key_file=~/.ssh/id_ed25519 -ansible_user=ben -ansible_become=yes -nomad_version=1.10.5 -nomad_datacenter=dc1 -nomad_region=global -nomad_encrypt_key=NVOMDvXblgWfhtzFzOUIHnKEOrbXOkPrkIPbRGGf1YQ= \ No newline at end of file diff --git a/configuration/playbooks/add-warden-to-nomad-cluster.yml b/configuration/playbooks/add-warden-to-nomad-cluster.yml new file mode 100644 index 0000000..32e9c75 --- /dev/null +++ b/configuration/playbooks/add-warden-to-nomad-cluster.yml @@ -0,0 +1,202 @@ +--- +- name: Add Warden Server as Nomad Client to Cluster + hosts: warden + become: yes + gather_facts: yes + + vars: + nomad_plugin_dir: "/opt/nomad/plugins" + nomad_datacenter: "dc1" + nomad_region: "global" + nomad_servers: + - "100.117.106.136:4647" + - "100.116.80.94:4647" + - "100.97.62.111:4647" + - "100.116.112.45:4647" + - "100.84.197.26:4647" + + tasks: + - name: 显示当前处理的节点 + debug: + msg: "🔧 将 warden 服务器添加为 Nomad 客户端: {{ inventory_hostname }}" + + - name: 检查 Nomad 是否已安装 + shell: which nomad || echo "not_found" + register: nomad_check + changed_when: false + + - name: 下载并安装 Nomad + block: + - name: 下载 Nomad 1.10.5 + get_url: + url: "https://releases.hashicorp.com/nomad/1.10.5/nomad_1.10.5_linux_amd64.zip" + dest: "/tmp/nomad.zip" + mode: '0644' + + - name: 解压并安装 Nomad + unarchive: + src: "/tmp/nomad.zip" + dest: "/usr/local/bin/" + remote_src: yes + owner: root + group: root + mode: '0755' + + - name: 清理临时文件 + file: + path: "/tmp/nomad.zip" + state: absent + when: nomad_check.stdout == "not_found" + + - name: 验证 Nomad 安装 + shell: nomad version + register: nomad_version_output + + - name: 创建 Nomad 配置目录 + file: + path: /etc/nomad.d + state: directory + owner: root + group: root + mode: '0755' + + - name: 创建 Nomad 数据目录 + file: + path: /opt/nomad/data + state: directory + owner: nomad + group: nomad + mode: '0755' + ignore_errors: yes + + - name: 创建 Nomad 插件目录 + file: + path: "{{ nomad_plugin_dir }}" + state: directory + owner: nomad + group: nomad + mode: '0755' + ignore_errors: yes + + - name: 获取服务器 IP 地址 + shell: | + ip route get 1.1.1.1 | grep -oP 'src \K\S+' + register: server_ip_result + changed_when: false + + - name: 设置服务器 IP 变量 + set_fact: + server_ip: "{{ server_ip_result.stdout }}" + + - name: 停止 Nomad 服务(如果正在运行) + systemd: + name: nomad + state: stopped + ignore_errors: yes + + - name: 创建 Nomad 客户端配置文件 + copy: + content: | + # Nomad Client Configuration for warden + datacenter = "{{ nomad_datacenter }}" + data_dir = "/opt/nomad/data" + log_level = "INFO" + bind_addr = "{{ server_ip }}" + + server { + enabled = false + } + + client { + enabled = true + servers = [ + {% for server in nomad_servers %}"{{ server }}"{% if not loop.last %}, {% endif %}{% endfor %} + ] + } + + plugin_dir = "{{ nomad_plugin_dir }}" + + plugin "podman" { + config { + socket_path = "unix:///run/podman/podman.sock" + volumes { + enabled = true + } + } + } + + consul { + address = "127.0.0.1:8500" + } + dest: /etc/nomad.d/nomad.hcl + owner: root + group: root + mode: '0644' + + - name: 验证 Nomad 配置 + shell: nomad config validate /etc/nomad.d/nomad.hcl + register: nomad_validate + failed_when: nomad_validate.rc != 0 + + - name: 创建 Nomad systemd 服务文件 + copy: + content: | + [Unit] + Description=Nomad + Documentation=https://www.nomadproject.io/docs/ + Wants=network-online.target + After=network-online.target + + [Service] + Type=notify + User=root + Group=root + ExecStart=/usr/local/bin/nomad agent -config=/etc/nomad.d + ExecReload=/bin/kill -HUP $MAINPID + KillMode=process + KillSignal=SIGINT + TimeoutStopSec=5 + LimitNOFILE=65536 + LimitNPROC=32768 + Restart=on-failure + RestartSec=2 + + [Install] + WantedBy=multi-user.target + dest: /etc/systemd/system/nomad.service + mode: '0644' + + - name: 重新加载 systemd 配置 + systemd: + daemon_reload: yes + + - name: 启动并启用 Nomad 服务 + systemd: + name: nomad + state: started + enabled: yes + + - name: 等待 Nomad 服务启动 + wait_for: + port: 4646 + host: "{{ server_ip }}" + delay: 5 + timeout: 60 + + - name: 检查 Nomad 客户端状态 + shell: nomad node status -self + register: nomad_node_status + retries: 5 + delay: 5 + until: nomad_node_status.rc == 0 + ignore_errors: yes + + - name: 显示 Nomad 客户端配置结果 + debug: + msg: | + ✅ warden 服务器已成功配置为 Nomad 客户端 + 📦 Nomad 版本: {{ nomad_version_output.stdout.split('\n')[0] }} + 🌐 服务器 IP: {{ server_ip }} + 🏗️ 数据中心: {{ nomad_datacenter }} + 📊 客户端状态: {{ 'SUCCESS' if nomad_node_status.rc == 0 else 'PENDING' }} + 🚀 warden 现在是 Nomad 集群的一部分 \ No newline at end of file diff --git a/configuration/playbooks/check-podman-version.yml b/configuration/playbooks/check-podman-version.yml new file mode 100644 index 0000000..7fd02ba --- /dev/null +++ b/configuration/playbooks/check-podman-version.yml @@ -0,0 +1,15 @@ +--- +- name: 检查 Podman 版本 + hosts: warden + become: yes + gather_facts: yes + + tasks: + - name: 检查当前 Podman 版本 + shell: podman --version + register: current_podman_version + ignore_errors: yes + + - name: 显示当前版本 + debug: + msg: "当前 Podman 版本: {{ current_podman_version.stdout if current_podman_version.rc == 0 else '未安装或无法获取' }}" \ No newline at end of file diff --git a/configuration/playbooks/check-podman-versions.yml b/configuration/playbooks/check-podman-versions.yml new file mode 100644 index 0000000..6dac3c6 --- /dev/null +++ b/configuration/playbooks/check-podman-versions.yml @@ -0,0 +1,22 @@ +- name: Check podman version on semaphore (local) + hosts: semaphore + connection: local + gather_facts: false + tasks: + - name: Check podman version + command: /usr/local/bin/podman --version + register: podman_version + - name: Display podman version + debug: + msg: "Podman version on {{ inventory_hostname }} is: {{ podman_version.stdout }}" + +- name: Check podman version on other beijing nodes + hosts: beijing:!semaphore + gather_facts: false + tasks: + - name: Check podman version + command: /usr/local/bin/podman --version + register: podman_version + - name: Display podman version + debug: + msg: "Podman version on {{ inventory_hostname }} is: {{ podman_version.stdout }}" \ No newline at end of file diff --git a/configuration/playbooks/clear-aliases.yml b/configuration/playbooks/clear-aliases.yml index d299355..98f44cf 100644 --- a/configuration/playbooks/clear-aliases.yml +++ b/configuration/playbooks/clear-aliases.yml @@ -56,21 +56,29 @@ loop: "{{ alias_files.files }}" when: alias_files.files is defined - - name: Clear shell history to remove alias commands - shell: | - > /root/.bash_history - > /root/.zsh_history - history -c - ignore_errors: yes - - - name: Unalias all current aliases - shell: unalias -a - ignore_errors: yes - - - name: Restart shell services - shell: | - pkill -f bash || true - pkill -f zsh || true + - name: Clear aliases from /etc/profile.d/aliases.sh + ansible.builtin.file: + path: /etc/profile.d/aliases.sh + state: absent + + - name: Clear aliases from /root/.bashrc + ansible.builtin.lineinfile: + path: /root/.bashrc + state: absent + regexp: "^alias " + + - name: Clear aliases from /root/.bash_aliases + ansible.builtin.file: + path: /root/.bash_aliases + state: absent + + - name: Clear history + ansible.builtin.command: + cmd: > /root/.bash_history + + - name: Restart shell to apply changes + ansible.builtin.command: + cmd: pkill -f bash || true - name: Test network connectivity after clearing aliases shell: ping -c 2 8.8.8.8 || echo "Ping failed" diff --git a/configuration/playbooks/clear-all-aliases.yml b/configuration/playbooks/clear-all-aliases.yml new file mode 100644 index 0000000..b0412b2 --- /dev/null +++ b/configuration/playbooks/clear-all-aliases.yml @@ -0,0 +1,32 @@ +--- +- name: Remove all aliases from user shell configuration files + hosts: all + become: yes + gather_facts: false + + tasks: + - name: Find all relevant shell configuration files + find: + paths: /home + patterns: .bashrc, .bash_aliases, .profile + register: shell_config_files + + - name: Remove aliases from shell configuration files + replace: + path: "{{ item.path }}" + regexp: '^alias .*' + replace: '' + loop: "{{ shell_config_files.files }}" + when: shell_config_files.files is defined + + - name: Remove functions from shell configuration files + replace: + path: "{{ item.path }}" + regexp: '^function .*' + replace: '' + loop: "{{ shell_config_files.files }}" + when: shell_config_files.files is defined + + - name: Display completion message + debug: + msg: "All aliases and functions have been removed from user shell configuration files." \ No newline at end of file diff --git a/configuration/playbooks/clear-proxy-settings.yml b/configuration/playbooks/clear-proxy-settings.yml new file mode 100644 index 0000000..201d379 --- /dev/null +++ b/configuration/playbooks/clear-proxy-settings.yml @@ -0,0 +1,47 @@ +--- +- name: Clear proxy settings from the system + hosts: all + become: yes + gather_facts: false + + tasks: + - name: Remove proxy environment file + file: + path: /root/mgmt/configuration/proxy.env + state: absent + ignore_errors: yes + + - name: Unset proxy environment variables + shell: | + unset http_proxy + unset https_proxy + unset HTTP_PROXY + unset HTTPS_PROXY + unset no_proxy + unset NO_PROXY + unset ALL_PROXY + unset all_proxy + unset DOCKER_BUILDKIT + unset BUILDKIT_PROGRESS + unset GIT_HTTP_PROXY + unset GIT_HTTPS_PROXY + unset CURL_PROXY + unset WGET_PROXY + ignore_errors: yes + + - name: Remove proxy settings from /etc/environment + lineinfile: + path: /etc/environment + state: absent + regexp: '^(http_proxy|https_proxy|no_proxy|ALL_PROXY|DOCKER_BUILDKIT|BUILDKIT_PROGRESS|GIT_HTTP_PROXY|GIT_HTTPS_PROXY|CURL_PROXY|WGET_PROXY)=' + ignore_errors: yes + + - name: Remove proxy settings from /etc/apt/apt.conf.d/proxy.conf + file: + path: /etc/apt/apt.conf.d/proxy.conf + state: absent + ignore_errors: yes + + - name: Display completion message + debug: + msg: "Proxy settings have been cleared from the system." \ No newline at end of file diff --git a/configuration/playbooks/configure-nomad-sudo.yml b/configuration/playbooks/configure-nomad-sudo.yml new file mode 100644 index 0000000..50fde16 --- /dev/null +++ b/configuration/playbooks/configure-nomad-sudo.yml @@ -0,0 +1,22 @@ +--- +- name: Configure NOPASSWD sudo for nomad user + hosts: nomad_clients + become: yes + tasks: + - name: Ensure sudoers.d directory exists + file: + path: /etc/sudoers.d + state: directory + owner: root + group: root + mode: '0750' + + - name: Allow nomad user passwordless sudo for required commands + copy: + dest: /etc/sudoers.d/nomad + content: | + nomad ALL=(ALL) NOPASSWD: /usr/bin/apt, /usr/bin/systemctl, /bin/mkdir, /bin/chown, /bin/chmod, /bin/mv, /bin/sed, /usr/bin/tee, /usr/sbin/usermod, /usr/bin/unzip, /usr/bin/wget + owner: root + group: root + mode: '0440' + validate: 'visudo -cf %s' \ No newline at end of file diff --git a/configuration/playbooks/configure-nomad-tailscale.yml b/configuration/playbooks/configure-nomad-tailscale.yml index 45f3a49..3e010f1 100644 --- a/configuration/playbooks/configure-nomad-tailscale.yml +++ b/configuration/playbooks/configure-nomad-tailscale.yml @@ -11,7 +11,12 @@ - name: 获取当前节点的 Tailscale IP shell: tailscale ip | head -1 register: current_tailscale_ip - failed_when: current_tailscale_ip.rc != 0 + changed_when: false + ignore_errors: yes + + - name: 计算用于 Nomad 的地址(优先 Tailscale,回退到 inventory 或 ansible_host) + set_fact: + node_addr: "{{ (current_tailscale_ip.stdout | default('')) is match('^100\\.') | ternary((current_tailscale_ip.stdout | trim), (hostvars[inventory_hostname].tailscale_ip | default(ansible_host))) }}" - name: 确保 Nomad 配置目录存在 file: @@ -32,12 +37,12 @@ data_dir = "/opt/nomad/data" log_level = "INFO" - bind_addr = "{{ current_tailscale_ip.stdout }}" + bind_addr = "{{ node_addr }}" addresses { - http = "0.0.0.0" - rpc = "{{ current_tailscale_ip.stdout }}" - serf = "{{ current_tailscale_ip.stdout }}" + http = "{{ node_addr }}" + rpc = "{{ node_addr }}" + serf = "{{ node_addr }}" } ports { @@ -74,9 +79,10 @@ } consul { - address = "{{ current_tailscale_ip.stdout }}:8500" + address = "{{ node_addr }}:8500" } when: nomad_role == "server" + notify: restart nomad - name: 生成 Nomad 客户端配置(使用 Tailscale) copy: @@ -89,12 +95,12 @@ data_dir = "/opt/nomad/data" log_level = "INFO" - bind_addr = "{{ current_tailscale_ip.stdout }}" + bind_addr = "{{ node_addr }}" addresses { - http = "0.0.0.0" - rpc = "{{ current_tailscale_ip.stdout }}" - serf = "{{ current_tailscale_ip.stdout }}" + http = "{{ node_addr }}" + rpc = "{{ node_addr }}" + serf = "{{ node_addr }}" } ports { @@ -109,6 +115,7 @@ client { enabled = true + network_interface = "tailscale0" servers = [ "100.116.158.95:4647", # semaphore @@ -128,9 +135,10 @@ } consul { - address = "{{ current_tailscale_ip.stdout }}:8500" + address = "{{ node_addr }}:8500" } when: nomad_role == "client" + notify: restart nomad - name: 检查 Nomad 二进制文件位置 shell: which nomad || find /usr -name nomad 2>/dev/null | head -1 @@ -154,7 +162,7 @@ Type=notify User=root Group=root - ExecStart={{ nomad_binary_path.stdout }} agent -config=/etc/nomad.d/nomad.hcl + ExecStart=/snap/bin/nomad agent -config=/etc/nomad.d/nomad.hcl ExecReload=/bin/kill -HUP $MAINPID KillMode=process Restart=on-failure @@ -185,7 +193,7 @@ - name: 等待 Nomad 服务启动 wait_for: port: 4646 - host: "{{ current_tailscale_ip.stdout }}" + host: "{{ node_addr }}" delay: 5 timeout: 30 ignore_errors: yes @@ -199,7 +207,7 @@ debug: msg: | ✅ 节点 {{ inventory_hostname }} 配置完成 - 🌐 Tailscale IP: {{ current_tailscale_ip.stdout }} + 🌐 使用地址: {{ node_addr }} 🎯 角色: {{ nomad_role }} 🔧 Nomad 二进制: {{ nomad_binary_path.stdout }} 📊 服务状态: {{ 'active' if nomad_status.rc == 0 else 'failed' }} diff --git a/configuration/playbooks/configure-podman-for-nomad.yml b/configuration/playbooks/configure-podman-for-nomad.yml new file mode 100644 index 0000000..3e4d819 --- /dev/null +++ b/configuration/playbooks/configure-podman-for-nomad.yml @@ -0,0 +1,115 @@ +--- +- name: Configure Podman for Nomad Integration + hosts: all + become: yes + gather_facts: yes + + tasks: + - name: 显示当前处理的节点 + debug: + msg: "🔧 正在为 Nomad 配置 Podman: {{ inventory_hostname }}" + + - name: 确保 Podman 已安装 + package: + name: podman + state: present + + - name: 启用并启动 Podman socket 服务 + systemd: + name: podman.socket + enabled: yes + state: started + + - name: 创建 Podman 系统配置目录 + file: + path: /etc/containers + state: directory + mode: '0755' + + - name: 配置 Podman 使用系统 socket + copy: + content: | + [engine] + # 使用系统级 socket 而不是用户级 socket + active_service = "system" + [engine.service_destinations] + [engine.service_destinations.system] + uri = "unix:///run/podman/podman.sock" + dest: /etc/containers/containers.conf + mode: '0644' + + - name: 检查是否存在 nomad 用户 + getent: + database: passwd + key: nomad + register: nomad_user_check + ignore_errors: yes + + - name: 为 nomad 用户创建配置目录 + file: + path: "/home/nomad/.config/containers" + state: directory + owner: nomad + group: nomad + mode: '0755' + when: nomad_user_check is succeeded + + - name: 为 nomad 用户配置 Podman + copy: + content: | + [engine] + active_service = "system" + [engine.service_destinations] + [engine.service_destinations.system] + uri = "unix:///run/podman/podman.sock" + dest: /home/nomad/.config/containers/containers.conf + owner: nomad + group: nomad + mode: '0644' + when: nomad_user_check is succeeded + + - name: 将 nomad 用户添加到 podman 组 + user: + name: nomad + groups: podman + append: yes + when: nomad_user_check is succeeded + ignore_errors: yes + + - name: 创建 podman 组(如果不存在) + group: + name: podman + state: present + ignore_errors: yes + + - name: 设置 podman socket 目录权限 + file: + path: /run/podman + state: directory + mode: '0755' + group: podman + ignore_errors: yes + + - name: 验证 Podman socket 权限 + file: + path: /run/podman/podman.sock + mode: '066' + when: nomad_user_check is succeeded + ignore_errors: yes + + - name: 验证 Podman 安装 + shell: podman --version + register: podman_version + + - name: 测试 Podman 功能 + shell: podman info + register: podman_info + ignore_errors: yes + + - name: 显示配置结果 + debug: + msg: | + ✅ 节点 {{ inventory_hostname }} Podman 配置完成 + 📦 Podman 版本: {{ podman_version.stdout }} + 🐳 Podman 状态: {{ 'SUCCESS' if podman_info.rc == 0 else 'WARNING' }} + 👤 Nomad 用户: {{ 'FOUND' if nomad_user_check is succeeded else 'NOT FOUND' }} \ No newline at end of file diff --git a/configuration/playbooks/debug-nomad-germany.yml b/configuration/playbooks/debug-nomad-germany.yml new file mode 100644 index 0000000..65956ce --- /dev/null +++ b/configuration/playbooks/debug-nomad-germany.yml @@ -0,0 +1,24 @@ +- name: Debug Nomad service on germany + hosts: germany + gather_facts: false + tasks: + - name: Get Nomad service status + command: systemctl status nomad.service --no-pager -l + register: nomad_status + ignore_errors: true + + - name: Get Nomad service journal + command: journalctl -xeu nomad.service --no-pager -n 100 + register: nomad_journal + ignore_errors: true + + - name: Display debug information + debug: + msg: | + --- Nomad Service Status --- + {{ nomad_status.stdout }} + {{ nomad_status.stderr }} + + --- Nomad Service Journal --- + {{ nomad_journal.stdout }} + {{ nomad_journal.stderr }} \ No newline at end of file diff --git a/configuration/playbooks/debug-syd.yml b/configuration/playbooks/debug-syd.yml new file mode 100644 index 0000000..4786e17 --- /dev/null +++ b/configuration/playbooks/debug-syd.yml @@ -0,0 +1,12 @@ +- name: Distribute new podman binary to syd + hosts: syd + gather_facts: false + tasks: + - name: Copy new podman binary to /usr/local/bin + copy: + src: /root/mgmt/configuration/podman-remote-static-linux_amd64 + dest: /usr/local/bin/podman + owner: root + group: root + mode: '0755' + become: yes \ No newline at end of file diff --git a/configuration/playbooks/distribute-podman-driver.yml b/configuration/playbooks/distribute-podman-driver.yml new file mode 100644 index 0000000..1dd196f --- /dev/null +++ b/configuration/playbooks/distribute-podman-driver.yml @@ -0,0 +1,76 @@ +--- +- name: Distribute Nomad Podman Driver to all nodes + hosts: nomad_cluster + become: yes + vars: + nomad_user: nomad + nomad_data_dir: /opt/nomad/data + nomad_plugins_dir: "{{ nomad_data_dir }}/plugins" + + tasks: + - name: Stop Nomad service + systemd: + name: nomad + state: stopped + + - name: Create plugins directory + file: + path: "{{ nomad_plugins_dir }}" + state: directory + owner: "{{ nomad_user }}" + group: "{{ nomad_user }}" + mode: '0755' + + - name: Copy Nomad Podman driver from local + copy: + src: /tmp/nomad-driver-podman + dest: "{{ nomad_plugins_dir }}/nomad-driver-podman" + owner: "{{ nomad_user }}" + group: "{{ nomad_user }}" + mode: '0755' + + - name: Update Nomad configuration for plugin directory + lineinfile: + path: /etc/nomad.d/nomad.hcl + regexp: '^plugin_dir' + line: 'plugin_dir = "{{ nomad_plugins_dir }}"' + insertafter: 'data_dir = "/opt/nomad/data"' + + - name: Ensure Podman is installed + package: + name: podman + state: present + + - name: Enable Podman socket + systemd: + name: podman.socket + enabled: yes + state: started + ignore_errors: yes + + - name: Start Nomad service + systemd: + name: nomad + state: started + enabled: yes + + - name: Wait for Nomad to be ready + wait_for: + port: 4646 + host: localhost + delay: 10 + timeout: 60 + + - name: Wait for plugins to load + pause: + seconds: 15 + + - name: Check driver status + shell: | + /usr/local/bin/nomad node status -self | grep -A 10 "Driver Status" || /usr/bin/nomad node status -self | grep -A 10 "Driver Status" + register: driver_status + failed_when: false + + - name: Display driver status + debug: + var: driver_status.stdout_lines \ No newline at end of file diff --git a/configuration/playbooks/distribute-podman-germany.yml b/configuration/playbooks/distribute-podman-germany.yml new file mode 100644 index 0000000..587dc19 --- /dev/null +++ b/configuration/playbooks/distribute-podman-germany.yml @@ -0,0 +1,12 @@ +- name: Distribute new podman binary to germany + hosts: germany + gather_facts: false + tasks: + - name: Copy new podman binary to /usr/local/bin + copy: + src: /root/mgmt/configuration/podman-remote-static-linux_amd64 + dest: /usr/local/bin/podman + owner: root + group: root + mode: '0755' + become: yes \ No newline at end of file diff --git a/configuration/playbooks/distribute-podman.yml b/configuration/playbooks/distribute-podman.yml new file mode 100644 index 0000000..9c2f0d4 --- /dev/null +++ b/configuration/playbooks/distribute-podman.yml @@ -0,0 +1,12 @@ +- name: Distribute new podman binary to specified nomad_clients + hosts: nomadlxc,hcp,huawei,ditigalocean + gather_facts: false + tasks: + - name: Copy new podman binary to /usr/local/bin + copy: + src: /root/mgmt/configuration/podman-remote-static-linux_amd64 + dest: /usr/local/bin/podman + owner: root + group: root + mode: '0755' + become: yes \ No newline at end of file diff --git a/configuration/playbooks/ensure-nomad-user.yml b/configuration/playbooks/ensure-nomad-user.yml new file mode 100644 index 0000000..7bead5c --- /dev/null +++ b/configuration/playbooks/ensure-nomad-user.yml @@ -0,0 +1,25 @@ +--- +- name: Ensure nomad user and plugin directory exist + hosts: nomad_clients + become: yes + tasks: + - name: Ensure nomad group exists + group: + name: nomad + state: present + + - name: Ensure nomad user exists + user: + name: nomad + group: nomad + shell: /usr/sbin/nologin + system: yes + create_home: no + + - name: Ensure plugin directory exists with correct ownership + file: + path: /opt/nomad/data/plugins + state: directory + owner: nomad + group: nomad + mode: '0755' \ No newline at end of file diff --git a/configuration/playbooks/fix-apt-errors.yml b/configuration/playbooks/fix-apt-errors.yml new file mode 100644 index 0000000..ca8c0d5 --- /dev/null +++ b/configuration/playbooks/fix-apt-errors.yml @@ -0,0 +1,16 @@ +--- +- name: Debug apt repository issues + hosts: beijing:children + become: yes + ignore_unreachable: yes + tasks: + - name: Run apt-get update to capture error + ansible.builtin.shell: apt-get update + register: apt_update_result + failed_when: false + changed_when: false + + - name: Display apt-get update stderr + ansible.builtin.debug: + var: apt_update_result.stderr + verbosity: 2 \ No newline at end of file diff --git a/configuration/playbooks/fix-duplicate-podman-config.yml b/configuration/playbooks/fix-duplicate-podman-config.yml new file mode 100644 index 0000000..15b6852 --- /dev/null +++ b/configuration/playbooks/fix-duplicate-podman-config.yml @@ -0,0 +1,126 @@ +--- +- name: Fix duplicate Podman configuration in Nomad + hosts: nomad_cluster + become: yes + tasks: + - name: Stop Nomad service + systemd: + name: nomad + state: stopped + + - name: Backup current configuration + copy: + src: /etc/nomad.d/nomad.hcl + dest: /etc/nomad.d/nomad.hcl.backup-duplicate-fix + remote_src: yes + + - name: Read current configuration + slurp: + src: /etc/nomad.d/nomad.hcl + register: current_config + + - name: Create clean configuration for clients + copy: + content: | + datacenter = "{{ nomad_datacenter }}" + region = "{{ nomad_region }}" + data_dir = "/opt/nomad/data" + bind_addr = "{{ tailscale_ip }}" + + server { + enabled = false + } + + client { + enabled = true + servers = ["100.116.158.95:4647", "100.117.106.136:4647", "100.86.141.112:4647", "100.81.26.3:4647", "100.103.147.94:4647"] + } + + ui { + enabled = true + } + + addresses { + http = "0.0.0.0" + rpc = "{{ tailscale_ip }}" + serf = "{{ tailscale_ip }}" + } + + ports { + http = 4646 + rpc = 4647 + serf = 4648 + } + + plugin "podman" { + config { + socket_path = "unix:///run/podman/podman.sock" + volumes { + enabled = true + } + recover_stopped = true + } + } + + consul { + auto_advertise = false + server_auto_join = false + client_auto_join = false + } + + log_level = "INFO" + enable_syslog = true + dest: /etc/nomad.d/nomad.hcl + owner: nomad + group: nomad + mode: '0640' + when: nomad_role == "client" + + - name: Ensure Podman is installed + package: + name: podman + state: present + + - name: Enable and start Podman socket + systemd: + name: podman.socket + enabled: yes + state: started + + - name: Set proper permissions on Podman socket + file: + path: /run/podman/podman.sock + mode: '0666' + ignore_errors: yes + + - name: Validate Nomad configuration + shell: /usr/local/bin/nomad config validate /etc/nomad.d/nomad.hcl || /usr/bin/nomad config validate /etc/nomad.d/nomad.hcl + register: config_validation + failed_when: config_validation.rc != 0 + + - name: Start Nomad service + systemd: + name: nomad + state: started + enabled: yes + + - name: Wait for Nomad to be ready + wait_for: + port: 4646 + host: localhost + delay: 10 + timeout: 60 + + - name: Wait for drivers to load + pause: + seconds: 20 + + - name: Check driver status + shell: | + /usr/local/bin/nomad node status -self | grep -A 10 "Driver Status" || /usr/bin/nomad node status -self | grep -A 10 "Driver Status" + register: driver_status + failed_when: false + + - name: Display driver status + debug: + var: driver_status.stdout_lines \ No newline at end of file diff --git a/configuration/playbooks/fix-hashicorp-apt-source.yml b/configuration/playbooks/fix-hashicorp-apt-source.yml new file mode 100644 index 0000000..67276f1 --- /dev/null +++ b/configuration/playbooks/fix-hashicorp-apt-source.yml @@ -0,0 +1,34 @@ +--- +- name: 直接复制正确的 HashiCorp APT 源配置 + hosts: nomad_cluster + become: yes + + tasks: + - name: 备份现有的 HashiCorp APT 源配置(如果存在) + copy: + src: "/etc/apt/sources.list.d/hashicorp.list" + dest: "/etc/apt/sources.list.d/hashicorp.list.backup-{{ ansible_date_time.epoch }}" + remote_src: yes + ignore_errors: yes + + - name: 创建正确的 HashiCorp APT 源配置 + copy: + content: "deb [trusted=yes] http://apt.releases.hashicorp.com bookworm main\n" + dest: "/etc/apt/sources.list.d/hashicorp.list" + owner: root + group: root + mode: '0644' + + - name: 更新 APT 缓存 + apt: + update_cache: yes + ignore_errors: yes + + - name: 验证配置 + command: cat /etc/apt/sources.list.d/hashicorp.list + register: config_check + changed_when: false + + - name: 显示配置内容 + debug: + msg: "HashiCorp APT 源配置: {{ config_check.stdout }}" \ No newline at end of file diff --git a/configuration/playbooks/fix-nomad-cluster.yml b/configuration/playbooks/fix-nomad-cluster.yml new file mode 100644 index 0000000..f546ff7 --- /dev/null +++ b/configuration/playbooks/fix-nomad-cluster.yml @@ -0,0 +1,98 @@ +--- +- name: Fix Nomad Cluster Configuration + hosts: nomad_servers + become: yes + vars: + nomad_servers_list: + - "100.116.158.95" # semaphore + - "100.103.147.94" # ash2e + - "100.81.26.3" # ash1d + - "100.90.159.68" # ch2 + - "{{ ansible_default_ipv4.address }}" # ch3 (will be determined dynamically) + + tasks: + - name: Stop Nomad service + systemd: + name: nomad + state: stopped + ignore_errors: yes + + - name: Create nomad user + user: + name: nomad + system: yes + shell: /bin/false + home: /opt/nomad + create_home: no + + - name: Create Nomad configuration directory + file: + path: /etc/nomad.d + state: directory + mode: '0755' + + - name: Create Nomad data directory + file: + path: /opt/nomad/data + state: directory + mode: '0755' + owner: nomad + group: nomad + ignore_errors: yes + + - name: Create Nomad log directory + file: + path: /var/log/nomad + state: directory + mode: '0755' + owner: nomad + group: nomad + ignore_errors: yes + + - name: Generate Nomad server configuration + template: + src: nomad-server.hcl.j2 + dest: /etc/nomad.d/nomad.hcl + mode: '0644' + notify: restart nomad + + - name: Create Nomad systemd service file + copy: + content: | + [Unit] + Description=Nomad + Documentation=https://www.nomadproject.io/ + Requires=network-online.target + After=network-online.target + ConditionFileNotEmpty=/etc/nomad.d/nomad.hcl + + [Service] + Type=notify + User=nomad + Group=nomad + ExecStart=/usr/bin/nomad agent -config=/etc/nomad.d/nomad.hcl + ExecReload=/bin/kill -HUP $MAINPID + KillMode=process + Restart=on-failure + LimitNOFILE=65536 + + [Install] + WantedBy=multi-user.target + dest: /etc/systemd/system/nomad.service + mode: '0644' + + - name: Reload systemd daemon + systemd: + daemon_reload: yes + + - name: Enable and start Nomad service + systemd: + name: nomad + enabled: yes + state: started + + handlers: + - name: restart nomad + systemd: + name: nomad + state: restarted \ No newline at end of file diff --git a/configuration/playbooks/fix-server-config.yml b/configuration/playbooks/fix-server-config.yml new file mode 100644 index 0000000..aa44bc4 --- /dev/null +++ b/configuration/playbooks/fix-server-config.yml @@ -0,0 +1,109 @@ +--- +- name: Fix Nomad server configuration + hosts: nomad_servers + become: yes + tasks: + - name: Stop Nomad service + systemd: + name: nomad + state: stopped + + - name: Backup current configuration + copy: + src: /etc/nomad.d/nomad.hcl + dest: /etc/nomad.d/nomad.hcl.backup-server-fix + remote_src: yes + + - name: Create clean server configuration + copy: + content: | + datacenter = "{{ nomad_datacenter }}" + region = "{{ nomad_region }}" + data_dir = "/opt/nomad/data" + bind_addr = "{{ ansible_default_ipv4.address }}" + + server { + enabled = true + bootstrap_expect = {{ nomad_bootstrap_expect }} + encrypt = "{{ nomad_encrypt_key }}" + + retry_join = [ + "100.116.158.95", + "100.103.147.94", + "100.81.26.3", + "100.90.159.68", + "100.86.141.112" + ] + } + + client { + enabled = true + } + + ui { + enabled = true + } + + addresses { + http = "0.0.0.0" + rpc = "{{ ansible_default_ipv4.address }}" + serf = "{{ ansible_default_ipv4.address }}" + } + + ports { + http = 4646 + rpc = 4647 + serf = 4648 + } + + plugin "podman" { + config { + socket_path = "unix:///run/podman/podman.sock" + volumes { + enabled = true + } + recover_stopped = true + } + } + + consul { + auto_advertise = false + server_auto_join = false + client_auto_join = false + } + + log_level = "INFO" + log_file = "/var/log/nomad/nomad.log" + dest: /etc/nomad.d/nomad.hcl + owner: nomad + group: nomad + mode: '0640' + + - name: Ensure Podman is installed + package: + name: podman + state: present + + - name: Enable and start Podman socket + systemd: + name: podman.socket + enabled: yes + state: started + + - name: Validate Nomad configuration + shell: /usr/local/bin/nomad config validate /etc/nomad.d/nomad.hcl || /usr/bin/nomad config validate /etc/nomad.d/nomad.hcl + register: config_validation + failed_when: config_validation.rc != 0 + + - name: Start Nomad service + systemd: + name: nomad + state: started + enabled: yes + + - name: Wait for Nomad to be ready + wait_for: + port: 4646 + host: localhost + delay: 10 + timeout: 60 \ No newline at end of file diff --git a/configuration/playbooks/fix-server-network-config.yml b/configuration/playbooks/fix-server-network-config.yml new file mode 100644 index 0000000..dab81fa --- /dev/null +++ b/configuration/playbooks/fix-server-network-config.yml @@ -0,0 +1,103 @@ +--- +- name: Fix Nomad server network configuration + hosts: nomad_servers + become: yes + vars: + server_ips: + semaphore: "100.116.158.95" + ash2e: "100.103.147.94" + ash1d: "100.81.26.3" + ch2: "100.90.159.68" + ch3: "100.86.141.112" + tasks: + - name: Stop Nomad service + systemd: + name: nomad + state: stopped + + - name: Get server IP for this host + set_fact: + server_ip: "{{ server_ips[inventory_hostname] }}" + + - name: Create corrected server configuration + copy: + content: | + datacenter = "{{ nomad_datacenter }}" + region = "{{ nomad_region }}" + data_dir = "/opt/nomad/data" + bind_addr = "{{ server_ip }}" + + server { + enabled = true + bootstrap_expect = {{ nomad_bootstrap_expect }} + encrypt = "{{ nomad_encrypt_key }}" + + retry_join = [ + "100.116.158.95", + "100.103.147.94", + "100.81.26.3", + "100.90.159.68", + "100.86.141.112" + ] + } + + client { + enabled = true + } + + ui { + enabled = true + } + + addresses { + http = "0.0.0.0" + rpc = "{{ server_ip }}" + serf = "{{ server_ip }}" + } + + ports { + http = 4646 + rpc = 4647 + serf = 4648 + } + + plugin "podman" { + config { + socket_path = "unix:///run/podman/podman.sock" + volumes { + enabled = true + } + recover_stopped = true + } + } + + consul { + auto_advertise = false + server_auto_join = false + client_auto_join = false + } + + log_level = "INFO" + log_file = "/var/log/nomad/nomad.log" + dest: /etc/nomad.d/nomad.hcl + owner: nomad + group: nomad + mode: '0640' + + - name: Validate Nomad configuration + shell: /usr/local/bin/nomad config validate /etc/nomad.d/nomad.hcl || /usr/bin/nomad config validate /etc/nomad.d/nomad.hcl + register: config_validation + failed_when: config_validation.rc != 0 + + - name: Start Nomad service + systemd: + name: nomad + state: started + enabled: yes + + - name: Wait for Nomad to be ready + wait_for: + port: 4646 + host: localhost + delay: 10 + timeout: 60 \ No newline at end of file diff --git a/configuration/playbooks/fix-warden-compose.yml b/configuration/playbooks/fix-warden-compose.yml new file mode 100644 index 0000000..b904d65 --- /dev/null +++ b/configuration/playbooks/fix-warden-compose.yml @@ -0,0 +1,39 @@ +--- +- name: Fix Warden docker-compose.yml + hosts: warden + become: yes + gather_facts: no + + tasks: + - name: Ensure /opt/warden directory exists + file: + path: /opt/warden + state: directory + owner: root + group: root + mode: '0755' + + - name: Create or update docker-compose.yml with correct indentation + copy: + dest: /opt/warden/docker-compose.yml + content: | + services: + vaultwarden: + image: hub.git4ta.fun/vaultwarden/server:latest + security_opt: + - "seccomp=unconfined" + env_file: + - .env + volumes: + - ./data:/data + ports: + - "980:80" + restart: always + networks: + - vaultwarden_network + + networks: + vaultwarden_network: + owner: root + group: root + mode: '0644' \ No newline at end of file diff --git a/configuration/playbooks/hack-podman-upgrade.yml b/configuration/playbooks/hack-podman-upgrade.yml new file mode 100644 index 0000000..96439b8 --- /dev/null +++ b/configuration/playbooks/hack-podman-upgrade.yml @@ -0,0 +1,67 @@ +--- +- name: 强制升级 Podman 到最新版本 + hosts: warden + become: yes + gather_facts: yes + + tasks: + - name: 检查当前 Podman 版本 + shell: podman --version + register: current_podman_version + ignore_errors: yes + + - name: 显示当前版本 + debug: + msg: "升级前版本: {{ current_podman_version.stdout if current_podman_version.rc == 0 else '未安装' }}" + + - name: 卸载现有 Podman + shell: apt-get remove -y --purge podman* containerd* runc* + ignore_errors: yes + + - name: 清理残留配置 + shell: | + rm -rf /etc/containers + rm -rf /usr/share/containers + rm -rf /var/lib/containers + ignore_errors: yes + + - name: 直接下载并安装最新版Podman二进制文件 + shell: | + # 清理可能存在的旧版本 + rm -f /tmp/podman-latest.tar.gz + rm -f /usr/local/bin/podman + + # 获取最新版本号 + LATEST_VERSION="v5.6.1" # 硬编码最新版本避免网络问题 + echo "安装版本: $LATEST_VERSION" + + # 使用GitHub镜像站点下载二进制文件 + echo "使用GitHub镜像站点下载..." + wget -O /tmp/podman-latest.tar.gz "https://gh.git4ta.fun/github.com/containers/podman/releases/download/${LATEST_VERSION}/podman-linux-static-amd64.tar.gz" + + # 检查文件是否下载成功,如果失败尝试直接下载 + if [ ! -f /tmp/podman-latest.tar.gz ]; then + echo "镜像下载失败,尝试直接下载..." + wget -O /tmp/podman-latest.tar.gz "https://github.com/containers/podman/releases/download/${LATEST_VERSION}/podman-linux-static-amd64.tar.gz" + fi + + # 解压并安装 + tar -xzf /tmp/podman-latest.tar.gz -C /usr/local/bin/ --strip-components=1 + chmod +x /usr/local/bin/podman + + # 更新PATH + echo 'export PATH=/usr/local/bin:$PATH' >> /etc/profile + . /etc/profile + + # 验证安装 + /usr/local/bin/podman --version + ignore_errors: yes + + - name: 验证安装结果 + shell: podman --version + register: new_podman_version + ignore_errors: yes + + - name: 显示最终版本 + debug: + msg: "升级后版本: {{ new_podman_version.stdout if new_podman_version.rc == 0 else '安装失败' }}" \ No newline at end of file diff --git a/configuration/playbooks/install-configure-nomad-podman-driver.yml b/configuration/playbooks/install-configure-nomad-podman-driver.yml new file mode 100644 index 0000000..88b66ef --- /dev/null +++ b/configuration/playbooks/install-configure-nomad-podman-driver.yml @@ -0,0 +1,161 @@ +--- +- name: Install and Configure Nomad Podman Driver on Client Nodes + hosts: nomad_clients + become: yes + vars: + nomad_plugin_dir: "/opt/nomad/plugins" + + tasks: + - name: Create backup directory with timestamp + set_fact: + backup_dir: "/root/backup/{{ ansible_date_time.date }}_{{ ansible_date_time.hour }}{{ ansible_date_time.minute }}{{ ansible_date_time.second }}" + + - name: Create backup directory + file: + path: "{{ backup_dir }}" + state: directory + mode: '0755' + + - name: Backup current Nomad configuration + copy: + src: /etc/nomad.d/nomad.hcl + dest: "{{ backup_dir }}/nomad.hcl.backup" + remote_src: yes + ignore_errors: yes + + - name: Backup current apt sources + shell: | + cp -r /etc/apt/sources.list* {{ backup_dir }}/ + dpkg --get-selections > {{ backup_dir }}/installed_packages.txt + ignore_errors: yes + + - name: Create temporary directory for apt + file: + path: /tmp/apt-temp + state: directory + mode: '1777' + + - name: Download HashiCorp GPG key + get_url: + url: https://apt.releases.hashicorp.com/gpg + dest: /tmp/hashicorp.gpg + mode: '0644' + environment: + TMPDIR: /tmp/apt-temp + + - name: Install HashiCorp GPG key + shell: | + gpg --dearmor < /tmp/hashicorp.gpg > /usr/share/keyrings/hashicorp-archive-keyring.gpg + environment: + TMPDIR: /tmp/apt-temp + + - name: Add HashiCorp repository + lineinfile: + path: /etc/apt/sources.list.d/hashicorp.list + line: "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com {{ ansible_distribution_release }} main" + create: yes + mode: '0644' + + - name: Update apt cache + apt: + update_cache: yes + environment: + TMPDIR: /tmp/apt-temp + ignore_errors: yes + + - name: Install nomad-driver-podman + apt: + name: nomad-driver-podman + state: present + environment: + TMPDIR: /tmp/apt-temp + + - name: Create Nomad plugin directory + file: + path: "{{ nomad_plugin_dir }}" + state: directory + owner: nomad + group: nomad + mode: '0755' + + - name: Create symlink for nomad-driver-podman in plugin directory + file: + src: /usr/bin/nomad-driver-podman + dest: "{{ nomad_plugin_dir }}/nomad-driver-podman" + state: link + owner: nomad + group: nomad + + - name: Get server IP address + shell: | + ip route get 1.1.1.1 | grep -oP 'src \K\S+' + register: server_ip_result + changed_when: false + + - name: Set server IP fact + set_fact: + server_ip: "{{ server_ip_result.stdout }}" + + - name: Stop Nomad service + systemd: + name: nomad + state: stopped + + - name: Create updated Nomad client configuration + copy: + content: | + datacenter = "{{ nomad_datacenter }}" + data_dir = "/opt/nomad/data" + log_level = "INFO" + bind_addr = "{{ server_ip }}" + + server { + enabled = false + } + + client { + enabled = true + servers = ["100.117.106.136:4647", "100.116.80.94:4647", "100.97.62.111:4647", "100.116.112.45:4647", "100.84.197.26:4647"] + } + + plugin_dir = "{{ nomad_plugin_dir }}" + + plugin "nomad-driver-podman" { + config { + volumes { + enabled = true + } + recover_stopped = true + } + } + + consul { + address = "127.0.0.1:8500" + } + dest: /etc/nomad.d/nomad.hcl + owner: nomad + group: nomad + mode: '0640' + backup: yes + + - name: Validate Nomad configuration + shell: nomad config validate /etc/nomad.d/nomad.hcl + register: nomad_validate + failed_when: nomad_validate.rc != 0 + + - name: Start Nomad service + systemd: + name: nomad + state: started + enabled: yes + + - name: Wait for Nomad to be ready + wait_for: + port: 4646 + host: "{{ server_ip }}" + delay: 5 + timeout: 60 + + - name: Display backup location + debug: + msg: "Backup created at: {{ backup_dir }}" \ No newline at end of file diff --git a/configuration/playbooks/integrated-podman-setup.yml b/configuration/playbooks/integrated-podman-setup.yml new file mode 100644 index 0000000..871f85e --- /dev/null +++ b/configuration/playbooks/integrated-podman-setup.yml @@ -0,0 +1,218 @@ +--- +- name: Integrated Podman Setup - Remove Docker, Install and Configure Podman with Compose for Nomad + hosts: all + become: yes + gather_facts: yes + + tasks: + - name: 显示当前处理的节点 + debug: + msg: "🔧 开始集成 Podman 设置: {{ inventory_hostname }}" + + - name: 检查 Docker 服务状态 + shell: systemctl is-active docker 2>/dev/null || echo "inactive" + register: docker_status + changed_when: false + + - name: 停止 Docker 服务 + systemd: + name: docker + state: stopped + enabled: no + ignore_errors: yes + when: docker_status.stdout == "active" + + - name: 停止 Docker socket + systemd: + name: docker.socket + state: stopped + enabled: no + ignore_errors: yes + + - name: 移除 Docker 相关包 + apt: + name: + - docker-ce + - docker-ce-cli + - containerd.io + - docker-buildx-plugin + - docker-compose-plugin + - docker.io + - docker-doc + - docker-compose + - docker-registry + - containerd + - runc + state: absent + purge: yes + ignore_errors: yes + + - name: 清理 Docker 数据目录 + file: + path: "{{ item }}" + state: absent + loop: + - /var/lib/docker + - /var/lib/containerd + - /etc/docker + - /etc/containerd + ignore_errors: yes + + - name: 清理 Docker 用户组 + group: + name: docker + state: absent + ignore_errors: yes + + - name: 更新包缓存 + apt: + update_cache: yes + cache_valid_time: 3600 + + - name: 安装 Podman 及相关工具 + apt: + name: + - podman + - buildah + - skopeo + - python3-pip + - python3-setuptools + state: present + retries: 3 + delay: 10 + + - name: 安装 Podman Compose via pip + pip: + name: podman-compose + state: present + ignore_errors: yes + + - name: 启用 Podman socket 服务 + systemd: + name: podman.socket + enabled: yes + state: started + ignore_errors: yes + + - name: 创建 Podman 用户服务目录 + file: + path: /etc/systemd/user + state: directory + mode: '0755' + + - name: 验证 Podman 安装 + shell: podman --version + register: podman_version + + - name: 验证 Podman Compose 安装 + shell: podman-compose --version 2>/dev/null || echo "未安装" + register: podman_compose_version + + - name: 检查 Docker 清理状态 + shell: systemctl is-active docker 2>/dev/null || echo "已移除" + register: final_docker_status + + - name: 显示 Docker 移除和 Podman 安装结果 + debug: + msg: | + ✅ 节点 {{ inventory_hostname }} Docker 移除和 Podman 安装完成 + 🐳 Docker 状态: {{ final_docker_status.stdout }} + 📦 Podman 版本: {{ podman_version.stdout }} + 🔧 Compose 状态: {{ podman_compose_version.stdout }} + + - name: 创建 Podman 系统配置目录 + file: + path: /etc/containers + state: directory + mode: '0755' + + - name: 配置 Podman 使用系统 socket + copy: + content: | + [engine] + # 使用系统级 socket 而不是用户级 socket + active_service = "system" + [engine.service_destinations] + [engine.service_destinations.system] + uri = "unix:///run/podman/podman.sock" + dest: /etc/containers/containers.conf + mode: '0644' + + - name: 检查是否存在 nomad 用户 + getent: + database: passwd + key: nomad + register: nomad_user_check + ignore_errors: yes + + - name: 为 nomad 用户创建配置目录 + file: + path: "/home/nomad/.config/containers" + state: directory + owner: nomad + group: nomad + mode: '0755' + when: nomad_user_check is succeeded + + - name: 为 nomad 用户配置 Podman + copy: + content: | + [engine] + active_service = "system" + [engine.service_destinations] + [engine.service_destinations.system] + uri = "unix:///run/podman/podman.sock" + dest: /home/nomad/.config/containers/containers.conf + owner: nomad + group: nomad + mode: '0644' + when: nomad_user_check is succeeded + + - name: 将 nomad 用户添加到 podman 组 + user: + name: nomad + groups: podman + append: yes + when: nomad_user_check is succeeded + ignore_errors: yes + + - name: 创建 podman 组(如果不存在) + group: + name: podman + state: present + ignore_errors: yes + + - name: 设置 podman socket 目录权限 + file: + path: /run/podman + state: directory + mode: '0755' + group: podman + ignore_errors: yes + + - name: 验证 Podman socket 权限 + file: + path: /run/podman/podman.sock + mode: '0666' + when: nomad_user_check is succeeded + ignore_errors: yes + + - name: 测试 Podman 功能 + shell: podman info + register: podman_info + ignore_errors: yes + + - name: 清理 apt 缓存 + apt: + autoclean: yes + autoremove: yes + + - name: 显示最终配置结果 + debug: + msg: | + 🎉 节点 {{ inventory_hostname }} 集成 Podman 设置完成! + 📦 Podman 版本: {{ podman_version.stdout }} + 🐳 Podman Compose: {{ podman_compose_version.stdout }} + 👤 Nomad 用户: {{ 'FOUND' if nomad_user_check is succeeded else 'NOT FOUND' }} + 🔧 Podman 状态: {{ 'SUCCESS' if podman_info.rc == 0 else 'WARNING' }} + 🚀 Docker 已移除,Podman 已配置为与 Nomad 集成 \ No newline at end of file diff --git a/configuration/playbooks/manual-run-nomad-germany.yml b/configuration/playbooks/manual-run-nomad-germany.yml new file mode 100644 index 0000000..4b8d417 --- /dev/null +++ b/configuration/playbooks/manual-run-nomad-germany.yml @@ -0,0 +1,17 @@ +- name: Manually run Nomad agent to capture output + hosts: germany + gather_facts: false + tasks: + - name: Run nomad agent directly + command: /snap/bin/nomad agent -config=/etc/nomad.d/nomad.hcl + register: nomad_agent_output + ignore_errors: true + + - name: Display agent output + debug: + msg: | + --- Nomad Agent STDOUT --- + {{ nomad_agent_output.stdout }} + + --- Nomad Agent STDERR --- + {{ nomad_agent_output.stderr }} \ No newline at end of file diff --git a/configuration/playbooks/read-nomad-config-germany.yml b/configuration/playbooks/read-nomad-config-germany.yml new file mode 100644 index 0000000..66ad8c7 --- /dev/null +++ b/configuration/playbooks/read-nomad-config-germany.yml @@ -0,0 +1,12 @@ +- name: Read Nomad config on germany + hosts: germany + gather_facts: false + tasks: + - name: Read nomad.hcl + command: cat /etc/nomad.d/nomad.hcl + register: nomad_config + ignore_errors: true + + - name: Display config + debug: + msg: "{{ nomad_config.stdout }}" \ No newline at end of file diff --git a/configuration/playbooks/remove-docker-install-podman-with-compose.yml b/configuration/playbooks/remove-docker-install-podman-with-compose.yml new file mode 100644 index 0000000..686b660 --- /dev/null +++ b/configuration/playbooks/remove-docker-install-podman-with-compose.yml @@ -0,0 +1,126 @@ +--- +- name: 移除 Docker 并安装带 Compose 功能的 Podman + hosts: all + become: yes + gather_facts: yes + + tasks: + - name: 显示当前处理的节点 + debug: + msg: "🔧 正在处理节点: {{ inventory_hostname }}" + + - name: 检查 Docker 服务状态 + shell: systemctl is-active docker 2>/dev/null || echo "inactive" + register: docker_status + changed_when: false + + - name: 停止 Docker 服务 + systemd: + name: docker + state: stopped + enabled: no + ignore_errors: yes + when: docker_status.stdout == "active" + + - name: 停止 Docker socket + systemd: + name: docker.socket + state: stopped + enabled: no + ignore_errors: yes + + - name: 移除 Docker 相关包 + apt: + name: + - docker-ce + - docker-ce-cli + - containerd.io + - docker-buildx-plugin + - docker-compose-plugin + - docker.io + - docker-doc + - docker-compose + - docker-registry + - containerd + - runc + state: absent + purge: yes + ignore_errors: yes + + - name: 清理 Docker 数据目录 + file: + path: "{{ item }}" + state: absent + loop: + - /var/lib/docker + - /var/lib/containerd + - /etc/docker + - /etc/containerd + ignore_errors: yes + + - name: 清理 Docker 用户组 + group: + name: docker + state: absent + ignore_errors: yes + + - name: 更新包缓存 + apt: + update_cache: yes + cache_valid_time: 3600 + + - name: 安装 Podman 及相关工具 + apt: + name: + - podman + - buildah + - skopeo + - python3-pip + - python3-setuptools + state: present + retries: 3 + delay: 10 + + - name: 安装 Podman Compose via pip + pip: + name: podman-compose + state: present + ignore_errors: yes + + - name: 启用 Podman socket 服务 + systemd: + name: podman.socket + enabled: yes + state: started + ignore_errors: yes + + - name: 创建 Podman 用户服务目录 + file: + path: /etc/systemd/user + state: directory + mode: '0755' + + - name: 验证 Podman 安装 + shell: podman --version + register: podman_version + + - name: 验证 Podman Compose 安装 + shell: podman-compose --version 2>/dev/null || echo "未安装" + register: podman_compose_version + + - name: 检查 Docker 清理状态 + shell: systemctl is-active docker 2>/dev/null || echo "已移除" + register: final_docker_status + + - name: 显示节点处理结果 + debug: + msg: | + ✅ 节点 {{ inventory_hostname }} 处理完成 + 🐳 Docker 状态: {{ final_docker_status.stdout }} + 📦 Podman 版本: {{ podman_version.stdout }} + 🔧 Compose 状态: {{ podman_compose_version.stdout }} + + - name: 清理 apt 缓存 + apt: + autoclean: yes + autoremove: yes \ No newline at end of file diff --git a/configuration/playbooks/setup-new-nomad-nodes.yml b/configuration/playbooks/setup-new-nomad-nodes.yml index 802587d..5be605e 100644 --- a/configuration/playbooks/setup-new-nomad-nodes.yml +++ b/configuration/playbooks/setup-new-nomad-nodes.yml @@ -1,6 +1,6 @@ --- - name: 安装并配置新的 Nomad Server 节点 - hosts: ash2e,ash1d,ch2 + hosts: influxdb1 become: yes gather_facts: no diff --git a/configuration/playbooks/test-podman-snap-migration.yml b/configuration/playbooks/test-podman-snap-migration.yml new file mode 100644 index 0000000..dc1241c --- /dev/null +++ b/configuration/playbooks/test-podman-snap-migration.yml @@ -0,0 +1,100 @@ +--- +- name: 测试将 Podman 切换到 Snap 版本 (ch2 节点) + hosts: ch2 + become: yes + gather_facts: yes + + tasks: + - name: 检查当前 Podman 版本和安装方式 + shell: | + echo "=== 当前 Podman 信息 ===" + podman --version + echo "安装路径: $(which podman)" + echo "=== Snap 状态 ===" + which snap || echo "snap 未安装" + snap list podman 2>/dev/null || echo "Podman snap 未安装" + echo "=== 包管理器状态 ===" + dpkg -l | grep podman || echo "未通过 apt 安装" + register: current_status + + - name: 显示当前状态 + debug: + msg: "{{ current_status.stdout }}" + + - name: 检查 snap 是否已安装 + shell: which snap + register: snap_check + ignore_errors: yes + changed_when: false + + - name: 安装 snapd (如果未安装) + apt: + name: snapd + state: present + when: snap_check.rc != 0 + + - name: 确保 snapd 服务运行 + systemd: + name: snapd + state: started + enabled: yes + + - name: 检查当前 Podman snap 版本 + shell: snap info podman + register: snap_podman_info + ignore_errors: yes + + - name: 显示可用的 Podman snap 版本 + debug: + msg: "{{ snap_podman_info.stdout if snap_podman_info.rc == 0 else '无法获取 snap podman 信息' }}" + + - name: 停止当前 Podman 相关服务 + systemd: + name: podman + state: stopped + ignore_errors: yes + + - name: 移除通过包管理器安装的 Podman + apt: + name: podman + state: absent + purge: yes + ignore_errors: yes + + - name: 安装 Podman snap (edge 通道) + snap: + name: podman + state: present + classic: yes + channel: edge + + - name: 创建符号链接 (确保 podman 命令可用) + file: + src: /snap/bin/podman + dest: /usr/local/bin/podman + state: link + force: yes + + - name: 验证 Snap Podman 安装 + shell: | + /snap/bin/podman --version + which podman + register: snap_podman_verify + + - name: 显示安装结果 + debug: + msg: | + ✅ Snap Podman 安装完成 + 🚀 版本: {{ snap_podman_verify.stdout_lines[0] }} + 📍 路径: {{ snap_podman_verify.stdout_lines[1] }} + + - name: 测试 Podman 基本功能 + shell: | + /snap/bin/podman version + /snap/bin/podman info --format json | jq -r '.host.arch' + register: podman_test + ignore_errors: yes + + - name: 显示测试结果 + debug: + msg: "Podman 测试结果: {{ podman_test.stdout if podman_test.rc == 0 else '测试失败' }}" \ No newline at end of file diff --git a/configuration/playbooks/upgrade-podman-to-5.yml b/configuration/playbooks/upgrade-podman-to-5.yml new file mode 100644 index 0000000..823fa0c --- /dev/null +++ b/configuration/playbooks/upgrade-podman-to-5.yml @@ -0,0 +1,77 @@ +--- +- name: 升级 Podman 到最新版本 (warden 节点测试) + hosts: warden + become: yes + gather_facts: yes + + tasks: + - name: 检查当前 Podman 版本 + shell: podman --version + register: current_podman_version + ignore_errors: yes + + - name: 显示当前版本 + debug: + msg: "当前 Podman 版本: {{ current_podman_version.stdout if current_podman_version.rc == 0 else '未安装或无法获取' }}" + + - name: 备份现有 Podman 配置 + shell: | + if [ -d /etc/containers ]; then + cp -r /etc/containers /etc/containers.backup.$(date +%Y%m%d) + fi + if [ -d /usr/share/containers ]; then + cp -r /usr/share/containers /usr/share/containers.backup.$(date +%Y%m%d) + fi + ignore_errors: yes + + - name: 添加 Kubic 仓库 (HTTP 跳过签名) + shell: | + # 添加仓库并跳过签名验证 + echo "deb [trusted=yes] http://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_22.04/ /" > /etc/apt/sources.list.d/kubic-containers.list + + - name: 更新包列表 (跳过签名验证) + shell: apt-get update -o Acquire::AllowInsecureRepositories=true -o Acquire::AllowDowngradeToInsecureRepositories=true + + - name: 检查仓库中可用的 Podman 版本 + shell: apt-cache policy podman + register: podman_versions + + - name: 显示可用的 Podman 版本 + debug: + msg: "{{ podman_versions.stdout }}" + + - name: 安装 Podman 5.x (强制跳过签名) + shell: apt-get install -y --allow-unauthenticated --allow-downgrades --allow-remove-essential --allow-change-held-packages podman + + - name: 验证 Podman 5.x 安装 + shell: | + podman --version + podman info --format json | jq -r '.Version.Version' + register: podman_5_verify + + - name: 显示升级结果 + debug: + msg: | + ✅ Podman 升级完成 + 🚀 新版本: {{ podman_5_verify.stdout_lines[0] }} + 📊 详细版本: {{ podman_5_verify.stdout_lines[1] }} + + - name: 测试基本功能 + shell: | + podman run --rm hello-world + register: podman_test + ignore_errors: yes + + - name: 显示测试结果 + debug: + msg: "Podman 功能测试: {{ '成功' if podman_test.rc == 0 else '失败 - ' + podman_test.stderr }}" + + - name: 检查相关服务状态 + shell: | + systemctl status podman.socket 2>/dev/null || echo "podman.socket 未运行" + systemctl status containerd 2>/dev/null || echo "containerd 未运行" + register: service_status + + - name: 显示服务状态 + debug: + msg: "{{ service_status.stdout }}" \ No newline at end of file diff --git a/configuration/podman-remote-static-linux_amd64 b/configuration/podman-remote-static-linux_amd64 new file mode 100755 index 0000000..86532f8 Binary files /dev/null and b/configuration/podman-remote-static-linux_amd64 differ diff --git a/consul-cluster-nomad.nomad b/consul-cluster-nomad.nomad new file mode 100644 index 0000000..6791d61 --- /dev/null +++ b/consul-cluster-nomad.nomad @@ -0,0 +1,81 @@ +job "consul-cluster" { + datacenters = ["dc1"] + type = "service" + + constraint { + attribute = "${node.unique.name}" + operator = "regexp" + value = "^(master|ash3c|hcs)$" + } + + group "consul" { + count = 3 + + network { + port "http" { + static = 8500 + } + port "serf_lan" { + static = 8301 + } + port "serf_wan" { + static = 8302 + } + port "server" { + static = 8300 + } + port "dns" { + static = 8600 + } + } + + service { + name = "consul" + port = "http" + + check { + type = "http" + path = "/v1/status/leader" + interval = "10s" + timeout = "2s" + } + } + + task "consul" { + driver = "podman" + + config { + image = "consul:1.15.4" + network_mode = "host" + + args = [ + "agent", + "-server", + "-bootstrap-expect=3", + "-ui", + "-data-dir=/consul/data", + "-config-dir=/consul/config", + "-bind={{ env \"attr.unique.network.ip-address\" }}", + "-client=0.0.0.0", + "-retry-join=100.117.106.136", + "-retry-join=100.116.80.94", + "-retry-join=100.84.197.26" + ] + + volumes = [ + "consul-data:/consul/data", + "consul-config:/consul/config" + ] + } + + resources { + cpu = 500 + memory = 512 + } + + env { + CONSUL_BIND_INTERFACE = "tailscale0" + } + } + } +} \ No newline at end of file diff --git a/consul-cluster.nomad b/consul-cluster.nomad new file mode 100644 index 0000000..401ff21 --- /dev/null +++ b/consul-cluster.nomad @@ -0,0 +1,118 @@ +job "consul-cluster" { + datacenters = ["dc1"] + type = "service" + + # 确保在指定的节点上运行 + constraint { + attribute = "${node.unique.name}" + operator = "regexp" + value = "(hcs|master|ash3c)" + } + + group "consul-servers" { + count = 3 + + # 每个节点只运行一个 Consul 实例 + constraint { + operator = "distinct_hosts" + value = "true" + } + + # 网络配置 + network { + mode = "host" + port "http" { + static = 8500 + } + port "rpc" { + static = 8300 + } + port "serf_lan" { + static = 8301 + } + port "serf_wan" { + static = 8302 + } + port "grpc" { + static = 8502 + } + } + + # 持久化存储 + volume "consul-data" { + type = "host" + read_only = false + source = "consul-data" + } + + task "consul" { + driver = "podman" + + volume_mount { + volume = "consul-data" + destination = "/consul/data" + read_only = false + } + + config { + image = "docker.io/hashicorp/consul:1.17" + ports = ["http", "rpc", "serf_lan", "serf_wan", "grpc"] + + args = [ + "agent", + "-server", + "-bootstrap-expect=3", + "-datacenter=dc1", + "-data-dir=/consul/data", + "-log-level=INFO", + "-node=${node.unique.name}", + "-bind=${NOMAD_IP_serf_lan}", + "-client=0.0.0.0", + "-retry-join=100.84.197.26", + "-retry-join=100.117.106.136", + "-retry-join=100.116.80.94", + "-ui-config-enabled=true", + "-connect-enabled=true" + ] + } + + # 环境变量 + env { + CONSUL_ALLOW_PRIVILEGED_PORTS = "true" + } + + # 资源配置 + resources { + cpu = 500 + memory = 512 + } + + # 健康检查 + service { + name = "consul" + port = "http" + + tags = [ + "consul", + "server", + "${node.unique.name}" + ] + + check { + type = "http" + path = "/v1/status/leader" + interval = "10s" + timeout = "3s" + } + } + + # 重启策略 + restart { + attempts = 3 + interval = "30m" + delay = "15s" + mode = "fail" + } + } + } +} \ No newline at end of file diff --git a/docs/setup/zsh-configuration.md b/docs/setup/zsh-configuration.md deleted file mode 100644 index 73ec6df..0000000 --- a/docs/setup/zsh-configuration.md +++ /dev/null @@ -1,240 +0,0 @@ -# ZSH 配置总结 - -## 已安装和配置的组件 - -### 1. 基础组件 -- ✅ **oh-my-zsh**: 已安装并配置 -- ✅ **zsh**: 版本 5.9 -- ✅ **Powerline 字体**: 已安装支持 -- ✅ **tmux**: 已安装 - -### 2. 核心插件 -- ✅ **git**: Git 集成和别名 -- ✅ **docker**: Docker 命令补全和别名 -- ✅ **docker-compose**: Docker Compose 支持 -- ✅ **ansible**: Ansible 命令补全 -- ✅ **terraform**: Terraform/OpenTofu 支持 -- ✅ **kubectl**: Kubernetes 命令补全 -- ✅ **helm**: Helm 包管理器支持 -- ✅ **aws**: AWS CLI 支持 -- ✅ **gcloud**: Google Cloud CLI 支持 - -### 3. 增强插件 -- ✅ **zsh-autosuggestions**: 命令自动建议 -- ✅ **zsh-syntax-highlighting**: 语法高亮 -- ✅ **zsh-completions**: 增强补全功能 -- ✅ **colored-man-pages**: 彩色手册页 -- ✅ **command-not-found**: 命令未找到提示 -- ✅ **extract**: 解压文件支持 -- ✅ **history-substring-search**: 历史搜索 -- ✅ **sudo**: sudo 支持 -- ✅ **systemd**: systemd 服务管理 -- ✅ **tmux**: tmux 集成 -- ✅ **vscode**: VS Code 集成 -- ✅ **web-search**: 网络搜索 -- ✅ **z**: 智能目录跳转 - -### 4. 主题 -- ✅ **agnoster**: 功能丰富的主题,支持 Git 状态显示 - -## 自定义别名 - -### 项目管理别名 -```bash -mgmt # 进入管理项目目录 -mgmt-status # 显示项目状态 -mgmt-deploy # 快速部署 -mgmt-cleanup # 清理环境 -mgmt-swarm # Swarm 管理 -mgmt-tofu # OpenTofu 管理 -``` - -### Ansible 别名 -```bash -ansible-check # 语法检查 -ansible-deploy # 部署 -ansible-ping # 连通性测试 -ansible-vault # 密码管理 -ansible-galaxy # 角色管理 -``` - -### OpenTofu/Terraform 别名 -```bash -tofu-init # 初始化 -tofu-plan # 计划 -tofu-apply # 应用 -tofu-destroy # 销毁 -tofu-output # 输出 -tofu-validate # 验证 -tofu-fmt # 格式化 -``` - -### Docker 别名 -```bash -d # docker -dc # docker-compose -dps # docker ps -dpsa # docker ps -a -di # docker images -dex # docker exec -it -dlog # docker logs -f -dclean # docker system prune -f -``` - -### Docker Swarm 别名 -```bash -dswarm # docker swarm -dstack # docker stack -dservice # docker service -dnode # docker node -dnetwork # docker network -dsecret # docker secret -dconfig # docker config -``` - -### Kubernetes 别名 -```bash -k # kubectl -kgp # kubectl get pods -kgs # kubectl get services -kgd # kubectl get deployments -kgn # kubectl get nodes -kaf # kubectl apply -f -kdf # kubectl delete -f -kl # kubectl logs -f -``` - -### Git 别名 -```bash -gs # git status -ga # git add -gc # git commit -gp # git push -gl # git pull -gd # git diff -gb # git branch -gco # git checkout -``` - -### 系统别名 -```bash -ll # ls -alF -la # ls -A -l # ls -CF -.. # cd .. -... # cd ../.. -.... # cd ../../.. -grep # grep --color=auto -ports # netstat -tuln -myip # 获取公网IP -speedtest # 网速测试 -psg # ps aux | grep -top # htop -``` - -## 配置文件位置 - -- **主配置**: `~/.zshrc` -- **自定义别名**: `~/.oh-my-zsh/custom/aliases.zsh` -- **代理配置**: `/root/mgmt/configuration/proxy.env` - -## 使用方法 - -### 启动 ZSH -```bash -zsh -``` - -### 重新加载配置 -```bash -source ~/.zshrc -``` - -### 查看所有别名 -```bash -alias -``` - -### 查看特定别名 -```bash -alias | grep docker -alias | grep mgmt -``` - -## 功能特性 - -### 1. 自动建议 -- 输入命令时会显示历史命令建议 -- 使用 `→` 键接受建议 - -### 2. 语法高亮 -- 命令输入时实时语法高亮 -- 错误命令显示为红色 - -### 3. 智能补全 -- 支持所有已安装工具的补全 -- 支持文件路径补全 -- 支持命令参数补全 - -### 4. 历史搜索 -- 使用 `↑` `↓` 键搜索历史命令 -- 支持部分匹配搜索 - -### 5. 目录跳转 -- 使用 `z` 命令智能跳转到常用目录 -- 基于访问频率和最近访问时间 - -### 6. 代理支持 -- 自动加载代理配置 -- 支持 HTTP/HTTPS 代理 - -## 故障排除 - -### 如果别名不工作 -```bash -# 检查别名是否加载 -alias | grep - -# 重新加载配置 -source ~/.zshrc -``` - -### 如果插件不工作 -```bash -# 检查插件是否安装 -ls ~/.oh-my-zsh/plugins/ | grep - -# 检查自定义插件 -ls ~/.oh-my-zsh/custom/plugins/ -``` - -### 如果主题显示异常 -```bash -# 检查字体是否安装 -fc-list | grep Powerline - -# 尝试其他主题 -# 编辑 ~/.zshrc 中的 ZSH_THEME -``` - -## 扩展建议 - -### 可以添加的额外插件 -- **fzf**: 模糊查找 -- **bat**: 更好的 cat 命令 -- **exa**: 更好的 ls 命令 -- **ripgrep**: 更快的 grep -- **fd**: 更快的 find - -### 可以添加的额外别名 -- 根据个人使用习惯添加更多别名 -- 为常用命令组合创建别名 -- 为项目特定命令创建别名 - -## 性能优化 - -- 已配置的插件数量适中,不会显著影响启动速度 -- 使用 `zsh-completions` 提供更好的补全性能 -- 历史记录配置优化,避免内存占用过大 - -配置完成!现在您拥有了一个功能强大、高度定制的 ZSH 环境,专门为管理系统的需求进行了优化。 diff --git a/install-podman-driver.nomad b/install-podman-driver.nomad new file mode 100644 index 0000000..70c9b19 --- /dev/null +++ b/install-podman-driver.nomad @@ -0,0 +1,110 @@ +job "install-podman-driver" { + datacenters = ["dc1"] + type = "system" # 在所有节点上运行 + + group "install" { + task "install-podman" { + driver = "exec" + + config { + command = "bash" + args = [ + "-c", + <<-EOF + set -euo pipefail + export PATH="/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin" + + # 依赖工具 + if ! command -v jq >/dev/null 2>&1 || ! command -v unzip >/dev/null 2>&1 || ! command -v wget >/dev/null 2>&1; then + echo "Installing dependencies (jq unzip wget)..." + sudo -n apt update -y || true + sudo -n apt install -y jq unzip wget || true + fi + + # 安装 Podman(若未安装) + if ! command -v podman >/dev/null 2>&1; then + echo "Installing Podman..." + sudo -n apt update -y || true + sudo -n apt install -y podman || true + sudo -n systemctl enable podman || true + else + echo "Podman already installed" + fi + + # 启用并启动 podman.socket,确保 Nomad 可访问 + sudo -n systemctl enable --now podman.socket || true + if getent group podman >/dev/null 2>&1; then + sudo -n usermod -aG podman nomad || true + fi + + # 安装 Nomad Podman 驱动插件(始终确保存在) + PODMAN_DRIVER_VERSION="0.6.1" + PLUGIN_DIR="/opt/nomad/data/plugins" + sudo -n mkdir -p "${PLUGIN_DIR}" || true + cd /tmp + if [ ! -x "${PLUGIN_DIR}/nomad-driver-podman" ]; then + echo "Installing nomad-driver-podman ${PODMAN_DRIVER_VERSION}..." + wget -q "https://releases.hashicorp.com/nomad-driver-podman/${PODMAN_DRIVER_VERSION}/nomad-driver-podman_${PODMAN_DRIVER_VERSION}_linux_amd64.zip" + unzip -o "nomad-driver-podman_${PODMAN_DRIVER_VERSION}_linux_amd64.zip" + sudo -n mv -f nomad-driver-podman "${PLUGIN_DIR}/" + sudo -n chmod +x "${PLUGIN_DIR}/nomad-driver-podman" + sudo -n chown -R nomad:nomad "${PLUGIN_DIR}" + rm -f "nomad-driver-podman_${PODMAN_DRIVER_VERSION}_linux_amd64.zip" + else + echo "nomad-driver-podman already present in ${PLUGIN_DIR}" + fi + + # 更新 /etc/nomad.d/nomad.hcl 的 plugin_dir 设置 + if [ -f /etc/nomad.d/nomad.hcl ]; then + if grep -q "^plugin_dir\s*=\s*\"" /etc/nomad.d/nomad.hcl; then + sudo -n sed -i 's#^plugin_dir\s*=\s*\".*\"#plugin_dir = "/opt/nomad/data/plugins"#' /etc/nomad.d/nomad.hcl || true + else + echo 'plugin_dir = "/opt/nomad/data/plugins"' | sudo -n tee -a /etc/nomad.d/nomad.hcl >/dev/null || true + fi + fi + + # 重启 Nomad 服务以加载插件 + sudo -n systemctl restart nomad || true + echo "Waiting for Nomad to restart..." + sleep 15 + + # 检查 Podman 驱动是否被 Nomad 检测到 + if /usr/local/bin/nomad node status -self -json 2>/dev/null | jq -r '.Drivers.podman.Detected' | grep -q "true"; then + echo "Podman driver successfully loaded" + exit 0 + fi + + echo "Podman driver not detected yet, retrying once after socket restart..." + sudo -n systemctl restart podman.socket || true + sleep 5 + if /usr/local/bin/nomad node status -self -json 2>/dev/null | jq -r '.Drivers.podman.Detected' | grep -q "true"; then + echo "Podman driver successfully loaded after socket restart" + exit 0 + else + echo "Podman driver still not detected; manual investigation may be required" + exit 1 + fi + EOF + ] + } + + resources { + cpu = 200 + memory = 256 + } + + // 以root权限运行 + // user = "root" + # 使用 nomad 用户运行任务,避免客户端策略禁止 root + user = "nomad" + + # 确保任务成功完成 + restart { + attempts = 1 + interval = "24h" + delay = "60s" + mode = "fail" + } + } + } +} \ No newline at end of file diff --git a/purge_stale_nodes.sh b/purge_stale_nodes.sh new file mode 100755 index 0000000..4ff0f6b --- /dev/null +++ b/purge_stale_nodes.sh @@ -0,0 +1,39 @@ +#!/bin/bash +set -euo pipefail + +ADDR="http://100.81.26.3:4646" +# 检查 NOMAD_TOKEN 是否设置,如果设置了,则准备好 Header +HDR="" +if [ -n "${NOMAD_TOKEN:-}" ]; then + HDR="-H "X-Nomad-Token: $NOMAD_TOKEN"" +fi + +echo "--- 节点列表 (Before) ---" +nomad node status -address="$ADDR" + +echo +echo "--- 开始查找需要清理的旧节点 ---" + +# 使用 jq 从 nomad node status 的 json 输出中精确查找 +# 条件: 状态为 "down" 且 名称匹配列表 +IDS_TO_PURGE=$(nomad node status -address="$ADDR" -json | jq -r '.[] | select(.Status == "down" and (.Name | test("^(ch3|ch2|ash1d|ash2e|semaphore)$"))) | .ID') + +if [[ -z "$IDS_TO_PURGE" ]]; then + echo "✅ 未找到符合条件的 'down' 状态节点,无需清理。" +else + echo "以下是待清理的节点 ID:" + echo "$IDS_TO_PURGE" + echo + + # 循环遍历 ID,使用 curl 调用 HTTP API 进行 purge + for NODE_ID in $IDS_TO_PURGE; do + echo "===> 正在清理节点: $NODE_ID" + # 构造 curl 命令,并使用 eval 来正确处理可能为空的 $HDR + cmd="curl -sS -XPOST $HDR -w ' -> HTTP %{http_code}\n' '$ADDR/v1/node/$NODE_ID/purge'" + eval $cmd + done +fi + +echo +echo "--- 节点列表 (After) ---" +nomad node status -address="$ADDR" \ No newline at end of file diff --git a/snippets/zsh/zshrc-minimal.sh b/snippets/zsh/zshrc-minimal.sh deleted file mode 100644 index 7fc136c..0000000 --- a/snippets/zsh/zshrc-minimal.sh +++ /dev/null @@ -1,106 +0,0 @@ -#!/bin/bash - -# 最小化 ZSH 配置 - 适合快速部署 -# 用法: curl -fsSL https://your-gitea.com/ben/mgmt/raw/branch/main/snippets/zsh/zshrc-minimal.sh | bash - -set -euo pipefail - -# 颜色定义 -RED='\033[0;31m' -GREEN='\033[0;32m' -YELLOW='\033[1;33m' -BLUE='\033[0;34m' -NC='\033[0m' - -log_info() { echo -e "${BLUE}[INFO]${NC} $1"; } -log_success() { echo -e "${GREEN}[SUCCESS]${NC} $1"; } -log_error() { echo -e "${RED}[ERROR]${NC} $1"; } - -# 检查 root 权限 -if [[ $EUID -ne 0 ]]; then - log_error "需要 root 权限" - exit 1 -fi - -log_info "开始安装最小化 ZSH 配置..." - -# 安装依赖 -apt update && apt install -y zsh git curl fonts-powerline - -# 安装 oh-my-zsh -if [[ ! -d "$HOME/.oh-my-zsh" ]]; then - RUNZSH=no CHSH=no sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)" -fi - -# 安装关键插件 -custom_dir="$HOME/.oh-my-zsh/custom/plugins" -mkdir -p "$custom_dir" - -[[ ! -d "$custom_dir/zsh-autosuggestions" ]] && git clone https://github.com/zsh-users/zsh-autosuggestions "$custom_dir/zsh-autosuggestions" -[[ ! -d "$custom_dir/zsh-syntax-highlighting" ]] && git clone https://github.com/zsh-users/zsh-syntax-highlighting.git "$custom_dir/zsh-syntax-highlighting" - -# 创建最小化配置 -cat > "$HOME/.zshrc" << 'EOF' -# Oh My Zsh 配置 -export ZSH="$HOME/.oh-my-zsh" -ZSH_THEME="agnoster" - -plugins=( - git - docker - ansible - terraform - kubectl - zsh-autosuggestions - zsh-syntax-highlighting -) - -source $ZSH/oh-my-zsh.sh - -# 基本别名 -alias ll='ls -alF' -alias la='ls -A' -alias l='ls -CF' -alias ..='cd ..' -alias ...='cd ../..' -alias grep='grep --color=auto' - -# Docker 别名 -alias d='docker' -alias dps='docker ps' -alias dpsa='docker ps -a' -alias dex='docker exec -it' -alias dlog='docker logs -f' - -# Kubernetes 别名 -alias k='kubectl' -alias kgp='kubectl get pods' -alias kgs='kubectl get services' -alias kgd='kubectl get deployments' - -# Git 别名 -alias gs='git status' -alias ga='git add' -alias gc='git commit' -alias gp='git push' -alias gl='git pull' - -# 历史配置 -HISTSIZE=10000 -SAVEHIST=10000 -HISTFILE=~/.zsh_history -setopt SHARE_HISTORY -setopt HIST_IGNORE_DUPS - -# 自动建议配置 -ZSH_AUTOSUGGEST_HIGHLIGHT_STYLE='fg=8' -ZSH_AUTOSUGGEST_STRATEGY=(history completion) - -echo "🚀 ZSH 配置完成!" -EOF - -# 设置默认 shell -chsh -s "$(which zsh)" - -log_success "最小化 ZSH 配置安装完成!" -log_info "请重新登录或运行: source ~/.zshrc" diff --git a/test-job.nomad b/test-job.nomad new file mode 100644 index 0000000..bc0e9f7 --- /dev/null +++ b/test-job.nomad @@ -0,0 +1,40 @@ +job "test-nginx" { + datacenters = ["dc1"] + type = "service" + + group "web" { + count = 1 + + network { + port "http" { + static = 8080 + } + } + + task "nginx" { + driver = "podman" + + config { + image = "nginx:alpine" + ports = ["http"] + } + + resources { + cpu = 100 + memory = 128 + } + + service { + name = "nginx-test" + port = "http" + + check { + type = "http" + path = "/" + interval = "10s" + timeout = "3s" + } + } + } + } +} \ No newline at end of file