feat(监控): 添加Telegraf监控配置和磁盘监控脚本

refactor(容器): 从Docker迁移到Podman并更新Nomad配置 fix(配置): 修复代理和别名配置问题 docs(文档): 更新配置文件和脚本注释 chore(清理): 移除不再使用的Consul和Docker相关文件
2025-09-24 03:46:30 +00:00
parent 3f45ad8361
commit d0e7f64c1d
92 changed files with 3552 additions and 7737 deletions
--- a/docs/disk-management.md
+++ b/docs/disk-management.md
@@ -0,0 +1,169 @@
+# 磁盘管理工具使用指南
+
+## 🔧 工具概览
+
+我们提供了三个主要的磁盘管理工具来解决磁盘空间不足的问题：
+
+### 1. 磁盘分析工具 (`disk-analysis-ncdu.yml`)
+使用 `ncdu` 工具深度分析磁盘使用情况，生成详细报告。
+
+### 2. 磁盘清理工具 (`disk-cleanup.yml`)
+自动清理系统垃圾文件、日志、缓存等。
+
+### 3. 磁盘监控脚本 (`disk-monitor.sh`)
+一键监控所有节点的磁盘使用情况。
+
+## 🚀 快速使用
+
+### 监控所有节点磁盘使用情况
+```bash
+# 使用默认阈值 85%
+./scripts/utilities/disk-monitor.sh
+
+# 使用自定义阈值 90%
+./scripts/utilities/disk-monitor.sh 90
+```
+
+### 分析特定节点磁盘使用
+```bash
+# 分析所有节点
+ansible-playbook -i configuration/inventories/production/nomad-cluster.ini \
+  configuration/playbooks/disk-analysis-ncdu.yml
+
+# 分析特定节点
+ansible-playbook -i configuration/inventories/production/nomad-cluster.ini \
+  configuration/playbooks/disk-analysis-ncdu.yml --limit semaphore
+```
+
+### 清理磁盘空间
+```bash
+# 清理所有节点 (安全模式)
+ansible-playbook -i configuration/inventories/production/nomad-cluster.ini \
+  configuration/playbooks/disk-cleanup.yml
+
+# 清理特定节点
+ansible-playbook -i configuration/inventories/production/nomad-cluster.ini \
+  configuration/playbooks/disk-cleanup.yml --limit ash3c
+
+# 包含容器清理 (谨慎使用)
+ansible-playbook -i configuration/inventories/production/nomad-cluster.ini \
+  configuration/playbooks/disk-cleanup.yml -e cleanup_containers=true
+```
+
+## 📊 分析报告说明
+
+### ncdu 文件位置
+分析完成后，ncdu 扫描文件保存在各节点的 `/tmp/disk-analysis/` 目录：
+
+- `ncdu-root-<hostname>.json` - 根目录扫描结果
+- `ncdu-var-<hostname>.json` - /var 目录扫描结果  
+- `ncdu-opt-<hostname>.json` - /opt 目录扫描结果
+
+### 查看 ncdu 报告
+```bash
+# 在目标节点上查看交互式报告
+ncdu -f /tmp/disk-analysis/ncdu-root-semaphore.json
+
+# 查看文本报告
+cat /tmp/disk-analysis/disk-report-semaphore.txt
+
+# 查看清理建议
+cat /tmp/disk-analysis/cleanup-suggestions-semaphore.txt
+```
+
+## 🧹 清理选项说明
+
+### 默认清理项目
+- ✅ **系统日志**: 清理7天前的日志文件
+- ✅ **包缓存**: 清理 APT/YUM 缓存
+- ✅ **临时文件**: 清理7天前的临时文件
+- ✅ **核心转储**: 删除 core dump 文件
+
+### 可选清理项目
+- ⚠️ **容器清理**: 需要手动启用 (`cleanup_containers=true`)
+  - 停止所有容器
+  - 删除未使用的容器、镜像、卷
+
+### 自定义清理参数
+```bash
+ansible-playbook configuration/playbooks/disk-cleanup.yml \
+  -e cleanup_logs=false \
+  -e cleanup_cache=true \
+  -e cleanup_temp=true \
+  -e cleanup_containers=false
+```
+
+## 🚨 紧急情况处理
+
+### 磁盘使用率 > 95%
+```bash
+# 1. 立即检查最大文件
+ansible all -i configuration/inventories/production/nomad-cluster.ini \
+  -m shell -a "find / -type f -size +1G -exec ls -lh {} \; 2>/dev/null | head -5"
+
+# 2. 紧急清理
+ansible-playbook configuration/playbooks/disk-cleanup.yml \
+  -e cleanup_containers=true
+
+# 3. 手动清理大文件
+ansible all -m shell -a "truncate -s 0 /var/log/large.log"
+```
+
+### 常见大文件位置
+- `/var/log/` - 系统日志
+- `/tmp/` - 临时文件
+- `/var/cache/` - 包管理器缓存
+- `/opt/nomad/data/` - Nomad 数据
+- `~/.local/share/containers/` - Podman 数据
+
+## 📈 定期维护建议
+
+### 每日监控
+```bash
+# 添加到 crontab
+0 9 * * * /root/mgmt/scripts/utilities/disk-monitor.sh 85
+```
+
+### 每周清理
+```bash
+# 每周日凌晨2点自动清理
+0 2 * * 0 cd /root/mgmt && ansible-playbook configuration/playbooks/disk-cleanup.yml
+```
+
+### 每月深度分析
+```bash
+# 每月1号生成详细报告
+0 3 1 * * cd /root/mgmt && ansible-playbook configuration/playbooks/disk-analysis-ncdu.yml
+```
+
+## 🔍 故障排除
+
+### ncdu 安装失败
+```bash
+# 手动安装
+ansible all -m package -a "name=ncdu state=present" --become
+```
+
+### 扫描超时
+```bash
+# 增加超时时间
+ansible-playbook disk-analysis-ncdu.yml -e ansible_timeout=600
+```
+
+### 权限问题
+```bash
+# 确保使用 sudo
+ansible-playbook disk-analysis-ncdu.yml --become
+```
+
+## 💡 最佳实践
+
+1. **定期监控**: 每天检查磁盘使用情况
+2. **预防性清理**: 使用率超过80%时主动清理
+3. **日志轮转**: 配置合适的日志轮转策略
+4. **容器管理**: 定期清理未使用的容器镜像
+5. **监控告警**: 设置磁盘使用率告警阈值
+
+---
+
+💡 **提示**: 使用 `./scripts/utilities/disk-monitor.sh` 可以快速检查所有节点状态！