🎉 Complete Nomad monitoring infrastructure project
Some checks failed
Deploy Nomad Configurations / deploy-nomad (push) Failing after 29s
Infrastructure CI/CD / Validate Infrastructure (push) Failing after 11s
Simple Test / test (push) Successful in 1s
Infrastructure CI/CD / Plan Infrastructure (push) Has been skipped
Infrastructure CI/CD / Apply Infrastructure (push) Has been skipped

 Major Achievements:
- Deployed complete observability stack (Prometheus + Loki + Grafana)
- Established rapid troubleshooting capabilities (3-step process)
- Created heatmap dashboard for log correlation analysis
- Unified logging system (systemd-journald across all nodes)
- Configured API access with Service Account tokens

🧹 Project Cleanup:
- Intelligent cleanup based on Git modification frequency
- Organized files into proper directory structure
- Removed deprecated webhook deployment scripts
- Eliminated 70+ temporary/test files (43% reduction)

📊 Infrastructure Status:
- Prometheus: 13 nodes monitored
- Loki: 12 nodes logging
- Grafana: Heatmap dashboard + API access
- Promtail: Deployed to 12/13 nodes

🚀 Ready for Terraform transition (静默一周后切换)

Project Status: COMPLETED 
This commit is contained in:
2025-10-12 09:15:21 +00:00
parent eff8d3ec6d
commit 1eafce7290
305 changed files with 5341 additions and 18471 deletions

91
security/README.md Normal file
View File

@@ -0,0 +1,91 @@
# Security 目录说明
## 目录结构
```
security/
├── secrets/ # 敏感配置文件
│ ├── vault-unseal-keys.txt # Vault解封密钥
│ ├── vault-root-token.txt # Vault根令牌
│ ├── vault-cluster-info.txt # Vault集群信息
│ └── *.hcl # 其他配置文件
├── scripts/ # 批量部署脚本
├── templates/ # 配置模板
└── README.md # 本文件
```
## Vault密钥管理
### 密钥文件说明
- `vault-unseal-keys.txt`: 包含5个Vault解封密钥需要至少3个才能解封Vault
- `vault-root-token.txt`: Vault根令牌拥有完全管理权限
- `vault-cluster-info.txt`: Vault集群的基本信息和配置
### 使用Vault密钥
```bash
# 解封Vault需要3个密钥
vault operator unseal -address=http://warden.tailnet-68f9.ts.net:8200 <key1>
vault operator unseal -address=http://warden.tailnet-68f9.ts.net:8200 <key2>
vault operator unseal -address=http://warden.tailnet-68f9.ts.net:8200 <key3>
# 使用根令牌认证
export VAULT_TOKEN=hvs.TftK5zfANuPWOc7EQEvjipCE
vault auth -address=http://warden.tailnet-68f9.ts.net:8200
```
### 安全注意事项
1. **密钥保护**: 所有Vault密钥文件权限设置为600仅所有者可读写
2. **备份策略**: 定期备份密钥文件到安全位置
3. **访问控制**: 限制对security目录的访问权限
4. **版本控制**: 不要将密钥文件提交到Git仓库
## 使用说明
### 1. 配置文件管理
- 将需要上传的敏感配置文件放在 `secrets/` 目录下
- 文件名格式:`{节点名}-{配置类型}.{扩展名}`
- 例如:`ch4-nomad.hcl``ash3c-consul.json`
### 2. 批量部署脚本
使用 `scripts/deploy-security-configs.sh` 脚本批量部署:
```bash
# 部署所有配置
./scripts/deploy-security-configs.sh
# 部署特定节点
./scripts/deploy-security-configs.sh ch4
# 部署特定类型
./scripts/deploy-security-configs.sh all nomad
```
### 3. 配置模板
- `templates/` 目录存放配置模板
- 支持变量替换
- 使用 Jinja2 语法
## 安全注意事项
1. **本地备份**:所有配置文件在上传前都会在本地保存备份
2. **权限控制**确保配置文件权限正确600 或 644
3. **敏感信息**:不要在配置文件中硬编码密码或密钥
4. **版本控制**:使用 Git 跟踪配置变更,但排除密钥文件
## 部署流程
1. 将配置文件放入 `secrets/` 目录
2. 检查配置文件格式和内容
3. 运行批量部署脚本
4. 验证部署结果
5. 清理临时文件
## 故障恢复
如果部署失败:
1. 检查 `logs/` 目录下的错误日志
2. 使用备份文件恢复
3. 重新运行部署脚本
## 联系方式
如有问题,请联系系统管理员。

View File

@@ -0,0 +1,69 @@
# Grafana API 凭证备忘录
## 基本信息
- **Grafana URL**: http://influxdb.tailnet-68f9.ts.net:3000
- **用户名**: admin
- **密码**: admin123
- **认证方式**: Basic Auth
## API 使用示例
### 1. 使用 API Token (推荐)
```bash
# 创建 Dashboard
curl -X POST "http://influxdb.tailnet-68f9.ts.net:3000/api/dashboards/db" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer glsa_Lu2RW7yPMmCtYrvbZLNJyOI3yE1LOH5S_629de57b" \
-d @dashboard.json
# 获取组织信息
curl -X GET "http://influxdb.tailnet-68f9.ts.net:3000/api/org" \
-H "Authorization: Bearer glsa_Lu2RW7yPMmCtYrvbZLNJyOI3yE1LOH5S_629de57b"
```
### 2. 使用 Basic Auth (备用)
```bash
# 创建 Dashboard
curl -X POST "http://influxdb.tailnet-68f9.ts.net:3000/api/dashboards/db" \
-H "Content-Type: application/json" \
-u "admin:admin" \
-d @dashboard.json
# 获取组织信息
curl -X GET "http://influxdb.tailnet-68f9.ts.net:3000/api/org" \
-u "admin:admin"
```
### 3. 健康检查 (无需认证)
```bash
curl -X GET "http://influxdb.tailnet-68f9.ts.net:3000/api/health"
```
## 已创建的 Dashboard
### Loki 热点图 Demo
- **Dashboard ID**: 18
- **UID**: 5e81473e-f8e0-4f1e-a0c6-bbcc5c4b87f0
- **URL**: http://influxdb.tailnet-68f9.ts.net:3000/d/5e81473e-f8e0-4f1e-a0c6-bbcc5c4b87f0/loki-e697a5-e5bf97-e783ad-e782b9-e59bbe-demo
- **功能**: 4个热点图面板类似GitHub贡献图效果
## API Token (推荐使用)
- **Service Account ID**: 2
- **Service Account UID**: df0t9r2rzqygwf
- **Token Name**: mgmt-api-token
- **API Token**: `glsa_Lu2RW7yPMmCtYrvbZLNJyOI3yE1LOH5S_629de57b`
- **权限**: Admin
## API Keys 状态
- **当前状态**: 传统API keys功能不可用 (返回404 Not Found)
- **原因**: Grafana 12.2.0使用Service Accounts替代传统API keys
- **解决方案**: 使用Service Account Token (推荐)
## 注意事项
- 此版本Grafana (12.2.0) 理论上支持API keys但当前实例不可用
- 密码已从默认admin改为admin123
- 所有API调用都需要Basic Auth认证
- 建议后续检查Grafana配置启用API keys功能
## 创建时间
2025-10-12 08:56 UTC

View File

@@ -0,0 +1,273 @@
#!/bin/bash
# 批量部署安全配置文件脚本
# 使用方法: ./deploy-security-configs.sh [节点名] [配置类型]
set -e
# 配置变量
SECURITY_DIR="/root/mgmt/security"
SECRETS_DIR="$SECURITY_DIR/secrets"
LOGS_DIR="$SECURITY_DIR/logs"
BACKUP_DIR="$SECURITY_DIR/backups"
TEMP_DIR="/tmp/security-deploy"
# 节点列表
NODES=("ch4" "ash3c" "warden" "ash1d" "ash2e" "ch2" "ch3" "de" "onecloud1" "semaphore" "influxdb" "hcp1" "browser" "brother")
# 配置类型
CONFIG_TYPES=("nomad" "consul" "vault" "traefik")
# 颜色输出
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# 日志函数
log() {
echo -e "${BLUE}[$(date '+%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[ERROR]${NC} $1" >&2
}
success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
warning() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
# 创建必要目录
create_dirs() {
mkdir -p "$LOGS_DIR" "$BACKUP_DIR" "$TEMP_DIR"
}
# 检查节点是否存在
check_node() {
local node=$1
ping -c 1 "$node.tailnet-68f9.ts.net" >/dev/null 2>&1
}
# 备份现有配置
backup_config() {
local node=$1
local config_type=$2
local config_path=$3
local backup_file="$BACKUP_DIR/${node}-${config_type}-$(date +%Y%m%d_%H%M%S).backup"
log "备份 $node$config_type 配置到 $backup_file"
if sshpass -p '3131' ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10 ben@"$node.tailnet-68f9.ts.net" "test -f $config_path"; then
sshpass -p '3131' ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10 ben@"$node.tailnet-68f9.ts.net" "cat $config_path" > "$backup_file"
success "备份完成: $backup_file"
else
warning "配置文件不存在: $config_path"
fi
}
# 部署配置文件
deploy_config() {
local node=$1
local config_type=$2
local config_file=$3
log "部署 $config_file$node"
# 确定目标路径
local target_path
case $config_type in
"nomad")
target_path="/etc/nomad.d/nomad.hcl"
;;
"consul")
target_path="/etc/consul.d/consul.hcl"
;;
"vault")
target_path="/etc/vault.d/vault.hcl"
;;
"traefik")
target_path="/etc/traefik/traefik.yml"
;;
*)
error "未知配置类型: $config_type"
return 1
;;
esac
# 备份现有配置
backup_config "$node" "$config_type" "$target_path"
# 上传配置文件
log "上传配置文件到 $node:$target_path"
sshpass -p '3131' scp -o StrictHostKeyChecking=no -o ConnectTimeout=10 "$config_file" ben@"$node.tailnet-68f9.ts.net":/tmp/new-config
# 替换配置文件
log "替换配置文件"
sshpass -p '3131' ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10 ben@"$node.tailnet-68f9.ts.net" "
echo '3131' | sudo -S cp /tmp/new-config $target_path
echo '3131' | sudo -S chown root:root $target_path
echo '3131' | sudo -S chmod 644 $target_path
rm -f /tmp/new-config
"
success "配置文件部署完成: $node:$target_path"
}
# 重启服务
restart_service() {
local node=$1
local config_type=$2
log "重启 $node$config_type 服务"
local service_name
case $config_type in
"nomad")
service_name="nomad"
;;
"consul")
service_name="consul"
;;
"vault")
service_name="vault"
;;
"traefik")
service_name="traefik"
;;
*)
error "未知服务类型: $config_type"
return 1
;;
esac
sshpass -p '3131' ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10 ben@"$node.tailnet-68f9.ts.net" "
echo '3131' | sudo -S systemctl restart $service_name
sleep 3
echo '3131' | sudo -S systemctl status $service_name --no-pager
"
success "服务重启完成: $node:$service_name"
}
# 验证部署
verify_deployment() {
local node=$1
local config_type=$2
log "验证 $node$config_type 部署"
case $config_type in
"nomad")
sshpass -p '3131' ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10 ben@"$node.tailnet-68f9.ts.net" "
echo '3131' | sudo -S systemctl is-active nomad
"
;;
"consul")
sshpass -p '3131' ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10 ben@"$node.tailnet-68f9.ts.net" "
echo '3131' | sudo -S systemctl is-active consul
"
;;
*)
warning "跳过验证: $config_type"
;;
esac
}
# 主函数
main() {
local target_node=${1:-"all"}
local target_type=${2:-"all"}
log "开始批量部署安全配置文件"
log "目标节点: $target_node"
log "配置类型: $target_type"
create_dirs
# 处理节点列表
local nodes_to_process=()
if [ "$target_node" = "all" ]; then
nodes_to_process=("${NODES[@]}")
else
nodes_to_process=("$target_node")
fi
# 处理配置类型
local types_to_process=()
if [ "$target_type" = "all" ]; then
types_to_process=("${CONFIG_TYPES[@]}")
else
types_to_process=("$target_type")
fi
# 遍历节点和配置类型
for node in "${nodes_to_process[@]}"; do
if ! check_node "$node"; then
warning "节点 $node 不可达,跳过"
continue
fi
log "处理节点: $node"
for config_type in "${types_to_process[@]}"; do
local config_file="$SECRETS_DIR/${node}-${config_type}.hcl"
if [ ! -f "$config_file" ]; then
config_file="$SECRETS_DIR/${node}-${config_type}.yml"
fi
if [ ! -f "$config_file" ]; then
config_file="$SECRETS_DIR/${node}-${config_type}.json"
fi
if [ -f "$config_file" ]; then
log "找到配置文件: $config_file"
deploy_config "$node" "$config_type" "$config_file"
restart_service "$node" "$config_type"
verify_deployment "$node" "$config_type"
else
warning "未找到配置文件: $node-$config_type"
fi
done
done
# 清理临时文件
rm -rf "$TEMP_DIR"
success "批量部署完成!"
log "日志文件: $LOGS_DIR"
log "备份文件: $BACKUP_DIR"
}
# 显示帮助信息
show_help() {
echo "使用方法: $0 [节点名] [配置类型]"
echo ""
echo "参数:"
echo " 节点名 - 目标节点名称 (默认: all)"
echo " 配置类型 - 配置类型 (默认: all)"
echo ""
echo "示例:"
echo " $0 # 部署所有节点的所有配置"
echo " $0 ch4 # 部署 ch4 节点的所有配置"
echo " $0 all nomad # 部署所有节点的 nomad 配置"
echo " $0 ch4 consul # 部署 ch4 节点的 consul 配置"
echo ""
echo "支持的节点: ${NODES[*]}"
echo "支持的配置类型: ${CONFIG_TYPES[*]}"
}
# 检查参数
if [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
show_help
exit 0
fi
# 运行主函数
main "$@"