更新个人分支 #26
develop into xingyuanxin_branch 3 months ago
@ -0,0 +1,125 @@
|
||||
# 王祖旺个人周计划
|
||||
基于大数据技术发展方向,本周将重点进行分布式存储与计算框架的深入学习,为构建大数据处理能力奠定基础。
|
||||
|
||||
## 核心学习任务
|
||||
|
||||
### 1. HDFS分布式文件系统深入学习
|
||||
**学习重点**
|
||||
#### HDFS架构原理
|
||||
- NameNode元数据管理机制
|
||||
- DataNode数据块存储实现
|
||||
- 读写流程和一致性保证
|
||||
- 副本放置策略和机架感知
|
||||
|
||||
#### 高级特性
|
||||
- HDFS Federation架构
|
||||
- 快照(Snapshot)功能
|
||||
- 透明加密(Transparent Encryption)
|
||||
- Erasure Coding编码方案
|
||||
|
||||
#### 运维管理
|
||||
- Balancer负载均衡工具
|
||||
- Disk Balancer磁盘均衡
|
||||
- 权限控制(ACL)配置
|
||||
- Audit Log审计日志分析
|
||||
|
||||
**具体任务安排**
|
||||
- 周一: 研究NameNode HA实现和ZKFC机制
|
||||
- 周二: 实践Erasure Coding配置和性能测试
|
||||
- 周三: 分析HDFS源码中的RPC通信模型
|
||||
|
||||
### 2. Hadoop生态系统实践学习
|
||||
**学习重点**
|
||||
#### YARN深入
|
||||
- 资源调度算法(Fair/Capacity)
|
||||
- NodeManager资源隔离
|
||||
- ApplicationMaster工作机制
|
||||
- Timeline Server使用
|
||||
|
||||
#### 生态组件
|
||||
- HBase与HDFS集成
|
||||
- Hive数据仓库实践
|
||||
- ZooKeeper协调服务
|
||||
- Flume数据采集
|
||||
|
||||
**具体任务安排**
|
||||
- 周四: 搭建YARN HA集群并测试故障转移
|
||||
- 周五: 实践Hive on Spark配置优化
|
||||
- 周六上午: 完成HBase集群部署测试
|
||||
|
||||
### 3. Spark核心引擎学习
|
||||
**学习重点**
|
||||
#### 内核原理
|
||||
- RDD弹性数据集特性
|
||||
- DAG调度和执行计划
|
||||
- 内存管理机制
|
||||
- Shuffle优化策略
|
||||
|
||||
#### 开发实践
|
||||
- DataFrame API编程
|
||||
- Spark SQL优化技巧
|
||||
- 结构化流处理
|
||||
- 性能调优参数
|
||||
|
||||
**具体任务安排**
|
||||
- 周六下午: 编写Spark Core性能测试用例
|
||||
- 周日: 完成Structured Streaming实时处理demo
|
||||
- 周日晚上: 研究Spark Shuffle源码实现
|
||||
|
||||
## 学习资源和参考材料
|
||||
**核心书籍**
|
||||
- 《Hadoop技术内幕》系列
|
||||
- 《Spark权威指南》
|
||||
- 《大数据处理之道》
|
||||
|
||||
**技术文档**
|
||||
- Apache官方技术白皮书
|
||||
- HDFS Architecture Guide
|
||||
- Spark Performance Tuning Guide
|
||||
|
||||
**实验环境**
|
||||
- 3节点虚拟机集群(8C16G)
|
||||
- CDH 6.3.2发行版
|
||||
- Spark 3.1.3版本
|
||||
|
||||
## 学习成果和交付物
|
||||
**本周预期成果**
|
||||
1. HDFS技术分析报告(含性能测试数据)
|
||||
2. Hadoop生态组件部署文档
|
||||
3. Spark核心示例代码集
|
||||
4. 技术原理脑图总结
|
||||
|
||||
**能力目标**
|
||||
- 掌握HDFS高级特性和调优方法
|
||||
- 具备Hadoop生态集成部署能力
|
||||
- 熟练使用Spark核心API开发
|
||||
- 理解分布式计算调度原理
|
||||
|
||||
## 执行策略
|
||||
**时间管理**
|
||||
- 工作日: 19:00-23:00(4h)
|
||||
- 周末: 9:00-12:00, 14:00-18:00(7h)
|
||||
- 每日晨间30分钟复习
|
||||
|
||||
**学习方法**
|
||||
- 源码分析配合实操验证
|
||||
- 性能基准测试驱动学习
|
||||
- 技术方案对比研究
|
||||
- 技术博客输出总结
|
||||
|
||||
**进度控制**
|
||||
- 每日记录GitHub仓库
|
||||
- 模块学习完成后做演示
|
||||
- 关键问题记录issue跟踪
|
||||
|
||||
## 风险预案
|
||||
**潜在挑战**
|
||||
- 集群资源不足影响实验
|
||||
- 版本兼容性问题
|
||||
- 复杂概念理解困难
|
||||
|
||||
**应对措施**
|
||||
- 优先保证核心组件运行
|
||||
- 使用Docker简化环境
|
||||
- 结合多种资料对比学习
|
||||
- 技术社区寻求帮助
|
||||
@ -0,0 +1,135 @@
|
||||
# 沈永佳 - 第6周工作计划
|
||||
|
||||
## 基本信息
|
||||
- **姓名**: 沈永佳
|
||||
- **周次**: 第6周
|
||||
- **日期**: 2025年第6周
|
||||
- **项目**: 基于Hadoop的故障检测与自动恢复系统
|
||||
|
||||
## 本周工作目标
|
||||
|
||||
基于项目核心任务说明文档,本周主要聚焦于系统架构设计和前后端接口规范制定,为后续开发阶段奠定基础。
|
||||
|
||||
## 详细任务计划
|
||||
|
||||
### 1. 前后端接口定义规范制定 【新增任务】
|
||||
**优先级**: 高
|
||||
**预计工作量**: 2天
|
||||
**任务描述**:
|
||||
- 制定统一的API接口规范文档
|
||||
- 定义RESTful API设计标准
|
||||
- 规范请求/响应数据格式
|
||||
- 制定错误码和状态码标准
|
||||
|
||||
**具体工作内容**:
|
||||
- **接口命名规范**:
|
||||
- 统一使用RESTful风格
|
||||
- 资源路径采用复数形式
|
||||
- 版本控制策略(/api/v1/)
|
||||
- **数据格式规范**:
|
||||
- 统一JSON格式
|
||||
- 时间戳格式标准化
|
||||
- 分页参数规范
|
||||
- **核心接口定义**:
|
||||
- 日志上传接口: `POST /api/v1/log/upload`
|
||||
- 集群状态接口: `GET /api/v1/cluster/status`
|
||||
- 故障诊断接口: `POST /api/v1/llm/diagnose`
|
||||
- 修复执行接口: `POST /api/v1/repair/execute`
|
||||
- **错误处理规范**:
|
||||
- HTTP状态码使用标准
|
||||
- 业务错误码定义
|
||||
- 错误信息格式统一
|
||||
|
||||
**交付物**:
|
||||
- API接口规范文档
|
||||
- 接口Mock数据示例
|
||||
- Postman测试集合
|
||||
|
||||
### 2. 后端API接口框架搭建
|
||||
**优先级**: 高
|
||||
**预计工作量**: 1.5天
|
||||
**任务描述**:
|
||||
基于FastAPI框架搭建后端服务基础架构
|
||||
|
||||
**具体工作内容**:
|
||||
- FastAPI项目初始化和依赖配置
|
||||
- 数据库连接池配置(MySQL + Redis)
|
||||
- 中间件配置(CORS、认证、日志)
|
||||
- 基础路由结构设计
|
||||
- 数据模型定义(Pydantic)
|
||||
|
||||
**技术要点**:
|
||||
- 使用FastAPI + Uvicorn
|
||||
- 集成mysql-connector-python和redis
|
||||
- 实现统一的响应格式
|
||||
- 添加请求参数验证
|
||||
|
||||
### 3. 日志数据结构化处理模块设计
|
||||
**优先级**: 中
|
||||
**预计工作量**: 1天
|
||||
**任务描述**:
|
||||
设计Hadoop日志的结构化处理逻辑
|
||||
|
||||
**具体工作内容**:
|
||||
- 分析Hadoop各组件日志格式
|
||||
- 设计正则表达式匹配规则
|
||||
- 实现日志解析函数preprocess_log()
|
||||
- 定义结构化数据模型
|
||||
|
||||
**技术规范**:
|
||||
- 日志格式: `(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (\w+) (\w+): (.*)`
|
||||
- 输出字段: timestamp, log_level, component, content
|
||||
- 支持ERROR/WARN级别日志过滤
|
||||
|
||||
### 4. 数据库表结构设计
|
||||
**优先级**: 中
|
||||
**预计工作量**: 0.5天
|
||||
**任务描述**:
|
||||
设计MySQL数据库表结构
|
||||
|
||||
**表结构设计**:
|
||||
- **cluster_node**: 节点信息表
|
||||
- 字段: node_id, ip_address, role, status, last_heartbeat
|
||||
- **fault_record**: 故障记录表
|
||||
- 字段: fault_id, fault_type, occur_time, diagnosis, fix_script, fix_status, exec_log
|
||||
- **operation_log**: 操作日志表
|
||||
- 字段: log_id, user_id, operation, timestamp, operation_data, result
|
||||
|
||||
## 风险识别与应对
|
||||
|
||||
### 技术风险
|
||||
1. **接口规范制定复杂性**
|
||||
- 风险: 不同模块间接口协调困难
|
||||
- 应对: 提前与团队成员沟通,建立评审机制
|
||||
|
||||
2. **日志格式多样性**
|
||||
- 风险: Hadoop不同版本日志格式差异
|
||||
- 应对: 收集多版本日志样本,设计兼容性方案
|
||||
|
||||
### 进度风险
|
||||
1. **任务量评估偏差**
|
||||
- 风险: 接口规范制定时间可能超预期
|
||||
- 应对: 采用迭代方式,先完成核心接口定义
|
||||
|
||||
## 本周里程碑
|
||||
|
||||
- **周三**: 完成前后端接口规范文档初稿
|
||||
- **周五**: 完成后端框架搭建和基础接口实现
|
||||
- **周日**: 完成日志处理模块设计和数据库表结构
|
||||
|
||||
## 下周计划预览
|
||||
|
||||
1. 实现Flume日志采集配置
|
||||
2. 完善后端API接口实现
|
||||
3. 开始前端页面框架搭建
|
||||
4. 集成大模型诊断接口
|
||||
|
||||
## 学习与提升
|
||||
|
||||
- 深入学习FastAPI框架高级特性
|
||||
- 研究Hadoop日志分析最佳实践
|
||||
- 学习RESTful API设计规范
|
||||
- 了解分布式系统监控方案
|
||||
|
||||
---
|
||||
**备注**: 本计划基于项目核心任务说明文档制定,如有调整将及时更新。
|
||||
|
After Width: | Height: | Size: 203 KiB |
@ -0,0 +1,152 @@
|
||||
@startuml
|
||||
title 故障检测系统 - 原型界面设计
|
||||
skinparam defaultFontName Microsoft YaHei
|
||||
skinparam backgroundColor #F8F9FA
|
||||
|
||||
!define LIGHTBLUE #E3F2FD
|
||||
!define LIGHTGREEN #E8F5E8
|
||||
!define LIGHTYELLOW #FFF3E0
|
||||
!define LIGHTRED #FFEBEE
|
||||
|
||||
package "主界面布局" {
|
||||
rectangle "顶部导航栏" as TopNav {
|
||||
rectangle "Logo" as Logo
|
||||
rectangle "集群状态" as NavStatus
|
||||
rectangle "日志查询" as NavLogs
|
||||
rectangle "故障诊断" as NavDiagnose
|
||||
rectangle "自动修复" as NavRepair
|
||||
rectangle "系统配置" as NavConfig
|
||||
rectangle "用户信息" as UserInfo
|
||||
}
|
||||
|
||||
rectangle "侧边栏" as Sidebar {
|
||||
rectangle "仪表板" as MenuDash
|
||||
rectangle "集群监控" as MenuCluster
|
||||
rectangle "日志管理" as MenuLogMgmt
|
||||
rectangle "故障处理" as MenuFault
|
||||
rectangle "修复历史" as MenuHistory
|
||||
rectangle "系统设置" as MenuSettings
|
||||
}
|
||||
|
||||
rectangle "主内容区" as MainContent {
|
||||
rectangle "内容展示区域" as ContentArea
|
||||
}
|
||||
}
|
||||
|
||||
package "仪表板页面" as Dashboard {
|
||||
rectangle "集群概览卡片" as ClusterOverview LIGHTBLUE {
|
||||
rectangle "节点状态统计" as NodeStats
|
||||
rectangle "资源使用率" as ResourceUsage
|
||||
rectangle "告警数量" as AlertCount
|
||||
}
|
||||
|
||||
rectangle "实时监控图表" as RealtimeCharts LIGHTGREEN {
|
||||
rectangle "CPU使用率趋势" as CPUChart
|
||||
rectangle "内存使用率趋势" as MemoryChart
|
||||
rectangle "磁盘I/O趋势" as DiskChart
|
||||
}
|
||||
|
||||
rectangle "最近故障列表" as RecentFaults LIGHTYELLOW {
|
||||
rectangle "故障ID | 类型 | 时间 | 状态" as FaultTable
|
||||
}
|
||||
}
|
||||
|
||||
package "日志查询页面" as LogQuery {
|
||||
rectangle "搜索条件" as SearchConditions LIGHTBLUE {
|
||||
rectangle "时间范围选择器" as TimeRange
|
||||
rectangle "日志级别筛选" as LogLevel
|
||||
rectangle "关键词搜索" as KeywordSearch
|
||||
rectangle "主机筛选" as HostFilter
|
||||
}
|
||||
|
||||
rectangle "日志列表" as LogList LIGHTGREEN {
|
||||
rectangle "时间 | 主机 | 级别 | 消息" as LogTable
|
||||
rectangle "分页控件" as Pagination
|
||||
}
|
||||
|
||||
rectangle "操作按钮" as LogActions LIGHTYELLOW {
|
||||
rectangle "导出日志" as ExportLogs
|
||||
rectangle "发起诊断" as StartDiagnose
|
||||
}
|
||||
}
|
||||
|
||||
package "故障诊断页面" as DiagnosePage {
|
||||
rectangle "诊断配置" as DiagnoseConfig LIGHTBLUE {
|
||||
rectangle "选择日志范围" as SelectLogs
|
||||
rectangle "诊断类型选择" as DiagnoseType
|
||||
rectangle "AI模型选择" as ModelSelect
|
||||
}
|
||||
|
||||
rectangle "诊断进度" as DiagnoseProgress LIGHTGREEN {
|
||||
rectangle "进度条" as ProgressBar
|
||||
rectangle "当前状态" as CurrentStatus
|
||||
rectangle "预计时间" as EstimatedTime
|
||||
}
|
||||
|
||||
rectangle "诊断结果" as DiagnoseResult LIGHTYELLOW {
|
||||
rectangle "故障类型" as FaultType
|
||||
rectangle "原因分析" as RootCause
|
||||
rectangle "修复建议" as RepairSuggestion
|
||||
rectangle "风险等级" as RiskLevel
|
||||
}
|
||||
}
|
||||
|
||||
package "自动修复页面" as RepairPage {
|
||||
rectangle "修复方案" as RepairPlan LIGHTBLUE {
|
||||
rectangle "修复脚本预览" as ScriptPreview
|
||||
rectangle "影响范围评估" as ImpactAssessment
|
||||
rectangle "回滚方案" as RollbackPlan
|
||||
}
|
||||
|
||||
rectangle "风险确认" as RiskConfirm LIGHTRED {
|
||||
rectangle "风险等级显示" as RiskDisplay
|
||||
rectangle "确认复选框" as ConfirmCheckbox
|
||||
rectangle "执行按钮" as ExecuteButton
|
||||
}
|
||||
|
||||
rectangle "执行监控" as ExecutionMonitor LIGHTGREEN {
|
||||
rectangle "执行日志实时显示" as ExecutionLogs
|
||||
rectangle "执行状态" as ExecutionStatus
|
||||
rectangle "停止按钮" as StopButton
|
||||
}
|
||||
}
|
||||
|
||||
package "弹窗组件" as Modals {
|
||||
rectangle "高风险确认弹窗" as HighRiskModal LIGHTRED {
|
||||
rectangle "警告图标" as WarningIcon
|
||||
rectangle "风险说明" as RiskDescription
|
||||
rectangle "确认按钮" as ConfirmBtn
|
||||
rectangle "取消按钮" as CancelBtn
|
||||
}
|
||||
|
||||
rectangle "执行结果弹窗" as ResultModal LIGHTGREEN {
|
||||
rectangle "执行状态图标" as StatusIcon
|
||||
rectangle "结果摘要" as ResultSummary
|
||||
rectangle "详细日志" as DetailedLogs
|
||||
rectangle "关闭按钮" as CloseBtn
|
||||
}
|
||||
}
|
||||
|
||||
package "交互流程" as InteractionFlow {
|
||||
LogQuery --> DiagnosePage : 选择日志发起诊断
|
||||
DiagnosePage --> RepairPage : 诊断完成生成修复方案
|
||||
RepairPage --> HighRiskModal : 高风险修复需确认
|
||||
HighRiskModal --> ExecutionMonitor : 确认后开始执行
|
||||
ExecutionMonitor --> ResultModal : 执行完成显示结果
|
||||
|
||||
Dashboard --> LogQuery : 点击故障查看相关日志
|
||||
Dashboard --> RepairPage : 快速修复入口
|
||||
}
|
||||
|
||||
package "响应式设计说明" as ResponsiveDesign {
|
||||
note as ResponsiveNote
|
||||
移动端适配:
|
||||
- 顶部导航折叠为汉堡菜单
|
||||
- 侧边栏可收起/展开
|
||||
- 表格支持横向滚动
|
||||
- 图表自适应屏幕宽度
|
||||
- 弹窗全屏显示
|
||||
end note
|
||||
}
|
||||
|
||||
@enduml
|
||||
|
After Width: | Height: | Size: 24 KiB |
@ -0,0 +1,29 @@
|
||||
@startuml
|
||||
title 日志诊断与自动修复流程
|
||||
|
||||
actor User
|
||||
participant Frontend as FE
|
||||
participant FastAPI as API
|
||||
participant Flume
|
||||
database MySQL as DB
|
||||
queue Redis
|
||||
participant LLM
|
||||
|
||||
Flume -> API : 推送结构化日志
|
||||
API -> DB : 写入 fault_record
|
||||
FE -> API : 查询 /api/logs/query
|
||||
API -> FE : 返回日志列表
|
||||
|
||||
API -> LLM : call_llm_diagnose(logs)
|
||||
LLM --> API : 返回 FixCommand(JSON)
|
||||
API -> DB : 写入 exec_log
|
||||
API -> Redis : 缓存/发布修复任务
|
||||
API -> FE : WebSocket 推送诊断结果
|
||||
|
||||
FE -> API : /api/repair/execute
|
||||
API -> "修复脚本" : 执行Shell/Hadoop命令
|
||||
"修复脚本" -> API : stdout/stderr
|
||||
API -> DB : 更新 exec_log
|
||||
API -> FE : 返回执行结果
|
||||
|
||||
@enduml
|
||||
|
After Width: | Height: | Size: 22 KiB |
@ -0,0 +1,25 @@
|
||||
@startuml
|
||||
title 故障检测系统总体架构
|
||||
|
||||
node "Hadoop Cluster" {
|
||||
[NameNode]
|
||||
[DataNode] as DN1
|
||||
[DataNode] as DN2
|
||||
}
|
||||
|
||||
cloud "Flume Agents" as Flume
|
||||
Flume --> DN1 : 采集HDFS/YARN日志
|
||||
Flume --> DN2 : 采集HDFS/YARN日志
|
||||
|
||||
component "FastAPI Service" as API
|
||||
database "MySQL" as DB
|
||||
queue "Redis" as Cache
|
||||
API --> DB : 写入/查询故障记录
|
||||
API --> Cache : 状态缓存/队列
|
||||
API --> "LLM Diagnose" : 调用大模型\n返回FixCommand
|
||||
|
||||
component "Frontend Web (Vue/React + ECharts)" as FE
|
||||
FE --> API : /api/cluster/status\n/api/logs/query\n/api/diagnosis/result\n/api/repair/execute
|
||||
API --> FE : WebSocket推送状态/诊断结果
|
||||
|
||||
@enduml
|
||||
|
After Width: | Height: | Size: 32 KiB |
@ -0,0 +1,45 @@
|
||||
@startuml
|
||||
title 日志诊断与自动修复 - 活动图
|
||||
skinparam defaultFontName Microsoft YaHei
|
||||
|
||||
start
|
||||
:Flume采集日志;
|
||||
:FastAPI接收并解析日志;
|
||||
:保存 FaultRecord 到 MySQL;
|
||||
|
||||
partition "用户/系统触发" {
|
||||
if (是否需要诊断?) then (是)
|
||||
:聚合相关日志;
|
||||
:构造 Prompt;
|
||||
:调用 LLM 诊断;
|
||||
:生成 FixCommand(JSON);
|
||||
:安全校验(禁止高危命令);
|
||||
else (否)
|
||||
:等待新日志/用户请求;
|
||||
stop
|
||||
endif
|
||||
}
|
||||
|
||||
if (风险等级 == high?) then (是)
|
||||
:前端弹窗请求人工确认;
|
||||
if (用户确认执行?) then (是)
|
||||
:继续执行修复;
|
||||
else (否)
|
||||
:记录并通知未执行;
|
||||
stop
|
||||
endif
|
||||
endif
|
||||
|
||||
:修复前预检查(配置/路径/权限);
|
||||
if (预检查通过?) then (是)
|
||||
:执行修复脚本;
|
||||
:采集stdout/stderr;
|
||||
:保存 ExecLog 到 MySQL;
|
||||
:更新状态到 Redis 并推送 WebSocket;
|
||||
else (否)
|
||||
:记录失败原因;
|
||||
endif
|
||||
|
||||
:返回结果给前端;
|
||||
stop
|
||||
@enduml
|
||||
|
After Width: | Height: | Size: 22 KiB |
@ -0,0 +1,38 @@
|
||||
@startuml
|
||||
title 故障检测系统 - 用例图
|
||||
skinparam defaultFontName Microsoft YaHei
|
||||
|
||||
actor 运维工程师 as Ops
|
||||
actor 前端用户 as User
|
||||
actor 测试工程师 as QA
|
||||
|
||||
rectangle "故障检测系统" {
|
||||
usecase "查看集群状态" as UC_Status
|
||||
usecase "查询日志" as UC_QueryLogs
|
||||
usecase "发起故障诊断" as UC_Diagnose
|
||||
usecase "执行自动修复" as UC_Repair
|
||||
usecase "查看执行日志" as UC_ExecLogs
|
||||
usecase "配置Flume收集" as UC_ConfigFlume
|
||||
usecase "配置告警阈值" as UC_ConfigAlert
|
||||
usecase "导出故障与诊断报告" as UC_Export
|
||||
usecase "生成FixCommand" as UC_FixCmd
|
||||
usecase "命令安全校验" as UC_SafeCheck
|
||||
|
||||
User --> UC_Status
|
||||
User --> UC_QueryLogs
|
||||
User --> UC_Diagnose
|
||||
User --> UC_Repair
|
||||
User --> UC_ExecLogs
|
||||
|
||||
Ops --> UC_ConfigFlume
|
||||
Ops --> UC_ConfigAlert
|
||||
Ops --> UC_Repair
|
||||
Ops --> UC_Status
|
||||
|
||||
QA --> UC_QueryLogs
|
||||
QA --> UC_Export
|
||||
|
||||
UC_Diagnose --> UC_FixCmd : <<include>>
|
||||
UC_Repair --> UC_SafeCheck : <<include>>
|
||||
}
|
||||
@enduml
|
||||
|
After Width: | Height: | Size: 128 KiB |
|
After Width: | Height: | Size: 171 KiB |
|
After Width: | Height: | Size: 62 KiB |
@ -0,0 +1,130 @@
|
||||
@startuml
|
||||
title 故障检测与自动修复 - 类图
|
||||
skinparam backgroundColor #FFFFFF
|
||||
skinparam defaultFontName Microsoft YaHei
|
||||
skinparam classAttributeIconSize 0
|
||||
|
||||
class FlumeAgent {
|
||||
+config : Map
|
||||
+start()
|
||||
+stop()
|
||||
}
|
||||
|
||||
class LogEvent {
|
||||
+timestamp : datetime
|
||||
+host : string
|
||||
+source : string
|
||||
+level : string
|
||||
+message : string
|
||||
+raw : text
|
||||
}
|
||||
|
||||
class FastAPIService {
|
||||
+ingestLog(e: LogEvent)
|
||||
+getClusterStatus()
|
||||
+queryLogs(filter)
|
||||
+diagnose(logs)
|
||||
+executeRepair(cmd: FixCommand)
|
||||
}
|
||||
|
||||
class DiagnosisService {
|
||||
+callLLM(logs) : FixCommand
|
||||
+validateCommand(cmd: FixCommand) : bool
|
||||
}
|
||||
|
||||
class LLMClient {
|
||||
+apiKey : string
|
||||
+endpoint : string
|
||||
+invoke(prompt) : string
|
||||
}
|
||||
|
||||
class FixCommand {
|
||||
+fault_type : string
|
||||
+reason : string
|
||||
+fix_script : string
|
||||
+risk_level : RiskLevel
|
||||
}
|
||||
|
||||
enum RiskLevel {
|
||||
low
|
||||
medium
|
||||
high
|
||||
}
|
||||
|
||||
class RepairExecutor {
|
||||
+run(script) : ExecResult
|
||||
+precheck() : bool
|
||||
}
|
||||
|
||||
class ExecResult {
|
||||
+stdout : text
|
||||
+stderr : text
|
||||
+exitCode : int
|
||||
}
|
||||
|
||||
class FaultRecord {
|
||||
+id : int
|
||||
+fault_type : string
|
||||
+reason : string
|
||||
+timestamp : datetime
|
||||
+node : string
|
||||
}
|
||||
|
||||
class ExecLog {
|
||||
+id : int
|
||||
+record_id : int
|
||||
+stdout : text
|
||||
+stderr : text
|
||||
+timestamp : datetime
|
||||
}
|
||||
|
||||
class MySQLClient {
|
||||
+saveFault(record: FaultRecord)
|
||||
+saveExecLog(log: ExecLog)
|
||||
+queryLogs(filter)
|
||||
}
|
||||
|
||||
class RedisCache {
|
||||
+set(key, value)
|
||||
+publish(channel, msg)
|
||||
+get(key)
|
||||
}
|
||||
|
||||
class ClusterStatus {
|
||||
+nodesUp : int
|
||||
+nodesDown : int
|
||||
+hdfsUsage : float
|
||||
+yarnActiveApps : int
|
||||
}
|
||||
|
||||
class FrontendWeb {
|
||||
+viewStatus()
|
||||
+queryLogs()
|
||||
+requestDiagnosis()
|
||||
+executeRepair()
|
||||
}
|
||||
|
||||
FlumeAgent --> FastAPIService : push(LogEvent)
|
||||
FastAPIService --> DiagnosisService : diagnose(logs)
|
||||
DiagnosisService --> LLMClient : call_llm_diagnose
|
||||
DiagnosisService --> FixCommand : returns
|
||||
FastAPIService --> RepairExecutor : execute(FixCommand)
|
||||
RepairExecutor --> ExecResult : returns
|
||||
FastAPIService --> MySQLClient : save FaultRecord/ExecLog
|
||||
FastAPIService --> RedisCache : cache/publish status
|
||||
FrontendWeb --> FastAPIService : REST/WebSocket
|
||||
FastAPIService --> ClusterStatus : compose
|
||||
MySQLClient --> FaultRecord
|
||||
MySQLClient --> ExecLog
|
||||
FixCommand --> RiskLevel
|
||||
|
||||
note right of FixCommand
|
||||
JSON 示例:
|
||||
{
|
||||
fault_type: "DataNode故障",
|
||||
reason: "磁盘占满",
|
||||
fix_script: "ssh dn 'clean_temp.sh'",
|
||||
risk_level: "medium"
|
||||
}
|
||||
end note
|
||||
@enduml
|
||||
|
After Width: | Height: | Size: 8.8 KiB |
@ -0,0 +1,25 @@
|
||||
@startuml
|
||||
title 部署拓扑
|
||||
|
||||
node "On-Prem / Cloud" {
|
||||
node "Hadoop Cluster" {
|
||||
[NameNode]
|
||||
[DataNodes...]
|
||||
}
|
||||
|
||||
node "Logging Layer" {
|
||||
[Flume Agents]
|
||||
}
|
||||
|
||||
node "Application Layer" {
|
||||
[FastAPI]
|
||||
[LLM Connector]
|
||||
[Nginx for Frontend]
|
||||
}
|
||||
|
||||
node "Storage/Caching" {
|
||||
[MySQL]
|
||||
[Redis]
|
||||
}
|
||||
}
|
||||
@enduml
|
||||
@ -0,0 +1 @@
|
||||
java -jar tools/plantuml.jar -tpng Get-ChildItem "doc\project\diagrams\*.puml" | Select-Object Name
|
||||