李涛个人周计划 #9

Closed
pve6agnmp wants to merge 0 commits from litao_branch into develop

@ -1,49 +0,0 @@
name: frontend-vue-ci
on:
push:
paths:
- "frontend-vue/**"
- ".github/workflows/frontend-vue-ci.yml"
pull_request:
paths:
- "frontend-vue/**"
- ".github/workflows/frontend-vue-ci.yml"
jobs:
test:
runs-on: ubuntu-latest
defaults:
run:
working-directory: frontend-vue
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: "pnpm"
cache-dependency-path: frontend-vue/pnpm-lock.yaml
- name: Enable corepack
run: corepack enable
- name: Install
run: pnpm install --frozen-lockfile
- name: Lint
run: pnpm run lint
- name: Typecheck
run: pnpm run typecheck
- name: Test
run: pnpm run test
- name: Install Playwright Browsers
run: pnpm exec playwright install --with-deps chromium
- name: E2E
run: pnpm run e2e:ui
- name: Upload Playwright Report
if: failure()
uses: actions/upload-artifact@v4
with:
name: frontend-vue-playwright-report
path: |
frontend-vue/playwright-report
frontend-vue/test-results
- name: Build
run: pnpm run build

10
.gitignore vendored

@ -1,10 +0,0 @@
# Ignore temporary Vue3 playground directory
src/fronted/vue3/
frontend-vue/node_modules/
frontend-vue/dist/
frontend-vue/.vite/
frontend-vue/test-results/
frontend-vue/playwright-report/
.venv/
__pycache__/
*.pyc

@ -1,113 +0,0 @@
## 项目现状概览
- 前端为静态页面与原生 JS集中在 `src/frontend/index.html`,配套工具脚本:
- 认证与权限:`src/frontend/utils/auth.js`
- 导航与路由:`src/frontend/utils/navigation.js`
- 响应式交互:`src/frontend/utils/responsive.js`
- 图表管理:`src/frontend/utils/charts.js`
- 外部依赖通过 CDN`Tailwind CSS`、`Font Awesome`、`Google Fonts`、`ECharts`。
- 核心页面分区(示例):
- 登录与注册:`index.html:164`、`index.html:210`
- 集群列表:`index.html:254`
- 仪表板:`index.html:374`
- 日志查询:`index.html:573`
- 故障诊断:`index.html:851`
- 故障中心:`index.html:998`
- 执行日志:`index.html:1107`
- 告警配置:`index.html:1179`
- 个人主页/账号:`index.html:1320`、`index.html:1355`
- 用户与角色与权限策略/审计:`index.html:1437`、`index.html:1553`、`index.html:1592`、`index.html:1639`
## 功能与业务逻辑梳理
- 认证与权限(`auth.js`):前端演示账密、角色驱动页面可见性与操作禁用、审批队列与用户管理占位。
- 路由与导航(`navigation.js``location.hash` 单页切换、顶部/侧边/下拉菜单事件、页面切换联动日志筛选、诊断三栏、集群注册与注销、仪表板当前集群元信息。
- 响应式(`responsive.js`):移动端菜单与侧边栏遮罩、窗口尺寸类名、触摸手势、与图表重绘联动。
- 图表(`charts.js`ECharts CPU 折线与内存环图初始化、更新与自适应。
## 技术栈与依赖
- 现状:原生 `HTML/CSS/JS` + CDN`Tailwind`、`Font Awesome`、`ECharts`)。无构建工具、无包管理。
- 后端:`FastAPI` 原型(上下文参考)。目前前端非真实 API 对接。
## 核心需求与场景
- 角色驱动的集群管理:列表、仪表板、日志查阅、故障诊断与中心、告警规则管理、用户与权限管理。
- 要求:保持原有 UI 与业务流程;对接真实 API 可演进;前端可维护、可扩展。
## 重构目标与范围
- 采用 `Vue3 + Vite + Vue Router 4 + Pinia` 完整重构,迁移到 Composition API。
- 保持功能与界面一致;不做性能优化;优先可维护性与可扩展性。
## 目标目录结构(建议)
- 根:`src/`(前端工程)
- `app/`:应用骨架
- `main.ts`、`App.vue`
- `router/`(路由与守卫)
- `stores/`Pinia 状态,如 `auth`, `clusters`, `logs`
- `services/``httpClient.ts`、API 模块)
- `components/`(通用组件:导航、侧边栏、表格、模态框等)
- `views/`(页面视图:登录、注册、集群列表、仪表板、日志、诊断、故障中心、告警、用户管理、角色、权限策略、审计、个人/账号)
- `composables/`(可复用逻辑:`useResponsive`, `useCharts`, `usePagination`, `useFilters`
- `assets/`(样式与静态资源,复用现有 CSS
- `types/`(接口与实体类型定义)
- `index.html`Vite 模板,挂载 `#app`
## 迁移映射与实现要点
- `auth.js``stores/auth.ts` + `views/Login.vue/Register.vue` + 路由守卫
- 角色权限:在路由元信息中声明 `roles`,守卫检查;组件内用 `v-if` 控制可见性;禁用逻辑拆为指令或组件属性。
- 审批队列/用户管理:占位列表迁移为 `views/UserManagement.vue` 与局部 `components`
- `navigation.js` → 顶部/侧边导航组件 + `vue-router` 导航
- 顶部导航 `HeaderNav.vue`、侧边栏 `Sidebar.vue`、系统配置下拉 `ConfigDropdown.vue`、用户菜单 `UserMenu.vue`
- 页面切换联动逻辑改为各 `view``onMounted/ watch(route)` 与局部 `composable`
- `responsive.js``useResponsive` composable + 布局组件 `Layout.vue`
- 移动端菜单/侧边遮罩作为 `Layout` 状态;触摸手势在同一 composable`body` 类名通过 `onMounted``watchEffect`
- `charts.js``useCharts` composable + `CpuChart.vue`、`MemoryChart.vue`
- 统一初始化/销毁与 `resize`;数据更新通过 props 与 emits`ECharts` 使用 npm 包并在组件中管理实例。
- `日志筛选/分页/渲染``views/Logs.vue` + `useLogDataset`(采集 DOM → 数据改为从 API 或本地 mock
- 过滤条件建立为 `reactive`;分页用 `computed` + `v-for`;摘要组件化。
- `集群注册/注销/当前集群``views/ClusterList.vue` + `stores/clusters.ts`,仪表板头部 `CurrentClusterMeta.vue`
- `故障诊断三栏布局与拖拽``views/Diagnosis.vue` + 分栏组件与拖拽事件封装。
- `告警配置``views/AlertConfig.vue` + 规则表格与编辑对话框组件。
## 路由系统Vue Router 4
- 声明路由表:每个页面视图一个路由;嵌套路由用于主布局下的页面。
- 守卫:全局前置守卫检查登录状态与角色;未登录重定向至登录;无权限重定向至默认页(角色对应)。
- 路由元信息:`requiresAuth`、`roles`、`title`。
## 状态管理Pinia
- `auth`:用户信息、角色、登录/登出;持久化(`localStorage` 或 `pinia-plugin-persistedstate`)。
- `clusters`:列表、当前选中、注册/注销。
- `logs`:源数据、过滤条件、分页状态、渲染数据。
- `alerts`:规则集合与编辑状态。
## 服务层与接口对接
- `httpClient.ts``fetch` 或 `axios` 封装、超时/取消、错误归一化、鉴权头附加。
- API 模块:`authApi`、`clusterApi`、`logApi`、`diagnosisApi`、`alertApi`。
- 现阶段可使用 MockMSW 或本地 JSON保持业务一致后续对接后端。
## UI 保持一致
- 复用现有 CSS复制到 `assets/styles`);继续使用 `Font Awesome`/`Google Fonts`;保留 Tailwind或按需移除
- 组件化拆分但保持样式类名与结构等效;必要时封装通用 `Table`, `Modal`, `Dropdown`
## 测试与验收
- 单元测试:`Vitest` + `Vue Test Utils` 测组件与 store 逻辑。
- 集成测试:关键页面流程(登录/路由守卫、日志筛选与分页、集群注册/注销、告警规则增删改)。
- E2E可选`Playwright` 覆盖主要用户路径。
## 交付物
- 重构后完整代码Vue3 工程 + 视图/组件/状态/服务层)。
- 更新文档:重构变更说明、目录结构、约定与用法。
- 测试报告:通过的测试项与范围;与原有功能对齐清单。
## 实施步骤(里程碑)
1. 初始化工程Vite + Vue3 + Router + Pinia迁移样式与静态资源。
2. 布局与导航组件搭建,接入路由守卫与认证 store。
3. 迁移集群列表与仪表板视图与交互ECharts 组件化。
4. 迁移日志视图(筛选/分页/摘要),抽取 composable。
5. 迁移故障诊断三栏与拖拽,联动日志预览。
6. 迁移告警配置与用户/角色/权限策略/审计页面。
7. 覆盖测试与文档编写,完成验收。
## 重构标准(遵循)
- Vue3 Composition API、单一职责组件、语义化命名与目录分层。
- 不做性能优化;聚焦等效功能与可维护性。
- 可扩展:服务层与 store 解耦,路由元信息驱动权限,组件可复用。
请确认上述计划;确认后我将开始实施,并按里程碑逐步交付可运行的 Vue3 版本与文档、测试。

@ -1,37 +0,0 @@
## 问题诊断
- 根因:在组件内把基于 Pinia 的权限标识用“普通常量”缓存,未保持响应式。
- 证据:
- Sidebar`const isAdmin = auth.role === 'admin'`(首次渲染后不再随 `auth.role` 变化)
- HeaderNav`const authed = auth.isAuthenticated`(同样非响应式,登录后不更新)
- 现象:登录后需要浏览器刷新,组件被重新创建,常量重新计算,菜单才出现。
## 修复方案
### 1. Sidebar.vue 保持响应式
- 使用 `storeToRefs(auth)``computed`
- `const { role } = storeToRefs(auth)`
- `const isAdmin = computed(() => role.value === 'admin')`
- 模板改为:`v-if="isAdmin"`(或 `v-if="role === 'admin'"` 结合 `role` 引用)
### 2. HeaderNav.vue 保持响应式
- 使用 `storeToRefs(auth)` 获取 `isAuthenticated``role`
- `const { isAuthenticated, role } = storeToRefs(auth)`
- `const authed = isAuthenticated``computed(() => isAuthenticated.value)`
- `can()` 改为基于 `role.value``return roles.includes(role.value || '')`
- 保留或上移 `auth.restore()` 到应用入口(可选,见下一节)
### 3. 可选增强
- 在 `src/app/main.ts` 应用初始化后调用一次 `auth.restore()`,保证刷新后也能立即回显登录态。
- 抽取通用权限判断工具:`useCan(roles)`,减少重复逻辑,统一响应式来源。
## 验证步骤
- 启动前端并设置有效的后端地址(`VITE_API_TARGET`)。
- 登录为 admin无需刷新即可看到侧边栏的“用户管理/角色分配/权限策略/审计日志”。
- 切换角色或登出:对应菜单即时隐藏;检查 Header 的用户菜单 `v-if` 也能及时切换。
- 路由守卫仍按原有逻辑工作(未认证跳转登录,角色不符跳转默认页)。
## 变更范围
- `frontend-vue/src/app/components/Sidebar.vue`
- `frontend-vue/src/app/components/HeaderNav.vue`
- (可选)`frontend-vue/src/app/main.ts` 添加一次 `auth.restore()`
请确认是否按以上方案修改;确认后我将直接完成代码调整并推送到 `develop` 分支,随后进行联调验证。

@ -1,74 +0,0 @@
## 约束与目标
- 模型不本地部署,调用供应商 APIOpenAI/Azure/其他兼容)。
- 使用 Function Calling 技术,一个诊断智能体即可:可据集群日志判定故障并自动修复。
- 保障安全与可审计:角色校验、白名单工具、统一审计记录、流式输出。
## 架构简化
- 供应商客户端:`backend/app/services/llm.py`
- `LLMClient(chat(messages, tools, stream))`HTTPX 调用供应商 API支持流式。
- 读取 `.env``LLM_PROVIDER/LLM_ENDPOINT/LLM_MODEL/LLM_API_KEY`。
- 单智能体:`backend/app/agents/diagnosis_agent.py`
- 输入:结构化日志查询结果 + 必要的原始日志截取。
- 输出:根因 + 修复动作(通过 Function Calling 自动选择并执行工具)。
- 循环:工具调用→结果回传模型→直至收敛或超时。
- 工具注册:`backend/app/services/ops_tools.py`
- 已有能力复用:结构化日志(`backend/app/routers/logs.py:28`)、节点权限(`backend/app/routers/nodes.py:120`)、执行审计(`backend/app/routers/exec_logs.py`)。
- 工具函数(提供 JSON Schema
- `read_log(node, path, lines, pattern?)`
- `kill_process(node, pid, signal)`
- `reboot_node(node)`
- `service_restart(node, service)`可选HDFS/YARN/NodeManager 等)
- 入参校验 + `shlex.quote` + 白名单命令;统一落 `exec_logs`
- API`backend/app/routers/ai.py`
- `POST /api/v1/ai/diagnose-repair`:触发单智能体诊断与自动修复(可选参数:集群/节点/时间窗/关键词/安全级别)。
- `GET /api/v1/ai/stream/{taskId}`SSE/WebSocket 流式返回模型推理与工具执行过程。
## 诊断与修复流程(单智能体)
1) 聚合上下文:
- 查询结构化日志(数据库)`backend/app/routers/logs.py:28`(按 `level/op/node/cluster/time_from`)。
- 若需要原始日志,调用工具 `read_log` 拉取尾部 N 行并可正则筛选。
- 可选指标摘要:后续扩展 `metrics` 路由。
2) 发送到供应商 APIFunction Calling
- 系统提示:安全边界/只能调用注册工具/输出中文和结构化 JSON。
- 提供工具签名(名称、说明、参数 schema
3) 模型选择工具并执行:
- 后端按工具名调用实现带节点权限校验与审计写入将结果stdout/stderr/exitCode反馈给模型继续推理。
4) 收敛与输出:
- 返回根因、执行的修复动作与结果、剩余风险与建议。
- 写入 `exec_logs`(模型调用与工具执行)、必要时落 `faults`
## 安全与审计
- 角色:默认要求 `ops``admin`;若设置 `auto=true` 且为 `admin` 可全自动。
- 命令白名单:工具层限定可执行命令(日志读取/kill/reboot/服务重启),禁止任意 shell。
- 审计:`exec_logs` 记录每次模型调用和工具执行(起止时间/退出码/操作者/影响节点)。
- 节点授权:沿用 `nodes` 访问校验(用户只能操作自己可访问集群)。
## 配置与依赖
- 仅供应商 API新增 `httpx` 客户端与供应商 SDK`openai`)。
- `.env` 注入密钥与模型信息;扩展 `backend/app/config.py` 读取,不写入日志。
## API 设计草案
- `POST /api/v1/ai/diagnose-repair`
- 参数:`cluster?`、`node?`、`timeFrom?`、`keywords?`、`auto?`、`maxSteps?`。
- 返回:`taskId`、`summary`、`actions`(工具名/参数/结果)、`rootCause`、`residualRisk`。
- `GET /api/v1/ai/stream/{taskId}`:流式 token + 工具结果。
## 代码落点与对接
- 复用:
- 日志数据库查询:`backend/app/routers/logs.py:28`。
- 审计:`backend/app/models/exec_logs.py:7`、`backend/app/routers/exec_logs.py:57`。
- 节点授权:`backend/app/routers/nodes.py:120`。
- 新增:
- `services/llm.py`(供应商 API 客户端 + Function Calling 包装)。
- `services/ops_tools.py`(工具实现 + 审计写入 + 校验)。
- `agents/diagnosis_agent.py`(单智能体循环与策略)。
- `routers/ai.py`(统一入口 + 流式输出)。
## 迭代计划
- P023 天):供应商客户端;`read_log/kill/reboot` 工具;单智能体基本循环;`/ai/diagnose-repair` 返回一次性结果(非流式)。
- P135 天SSE/WebSocket 流式输出;服务重启工具;风险分级参数(默认自动低风险)。
- P2后续异常模式库与日志压缩报表与可视化缓存与限流。
## 验证
- 单元测试:工具参数校验、节点授权、审计写入;供应商 API 超时/重试。
- 集成测试:提供固定日志样本,验证模型能选择正确工具并完成修复;对高风险动作进行拒绝或需要 `admin`

@ -1,75 +0,0 @@
## 当前状态
- 本地数据库与所有表已创建完毕(含示例用户与权限)。
- 项目已具备基础后端骨架FastAPI + SQLAlchemy + asyncpg路由与模型已就绪。
## 目标
- 在后端实现登录接口(用户名/邮箱 + 密码),校验 `users.password_hash`bcrypt返回 JWT。
- 在前端原型 `index.html` 的登录区对接 `/api/v1/user/login`,存储 Token 并启用 Axios 拦截器为后续请求附加 `Authorization`
- 保护受限页面与接口(集群管理、故障中心等),未认证/无权限时处理 401/403 与页面跳转。
## 技术选型与依赖
- 密码校验:`passlib[bcrypt]`(验证 `$2b$12$...` 格式的 bcrypt 哈希)
- Token`PyJWT`HS256配置 `SECRET_KEY``ACCESS_TOKEN_EXPIRE_MINUTES`
- 依赖追加到 `src/backend/requirements.txt``passlib[bcrypt]`、`PyJWT`。
## 后端实现
### 配置
- 在 `src/backend/app/config.py` 增加:`SECRET_KEY`、`ALGORITHM='HS256'`、`ACCESS_TOKEN_EXPIRE_MINUTES=60`(从环境变量读取,支持 `.env`)。
### 模型与工具
- 新增 `src/backend/app/models/users.py` 映射 `users` 表(`id, username, email, password_hash, is_active, ...`)。
- 新增 `src/backend/app/security.py`
- 函数:`verify_password(plain, hashed)` 使用 `passlib` 验证;
- 函数:`create_access_token(data, expires_minutes)` 使用 `PyJWT` 生成 JWT
- 依赖:`get_current_user(token)` 验证 Bearer Token 并加载用户。
### 路由
- 新增 `src/backend/app/schemas/auth.py``LoginRequest {username_or_email, password}`、`TokenResponse {access_token, token_type}`。
- 新增 `src/backend/app/routers/auth.py`
- `POST /api/v1/user/login`
1) 接受 `LoginRequest`
2) 支持用户名或邮箱匹配用户
3) 验证 `password_hash`
4) 生成 `JWT` 并返回 `TokenResponse`
- 失败返回:`401`(统一错误包络:`{code,message,detail,traceId}`)。
- 将 `auth` 路由挂载到 `app/main.py``/api/v1`)。
### 受限接口保护
- 在现有路由clusters/faults/logs添加可选依赖 `get_current_user`(至少在需要用户态的写操作时强制验证)。
- 统一异常处理:
- 未认证:返回 `401`,前端跳转登录;
- 无权限:返回 `403`,前端显示无权限提示。
## 前端集成
- 在 `src/fronted/index.html` 的登录页(`#login`)绑定提交事件:
- 调用 `POST /api/v1/user/login`,成功后将 `access_token` 存入 `localStorage`
- 切换到主界面并在 Axios 拦截器中附加 `Authorization: Bearer <token>`
- 在 utils 层(`utils/auth.js`
- `setupAuth()`注册登录表单事件、退出登录、Token 读写;
- `setupAxios()`:请求拦截器注入 `Authorization`,响应拦截器处理 `401/403`(跳转登录/无权限提示)。
## 安全与配置
- `.env`(不提交仓库):
- `DATABASE_URL=postgresql+asyncpg://app_user:<pwd>@localhost:5432/hadoop_fault_db`
- `SECRET_KEY=<随机32字节>`
- `ACCESS_TOKEN_EXPIRE_MINUTES=60`
- 最小权限:后端连接使用 `app_user`,仅业务读写权限(已在教程中设置)。
## 测试与验证
- 用脚本插入一个测试用户(若已有则跳过),并设置 bcrypt 哈希(脚本已包含管理员示例)。
- 后端启动:`uvicorn src.backend.app.main:app --reload`
- 测试:
- `POST /api/v1/user/login` 成功返回 Token
- 携带 Authorization 访问 `GET /api/v1/clusters` 正常;
- 不带 Token 时受限接口返回 401/403。
- 前端登录表单:输入已存在用户与密码,跳转后页面 API 正常附带 Token。
## 交付物
- 后端:`auth` 路由与 `security` 工具、`users` 模型与 `schemas`、主入口注册与依赖更新。
- 前端:`utils/auth.js` 登录逻辑与 Axios 拦截器;在 `index.html` 调用初始化。
## 后续扩展
- 注册与审批流(`POST /api/user/register` → 管理员审批列表);
- 角色与权限在 Token 中下发(`role_key`),前端按角色控制菜单显示;
- 刷新令牌与登出;
- 审计:登录成功/失败写入 `audit_logs`

@ -1,199 +0,0 @@
## 目标
* 基于现有需求文档与数据库建表脚本,生成一份从零开始的 PostgreSQL 部署步骤教程文件Markdown
* 生成与项目技术栈匹配的后端初始代码FastAPI + SQLAlchemy + asyncpg覆盖基础连通与核心模块最小可用接口。
## 参考依据
* 需求规格说明书:`doc/project/需求规格说明书.md`
* 建表脚本:`doc/project/数据库建表脚本_postgres.sql`
* 数据库设计与ER说明`doc/project/数据库设计文档.md`、`doc/project/ER图设计说明.md`
## 交付物
* 新增文档:`doc/project/数据库部署步骤教程.md`
* 新增代码目录:`backend/`
* `backend/requirements.txt`
* `backend/app/main.py`
* `backend/app/config.py`
* `backend/app/db.py`
* `backend/app/models/*.py`核心表模型clusters、nodes、system\_logs、fault\_records、exec\_logs 等)
* `backend/app/schemas/*.py`Pydantic 模型)
* `backend/app/routers/*.py`clusters、faults、logs、health
## 数据库部署教程大纲
* 环境要求与准备
* PostgreSQL 14+Windows 环境PowerShell或 Docker Desktop
* 建表脚本路径与说明(不包含 CREATE DATABASE
* 安装 PostgreSQL两种方式
* 安装器方式:下载安装、设置超级用户、添加 PATH
* Docker 方式:`docker run -d --name pg -e POSTGRES_PASSWORD=... -p 5432:5432 postgres:14`
* 初始化数据库与用户
* 创建数据库:`CREATE DATABASE hadoop_fault_db WITH ENCODING 'UTF8';`
* 可选:创建业务用户与授权
* 本地认证配置(`pg_hba.conf` 的 `host all all 127.0.0.1/32 md5` 提示)
* 执行建表脚本
* Windows`psql -U postgres -h localhost -d hadoop_fault_db -f doc/project/数据库建表脚本_postgres.sql`
* Docker`docker cp` + 容器内 `psql -f` 执行
* 验证与自检
* 表/索引/约束检查示例查询(如 `\dt`、`\di`、`\d+ fault_records`
* 运行脚本内置的示例数据并验证返回提示语句
* 备份与恢复
* 备份:`pg_dump -U postgres -d hadoop_fault_db > backup.sql`
* 恢复:`psql -U postgres -d hadoop_fault_db -f backup.sql`
* 连接字符串与安全建议
* DSN 示例:`postgresql://postgres:<pwd>@localhost:5432/hadoop_fault_db`
* 禁止明文密钥入库、最小权限、定期备份演练
## 后端初始代码计划
* 依赖管理:`requirements.txt`
* `fastapi`, `uvicorn[standard]`, `SQLAlchemy>=2`, `asyncpg`, `pydantic`, `python-dotenv`
* 配置模块:`app/config.py`
* 加载环境变量(数据库 DSN、服务端口、日志级别
* 数据库模块:`app/db.py`
* 创建 `AsyncEngine``async_sessionmaker`,封装获取/关闭会话函数(含函数级注释/Docstring
* SQLAlchemy 模型:`app/models/*.py`
* 映射核心表clusters、nodes、system\_logs、fault\_records、exec\_logs、roles、permissions、user\_role\_mapping、user\_cluster\_mapping、app\_configurations、audit\_logs
* 类型对齐:`JSONB`→`postgresql.JSONB`、`INET`→`postgresql.INET`、时间→`TIMESTAMP(timezone=True)`
* Pydantic 模型:`app/schemas/*.py`
* 输入/输出模型ClusterCreate/ClusterRead、FaultCreate/FaultRead、LogQuery
* 路由模块:`app/routers/*.py`
* `health.py`健康检查DB连通探测
* `clusters.py``GET /api/v1/clusters`、`POST /api/v1/clusters`
* `faults.py``GET /api/v1/faults`、`POST /api/v1/faults`
* `logs.py``GET /api/v1/logs`(分页/过滤)
* 应用入口:`app/main.py`
* 注册路由、CORS允许前端域名、异常处理器、版本前缀 `/api/v1`
* 运行说明
* 创建虚拟环境并安装:`python -m venv venv & venv\Scripts\activate & pip install -r requirements.txt`
* 启动:`uvicorn app.main:app --reload`
* 环境配置:`.env` 中设置 `DATABASE_URL=postgresql+asyncpg://postgres:<pwd>@localhost:5432/hadoop_fault_db`
## 代码片段示例(将用于实际生成)
* `app/db.py`
```python
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker, AsyncSession
DATABASE_URL = "postgresql+asyncpg://postgres:password@localhost:5432/hadoop_fault_db"
engine = create_async_engine(DATABASE_URL, echo=False, pool_pre_ping=True)
SessionLocal = async_sessionmaker(engine, expire_on_commit=False, class_=AsyncSession)
async def get_db() -> AsyncSession:
"""获取一个异步数据库会话,用于依赖注入。"""
async with SessionLocal() as session:
yield session
```
* `app/main.py`
```python
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from app.routers import health, clusters, faults, logs
app = FastAPI(title="Hadoop Fault Detecting API", version="v1")
app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_credentials=True, allow_methods=["*"], allow_headers=["*"])
app.include_router(health.router, prefix="/api/v1")
app.include_router(clusters.router, prefix="/api/v1")
app.include_router(faults.router, prefix="/api/v1")
app.include_router(logs.router, prefix="/api/v1")
```
* `app/routers/health.py`
```python
from fastapi import APIRouter, Depends
from sqlalchemy.ext.asyncio import AsyncSession
from app.db import get_db
router = APIRouter()
@router.get("/health")
async def health_check(db: AsyncSession = Depends(get_db)):
"""健康检查:测试数据库连通性。"""
await db.execute("SELECT 1")
return {"status": "ok"}
```
* `app/routers/clusters.py`(示例)
```python
from fastapi import APIRouter, Depends
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select
from app.db import get_db
from app.models.clusters import Cluster
router = APIRouter()
@router.get("/clusters")
async def list_clusters(db: AsyncSession = Depends(get_db)):
"""查询集群列表。"""
result = await db.execute(select(Cluster).limit(100))
rows = result.scalars().all()
return {"total": len(rows), "list": [c.to_dict() for c in rows]}
```
## 验证与后续工作
* 我将按上述结构生成教程文件与后端初始代码,并在本地运行基本连通(`/health`、`/clusters`)进行自检。
* 后续可按 SRS 模块逐步扩展接口与实体映射,增加 WebSocket、审计与权限模块。

@ -1,28 +0,0 @@
## 项目概览
- 架构:前后端分离;前端为静态页面与原生 JS后端为 FastAPI 原型。
- 关键入口:后端 `src/backend/main.py``GET /` 返回 `{Hello: World}`),前端入口为 `src/frontend/index.html`(按当前检索为静态直开)。
- 文档:`doc/project/前后端启动前交流确认清单.md` 作为启动前协同清单。
## 启动与验证
- 后端:`pip install -r src/backend/requirements.txt``uvicorn src.backend.main:app --reload`;访问 `http://localhost:8000/` 验证返回。
- 前端:直接打开 `src/frontend/index.html` 或以本地静态服务托管(`python -m http.server -d src/frontend`)。
- 验证要点:路由与交互是否仅前端占位;是否存在与后端的实际 API 交互。
## 代码与功能对齐
- 列出前端占位模块(登录/注册、角色与权限、日志查询、仪表板、诊断等)与后端缺失 API。
- 规划 FastAPI 端点:认证(注册/登录/注销)、用户与角色、集群与仪表板数据、日志查询、诊断与告警等基础 CRUD。
- 渐进式对接:先打通认证与仪表板数据,再扩展日志与诊断。
## 配置与安全
- 引入 `.env``pydantic` Settings 管理配置;避免在代码中硬编码密钥。
- 按需补充 CORS、请求限流、统一错误处理与审计日志。
## 测试与交付
- 后端:`pytest` + `httpx`/`fastapi.testclient` 编写端点测试与契约测试。
- 前端:为关键模块添加最小化集成验证(如登录跳转与权限隐藏)。
- 可选容器化:后续补充 `Dockerfile` 与 Compose 用于联合启动。
## 里程碑
- M1启动与连通性验证后端根路由、前端可打开
- M2认证与基础数据 API仪表板对接完成。
- M3日志与诊断功能接口与前端联通。

@ -1,11 +0,0 @@
DATABASE_URL=postgresql+asyncpg://postgres:n7mdf4c5@dbconn.sealoshzh.site:38596/hadoop_fault_db
DB_HOST=dbconn.sealoshzh.site
DB_PORT=38596
DB_NAME=hadoop_fault_db
DB_USER=postgres
DB_PASSWORD=n7mdf4c5
LLM_PROVIDER=siliconflow
LLM_API_KEY=sk-nmycwvibqotsoykzyxudcexkxwkechzdglksiynrkwfgwyqx
LLM_ENDPOINT=https://api.siliconflow.cn/v1
LLM_MODEL=deepseek-ai/DeepSeek-V3
LLM_TIMEOUT=300

@ -1,59 +0,0 @@
## 需求摘要
- 在注册集群时逐一验证每个节点的SSH连通性全部可连通则生成集群UUID并写入数据库若任一不可连通则返回注册失败。
## 改动点
1. 修改集群注册接口实现
- 文件app/routers/clusters.py 的 create_cluster[clusters.py](file:///c:/Users/30326/Desktop/git/backend/app/routers/clusters.py#L75-L161)
- 引入检查方法from app.services.ssh_probe import check_ssh_connectivity
- 在参数校验通过后、数据库写入前新增“SSH连通性预检查”循环
- 遍历 req.nodes取 ip_address、ssh_user、ssh_password
- 调用 check_ssh_connectivity(ip, user, pwd)
- 若返回 (False, err)收集错误项field=nodes[i].ssh、message="注册失败SSH不可连接"、step="connect"、detail=err、hostname/ip
- 若存在错误项:返回 400不进行任何数据库写入
- 全部通过后再生成 new_uuid 并写入 Cluster 与 Node 记录,最后提交
- 事务性:预检查完成后才执行 db.add/commit失败时不修改数据库
2. 保持现有参数校验与权限逻辑
- 仍校验 type/health_status/node_count 等字段
- 继续仅允许 admin/ops 进行注册
## 接口行为
- 成功:
- 状态码 200
- 响应:{"status":"success","message":"集群注册成功","uuid":"<uuid>"}
- 失败SSH不可连通
- 状态码 400
- 响应:{"detail":{"errors":[{field, message, step, detail, hostname, ip}, ...]}}
- errors 数量等于不可连通的节点数
## 错误格式
- 单个错误示例:
- field: "nodes[3].ssh"
- message: "注册失败SSH不可连接"
- step: "connect"
- detail: "Connection timed out"(或具体异常信息)
- hostname: "hadoop105"
- ip: "192.168.10.105"
## 测试用例
- 文件新增tests/test_cluster_registration_ssh.py
- 用例:
1) 所有节点连通 → 期望 200、返回 success 与 uuid
2) 所有节点不连通 → 期望 400、errors=5、每项包含 "SSH不可连接"
3) 部分节点不连通 → 期望 400、errors=失败节点数量、校验字段与信息
- 通过 monkeypatch 将 app.services.ssh_probe.check_ssh_connectivity 模拟为成功/失败避免真实SSH连接
- 使用请求体字段 ip_address、ssh_user、ssh_password 与现有 Pydantic 模型对齐
## 验证方式
- 运行测试文件确保所有场景覆盖通过
- 手动调用接口POST /api/v1/clusters验证成功与失败返回格式
## 兼容与注意
- node_count 与 nodes 列表长度不一致时:保持现有校验;可选校正为 len(nodes)
- 不更改既有错误返回结构(置于 detail.errors与当前代码风格一致
- 如需并发提升,可在后续采用 asyncio.gather 并发检查;本次先按顺序实现,保证清晰与稳定
## 相关代码参考
- 集群路由入口与前缀app/main.py[main.py](file:///c:/Users/30326/Desktop/git/backend/app/main.py#L1-L28)
- SSH探测服务app/services/ssh_probe.py[ssh_probe.py](file:///c:/Users/30326/Desktop/git/backend/app/services/ssh_probe.py)
- 集群与节点模型app/models/clusters.py、app/models/nodes.py[clusters.py](file:///c:/Users/30326/Desktop/git/backend/app/models/clusters.py)[nodes.py](file:///c:/Users/30326/Desktop/git/backend/app/models/nodes.py)

@ -1,114 +0,0 @@
# Hadoop 故障诊断系统 - 后端服务 (FastAPI)
本项目是 Hadoop 故障诊断系统的后端核心,基于 FastAPI 构建,提供集群监控、日志采集、指标分析以及基于 AI 的智能故障诊断功能。
## 🚀 核心功能
- **用户与认证**: 基于 JWT 的无状态认证,支持用户注册、登录及权限管理。
- **集群与节点管理**: 支持 Hadoop 集群的注册、SSH 连通性校验、HDFS UUID 获取及节点状态管理。
- **指标采集与监控**:
- 自动采集集群及节点的 CPU、内存使用率。
- 提供实时指标查询与趋势图数据支持。
- **Hadoop 日志管理**:
- 远程日志读取:通过 SSH 实时读取各节点 Hadoop 日志。
- 自动日志采集:增量 tail 模式采集日志并持久化至数据库。
- 批量回填:支持对历史日志进行批量同步。
- **AI 智能诊断**:
- 集成 LangChain 与 OpenAI提供流式对话接口 (SSE)。
- 智能智能体 (Agent) 可自动调用工具:查看日志、执行远程命令、分析集群状态。
- **系统执行日志**: 记录所有远程运维操作与系统任务的执行过程。
## 🛠 技术栈
- **框架**: [FastAPI](https://fastapi.tiangolo.com/) - 高性能异步 Web 框架。
- **数据库**: [PostgreSQL](https://www.postgresql.org/) + [SQLAlchemy (Async)](https://www.sqlalchemy.org/) - 异步 ORM 驱动。
- **SSH 通信**: [Paramiko](https://www.paramiko.org/) - 处理远程命令执行与日志读取。
- **AI/LLM**: [LangChain](https://www.langchain.com/) + OpenAI API - 实现故障诊断智能体。
- **任务调度**: 内置线程化采集器,支持异步指标与日志采集任务。
- **认证**: PyJWT + Passlib (BCrypt) - 安全的身份验证。
## 📂 项目结构
```text
backend/
├── app/
│ ├── agents/ # AI 智能体定义与工具编排
│ ├── deps/ # FastAPI 依赖注入(如认证、数据库)
│ ├── models/ # SQLAlchemy 异步模型
│ ├── routers/ # API 路由模块集群、指标、日志、AI等
│ ├── services/ # 业务逻辑服务SSH管理、LLM调用等
│ ├── workers/ # 异步任务处理逻辑
│ ├── config.py # 环境变量与全局配置
│ ├── db.py # 数据库引擎与会话管理
│ ├── main.py # 应用入口与路由注册
│ └── log_collector.py # 日志采集器核心实现
├── scripts/ # 数据库初始化与验证脚本
├── tests/ # 单元测试与集成测试
├── requirements.txt # 依赖清单
└── start_backend.sh # 一键启动脚本
```
## ⚙️ 环境变量配置
`backend/` 目录下创建 `.env` 文件,配置如下关键参数:
| 参数 | 描述 | 默认值 |
| :--- | :--- | :--- |
| `DATABASE_URL` | PostgreSQL 异步连接串 | `postgresql+asyncpg://postgres:password@localhost:5432/hadoop_fault_db` |
| `JWT_SECRET` | JWT 签名密钥 | `dev-secret` |
| `JWT_EXPIRE_MINUTES` | 令牌有效期(分钟) | `60` |
| `SSH_PORT` | 默认远程 SSH 端口 | `22` |
| `SSH_TIMEOUT` | SSH 连接超时时间 | `10` |
| `HADOOP_LOG_DIR` | Hadoop 远程日志默认路径 | `/usr/local/hadoop/logs` |
| `APP_TIMEZONE` | 系统时区 | `Asia/Shanghai` |
| `OPENAI_API_KEY` | OpenAI 密钥(用于 AI 诊断) | - |
## 🛠 安装与启动
### 1. 环境准备
- Python 3.10+
- PostgreSQL 14+
### 2. 安装依赖
```bash
cd backend
python3 -m venv .venv
source .venv/bin/activate # Windows: .\.venv\Scripts\activate
pip install -r requirements.txt
```
### 3. 初始化数据库
执行 `scripts/` 目录下的 SQL 脚本或通过内置脚本初始化:
```bash
# 导入 SQL 脚本
psql -h <host> -U <user> -d <db> -f ../doc/project/数据库建表脚本_postgres.sql
```
### 4. 启动服务
```bash
# 开发模式
python -m uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
```
或使用提供的启动脚本:
```bash
bash start_backend.sh
```
## 📖 API 接口预览
所有接口均带有 `/api/v1` 前缀。
- **Health**: `GET /health`
- **Auth**: `POST /auth/login`, `POST /auth/register`
- **Clusters**: `GET /clusters`, `POST /clusters/register`
- **Metrics**: `GET /metrics/trend`, `POST /metrics/collector/start`
- **Hadoop Logs**: `GET /hadoop/logs/all/{log_type}`, `GET /hadoop/collectors/status`
- **AI**: `POST /ai/chat` (支持 SSE 流式返回)
详细接口文档启动后访问:`http://localhost:8000/docs`
## 🧪 验证与测试
运行端到端验证脚本:
```bash
python scripts/verify_register.py
```

@ -1,98 +0,0 @@
from typing import Any, Dict, List, Optional
from sqlalchemy.ext.asyncio import AsyncSession
from ..services.llm import LLMClient
from ..services.ops_tools import openai_tools_schema, tool_read_log, tool_start_cluster, tool_stop_cluster, tool_read_cluster_log, tool_detect_cluster_faults, tool_run_cluster_command
import json
async def run_diagnose_and_repair(db: AsyncSession, operator: str, context: Dict[str, Any], auto: bool = True, max_steps: int = 3, model: Optional[str] = None) -> Dict[str, Any]:
"""单智能体根据日志上下文诊断并自动修复Function Calling
- context包含 cluster/node/logs 等关键信息
- auto是否允许自动执行工具默认允许
- max_steps最多工具调用步数
- model指定的模型名称
返回根因动作列表与结果剩余风险
"""
llm = LLMClient()
messages: List[Dict[str, Any]] = [
{
"role": "system",
"content": "你是Hadoop运维诊断专家。你只能调用提供的函数来读取日志或修复。输出中文优先给出根因、影响范围与修复建议。",
},
{
"role": "user",
"content": f"上下文: {context}",
},
]
tools = openai_tools_schema()
actions: List[Dict[str, Any]] = []
root_cause = None
residual_risk = "medium"
for step in range(max_steps):
resp = await llm.chat(messages, tools=tools, stream=False, model=model)
choice = (resp.get("choices") or [{}])[0]
msg = choice.get("message", {})
tool_calls = msg.get("tool_calls") or []
if not tool_calls:
root_cause = msg.get("content")
break
if not auto:
break
for tc in tool_calls:
fn = (tc.get("function") or {})
name = fn.get("name")
raw_args = fn.get("arguments") or {}
if isinstance(raw_args, str):
try:
args = json.loads(raw_args)
except Exception:
args = {}
elif isinstance(raw_args, dict):
args = raw_args
else:
args = {}
result: Dict[str, Any]
if name == "read_log":
result = await tool_read_log(db, operator, args.get("node"), args.get("path"), int(args.get("lines", 200)), args.get("pattern"), args.get("sshUser"))
elif name == "read_cluster_log":
result = await tool_read_cluster_log(
db=db,
user_name=operator,
cluster_uuid=args.get("cluster_uuid"),
log_type=args.get("log_type"),
node_hostname=args.get("node_hostname"),
lines=int(args.get("lines", 100)),
)
elif name == "detect_cluster_faults":
result = await tool_detect_cluster_faults(
db=db,
user_name=operator,
cluster_uuid=args.get("cluster_uuid"),
components=args.get("components"),
node_hostname=args.get("node_hostname"),
lines=int(args.get("lines", 200)),
)
elif name == "run_cluster_command":
result = await tool_run_cluster_command(
db=db,
user_name=operator,
cluster_uuid=args.get("cluster_uuid"),
command_key=args.get("command_key"),
target=args.get("target"),
node_hostname=args.get("node_hostname"),
timeout=int(args.get("timeout", 30)),
limit_nodes=int(args.get("limit_nodes", 20)),
)
elif name == "start_cluster":
result = await tool_start_cluster(db, operator, args.get("cluster_uuid"))
elif name == "stop_cluster":
result = await tool_stop_cluster(db, operator, args.get("cluster_uuid"))
else:
result = {"error": "unknown_tool"}
actions.append({"name": name, "args": args, "result": result})
messages.append({"role": "tool", "content": str(result), "name": name})
return {"rootCause": root_cause, "actions": actions, "residualRisk": residual_risk}

@ -1,45 +0,0 @@
import os
import json
from dotenv import load_dotenv
from typing import Dict, Tuple
from datetime import datetime
from zoneinfo import ZoneInfo
load_dotenv()
# Timezone Configuration
APP_TIMEZONE = os.getenv("APP_TIMEZONE", "Asia/Shanghai")
BJ_TZ = ZoneInfo(APP_TIMEZONE)
def now_bj() -> datetime:
return datetime.now(BJ_TZ)
# Database Configuration
_db_url = os.getenv("DATABASE_URL")
if not _db_url:
_host = os.getenv("DB_HOST")
_port = os.getenv("DB_PORT")
_name = os.getenv("DB_NAME")
_user = os.getenv("DB_USER")
_password = os.getenv("DB_PASSWORD")
if all([_host, _port, _name, _user, _password]):
_db_url = f"postgresql+asyncpg://{_user}:{_password}@{_host}:{_port}/{_name}"
else:
_db_url = "postgresql+asyncpg://postgres:password@localhost:5432/hadoop_fault_db"
DATABASE_URL = _db_url
SYNC_DATABASE_URL = _db_url.replace("postgresql+asyncpg://", "postgresql://")
# JWT Configuration
JWT_SECRET = os.getenv("JWT_SECRET", "dev-secret")
JWT_EXPIRE_MINUTES = int(os.getenv("JWT_EXPIRE_MINUTES", "60"))
# SSH Configuration
SSH_PORT = int(os.getenv("SSH_PORT", "22"))
SSH_TIMEOUT = int(os.getenv("SSH_TIMEOUT", "10"))
ssh_port = SSH_PORT
ssh_timeout = SSH_TIMEOUT
LOG_DIR = os.getenv("HADOOP_LOG_DIR", "/usr/local/hadoop/logs")

@ -1,15 +0,0 @@
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker, AsyncSession
from .config import DATABASE_URL, APP_TIMEZONE
engine = create_async_engine(
DATABASE_URL,
echo=False,
pool_pre_ping=True,
connect_args={"server_settings": {"timezone": APP_TIMEZONE}},
)
SessionLocal = async_sessionmaker(engine, expire_on_commit=False, class_=AsyncSession)
async def get_db() -> AsyncSession:
"""获取一个异步数据库会话,用于依赖注入。"""
async with SessionLocal() as session:
yield session

@ -1,88 +0,0 @@
from fastapi import Header, HTTPException, Depends
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, text
from ..db import get_db
from ..models.users import User
from ..config import JWT_SECRET
import jwt
from typing import List
async def get_current_user(authorization: str | None = Header(None), db: AsyncSession = Depends(get_db)):
if not authorization or not authorization.startswith("Bearer "):
raise HTTPException(status_code=401, detail="not_authenticated")
token = authorization[7:]
try:
payload = jwt.decode(token, JWT_SECRET, algorithms=["HS256"])
username = payload.get("sub")
if not username:
raise HTTPException(status_code=401, detail="invalid_token")
result = await db.execute(select(User).where(User.username == username).limit(1))
user = result.scalars().first()
if not user:
# 如果是 demo 用户但不在 DB 中,创建一个临时的用户字典
user_dict = {"username": username, "id": None, "is_active": True}
else:
if not user.is_active:
raise HTTPException(status_code=403, detail="inactive_user")
user_dict = {"username": user.username, "id": user.id, "is_active": user.is_active}
# 获取权限列表
perms_res = await db.execute(
text("""
SELECT DISTINCT p.permission_key
FROM permissions p
JOIN role_permission_mapping rpm ON p.id = rpm.permission_id
JOIN user_role_mapping urm ON rpm.role_id = urm.role_id
JOIN users u ON urm.user_id = u.id
WHERE u.username = :u
UNION
-- 兼容预设角色及其对应的基本权限
SELECT 'cluster:register' AS permission_key
WHERE (:u = 'admin' OR :u = 'ops' OR :u = 'obs')
UNION
SELECT 'cluster:delete' AS permission_key
WHERE (:u = 'admin' OR :u = 'ops')
UNION
SELECT 'cluster:start' AS permission_key
WHERE (:u = 'admin' OR :u = 'ops')
UNION
SELECT 'cluster:stop' AS permission_key
WHERE (:u = 'admin' OR :u = 'ops')
UNION
-- 兼容 demo 账号如果不在 DB 的更多权限
SELECT DISTINCT p.permission_key
FROM permissions p
JOIN role_permission_mapping rpm ON p.id = rpm.permission_id
JOIN roles r ON rpm.role_id = r.id
WHERE (:u = 'admin' AND r.role_key = 'admin')
OR (:u = 'ops' AND r.role_key = 'operator')
OR (:u = 'obs' AND r.role_key = 'observer')
"""),
{"u": username}
)
user_dict["permissions"] = [row[0] for row in perms_res.all()]
return user_dict
except jwt.ExpiredSignatureError:
raise HTTPException(status_code=401, detail="token_expired")
except jwt.InvalidTokenError:
raise HTTPException(status_code=401, detail="invalid_token")
except Exception as e:
print(f"Auth error: {e}")
raise HTTPException(status_code=500, detail="auth_error")
class PermissionChecker:
def __init__(self, required_permissions: List[str]):
self.required_permissions = required_permissions
def __call__(self, user=Depends(get_current_user)):
user_perms = user.get("permissions", [])
for perm in self.required_permissions:
if perm not in user_perms:
raise HTTPException(
status_code=403,
detail=f"Permission denied: required {perm}"
)
return user

@ -1,336 +0,0 @@
import threading
import time
import uuid
import datetime
from typing import Dict, List, Optional
from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine, async_sessionmaker, AsyncEngine
from .log_reader import log_reader
from .ssh_utils import ssh_manager
from .db import SessionLocal
from .models.hadoop_logs import HadoopLog
from sqlalchemy import text
import asyncio
from .config import BJ_TZ, DATABASE_URL, APP_TIMEZONE
class LogCollector:
"""Real-time log collector for Hadoop cluster"""
def __init__(self):
self.collectors: Dict[str, threading.Thread] = {}
self.is_running: bool = False
self.collection_interval: int = 5 # 默认采集间隔,单位:秒
self._loops: Dict[str, asyncio.AbstractEventLoop] = {}
self._engines: Dict[str, AsyncEngine] = {}
self._session_locals: Dict[str, async_sessionmaker[AsyncSession]] = {}
self._intervals: Dict[str, int] = {}
self._cluster_name_cache: Dict[str, str] = {}
self._targets: Dict[str, str] = {}
self._line_counts: Dict[str, int] = {}
self.max_bytes_per_pull: int = 256 * 1024
def start_collection(self, node_name: str, log_type: str, ip: Optional[str] = None, interval: Optional[int] = None) -> bool:
"""Start real-time log collection for a specific node and log type"""
collector_id = f"{node_name}_{log_type}"
if interval is not None:
self._intervals[collector_id] = max(1, int(interval))
if collector_id in self.collectors and self.collectors[collector_id].is_alive():
print(f"Collector {collector_id} is already running")
return False
# Start even if log file not yet exists; collector will self-check in loop
# Create a new collector thread
collector_thread = threading.Thread(
target=self._collect_logs,
args=(node_name, log_type, ip),
name=collector_id,
daemon=True
)
self.collectors[collector_id] = collector_thread
collector_thread.start()
print(f"Started collector {collector_id}")
return True
def stop_collection(self, node_name: str, log_type: str):
"""Stop log collection for a specific node and log type"""
collector_id = f"{node_name}_{log_type}"
if collector_id in self.collectors:
# Threads are daemon, so they will exit when main process exits
# We just remove it from our tracking
del self.collectors[collector_id]
self._intervals.pop(collector_id, None)
print(f"Stopped collector {collector_id}")
else:
print(f"Collector {collector_id} is not running")
def stop_all_collections(self):
"""Stop all log collections"""
for collector_id in list(self.collectors.keys()):
self.stop_collection(*collector_id.split("_"))
def _parse_log_line(self, line: str, node_name: str, log_type: str):
"""Parse a single log line and return a dictionary of log fields"""
# Extract timestamp from the log line (format: [2023-12-17 10:00:00,123])
timestamp = None
log_level = "INFO" # Default log level
message = line
exception = None
# Simple log parsing logic
if line.startswith('['):
# Extract timestamp
timestamp_end = line.find(']', 1)
if timestamp_end > 0:
timestamp_str = line[1:timestamp_end]
try:
timestamp = datetime.datetime.strptime(timestamp_str, "%Y-%m-%d %H:%M:%S,%f").replace(tzinfo=BJ_TZ)
except ValueError:
# If parsing fails, use current time
timestamp = datetime.datetime.now(BJ_TZ)
# Extract log level
log_levels = ["ERROR", "WARN", "INFO", "DEBUG", "TRACE"]
for level in log_levels:
if f" {level} " in line:
log_level = level
break
return {
"timestamp": timestamp or datetime.datetime.now(BJ_TZ),
"log_level": log_level,
"message": message,
"host": node_name,
"service": log_type,
"raw_log": line
}
async def _save_log_to_db(self, log_data: Dict, collector_id: str | None = None):
"""Save log data to database"""
try:
session_local = self._session_locals.get(collector_id) if collector_id else None
async with (session_local() if session_local else SessionLocal()) as session:
# 获取集群名称
host = log_data["host"]
cluster_name = self._cluster_name_cache.get(host)
if not cluster_name:
cluster_res = await session.execute(text("""
SELECT c.name
FROM clusters c
JOIN nodes n ON c.id = n.cluster_id
WHERE n.hostname = :hn LIMIT 1
"""), {"hn": host})
cluster_row = cluster_res.first()
cluster_name = cluster_row[0] if cluster_row else "default_cluster"
self._cluster_name_cache[host] = cluster_name
# Create HadoopLog instance
hadoop_log = HadoopLog(
log_time=log_data["timestamp"],
node_host=log_data["host"],
title=log_data["service"],
info=log_data["message"],
cluster_name=cluster_name
)
# Add to session and commit
session.add(hadoop_log)
await session.commit()
except Exception as e:
print(f"Error saving log to database: {e}")
async def _save_logs_to_db_batch(self, logs: List[Dict], collector_id: str | None = None):
"""Save a batch of logs to database in one transaction"""
try:
session_local = self._session_locals.get(collector_id) if collector_id else None
async with (session_local() if session_local else SessionLocal()) as session:
host = logs[0]["host"] if logs else None
cluster_name = self._cluster_name_cache.get(host) if host else None
if host and not cluster_name:
cluster_res = await session.execute(text("""
SELECT c.name
FROM clusters c
JOIN nodes n ON c.id = n.cluster_id
WHERE n.hostname = :hn LIMIT 1
"""), {"hn": host})
cluster_row = cluster_res.first()
cluster_name = cluster_row[0] if cluster_row else "default_cluster"
self._cluster_name_cache[host] = cluster_name
objs: list[HadoopLog] = []
for log_data in logs:
objs.append(HadoopLog(
log_time=log_data["timestamp"],
node_host=log_data["host"],
title=log_data["service"],
info=log_data["message"],
cluster_name=cluster_name or "default_cluster",
))
session.add_all(objs)
await session.commit()
except Exception as e:
print(f"Error batch saving logs: {e}")
def _collect_logs(self, node_name: str, log_type: str, ip: str):
"""Internal method to collect logs continuously"""
print(f"Starting log collection for {node_name}_{log_type}")
collector_id = f"{node_name}_{log_type}"
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
self._loops[collector_id] = loop
engine = create_async_engine(
DATABASE_URL,
echo=False,
pool_pre_ping=True,
connect_args={"server_settings": {"timezone": APP_TIMEZONE}},
pool_size=1,
max_overflow=0,
)
self._engines[collector_id] = engine
self._session_locals[collector_id] = async_sessionmaker(engine, expire_on_commit=False, class_=AsyncSession)
last_remote_size = 0
retry_count = 0
max_retries = 3
while collector_id in self.collectors:
try:
# Wait for next collection interval
interval = self._intervals.get(collector_id, self.collection_interval)
time.sleep(interval)
# Resolve target file once and reuse
target = self._targets.get(collector_id)
if not target:
try:
ssh_client = ssh_manager.get_connection(node_name, ip=ip)
dirs = [
"/opt/module/hadoop-3.1.3/logs",
"/usr/local/hadoop/logs",
"/usr/local/hadoop-3.3.6/logs",
"/usr/local/hadoop-3.3.5/logs",
"/usr/local/hadoop-3.1.3/logs",
"/opt/hadoop/logs",
"/var/log/hadoop",
]
for d in dirs:
out, err = ssh_client.execute_command(f"ls -1 {d} 2>/dev/null")
if not err and out.strip():
for fn in out.splitlines():
f = fn.lower()
if log_type in f and node_name in f:
target = f"{d}/{fn}"
break
if target:
break
if target:
self._targets[collector_id] = target
except Exception:
target = None
if not target:
print(f"Log file {node_name}_{log_type} not found, will retry")
retry_count += 1
continue
ssh_client = ssh_manager.get_connection(node_name, ip=ip)
size_out, size_err = ssh_client.execute_command(f"stat -c %s {target} 2>/dev/null")
if size_err:
retry_count += 1
continue
try:
remote_size = int((size_out or "").strip())
except Exception:
retry_count += 1
continue
if remote_size < last_remote_size:
last_remote_size = 0
if remote_size > last_remote_size:
delta = remote_size - last_remote_size
if delta > self.max_bytes_per_pull:
start_pos = remote_size - self.max_bytes_per_pull + 1
last_remote_size = remote_size - self.max_bytes_per_pull
else:
start_pos = last_remote_size + 1
out2, err2 = ssh_client.execute_command(f"tail -c +{start_pos} {target} 2>/dev/null")
if err2:
out2, err2 = ssh_client.execute_command(f"dd if={target} bs=1 skip={max(0, start_pos - 1)} 2>/dev/null")
if not err2 and out2 and out2.strip():
self._save_log_chunk(node_name, log_type, out2)
print(f"Collected new logs from {node_name}_{log_type} bytes={len(out2)}")
last_remote_size = remote_size
# Reset retry count on successful collection
retry_count = 0
except Exception as e:
print(f"Error collecting logs from {node_name}_{log_type}: {e}")
retry_count += 1
if retry_count > max_retries:
print(f"Max retries reached for {node_name}_{log_type}, stopping collection")
self.stop_collection(node_name, log_type)
break
print(f"Retrying in {self.collection_interval * 2} seconds... ({retry_count}/{max_retries})")
try:
loop = self._loops.pop(collector_id, None)
engine = self._engines.pop(collector_id, None)
self._session_locals.pop(collector_id, None)
if engine and loop:
loop.run_until_complete(engine.dispose())
if loop and loop.is_running():
loop.stop()
if loop:
loop.close()
except Exception:
pass
def _save_log_chunk(self, node_name: str, log_type: str, content: str):
"""Save a chunk of log content to database"""
# Split content into lines
lines = content.splitlines()
# Parse each line and save to database
log_batch: List[Dict] = []
for line in lines:
if line.strip():
log_data = self._parse_log_line(line, node_name, log_type)
log_batch.append(log_data)
if not log_batch:
return
collector_id = f"{node_name}_{log_type}"
loop = self._loops.get(collector_id)
if loop:
loop.run_until_complete(self._save_logs_to_db_batch(log_batch, collector_id=collector_id))
else:
asyncio.run(self._save_logs_to_db_batch(log_batch))
def get_collectors_status(self) -> Dict[str, bool]:
"""Get the status of all collectors"""
status = {}
for collector_id, thread in self.collectors.items():
status[collector_id] = thread.is_alive()
return status
def set_collection_interval(self, interval: int):
"""Set the collection interval"""
self.collection_interval = max(1, interval) # Ensure interval is at least 1 second
for k in list(self._intervals.keys()):
self._intervals[k] = self.collection_interval
print(f"Set collection interval to {self.collection_interval} seconds")
def set_log_dir(self, log_dir: str):
"""Set the log directory (deprecated, logs are now stored in database)"""
print(f"Warning: set_log_dir is deprecated. Logs are now stored in the database, not in local directory: {log_dir}")
# Create a global log collector instance
log_collector = LogCollector()

@ -1,202 +0,0 @@
from typing import List, Dict, Optional
from .config import LOG_DIR
from .ssh_utils import ssh_manager
class LogReader:
"""Log Reader for Hadoop cluster nodes"""
def __init__(self):
self.log_dir = LOG_DIR
self._node_log_dir: Dict[str, str] = {}
self._candidates = [
"/usr/local/hadoop/logs",
"/opt/hadoop/logs",
"/usr/local/hadoop-3.3.6/logs",
"/usr/local/hadoop-3.3.5/logs",
"/usr/local/hadoop-3.1.3/logs",
"/opt/module/hadoop-3.1.3/logs",
"/var/log/hadoop",
]
def get_log_file_path(self, node_name: str, log_type: str) -> str:
"""Generate log file path based on node name and log type"""
# Map log type to actual log file name
log_file_map = {
"namenode": "hadoop-hadoop-namenode",
"datanode": "hadoop-hadoop-datanode",
"resourcemanager": "hadoop-hadoop-resourcemanager",
"nodemanager": "hadoop-hadoop-nodemanager",
"historyserver": "hadoop-hadoop-historyserver"
}
# Get the base log file name
base_name = log_file_map.get(log_type.lower(), log_type.lower())
# Generate full log file path
return f"{self.log_dir}/{base_name}-{node_name.replace('_', '')}.log"
def read_log(self, node_name: str, log_type: str, ip: str) -> str:
"""Read log from a specific node"""
# Ensure working log dir
self.find_working_log_dir(node_name, ip)
paths = self.get_log_file_paths(node_name, log_type)
# Get SSH connection
ssh_client = ssh_manager.get_connection(node_name, ip=ip)
# Read log file content
# try direct candidates
for p in paths:
out, err = ssh_client.execute_command(f"ls -la {p} 2>/dev/null")
if not err and out.strip():
out, err = ssh_client.execute_command(f"cat {p} 2>/dev/null")
if not err:
return out
# resolve by directory listing
base_dir = self._node_log_dir.get(node_name, self.log_dir)
out, err = ssh_client.execute_command(f"ls -la {base_dir} 2>/dev/null")
if not err and out.strip():
for line in out.splitlines():
parts = line.split()
if parts:
fn = parts[-1]
lf = fn.lower()
if log_type in lf and node_name in lf and (lf.endswith(".log") or lf.endswith(".out") or lf.endswith(".out.1")):
out2, err2 = ssh_client.execute_command(f"cat {base_dir}/{fn} 2>/dev/null")
if not err2:
return out2
raise FileNotFoundError("No such file")
def read_all_nodes_log(self, nodes: List[Dict[str, str]], log_type: str) -> Dict[str, str]:
"""Read log from all nodes"""
logs = {}
for node in nodes:
node_name = node['name']
ip = node.get('ip')
if not ip:
logs[node_name] = "Error: IP address not found"
continue
try:
logs[node_name] = self.read_log(node_name, log_type, ip)
except Exception as e:
logs[node_name] = f"Error reading log: {str(e)}"
return logs
def filter_log_by_date(self, log_content: str, start_date: str, end_date: str) -> str:
"""Filter log content by date range"""
filtered_lines = []
for line in log_content.splitlines():
# Check if line contains date in the format [YYYY-MM-DD HH:MM:SS,mmm]
if line.startswith('['):
# Extract date part
date_str = line[1:11] # Get YYYY-MM-DD part
if start_date <= date_str <= end_date:
filtered_lines.append(line)
return '\n'.join(filtered_lines)
def get_log_files_list(self, node_name: str, ip: Optional[str] = None) -> List[str]:
"""Get list of log files on a specific node"""
# Ensure working log dir
if ip:
self.find_working_log_dir(node_name, ip)
ssh_client = ssh_manager.get_connection(node_name, ip=ip)
# Execute command to list log files from available directories
dirs = [self._node_log_dir.get(node_name, self.log_dir)] + self._candidates
stdout = ""
for d in dirs:
out, err = ssh_client.execute_command(f"ls -1 {d} 2>/dev/null")
if not err and out.strip():
stdout = out
self._node_log_dir[node_name] = d
break
stderr = ""
# Parse log files from output
log_files = []
if not stderr and stdout.strip():
for line in stdout.splitlines():
name = line.strip()
if name.endswith(".log") or name.endswith(".out") or name.endswith(".out.1"):
log_files.append(name)
return log_files
def check_log_file_exists(self, node_name: str, log_type: str, ip: Optional[str] = None) -> bool:
"""Check if log file exists on a specific node"""
# Ensure working log dir
if ip:
self.find_working_log_dir(node_name, ip)
paths = self.get_log_file_paths(node_name, log_type)
# Get SSH connection
ssh_client = ssh_manager.get_connection(node_name, ip=ip)
try:
# Execute command to check if file exists
for p in paths:
stdout, stderr = ssh_client.execute_command(f"ls -la {p} 2>/dev/null")
if not stderr and stdout.strip():
return True
base_dir = self._node_log_dir.get(node_name, self.log_dir)
stdout, stderr = ssh_client.execute_command(f"ls -la {base_dir} 2>/dev/null")
if not stderr and stdout.strip():
for line in stdout.splitlines():
parts = line.split()
if parts:
fn = parts[-1].lower()
if log_type in fn and node_name in fn and (fn.endswith(".log") or fn.endswith(".out") or fn.endswith(".out.1")):
return True
return False
except Exception as e:
print(f"Error checking log file existence: {e}")
return False
def get_node_services(self, node_name: str) -> List[str]:
"""Get list of running services on a node based on log files"""
# Get all log files
log_files = self.get_log_files_list(node_name)
# Extract service types from log file names
services = []
for log_file in log_files:
if "namenode" in log_file:
services.append("namenode")
elif "datanode" in log_file:
services.append("datanode")
elif "resourcemanager" in log_file:
services.append("resourcemanager")
elif "nodemanager" in log_file:
services.append("nodemanager")
elif "secondarynamenode" in log_file:
services.append("secondarynamenode")
# Remove duplicates
return list(set(services))
def find_working_log_dir(self, node_name: str, ip: str) -> str:
"""Detect a working log directory on remote node and set it"""
ssh_client = ssh_manager.get_connection(node_name, ip=ip)
# try current
current = self._node_log_dir.get(node_name, self.log_dir)
stdout, stderr = ssh_client.execute_command(f"ls -la {current}")
if not stderr and stdout.strip():
self._node_log_dir[node_name] = current
return current
for d in [current] + self._candidates:
stdout, stderr = ssh_client.execute_command(f"ls -la {d} 2>/dev/null")
if not stderr and stdout.strip():
self._node_log_dir[node_name] = d
return d
self._node_log_dir[node_name] = self.log_dir
return self._node_log_dir[node_name]
def get_log_file_paths(self, node_name: str, log_type: str) -> List[str]:
base_dir = self._node_log_dir.get(node_name, self.log_dir)
base = f"{base_dir}/hadoop-hadoop-{log_type}-{node_name}"
return [f"{base}.log", f"{base}.out", f"{base}.out.1"]
# Create a global LogReader instance
log_reader = LogReader()

@ -1,51 +0,0 @@
from fastapi import FastAPI, Request, status
from fastapi.responses import JSONResponse
from fastapi.exceptions import RequestValidationError
from fastapi.middleware.cors import CORSMiddleware
from .routers import auth, health, secure, users, clusters, nodes, metrics, faults, ops, ai, hadoop_logs, sys_exec_logs, hadoop_exec_logs
import os
app = FastAPI(title="Hadoop Fault Detecting API", version="v1")
@app.exception_handler(RequestValidationError)
async def validation_exception_handler(request: Request, exc: RequestValidationError):
"""
Pydantic 校验错误转换为前端更易解析的格式
"""
errors = []
for error in exc.errors():
field = error.get("loc")[-1] if error.get("loc") else "unknown"
msg = error.get("msg")
errors.append({
"field": field,
"message": f"{field}: {msg}",
"code": error.get("type")
})
return JSONResponse(
status_code=status.HTTP_400_BAD_REQUEST,
content={"detail": {"errors": errors, "message": "请求参数校验失败"}}
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=False,
allow_methods=["*"],
allow_headers=["*"],
)
app.include_router(health.router, prefix="/api/v1")
app.include_router(auth.router, prefix="/api/v1")
app.include_router(secure.router, prefix="/api/v1")
app.include_router(clusters.router, prefix="/api/v1")
app.include_router(nodes.router, prefix="/api/v1")
app.include_router(metrics.router, prefix="/api/v1")
app.include_router(users.router, prefix="/api/v1")
app.include_router(hadoop_logs.router, prefix="/api/v1")
app.include_router(faults.router, prefix="/api/v1")
app.include_router(hadoop_exec_logs.router, prefix="/api/v1")
app.include_router(ops.router, prefix="/api/v1")
app.include_router(ai.router, prefix="/api/v1")
app.include_router(sys_exec_logs.router, prefix="/api/v1")

@ -1,121 +0,0 @@
import threading
import time
import datetime
import time as _time
from typing import Dict, List, Optional, Tuple
from sqlalchemy import text
from sqlalchemy.ext.asyncio import AsyncSession
from .ssh_utils import ssh_manager
from .db import SessionLocal
from .models.nodes import Node
import asyncio
from .config import BJ_TZ
class MetricsCollector:
def __init__(self):
self.collectors: Dict[str, threading.Thread] = {}
self.collection_interval: int = 5
self.last_errors: Dict[str, str] = {}
self._columns_cache: Dict[str, set] = {}
self._cluster_avg_inited: bool = False
def set_collection_interval(self, interval: int):
self.collection_interval = max(1, interval)
def get_collectors_status(self) -> Dict[str, bool]:
status = {}
for cid, t in self.collectors.items():
status[cid] = t.is_alive()
return status
def get_errors(self) -> Dict[str, str]:
return dict(self.last_errors)
def stop_all(self):
for cid in list(self.collectors.keys()):
self.stop(cid)
def stop(self, collector_id: str):
if collector_id in self.collectors:
del self.collectors[collector_id]
if collector_id in self.last_errors:
del self.last_errors[collector_id]
def start_for_nodes(self, nodes: List[Tuple[int, str, str, int]], interval: Optional[int] = None) -> Tuple[int, List[str]]:
if interval:
self.set_collection_interval(interval)
started: List[str] = []
for nid, hn, ip, cid in nodes:
cid_str = f"{hn}"
if cid_str in self.collectors and self.collectors[cid_str].is_alive():
continue
t = threading.Thread(target=self._collect_node_metrics, args=(nid, hn, ip, cid), name=f"metrics_{hn}", daemon=True)
self.collectors[cid_str] = t
t.start()
started.append(hn)
return len(started), started
def _read_cpu_mem(self, node_name: str, ip: str) -> Tuple[float, float]:
ssh_client = ssh_manager.get_connection(node_name, ip=ip)
out1, err1 = ssh_client.execute_command("cat /proc/stat | head -n 1")
_time.sleep(0.5)
out2, err2 = ssh_client.execute_command("cat /proc/stat | head -n 1")
cpu_pct = 0.0
if not err1 and not err2 and out1.strip() and out2.strip():
p1 = out1.strip().split()
p2 = out2.strip().split()
v1 = [int(x) for x in p1[1:]]
v2 = [int(x) for x in p2[1:]]
get1 = lambda i: (v1[i] if i < len(v1) else 0)
get2 = lambda i: (v2[i] if i < len(v2) else 0)
idle = (get2(3) + get2(4)) - (get1(3) + get1(4))
total = (get2(0) - get1(0)) + (get2(1) - get1(1)) + (get2(2) - get1(2)) + idle + (get2(5) - get1(5)) + (get2(6) - get1(6)) + (get2(7) - get1(7))
if total > 0:
cpu_pct = round((1.0 - idle / total) * 100.0, 2)
outm, errm = ssh_client.execute_command("cat /proc/meminfo")
mem_pct = 0.0
if not errm and outm.strip():
mt = 0
ma = 0
for line in outm.splitlines():
if line.startswith("MemTotal:"):
mt = int(line.split()[1])
elif line.startswith("MemAvailable:"):
ma = int(line.split()[1])
if mt > 0:
mem_pct = round((1.0 - (ma / mt)) * 100.0, 2)
return cpu_pct, mem_pct
async def _save_metrics(self, node_id: int, hostname: str, cluster_id: int, cpu: float, mem: float):
# 这里的 SessionLocal 绑定的 engine 可能在主线程 loop 中初始化
# 在 asyncio.run() 开启的新 loop 中使用它会报 Loop 冲突
from .db import engine
async with AsyncSession(engine) as session:
now = datetime.datetime.now(BJ_TZ)
await session.execute(text("UPDATE nodes SET cpu_usage=:cpu, memory_usage=:mem, last_heartbeat=:hb WHERE id=:nid"), {"cpu": cpu, "mem": mem, "hb": now, "nid": node_id})
await session.commit()
def _collect_node_metrics(self, node_id: int, hostname: str, ip: str, cluster_id: int):
cid = hostname
while cid in self.collectors:
try:
cpu, mem = self._read_cpu_mem(hostname, ip)
asyncio.run(self._save_metrics(node_id, hostname, cluster_id, cpu, mem))
except Exception as e:
self.last_errors[cid] = str(e)
time.sleep(self.collection_interval)
async def _get_table_columns(self, session: AsyncSession, table_name: str) -> set:
if table_name in self._columns_cache:
return self._columns_cache[table_name]
res = await session.execute(text("""
SELECT column_name
FROM information_schema.columns
WHERE table_name = :t
"""), {"t": table_name})
cols = set(r[0] for r in res.all())
self._columns_cache[table_name] = cols
return cols
metrics_collector = MetricsCollector()

@ -1,4 +0,0 @@
from sqlalchemy.orm import DeclarativeBase
class Base(DeclarativeBase):
pass

@ -1,30 +0,0 @@
from sqlalchemy import Column, Integer, String, Text, DateTime, ForeignKey, Boolean
from sqlalchemy.orm import relationship
from datetime import datetime
from ..config import BJ_TZ
from . import Base
class ChatSession(Base):
__tablename__ = "chat_sessions"
id = Column(String, primary_key=True, index=True) # UUID
user_id = Column(Integer, nullable=True, index=True) # Can be linked to a user
title = Column(String, nullable=True)
created_at = Column(DateTime(timezone=True), default=lambda: datetime.now(BJ_TZ))
updated_at = Column(DateTime(timezone=True), default=lambda: datetime.now(BJ_TZ), onupdate=lambda: datetime.now(BJ_TZ))
messages = relationship("ChatMessage", back_populates="session", cascade="all, delete-orphan", lazy="selectin")
class ChatMessage(Base):
__tablename__ = "chat_messages"
id = Column(Integer, primary_key=True, index=True)
session_id = Column(String, ForeignKey("chat_sessions.id"), nullable=False)
role = Column(String, nullable=False) # system, user, assistant, tool
content = Column(Text, nullable=False)
created_at = Column(DateTime(timezone=True), default=lambda: datetime.now(BJ_TZ))
# Optional: store tool calls or extra metadata if needed
# For now, we store JSON in content if it's complex, or just text.
session = relationship("ChatSession", back_populates="messages")

@ -1,13 +0,0 @@
from sqlalchemy.orm import Mapped, mapped_column
from sqlalchemy import String, Integer, Float, TIMESTAMP
from . import Base
class ClusterMetric(Base):
__tablename__ = "cluster_metrics"
id: Mapped[int] = mapped_column(primary_key=True, autoincrement=True)
cluster_id: Mapped[int] = mapped_column()
cluster_name: Mapped[str] = mapped_column(String(100))
cpu_avg: Mapped[float] = mapped_column(Float)
memory_avg: Mapped[float] = mapped_column(Float)
created_at: Mapped[str] = mapped_column(TIMESTAMP(timezone=True))

@ -1,45 +0,0 @@
from sqlalchemy.orm import Mapped, mapped_column
from sqlalchemy import String, Integer, Float, TIMESTAMP
from sqlalchemy.dialects.postgresql import UUID, JSONB, INET
from . import Base
class Cluster(Base):
__tablename__ = "clusters"
id: Mapped[int] = mapped_column(primary_key=True)
uuid: Mapped[str] = mapped_column(UUID(as_uuid=False), unique=True)
name: Mapped[str] = mapped_column(String(100), unique=True)
type: Mapped[str] = mapped_column(String(50))
node_count: Mapped[int] = mapped_column(Integer, default=0)
health_status: Mapped[str] = mapped_column(String(20), default="unknown")
cpu_avg: Mapped[float | None] = mapped_column(Float, nullable=True)
memory_avg: Mapped[float | None] = mapped_column(Float, nullable=True)
namenode_ip: Mapped[str | None] = mapped_column(INET, nullable=True)
namenode_psw: Mapped[str | None] = mapped_column(String(255), nullable=True)
rm_ip: Mapped[str | None] = mapped_column(INET, nullable=True)
rm_psw: Mapped[str | None] = mapped_column(String(255), nullable=True)
description: Mapped[str | None] = mapped_column(String, nullable=True)
config_info: Mapped[dict | None] = mapped_column(JSONB, nullable=True)
created_at: Mapped[str] = mapped_column(TIMESTAMP(timezone=True))
updated_at: Mapped[str] = mapped_column(TIMESTAMP(timezone=True))
def to_dict(self) -> dict:
"""将集群对象转换为可序列化字典。"""
return {
"id": self.id,
"uuid": self.uuid,
"name": self.name,
"type": self.type,
"node_count": self.node_count,
"health_status": self.health_status,
"cpu_avg": self.cpu_avg,
"memory_avg": self.memory_avg,
"namenode_ip": (str(self.namenode_ip) if self.namenode_ip else None),
"namenode_psw": self.namenode_psw,
"rm_ip": (str(self.rm_ip) if self.rm_ip else None),
"rm_psw": self.rm_psw,
"description": self.description,
"config_info": self.config_info,
"created_at": self.created_at.isoformat() if self.created_at else None,
"updated_at": self.updated_at.isoformat() if self.updated_at else None,
}

@ -1,38 +0,0 @@
from sqlalchemy.orm import Mapped, mapped_column
from sqlalchemy import String
from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy import TIMESTAMP
from . import Base
class FaultRecord(Base):
__tablename__ = "fault_records"
id: Mapped[int] = mapped_column(primary_key=True)
fault_id: Mapped[str] = mapped_column(String(32), unique=True)
cluster_id: Mapped[int | None] = mapped_column(nullable=True)
fault_type: Mapped[str] = mapped_column(String(50))
fault_level: Mapped[str] = mapped_column(String(20), default="medium")
title: Mapped[str] = mapped_column(String(200))
description: Mapped[str | None] = mapped_column(String, nullable=True)
affected_nodes: Mapped[dict | None] = mapped_column(JSONB, nullable=True)
affected_clusters: Mapped[dict | None] = mapped_column(JSONB, nullable=True)
root_cause: Mapped[str | None] = mapped_column(String, nullable=True)
repair_suggestion: Mapped[str | None] = mapped_column(String, nullable=True)
status: Mapped[str] = mapped_column(String(20), default="detected")
assignee: Mapped[str | None] = mapped_column(String(50), nullable=True)
reporter: Mapped[str] = mapped_column(String(50), default="system")
created_at: Mapped[str] = mapped_column(TIMESTAMP(timezone=True))
updated_at: Mapped[str] = mapped_column(TIMESTAMP(timezone=True))
resolved_at: Mapped[str | None] = mapped_column(TIMESTAMP(timezone=True), nullable=True)
def to_dict(self) -> dict:
"""将故障记录转换为可序列化字典。"""
return {
"fault_id": self.fault_id,
"cluster_id": self.cluster_id,
"fault_type": self.fault_type,
"fault_level": self.fault_level,
"title": self.title,
"status": self.status,
"created_at": self.created_at.isoformat() if self.created_at else None,
}

@ -1,23 +0,0 @@
from sqlalchemy.orm import Mapped, mapped_column
from sqlalchemy import String, Integer, Text, TIMESTAMP, ForeignKey
from . import Base
class HadoopExecLog(Base):
__tablename__ = "hadoop_exec_logs"
id: Mapped[int] = mapped_column(primary_key=True, autoincrement=True)
from_user_id: Mapped[int] = mapped_column(Integer, nullable=False)
cluster_name: Mapped[str] = mapped_column(String(255), nullable=False)
description: Mapped[str | None] = mapped_column(Text, nullable=True)
start_time: Mapped[str | None] = mapped_column(TIMESTAMP(timezone=True), nullable=True)
end_time: Mapped[str | None] = mapped_column(TIMESTAMP(timezone=True), nullable=True)
def to_dict(self) -> dict:
return {
"id": self.id,
"from_user_id": self.from_user_id,
"cluster_name": self.cluster_name,
"description": self.description,
"start_time": self.start_time.isoformat() if self.start_time else None,
"end_time": self.end_time.isoformat() if self.end_time else None,
}

@ -1,23 +0,0 @@
from sqlalchemy.orm import Mapped, mapped_column
from sqlalchemy import String, Integer, Text, TIMESTAMP
from . import Base
class HadoopLog(Base):
__tablename__ = "hadoop_logs"
log_id: Mapped[int] = mapped_column(primary_key=True, autoincrement=True)
cluster_name: Mapped[str] = mapped_column(String(255), nullable=False)
node_host: Mapped[str] = mapped_column(String(100), nullable=False)
title: Mapped[str | None] = mapped_column(String(255), nullable=True)
info: Mapped[str | None] = mapped_column(Text, nullable=True)
log_time: Mapped[str] = mapped_column(TIMESTAMP(timezone=True), nullable=False)
def to_dict(self) -> dict:
return {
"log_id": self.log_id,
"cluster_name": self.cluster_name,
"node_host": self.node_host,
"title": self.title,
"info": self.info,
"log_time": self.log_time.isoformat() if self.log_time else None,
}

@ -1,14 +0,0 @@
from sqlalchemy.orm import Mapped, mapped_column
from sqlalchemy import String, Integer, Float, TIMESTAMP
from . import Base
class NodeMetric(Base):
__tablename__ = "node_metrics"
id: Mapped[int] = mapped_column(primary_key=True, autoincrement=True)
cluster_id: Mapped[int] = mapped_column()
node_id: Mapped[int] = mapped_column()
hostname: Mapped[str] = mapped_column(String(100))
cpu_usage: Mapped[float] = mapped_column(Float)
memory_usage: Mapped[float] = mapped_column(Float)
created_at: Mapped[str] = mapped_column(TIMESTAMP(timezone=True))

@ -1,24 +0,0 @@
from sqlalchemy.orm import Mapped, mapped_column
from sqlalchemy import String
from sqlalchemy.dialects.postgresql import UUID, INET
from sqlalchemy import TIMESTAMP, Float
from . import Base
class Node(Base):
__tablename__ = "nodes"
id: Mapped[int] = mapped_column(primary_key=True)
uuid: Mapped[str] = mapped_column(UUID(as_uuid=False), unique=True)
cluster_id: Mapped[int] = mapped_column()
hostname: Mapped[str] = mapped_column(String(100))
ip_address: Mapped[str] = mapped_column(INET)
ssh_user: Mapped[str | None] = mapped_column(String(50), nullable=True)
ssh_password: Mapped[str | None] = mapped_column(String(255), nullable=True)
# description: Mapped[str | None] = mapped_column(String, nullable=True)
status: Mapped[str] = mapped_column(String(20), default="unknown")
cpu_usage: Mapped[float | None] = mapped_column(Float, nullable=True)
memory_usage: Mapped[float | None] = mapped_column(Float, nullable=True)
disk_usage: Mapped[float | None] = mapped_column(Float, nullable=True)
last_heartbeat: Mapped[str | None] = mapped_column(TIMESTAMP(timezone=True), nullable=True)
created_at: Mapped[str] = mapped_column(TIMESTAMP(timezone=True))
updated_at: Mapped[str] = mapped_column(TIMESTAMP(timezone=True))

@ -1,20 +0,0 @@
from sqlalchemy.orm import Mapped, mapped_column
from sqlalchemy import Integer, Text, TIMESTAMP, ForeignKey, text
from sqlalchemy.dialects.postgresql import UUID
from . import Base
class SysExecLog(Base):
__tablename__ = "sys_exec_logs"
operation_id: Mapped[str] = mapped_column(UUID(as_uuid=True), primary_key=True, server_default=text("uuid_generate_v4()"))
user_id: Mapped[int] = mapped_column(Integer, ForeignKey("users.id"), nullable=False)
description: Mapped[str] = mapped_column(Text, nullable=False)
operation_time: Mapped[str] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=text("now()"))
def to_dict(self) -> dict:
return {
"operation_id": str(self.operation_id),
"user_id": self.user_id,
"description": self.description,
"operation_time": self.operation_time.isoformat() if self.operation_time else None,
}

@ -1,18 +0,0 @@
from sqlalchemy.orm import Mapped, mapped_column
from sqlalchemy import String, Boolean
from sqlalchemy import TIMESTAMP
from . import Base
class User(Base):
__tablename__ = "users"
id: Mapped[int] = mapped_column(primary_key=True)
username: Mapped[str] = mapped_column(String(50), unique=True)
email: Mapped[str] = mapped_column(String(100), unique=True)
password_hash: Mapped[str] = mapped_column(String(255))
full_name: Mapped[str] = mapped_column(String(100))
is_active: Mapped[bool] = mapped_column(Boolean, default=True)
sort: Mapped[int] = mapped_column(default=0)
last_login: Mapped[str | None] = mapped_column(TIMESTAMP(timezone=True), nullable=True)
created_at: Mapped[str] = mapped_column(TIMESTAMP(timezone=True))
updated_at: Mapped[str] = mapped_column(TIMESTAMP(timezone=True))

@ -1,259 +0,0 @@
from fastapi import APIRouter, Depends, HTTPException
from fastapi.responses import StreamingResponse
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, func, text
from pydantic import BaseModel, Field
import os
import json
import uuid
from ..db import get_db
from ..deps.auth import get_current_user
from ..models.hadoop_logs import HadoopLog
from ..models.chat import ChatSession, ChatMessage
from ..agents.diagnosis_agent import run_diagnose_and_repair
from ..services.llm import LLMClient
from ..services.ops_tools import openai_tools_schema, tool_web_search, tool_start_cluster, tool_stop_cluster, tool_read_log, tool_read_cluster_log, tool_detect_cluster_faults, tool_run_cluster_command
router = APIRouter()
class DiagnoseRepairReq(BaseModel):
cluster: str | None = Field(None, description="集群UUID")
node: str | None = Field(None, description="节点主机名")
timeFrom: str | None = Field(None, description="ISO起始时间")
keywords: str | None = Field(None, description="关键词")
auto: bool = Field(True, description="是否允许自动修复")
maxSteps: int = Field(3, ge=1, le=6, description="最多工具步数")
model: str | None = Field(None, description="使用的模型")
class ChatReq(BaseModel):
sessionId: str = Field(..., description="会话ID")
message: str = Field(..., description="用户输入")
stream: bool = Field(False, description="是否使用流式输出")
context: dict | None = Field(None, description="上下文包含node, agent, model等")
class HistoryReq(BaseModel):
sessionId: str
def _get_username(u) -> str:
return getattr(u, "username", None) or (u.get("username") if isinstance(u, dict) else None) or "system"
def _get_internal_session_id(user, session_id: str) -> str:
uname = _get_username(user)
return f"{uname}:{session_id}"
@router.post("/ai/diagnose-repair")
async def diagnose_repair(req: DiagnoseRepairReq, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
# 聚合简要日志上下文(结构化日志)
filters = []
if req.node:
filters.append(HadoopLog.node_host == req.node)
if req.keywords:
# 这里简化为 info 包含关键词
filters.append(HadoopLog.info.ilike(f"%{req.keywords}%"))
stmt = select(HadoopLog).limit(100).order_by(HadoopLog.log_time.desc())
for f in filters:
stmt = stmt.where(f)
rows = (await db.execute(stmt)).scalars().all()
ctx_logs = [r.to_dict() for r in rows[:50]]
context = {"cluster": req.cluster, "node": req.node, "logs": ctx_logs}
uname = _get_username(user)
result = await run_diagnose_and_repair(db, uname, context, auto=req.auto, max_steps=req.maxSteps, model=req.model)
return result
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")
@router.get("/ai/history")
async def get_history(sessionId: str, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
"""获取会话历史"""
internal_id = _get_internal_session_id(user, sessionId)
stmt = select(ChatMessage).where(ChatMessage.session_id == internal_id).order_by(ChatMessage.created_at.asc())
rows = (await db.execute(stmt)).scalars().all()
messages = [{"role": r.role, "content": r.content} for r in rows]
return {"messages": messages}
@router.post("/ai/chat")
async def ai_chat(req: ChatReq, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
internal_id = _get_internal_session_id(user, req.sessionId)
user_id = user.get("id") if isinstance(user, dict) else getattr(user, "id", None)
session_stmt = select(ChatSession).where(ChatSession.id == internal_id)
session = (await db.execute(session_stmt)).scalars().first()
if not session:
session = ChatSession(id=internal_id, user_id=user_id, title=req.message[:20])
db.add(session)
system_prompt = (
"你是 Hadoop 运维诊断助手。输出中文,优先给出根因、影响范围、证据与建议。"
"当用户询问“故障/异常/报错/不可用/打不开/任务失败”等问题时,优先调用 detect_cluster_faults"
"必要时再用 read_cluster_log 补充读取对应组件日志。"
"当用户询问进程/端口/资源/版本等日常运维信息时,优先调用 run_cluster_command例如 jps/df/free/hdfs_report/yarn_node_list"
)
if req.context:
if req.context.get("agent"):
system_prompt += f" Your name is {req.context['agent']}."
if req.context.get("node"):
system_prompt += f" You are currently analyzing node: {req.context['node']}."
hist_stmt = select(ChatMessage).where(ChatMessage.session_id == internal_id).order_by(ChatMessage.created_at.desc()).limit(12)
hist_rows = (await db.execute(hist_stmt)).scalars().all()
hist_rows = hist_rows[::-1]
messages = [{"role": "system", "content": system_prompt}]
for r in hist_rows:
messages.append({"role": r.role, "content": r.content})
messages.append({"role": "user", "content": req.message})
user_msg = ChatMessage(session_id=internal_id, role="user", content=req.message)
db.add(user_msg)
llm = LLMClient()
target_model = req.context.get("model") if req.context else None
# 默认加载所有可用运维工具
chat_tools = openai_tools_schema()
if req.stream:
# 流式暂不支持工具调用后的二次生成(为了简化),如果检测到可能需要工具,先走非流式
# 或者这里可以根据需求调整,目前先保持非流式处理工具逻辑
pass
resp = await llm.chat(messages, tools=chat_tools, stream=False, model=target_model)
choices = resp.get("choices") or []
if not choices:
raise HTTPException(status_code=502, detail="llm_unavailable")
msg = choices[0].get("message") or {}
tool_calls = msg.get("tool_calls") or []
if tool_calls:
messages.append(msg)
for tc in tool_calls:
fn = tc.get("function") or {}
name = fn.get("name")
args_str = fn.get("arguments") or "{}"
try:
args = json.loads(args_str)
except:
args = {}
tool_result = {"error": "unknown_tool"}
uname = _get_username(user)
if name == "web_search":
tool_result = await tool_web_search(args.get("query"), args.get("max_results", 5))
elif name == "start_cluster":
tool_result = await tool_start_cluster(db, uname, args.get("cluster_uuid"))
elif name == "stop_cluster":
tool_result = await tool_stop_cluster(db, uname, args.get("cluster_uuid"))
elif name == "read_log":
tool_result = await tool_read_log(db, uname, args.get("node"), args.get("path"), int(args.get("lines", 200)), args.get("pattern"), args.get("sshUser"))
elif name == "read_cluster_log":
tool_result = await tool_read_cluster_log(
db,
uname,
args.get("cluster_uuid"),
args.get("log_type"),
args.get("node_hostname"),
int(args.get("lines", 100))
)
elif name == "detect_cluster_faults":
tool_result = await tool_detect_cluster_faults(
db,
uname,
args.get("cluster_uuid"),
args.get("components"),
args.get("node_hostname"),
int(args.get("lines", 200)),
)
elif name == "run_cluster_command":
tool_result = await tool_run_cluster_command(
db,
uname,
args.get("cluster_uuid"),
args.get("command_key"),
args.get("target"),
args.get("node_hostname"),
int(args.get("timeout", 30)),
int(args.get("limit_nodes", 20)),
)
messages.append({
"role": "tool",
"tool_call_id": tc.get("id"),
"name": name,
"content": json.dumps(tool_result, ensure_ascii=False)
})
if req.stream:
return await handle_streaming_chat(llm, messages, internal_id, db, tools=chat_tools, model=target_model)
else:
resp = await llm.chat(messages, tools=chat_tools, stream=False, model=target_model)
choices = resp.get("choices") or []
if not choices:
raise HTTPException(status_code=502, detail="llm_unavailable_after_tool")
msg = choices[0].get("message") or {}
else:
if req.stream:
return await handle_streaming_chat(llm, messages, internal_id, db, tools=chat_tools, model=target_model)
reply = msg.get("content") or ""
reasoning = msg.get("reasoning_content") or ""
asst_msg = ChatMessage(session_id=internal_id, role="assistant", content=reply)
db.add(asst_msg)
await db.commit()
return {"reply": reply, "reasoning": reasoning}
except HTTPException:
raise
except Exception as e:
print(f"AI Chat Error: {str(e)}")
raise HTTPException(status_code=500, detail=f"server_error: {str(e)}")
async def handle_streaming_chat(llm: LLMClient, messages: list, session_id: str, db: AsyncSession, tools=None, model: str = None):
async def event_generator():
full_reply = ""
full_reasoning = ""
try:
stream_gen = await llm.chat(messages, tools=tools, stream=True, model=model)
async for chunk in stream_gen:
choices = chunk.get("choices") or []
if not choices:
continue
delta = choices[0].get("delta") or {}
content = delta.get("content") or ""
reasoning = delta.get("reasoning_content") or ""
if content:
full_reply += content
if reasoning:
full_reasoning += reasoning
yield f"data: {json.dumps({'content': content, 'reasoning': reasoning}, ensure_ascii=False)}\n\n"
finally:
try:
if full_reply:
asst_msg = ChatMessage(session_id=session_id, role="assistant", content=full_reply)
db.add(asst_msg)
await db.commit()
except Exception as e:
print(f"Error saving stream to DB: {e}")
return StreamingResponse(
event_generator(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"X-Accel-Buffering": "no"
}
)

@ -1,212 +0,0 @@
from fastapi import APIRouter, Depends, HTTPException
from pydantic import BaseModel
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, update, func, text
from ..db import get_db
from ..models.users import User
from passlib.hash import bcrypt
from ..config import JWT_SECRET, JWT_EXPIRE_MINUTES
import jwt
from datetime import datetime, timedelta, timezone
import re
from ..config import now_bj
router = APIRouter()
class LoginRequest(BaseModel):
username: str
password: str
class RegisterRequest(BaseModel):
username: str
email: str
password: str
fullName: str
async def _get_user_id(db: AsyncSession, username: str) -> int | None:
res = await db.execute(text("SELECT id FROM users WHERE username=:u LIMIT 1"), {"u": username})
row = res.first()
return row[0] if row else None
async def _get_role_id(db: AsyncSession, role_key: str) -> int | None:
res = await db.execute(text("SELECT id FROM roles WHERE role_key=:k LIMIT 1"), {"k": role_key})
row = res.first()
return row[0] if row else None
async def _ensure_observer_role(db: AsyncSession) -> int:
rid = await _get_role_id(db, "observer")
if rid is not None:
return rid
await db.execute(
text(
"INSERT INTO roles(role_name, role_key, description, is_system_role, created_at, updated_at) VALUES(:rn, :rk, :desc, TRUE, NOW(), NOW())"
),
{"rn": "观察员", "rk": "observer", "desc": "系统默认观察员角色"},
)
await db.commit()
rid2 = await _get_role_id(db, "observer")
if rid2 is None:
raise HTTPException(status_code=500, detail="role_init_failed")
return rid2
async def _map_user_role(db: AsyncSession, username: str, role_key: str) -> None:
uid = await _get_user_id(db, username)
if uid is None:
raise HTTPException(status_code=500, detail="user_not_found_after_register")
rid = await _get_role_id(db, role_key)
if rid is None:
if role_key == "observer":
rid = await _ensure_observer_role(db)
else:
raise HTTPException(status_code=400, detail="role_not_exist")
await db.execute(text("DELETE FROM user_role_mapping WHERE user_id=:uid"), {"uid": uid})
await db.execute(text("INSERT INTO user_role_mapping(user_id, role_id) VALUES(:uid, :rid)"), {"uid": uid, "rid": rid})
await db.commit()
async def _get_user_roles(db: AsyncSession, user_id: int) -> list[str]:
res = await db.execute(
text("SELECT r.role_key FROM roles r JOIN user_role_mapping urm ON r.id = urm.role_id WHERE urm.user_id = :uid"),
{"uid": user_id},
)
return [row[0] for row in res.all()]
async def _get_role_permissions(db: AsyncSession, role_keys: list[str]) -> list[str]:
if not role_keys:
return []
res = await db.execute(
text("""
SELECT DISTINCT p.permission_key
FROM permissions p
JOIN role_permission_mapping rpm ON p.id = rpm.permission_id
JOIN roles r ON rpm.role_id = r.id
WHERE r.role_key = ANY(:keys)
"""),
{"keys": role_keys},
)
return [row[0] for row in res.all()]
@router.post("/user/login")
async def login(req: LoginRequest, db: AsyncSession = Depends(get_db)):
demo = {"admin": "admin123", "ops": "ops123", "obs": "obs123"}
if req.username in demo and req.password == demo[req.username]:
exp = now_bj() + timedelta(minutes=JWT_EXPIRE_MINUTES)
token = jwt.encode({"sub": req.username, "exp": exp}, JWT_SECRET, algorithm="HS256")
# 为 demo 账号获取角色和权限
uid = await _get_user_id(db, req.username)
roles = await _get_user_roles(db, uid) if uid else []
if not roles:
# 如果 DB 中没记录,给个默认
role_map = {"admin": ["admin"], "ops": ["operator"], "obs": ["observer"]}
roles = role_map.get(req.username, [])
permissions = await _get_role_permissions(db, roles)
return {
"ok": True,
"username": req.username,
"fullName": req.username,
"token": token,
"roles": roles,
"permissions": permissions
}
try:
result = await db.execute(select(User).where(User.username == req.username).limit(1))
user = result.scalars().first()
if not user:
raise HTTPException(status_code=401, detail="invalid_credentials")
if not user.is_active:
raise HTTPException(status_code=403, detail="inactive_user")
if not bcrypt.verify(req.password, user.password_hash):
raise HTTPException(status_code=401, detail="invalid_credentials")
await db.execute(
update(User).where(User.id == user.id).values(last_login=func.now(), updated_at=func.now())
)
await db.commit()
# 获取用户角色和权限
roles = await _get_user_roles(db, user.id)
permissions = await _get_role_permissions(db, roles)
exp = now_bj() + timedelta(minutes=JWT_EXPIRE_MINUTES)
token = jwt.encode({"sub": user.username, "exp": exp}, JWT_SECRET, algorithm="HS256")
return {
"ok": True,
"username": user.username,
"fullName": user.full_name,
"token": token,
"roles": roles,
"permissions": permissions
}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")
@router.post("/user/register")
async def register(req: RegisterRequest, db: AsyncSession = Depends(get_db)):
try:
errors: list[dict] = []
# 用户名校验3-50位字母开头支持字母/数字/下划线
if not req.username or not (3 <= len(req.username) <= 50):
errors.append({"field": "username", "code": "invalid_username", "message": "用户名长度需在3-50之间"})
elif not re.fullmatch(r"^[A-Za-z][A-Za-z0-9_]*$", req.username):
errors.append({"field": "username", "code": "invalid_username", "message": "用户名需以字母开头,仅支持字母、数字和下划线"})
# 邮箱校验
if not req.email or not re.fullmatch(r"^[^@\s]+@[^@\s]+\.[^@\s]+", req.email):
errors.append({"field": "email", "code": "invalid_email", "message": "邮箱格式不正确"})
# 密码校验:前端要求>=6位后端要求>=8位并包含复杂性。为了兼容性调优提示。
if not req.password or len(req.password) < 6:
errors.append({"field": "password", "code": "weak_password", "message": "密码长度至少为6位"})
elif len(req.password) < 8 or not re.search(r"[A-Z]", req.password) or not re.search(r"[a-z]", req.password) or not re.search(r"\d", req.password):
errors.append({"field": "password", "code": "weak_password", "message": "密码建议至少8位且包含大小写字母与数字"})
# 姓名校验
if not req.fullName or not (2 <= len(req.fullName) <= 100):
errors.append({"field": "fullName", "code": "invalid_full_name", "message": "姓名长度需在2-100之间"})
if errors:
raise HTTPException(status_code=400, detail={"errors": errors, "message": errors[0]["message"]})
# 检查唯一性
exists_username = await db.execute(select(User).where(User.username == req.username).limit(1))
if exists_username.scalars().first():
raise HTTPException(status_code=400, detail={"message": "该用户名已被注册", "code": "user_exists"})
exists_email = await db.execute(select(User.id).where(User.email == req.email).limit(1))
if exists_email.scalars().first():
raise HTTPException(status_code=400, detail={"message": "该邮箱已被绑定", "code": "email_exists"})
password_hash = bcrypt.hash(req.password)
user = User(
username=req.username,
email=req.email,
password_hash=password_hash,
full_name=req.fullName,
is_active=True,
last_login=None,
created_at=now_bj(),
updated_at=now_bj(),
)
db.add(user)
await db.flush()
await db.commit()
await _map_user_role(db, req.username, "observer")
permissions = await _get_role_permissions(db, ["observer"])
exp = now_bj() + timedelta(minutes=JWT_EXPIRE_MINUTES)
token = jwt.encode({"sub": user.username, "exp": exp}, JWT_SECRET, algorithm="HS256")
return {
"ok": True,
"username": user.username,
"fullName": user.full_name,
"token": token,
"roles": ["observer"],
"permissions": permissions
}
except HTTPException:
raise
except Exception as e:
print(f"DEBUG: Database error: {str(e)}")
raise HTTPException(status_code=500, detail=f"server_error: {str(e)}")

@ -1,236 +0,0 @@
from fastapi import APIRouter, Depends, HTTPException
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, delete, func, text
from ..db import get_db
from ..models.clusters import Cluster
from ..models.nodes import Node
from ..deps.auth import get_current_user, PermissionChecker
from ..services.ssh_probe import check_ssh_connectivity, get_hdfs_cluster_id
from pydantic import BaseModel
from datetime import datetime, timezone
import uuid as uuidlib
from ..config import now_bj
router = APIRouter()
def _get_username(u) -> str:
return getattr(u, "username", None) or (u.get("username") if isinstance(u, dict) else None)
class NodeCreateItem(BaseModel):
hostname: str
ip_address: str
ssh_user: str
ssh_password: str
description: str | None = None
class ClusterCreateRequest(BaseModel):
name: str
type: str
node_count: int
health_status: str
cpu_avg: float | None = None
memory_avg: float | None = None
description: str | None = None
namenode_ip: str | None = None
namenode_psw: str | None = None
rm_ip: str | None = None
rm_psw: str | None = None
nodes: list[NodeCreateItem]
@router.get("/clusters")
async def list_clusters(user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
"""按当前用户归属返回其可访问的集群列表。"""
try:
name = _get_username(user)
uid_res = await db.execute(text("SELECT id FROM users WHERE username=:un LIMIT 1"), {"un": name})
uid_row = uid_res.first()
if not uid_row:
return {"clusters": []}
ids_res = await db.execute(text("SELECT cluster_id FROM user_cluster_mapping WHERE user_id=:uid"), {"uid": uid_row[0]})
cluster_ids = [r[0] for r in ids_res.all()]
if not cluster_ids:
return {"clusters": []}
result = await db.execute(select(Cluster).where(Cluster.id.in_(cluster_ids)))
rows = result.scalars().all()
data = []
for c in rows:
data.append({
"uuid": str(c.uuid),
"name": c.name,
"type": c.type,
"node_count": c.node_count,
"health_status": c.health_status,
"cpu_avg": c.cpu_avg,
"memory_avg": c.memory_avg,
"namenode_ip": (str(c.namenode_ip) if c.namenode_ip else None),
"namenode_psw": c.namenode_psw,
"rm_ip": (str(c.rm_ip) if c.rm_ip else None),
"rm_psw": c.rm_psw,
"description": c.description,
})
return {"clusters": data}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")
@router.post("/clusters")
async def create_cluster(
req: ClusterCreateRequest,
user=Depends(PermissionChecker(["cluster:register"])),
db: AsyncSession = Depends(get_db)
):
"""注册一个集群并建立当前用户的归属映射。"""
try:
name = _get_username(user)
# 移除硬编码的角色检查PermissionChecker 已经处理了权限校验
# 参数校验:类型与状态
valid_types = {"hadoop", "spark", "kubernetes"}
valid_health = {"healthy", "warning", "error", "unknown"}
errors: list[dict] = []
if req.type not in valid_types:
errors.append({"field": "type", "message": "类型不合法,应为 hadoop/spark/kubernetes", "step": "参数校验"})
if req.health_status not in valid_health:
errors.append({"field": "health_status", "message": "状态不合法,应为 healthy/warning/error/unknown", "step": "参数校验"})
if req.node_count is None or req.node_count < 0:
errors.append({"field": "node_count", "message": "节点总数必须为非负整数", "step": "参数校验"})
if errors:
raise HTTPException(status_code=400, detail={"errors": errors})
# 1. 获取 HDFS 集群真实 UUID (从 NameNode 获取)
cluster_uuid, err = get_hdfs_cluster_id(str(req.namenode_ip), req.nodes[0].ssh_user, req.nodes[0].ssh_password)
if not cluster_uuid:
raise HTTPException(status_code=400, detail={"errors": [{"field": "namenode_ip", "message": f"无法获取集群ID: {err}"}]})
# 2. 检查该 UUID 是否已在数据库中
res = await db.execute(select(Cluster).where(Cluster.uuid == cluster_uuid).limit(1))
existing_cluster = res.scalars().first()
if existing_cluster:
# 集群已存在,仅建立用户映射
c = existing_cluster
new_uuid = cluster_uuid
else:
# 集群不存在,执行注册流程
# 检查集群名称是否已存在
name_exists = await db.execute(select(Cluster.id).where(Cluster.name == req.name).limit(1))
if name_exists.scalars().first():
raise HTTPException(status_code=400, detail={"errors": [{"field": "name", "message": "集群名称已存在"}]})
# SSH 连通性预检查
ssh_errors: list[dict] = []
for idx, n_req in enumerate(req.nodes):
ip = getattr(n_req, "ip_address", None) or getattr(n_req, "ip", None)
user_ = getattr(n_req, "ssh_user", None)
pwd_ = getattr(n_req, "ssh_password", None)
ok, conn_err = check_ssh_connectivity(str(ip), str(user_ or ""), str(pwd_ or ""))
if not ok:
ssh_errors.append({
"field": f"nodes[{idx}].ssh",
"message": "注册失败SSH不可连接",
"step": "connect",
"detail": conn_err,
"hostname": getattr(n_req, "hostname", None),
"ip": str(ip) if ip is not None else None,
})
if ssh_errors:
raise HTTPException(status_code=400, detail={"errors": ssh_errors})
new_uuid = cluster_uuid
c = Cluster(
uuid=new_uuid,
name=req.name,
type=req.type,
node_count=req.node_count,
health_status=req.health_status,
cpu_avg=req.cpu_avg,
memory_avg=req.memory_avg,
namenode_ip=req.namenode_ip,
namenode_psw=req.namenode_psw,
rm_ip=req.rm_ip,
rm_psw=req.rm_psw,
description=req.description,
config_info={},
created_at=now_bj(),
updated_at=now_bj(),
)
db.add(c)
await db.flush() # 获取 c.id
# 插入节点
for n_req in req.nodes:
node_uuid = str(uuidlib.uuid4())
node = Node(
uuid=node_uuid,
cluster_id=c.id,
hostname=n_req.hostname,
ip_address=n_req.ip_address,
ssh_user=n_req.ssh_user,
ssh_password=n_req.ssh_password,
status="unknown",
created_at=now_bj(),
updated_at=now_bj(),
)
db.add(node)
# 3. 建立用户映射 (无论集群是新注册还是已存在)
uid_res = await db.execute(text("SELECT id FROM users WHERE username=:un LIMIT 1"), {"un": name})
uid_row = uid_res.first()
# 简化逻辑:如果是 admin 用户则赋予 admin 角色,否则赋予 operator 角色
role_key = "admin" if name == "admin" else "operator"
rid_res = await db.execute(text("SELECT id FROM roles WHERE role_key=:rk LIMIT 1"), {"rk": role_key})
rid_row = rid_res.first()
if uid_row and rid_row:
await db.execute(
text("INSERT INTO user_cluster_mapping(user_id, cluster_id, role_id) VALUES (:uid,:cid,:rid) ON CONFLICT (user_id, cluster_id) DO NOTHING"),
{"uid": uid_row[0], "cid": c.id, "rid": rid_row[0]}
)
await db.commit()
return {
"status": "success",
"message": "集群注册成功" if not existing_cluster else "集群已关联至当前用户",
"uuid": new_uuid
}
except HTTPException:
raise
except Exception as e:
import traceback
traceback.print_exc()
raise HTTPException(status_code=500, detail="server_error")
@router.delete("/clusters/{uuid}")
async def delete_cluster(
uuid: str,
user=Depends(PermissionChecker(["cluster:delete"])),
db: AsyncSession = Depends(get_db)
):
"""注销指定集群,并清理用户归属映射。"""
try:
name = _get_username(user)
# 移除硬编码的角色检查
try:
uo = uuidlib.UUID(uuid)
except Exception:
raise HTTPException(status_code=400, detail={"errors": [{"field": "uuid", "message": "UUID 格式不正确"}]})
res = await db.execute(select(Cluster).where(Cluster.uuid == str(uo)).limit(1))
c = res.scalars().first()
if not c:
return {"ok": True}
await db.execute(delete(Cluster).where(Cluster.id == c.id))
await db.execute(text("DELETE FROM user_cluster_mapping WHERE cluster_id=:cid"), {"cid": c.id})
await db.commit()
return {"ok": True}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")

@ -1,206 +0,0 @@
from fastapi import APIRouter, Depends, Query, HTTPException
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, func, delete, update
from ..db import get_db
from ..models.hadoop_logs import HadoopLog
from ..models.clusters import Cluster
from ..deps.auth import get_current_user
from pydantic import BaseModel
from datetime import datetime
import json
from ..config import now_bj
from ..config import BJ_TZ
router = APIRouter()
def _get_username(u) -> str:
return getattr(u, "username", None) or (u.get("username") if isinstance(u, dict) else None)
def _now():
return now_bj()
def _map_level(level: str) -> str:
lv = (level or "").lower()
if lv in ("critical", "fatal"):
return "FATAL"
if lv == "high":
return "ERROR"
if lv == "medium":
return "WARN"
return "INFO"
class FaultCreate(BaseModel):
id: str | None = None
type: str
level: str
status: str
title: str
cluster: str | None = None
node: str | None = None
created: str | None = None
class FaultUpdate(BaseModel):
status: str | None = None
title: str | None = None
@router.get("/faults")
async def list_faults(
user=Depends(get_current_user),
db: AsyncSession = Depends(get_db),
cluster: str | None = Query(None),
node: str | None = Query(None),
time_from: str | None = Query(None),
page: int = Query(1, ge=1),
size: int = Query(10, ge=1, le=100),
):
try:
stmt = select(HadoopLog).where(HadoopLog.title == "fault")
count_stmt = select(func.count(HadoopLog.log_id)).where(HadoopLog.title == "fault")
if cluster:
stmt = stmt.where(HadoopLog.cluster_name == cluster)
count_stmt = count_stmt.where(HadoopLog.cluster_name == cluster)
if node:
stmt = stmt.where(HadoopLog.node_host == node)
count_stmt = count_stmt.where(HadoopLog.node_host == node)
if time_from:
try:
tf = datetime.fromisoformat(time_from.replace("Z", "+00:00"))
if tf.tzinfo is None:
tf = tf.replace(tzinfo=BJ_TZ)
else:
tf = tf.astimezone(BJ_TZ)
stmt = stmt.where(HadoopLog.log_time >= tf)
count_stmt = count_stmt.where(HadoopLog.log_time >= tf)
except Exception:
pass
stmt = stmt.order_by(HadoopLog.log_time.desc()).offset((page - 1) * size).limit(size)
rows = (await db.execute(stmt)).scalars().all()
total = (await db.execute(count_stmt)).scalar() or 0
items = []
for r in rows:
meta = {}
try:
if r.info:
meta = json.loads(r.info)
except Exception:
pass
items.append({
"id": str(r.log_id),
"type": meta.get("type", "unknown"),
"level": r.title,
"status": meta.get("status", "active"),
"title": meta.get("title", r.title),
"cluster": r.cluster_name,
"node": r.node_host,
"created": r.log_time.isoformat() if r.log_time else None
})
return {"items": items, "total": int(total)}
except HTTPException:
raise
except Exception as e:
print(f"Error listing faults: {e}")
raise HTTPException(status_code=500, detail="server_error")
@router.post("/faults")
async def create_fault(req: FaultCreate, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
uname = _get_username(user)
if uname not in {"admin", "ops"}:
raise HTTPException(status_code=403, detail="not_allowed")
# 确定集群名称
cluster_name = req.cluster or "unknown"
if req.cluster and "-" in req.cluster: # 可能是 UUID
res = await db.execute(select(Cluster.name).where(Cluster.uuid == req.cluster).limit(1))
name = res.scalars().first()
if name:
cluster_name = name
ts = _now()
if req.created:
try:
dt = datetime.fromisoformat(req.created.replace("Z", "+00:00"))
if dt.tzinfo is None:
ts = dt.replace(tzinfo=BJ_TZ)
else:
ts = dt.astimezone(BJ_TZ)
except Exception:
pass
meta = {"type": req.type, "status": req.status, "title": req.title, "cluster": req.cluster, "node": req.node}
log = HadoopLog(
cluster_name=cluster_name,
node_host=req.node or "unknown",
title="fault",
info=json.dumps(meta, ensure_ascii=False),
log_time=ts
)
db.add(log)
await db.commit()
return {"ok": True, "id": log.log_id}
except HTTPException:
raise
except Exception as e:
print(f"Error creating fault: {e}")
raise HTTPException(status_code=500, detail="server_error")
@router.put("/faults/{fid}")
async def update_fault(fid: int, req: FaultUpdate, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
uname = _get_username(user)
if uname not in {"admin", "ops"}:
raise HTTPException(status_code=403, detail="not_allowed")
res = await db.execute(select(HadoopLog).where(HadoopLog.log_id == fid, HadoopLog.title == "fault").limit(1))
row = res.scalars().first()
if not row:
raise HTTPException(status_code=404, detail="not_found")
meta = {}
try:
if row.info:
meta = json.loads(row.info)
except Exception:
pass
if req.status is not None:
meta["status"] = req.status
if req.title is not None:
meta["title"] = req.title
row.info = json.dumps(meta, ensure_ascii=False)
await db.commit()
return {"ok": True}
except HTTPException:
raise
except Exception as e:
print(f"Error updating fault: {e}")
raise HTTPException(status_code=500, detail="server_error")
@router.delete("/faults/{fid}")
async def delete_fault(fid: int, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
uname = _get_username(user)
if uname not in {"admin", "ops"}:
raise HTTPException(status_code=403, detail="not_allowed")
await db.execute(delete(HadoopLog).where(HadoopLog.log_id == fid, HadoopLog.title == "fault"))
await db.commit()
return {"ok": True}
except HTTPException:
raise
except Exception as e:
print(f"Error deleting fault: {e}")
raise HTTPException(status_code=500, detail="server_error")

@ -1,129 +0,0 @@
from fastapi import APIRouter, Depends, HTTPException
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, delete, update
from ..db import get_db
from ..models.hadoop_exec_logs import HadoopExecLog
from ..models.users import User
from ..deps.auth import get_current_user
from pydantic import BaseModel
from datetime import datetime, timezone
from ..config import now_bj
from ..config import BJ_TZ
router = APIRouter()
class ExecLogCreate(BaseModel):
from_user_id: int
cluster_name: str
description: str | None = None
start_time: str | None = None
end_time: str | None = None
class ExecLogUpdate(BaseModel):
description: str | None = None
start_time: str | None = None
end_time: str | None = None
def _now() -> datetime:
return now_bj()
def _parse_time(s: str | None) -> datetime | None:
if not s:
return None
try:
dt = datetime.fromisoformat(s.replace("Z", "+00:00"))
if dt.tzinfo is None:
return dt.replace(tzinfo=BJ_TZ)
return dt.astimezone(BJ_TZ)
except Exception:
return None
@router.get("/exec-logs")
async def list_exec_logs(user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
stmt = (
select(HadoopExecLog, User.username)
.join(User, HadoopExecLog.from_user_id == User.id)
.order_by(HadoopExecLog.start_time.desc())
)
result = await db.execute(stmt)
rows = result.all()
items = []
for log, username in rows:
d = log.to_dict()
d["username"] = username
if "from_user_id" in d:
del d["from_user_id"]
items.append(d)
return {"items": items}
except Exception as e:
print(f"Error listing exec logs: {e}")
raise HTTPException(status_code=500, detail="server_error")
@router.post("/exec-logs")
async def create_exec_log(req: ExecLogCreate, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
st = _parse_time(req.start_time)
et = _parse_time(req.end_time)
row = HadoopExecLog(
from_user_id=req.from_user_id,
cluster_name=req.cluster_name,
description=req.description,
start_time=st,
end_time=et
)
db.add(row)
await db.flush()
await db.commit()
return {"ok": True, "id": row.id}
except HTTPException:
raise
except Exception as e:
print(f"Error creating exec log: {e}")
raise HTTPException(status_code=500, detail="server_error")
@router.put("/exec-logs/{log_id}")
async def update_exec_log(log_id: int, req: ExecLogUpdate, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
st = _parse_time(req.start_time)
et = _parse_time(req.end_time)
values: dict = {}
if req.description is not None:
values["description"] = req.description
if st is not None:
values["start_time"] = st
if et is not None:
values["end_time"] = et
if not values:
return {"ok": True}
await db.execute(update(HadoopExecLog).where(HadoopExecLog.id == log_id).values(**values))
await db.commit()
return {"ok": True}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")
@router.delete("/exec-logs/{log_id}")
async def delete_exec_log(log_id: int, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
await db.execute(delete(HadoopExecLog).where(HadoopExecLog.id == log_id))
await db.commit()
return {"ok": True}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")

@ -1,459 +0,0 @@
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, func, or_, text
from ..db import get_db
from ..deps.auth import get_current_user
from ..log_reader import log_reader
from ..log_collector import log_collector
from ..ssh_utils import ssh_manager
from ..models.nodes import Node
from ..models.clusters import Cluster
from ..metrics_collector import metrics_collector
from ..models.hadoop_logs import HadoopLog
from datetime import datetime, timezone
import time
from ..models.node_metrics import NodeMetric
from ..models.cluster_metrics import ClusterMetric
from datetime import timedelta
from ..config import now_bj
from ..config import BJ_TZ
from zoneinfo import ZoneInfo
from ..schemas import (
LogRequest,
LogResponse,
MultiLogResponse,
NodeListResponse,
LogFilesResponse
)
router = APIRouter()
async def _ensure_metrics_schema(db: AsyncSession):
await db.execute(text("""
CREATE TABLE IF NOT EXISTS node_metrics (
id SERIAL PRIMARY KEY,
cluster_id INTEGER,
node_id INTEGER,
hostname VARCHAR(100),
cpu_usage DOUBLE PRECISION,
memory_usage DOUBLE PRECISION,
created_at TIMESTAMPTZ
)
"""))
await db.execute(text("""
CREATE TABLE IF NOT EXISTS cluster_metrics (
id SERIAL PRIMARY KEY,
cluster_id INTEGER,
cluster_name VARCHAR(100),
cpu_avg DOUBLE PRECISION,
memory_avg DOUBLE PRECISION,
created_at TIMESTAMPTZ
)
"""))
await db.execute(text("ALTER TABLE node_metrics ADD COLUMN IF NOT EXISTS node_id INTEGER"))
await db.execute(text("ALTER TABLE node_metrics ADD COLUMN IF NOT EXISTS hostname VARCHAR(100)"))
await db.execute(text("ALTER TABLE node_metrics ADD COLUMN IF NOT EXISTS cpu_usage DOUBLE PRECISION"))
await db.execute(text("ALTER TABLE node_metrics ADD COLUMN IF NOT EXISTS memory_usage DOUBLE PRECISION"))
await db.execute(text("ALTER TABLE node_metrics ADD COLUMN IF NOT EXISTS created_at TIMESTAMPTZ"))
await db.execute(text("ALTER TABLE node_metrics ADD COLUMN IF NOT EXISTS cluster_id INTEGER"))
await db.execute(text("ALTER TABLE cluster_metrics ADD COLUMN IF NOT EXISTS cluster_name VARCHAR(100)"))
await db.execute(text("ALTER TABLE cluster_metrics ADD COLUMN IF NOT EXISTS cpu_avg DOUBLE PRECISION"))
await db.execute(text("ALTER TABLE cluster_metrics ADD COLUMN IF NOT EXISTS memory_avg DOUBLE PRECISION"))
await db.execute(text("ALTER TABLE cluster_metrics ADD COLUMN IF NOT EXISTS created_at TIMESTAMPTZ"))
await db.execute(text("ALTER TABLE cluster_metrics ADD COLUMN IF NOT EXISTS cluster_id INTEGER"))
await db.commit()
def _parse_time(s: str | None) -> datetime | None:
if not s:
return None
try:
dt = datetime.fromisoformat(s.replace("Z", "+00:00"))
if dt.tzinfo is None:
return dt.replace(tzinfo=BJ_TZ)
return dt.astimezone(BJ_TZ)
except Exception:
return None
@router.get("/logs")
async def list_logs(
user=Depends(get_current_user),
db: AsyncSession = Depends(get_db),
cluster: str | None = Query(None),
node: str | None = Query(None),
source: str | None = Query(None),
time_from: str | None = Query(None),
page: int = Query(1, ge=1),
size: int = Query(10, ge=1, le=100),
):
try:
stmt = select(HadoopLog)
count_stmt = select(func.count(HadoopLog.log_id))
filters = []
if cluster:
filters.append(HadoopLog.cluster_name == cluster)
if node:
filters.append(HadoopLog.node_host == node)
if source:
like = f"%{source}%"
filters.append(or_(HadoopLog.title.ilike(like), HadoopLog.info.ilike(like), HadoopLog.node_host.ilike(like)))
tf = _parse_time(time_from)
if tf:
filters.append(HadoopLog.log_time >= tf)
for f in filters:
stmt = stmt.where(f)
count_stmt = count_stmt.where(f)
stmt = stmt.order_by(HadoopLog.log_time.desc()).offset((page - 1) * size).limit(size)
rows = (await db.execute(stmt)).scalars().all()
total = (await db.execute(count_stmt)).scalar() or 0
items = [
{
"id": r.log_id,
"time": r.log_time.isoformat() if r.log_time else None,
"cluster": r.cluster_name,
"node": r.node_host,
"title": r.title,
"info": r.info,
}
for r in rows
]
return {"items": items, "total": int(total)}
except HTTPException:
raise
except Exception as e:
print(f"Error listing logs: {e}")
raise HTTPException(status_code=500, detail="server_error")
async def get_node_ip(db: AsyncSession, node_name: str) -> str:
result = await db.execute(select(Node.ip_address).where(Node.hostname == node_name))
ip = result.scalar_one_or_none()
if not ip:
raise HTTPException(status_code=404, detail=f"Node {node_name} not found")
return str(ip)
@router.get("/hadoop/nodes/")
async def get_hadoop_nodes(user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
"""Get list of all Hadoop nodes"""
# Assuming all nodes in DB are relevant, or filter by Cluster type if needed
stmt = select(Node.hostname).join(Cluster)
# Optional: .where(Cluster.type.ilike('%hadoop%'))
result = await db.execute(stmt)
nodes = result.scalars().all()
return NodeListResponse(nodes=nodes)
@router.get("/hadoop/logs/{node_name}/{log_type}/", response_model=LogResponse)
async def get_hadoop_log(node_name: str, log_type: str, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
"""Get log from a specific Hadoop node"""
ip = await get_node_ip(db, node_name)
try:
# Read log content
log_content = log_reader.read_log(node_name, log_type, ip=ip)
return LogResponse(
node_name=node_name,
log_type=log_type,
log_content=log_content
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.get("/hadoop/logs/all/{log_type}/", response_model=MultiLogResponse)
async def get_all_hadoop_nodes_log(log_type: str, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
"""Get logs from all Hadoop nodes"""
stmt = select(Node.hostname, Node.ip_address).join(Cluster)
result = await db.execute(stmt)
nodes_data = result.all()
nodes_list = [{"name": n[0], "ip": str(n[1])} for n in nodes_data]
try:
# Read logs from all nodes
logs = log_reader.read_all_nodes_log(nodes_list, log_type)
return MultiLogResponse(logs=logs)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.get("/hadoop/logs/files/{node_name}/", response_model=LogFilesResponse)
async def get_hadoop_log_files(node_name: str, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
"""Get list of log files on a specific Hadoop node"""
ip = await get_node_ip(db, node_name)
try:
# Get log files list
log_files = log_reader.get_log_files_list(node_name, ip=ip)
return LogFilesResponse(
node_name=node_name,
log_files=log_files
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
# Log collection management endpoints
@router.get("/hadoop/collectors/status/")
async def get_hadoop_collectors_status(user=Depends(get_current_user)):
"""Get status of all Hadoop log collectors"""
status = log_collector.get_collectors_status()
return {
"collectors": status,
"total_running": sum(status.values())
}
@router.post("/hadoop/collectors/start/{node_name}/{log_type}/")
async def start_hadoop_collector(node_name: str, log_type: str, interval: int = 5, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
"""Start log collection for a specific Hadoop node and log type"""
ip = await get_node_ip(db, node_name)
try:
log_collector.start_collection(node_name, log_type, ip=ip, interval=interval)
return {
"message": f"Started log collection for {node_name}_{log_type}",
"interval": interval
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.post("/hadoop/collectors/stop/{node_name}/{log_type}/")
async def stop_hadoop_collector(node_name: str, log_type: str, user=Depends(get_current_user)):
"""Stop log collection for a specific Hadoop node and log type"""
# stop doesn't need IP as it just stops the thread by ID
try:
log_collector.stop_collection(node_name, log_type)
return {
"message": f"Stopped log collection for {node_name}_{log_type}"
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.post("/hadoop/collectors/stop/all/")
async def stop_all_hadoop_collectors(user=Depends(get_current_user)):
"""Stop all Hadoop log collectors"""
try:
log_collector.stop_all_collections()
return {
"message": "Stopped all log collectors"
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.post("/hadoop/collectors/set-interval/{interval}/")
async def set_hadoop_collection_interval(interval: int, user=Depends(get_current_user)):
"""Set collection interval for all Hadoop collectors"""
try:
log_collector.set_collection_interval(interval)
return {
"message": f"Set collection interval to {interval} seconds"
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.post("/hadoop/collectors/set-log-dir/{log_dir}/")
async def set_hadoop_log_directory(log_dir: str, user=Depends(get_current_user)):
"""Set log directory for all Hadoop collectors"""
try:
log_collector.set_log_dir(log_dir)
return {
"message": f"Set log directory to {log_dir}"
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.post("/hadoop/nodes/{node_name}/execute/")
async def execute_hadoop_command(node_name: str, command: str, timeout: int = 30, user=Depends(get_current_user)):
"""Execute a command on a specific Hadoop node"""
try:
from sqlalchemy import select
from ..db import SessionLocal
from ..models.nodes import Node
async with SessionLocal() as db:
res = await db.execute(select(Node.ip_address).where(Node.hostname == node_name).limit(1))
ip = res.scalar_one_or_none()
if not ip:
raise HTTPException(status_code=404, detail=f"Node {node_name} not found")
ssh_client = ssh_manager.get_connection(node_name, ip=str(ip))
# Execute command with timeout
stdout, stderr = ssh_client.execute_command_with_timeout(command, timeout)
return {
"node_name": node_name,
"command": command,
"stdout": stdout,
"stderr": stderr,
"status": "success" if not stderr else "error"
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.post("/hadoop/collectors/start-by-cluster/{cluster_uuid}/")
async def start_collectors_by_cluster(cluster_uuid: str, interval: int = 5, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
"""Start log collection for all nodes of the cluster (by UUID), only for existing services"""
try:
cid_res = await db.execute(select(Cluster.id).where(Cluster.uuid == cluster_uuid).limit(1))
cid = cid_res.scalar_one_or_none()
if cid is None:
raise HTTPException(status_code=404, detail="cluster_not_found")
nodes_res = await db.execute(select(Node.hostname, Node.ip_address).where(Node.cluster_id == cid))
rows = nodes_res.all()
if not rows:
return {"started": 0, "nodes": []}
started = []
for hn, ip in rows:
ip_s = str(ip)
files = []
try:
log_reader.find_working_log_dir(hn, ip_s)
files = log_reader.get_log_files_list(hn, ip=ip_s)
except Exception:
files = []
services = []
for fn in files:
f = fn.lower()
if "namenode" in f:
services.append("namenode")
elif "secondarynamenode" in f:
services.append("secondarynamenode")
elif "datanode" in f:
services.append("datanode")
elif "resourcemanager" in f:
services.append("resourcemanager")
elif "nodemanager" in f:
services.append("nodemanager")
elif "historyserver" in f:
services.append("historyserver")
services = list(set(services))
for t in services:
ok = False
try:
ok = log_collector.start_collection(hn, t, ip=ip_s, interval=interval)
except Exception:
ok = False
if ok:
started.append(f"{hn}_{t}")
return {"started": len(started), "nodes": started, "interval": interval}
except HTTPException:
raise
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.post("/hadoop/collectors/backfill-by-cluster/{cluster_uuid}/")
async def backfill_logs_by_cluster(cluster_uuid: str, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
cid_res = await db.execute(select(Cluster.id).where(Cluster.uuid == cluster_uuid).limit(1))
cid = cid_res.scalar_one_or_none()
if cid is None:
raise HTTPException(status_code=404, detail="cluster_not_found")
nodes_res = await db.execute(select(Node.hostname, Node.ip_address).where(Node.cluster_id == cid))
rows = nodes_res.all()
if not rows:
return {"backfilled": 0, "details": []}
details = []
for hn, ip in rows:
ip_s = str(ip)
ssh_client = ssh_manager.get_connection(hn, ip=ip_s)
candidates = [
"/opt/module/hadoop-3.1.3/logs",
"/usr/local/hadoop/logs",
"/usr/local/hadoop-3.3.6/logs",
"/usr/local/hadoop-3.3.5/logs",
"/usr/local/hadoop-3.1.3/logs",
"/opt/hadoop/logs",
"/var/log/hadoop",
]
base = None
for d in candidates:
out, err = ssh_client.execute_command(f"ls -1 {d} 2>/dev/null")
if not err and out.strip():
base = d
break
services = []
count = 0
if base:
out, err = ssh_client.execute_command(f"ls -1 {base} 2>/dev/null")
if not err and out.strip():
for fn in out.splitlines():
f = fn.lower()
t = None
if "namenode" in f:
t = "namenode"
elif "secondarynamenode" in f:
t = "secondarynamenode"
elif "datanode" in f:
t = "datanode"
elif "resourcemanager" in f:
t = "resourcemanager"
elif "nodemanager" in f:
t = "nodemanager"
elif "historyserver" in f:
t = "historyserver"
if t:
services.append(t)
out2, err2 = ssh_client.execute_command(f"cat {base}/{fn} 2>/dev/null")
if not err2 and out2:
log_collector._save_log_chunk(hn, t, out2)
count += out2.count("\n")
details.append({"node": hn, "services": list(set(services)), "lines": count})
total_lines = sum(d["lines"] for d in details)
return {"backfilled": total_lines, "details": details}
except HTTPException:
raise
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.post("/metrics/{cluster_uuid}/")
async def sync_metrics(cluster_uuid: str, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
from sqlalchemy import select
try:
metrics_collector.stop_all()
except Exception:
pass
cid_res = await db.execute(select(Cluster.id, Cluster.name).where(Cluster.uuid == cluster_uuid).limit(1))
row = cid_res.first()
if not row:
raise HTTPException(status_code=404, detail="cluster_not_found")
cid, cname = row
nodes_res = await db.execute(select(Node.id, Node.hostname, Node.ip_address).where(Node.cluster_id == cid))
rows = nodes_res.all()
now = now_bj()
details = []
for nid, hn, ip in rows:
ssh_client = ssh_manager.get_connection(hn, ip=str(ip))
out1, err1 = ssh_client.execute_command("cat /proc/stat | head -n 1")
time.sleep(0.5)
out2, err2 = ssh_client.execute_command("cat /proc/stat | head -n 1")
cpu_pct = 0.0
if not err1 and not err2 and out1.strip() and out2.strip():
p1 = out1.strip().split()
p2 = out2.strip().split()
v1 = [int(x) for x in p1[1:]]
v2 = [int(x) for x in p2[1:]]
get1 = lambda i: (v1[i] if i < len(v1) else 0)
get2 = lambda i: (v2[i] if i < len(v2) else 0)
idle = (get2(3) + get2(4)) - (get1(3) + get1(4))
total = (get2(0) - get1(0)) + (get2(1) - get1(1)) + (get2(2) - get1(2)) + idle + (get2(5) - get1(5)) + (get2(6) - get1(6)) + (get2(7) - get1(7))
if total > 0:
cpu_pct = round((1.0 - idle / total) * 100.0, 2)
outm, errm = ssh_client.execute_command("cat /proc/meminfo")
mem_pct = 0.0
if not errm and outm.strip():
mt = 0
ma = 0
for line in outm.splitlines():
if line.startswith("MemTotal:"):
mt = int(line.split()[1])
elif line.startswith("MemAvailable:"):
ma = int(line.split()[1])
if mt > 0:
mem_pct = round((1.0 - (ma / mt)) * 100.0, 2)
details.append({"node": hn, "cpu": cpu_pct, "memory": mem_pct})
if details:
ca = round(sum(d["cpu"] for d in details) / len(details), 3)
ma = round(sum(d["memory"] for d in details) / len(details), 3)
else:
ca = 0.0
ma = 0.0
return {"cluster": {"cpu_avg": round(ca, 2), "memory_avg": round(ma, 2), "time": now.isoformat(), "cluster_name": cname}, "nodes": details}
except HTTPException:
raise
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))

@ -1,16 +0,0 @@
from fastapi import APIRouter, Depends
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import text
from ..db import get_db
router = APIRouter()
@router.get("/health")
async def health_check(db: AsyncSession = Depends(get_db)):
"""健康检查,包括数据库连接验证。"""
try:
# 尝试执行一个简单的查询来验证数据库连接
await db.execute(text("SELECT 1"))
return {"status": "ok", "database": "connected"}
except Exception as e:
return {"status": "ok", "database": f"disconnected: {str(e)}"}

@ -1,213 +0,0 @@
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, text
from ..db import get_db
from ..deps.auth import get_current_user
from ..metrics_collector import metrics_collector
from ..models.nodes import Node
from ..models.clusters import Cluster
from datetime import datetime, timezone
router = APIRouter()
def _get_username(u) -> str:
return getattr(u, "username", None) or (u.get("username") if isinstance(u, dict) else None)
async def _ensure_access(db: AsyncSession, username: str, cluster_uuid: str) -> int | None:
uid_res = await db.execute(text("SELECT id FROM users WHERE username=:un LIMIT 1"), {"un": username})
uid_row = uid_res.first()
if not uid_row:
return None
cid_res = await db.execute(select(Cluster.id).where(Cluster.uuid == cluster_uuid).limit(1))
cid = cid_res.scalars().first()
if not cid:
return None
auth_res = await db.execute(text("SELECT 1 FROM user_cluster_mapping WHERE user_id=:uid AND cluster_id=:cid LIMIT 1"), {"uid": uid_row[0], "cid": cid})
if not auth_res.first():
return None
return cid
@router.post("/metrics/collectors/start-by-cluster/{cluster_uuid}")
async def start_collectors_by_cluster(
cluster_uuid: str,
interval: int = Query(5, ge=1, le=3600),
user=Depends(get_current_user),
db: AsyncSession = Depends(get_db),
):
try:
name = _get_username(user)
cid = await _ensure_access(db, name, cluster_uuid)
if not cid:
raise HTTPException(status_code=403, detail="not_allowed")
res = await db.execute(select(Node.id, Node.hostname, Node.ip_address).where(Node.cluster_id == cid))
rows = res.all()
nodes = [(int(nid), str(hn), str(ip), int(cid)) for nid, hn, ip in rows]
started_count, started_nodes = metrics_collector.start_for_nodes(nodes, interval=interval)
return {
"started": int(started_count),
"nodes": started_nodes,
"interval": int(metrics_collector.collection_interval),
}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")
@router.get("/metrics/collectors/status")
async def get_collectors_status(
cluster: str | None = Query(None),
user=Depends(get_current_user),
db: AsyncSession = Depends(get_db),
):
"""查询指标采集器的状态"""
try:
name = _get_username(user)
# 即使校验失败或发生错误,也返回一个 200 结构的友好响应,而不是让接口崩掉
try:
status = metrics_collector.get_collectors_status()
errors = metrics_collector.get_errors()
interval = int(metrics_collector.collection_interval)
# 如果提供了集群 UUID进行过滤
if cluster:
# 获取该集群下的节点列表
cid = await _ensure_access(db, name, cluster)
if cid:
res = await db.execute(select(Node.hostname).where(Node.cluster_id == cid))
cluster_nodes = set(str(hn) for (hn,) in res.all())
status = {k: v for k, v in status.items() if k in cluster_nodes}
errors = {k: v for k, v in errors.items() if k in cluster_nodes}
else:
# 权限不足时,返回空结果而非报错
status = {}
errors = {}
return {
"is_running": any(status.values()) if status else False,
"active_collectors_count": int(sum(1 for v in status.values() if v)),
"interval": interval,
"collectors": status,
"errors": errors
}
except Exception as inner_e:
return {
"is_running": False,
"active_collectors_count": 0,
"interval": 5,
"collectors": {},
"errors": {"system": str(inner_e)}
}
except Exception as e:
# 顶层异常捕获
return {
"is_running": False,
"active_collectors_count": 0,
"interval": 5,
"collectors": {},
"errors": {"fatal": str(e)}
}
@router.post("/metrics/collectors/stop-by-cluster/{cluster_uuid}")
async def stop_collectors_by_cluster(
cluster_uuid: str,
user=Depends(get_current_user),
db: AsyncSession = Depends(get_db),
):
try:
name = _get_username(user)
cid = await _ensure_access(db, name, cluster_uuid)
if not cid:
raise HTTPException(status_code=403, detail="not_allowed")
res = await db.execute(select(Node.hostname).where(Node.cluster_id == cid))
hostnames = [str(hn) for (hn,) in res.all()]
stopped = []
for hn in hostnames:
if hn in metrics_collector.collectors:
metrics_collector.stop(hn)
stopped.append(hn)
return {"stopped": int(len(stopped)), "nodes": stopped}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")
@router.get("/metrics/cpu_trend")
async def cpu_trend(cluster: str = Query(...), user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
"""获取指定集群的 CPU 使用率趋势数据。"""
try:
name = _get_username(user)
cid = await _ensure_access(db, name, cluster)
if not cid:
raise HTTPException(status_code=403, detail="not_allowed")
res = await db.execute(select(Node.cpu_usage).where(Node.cluster_id == cid))
vals = [v for v in res.scalars().all() if v is not None]
base = sum(vals) / len(vals) if vals else 30.0
pattern = [-10, -5, 0, 5, 10, 5, 0]
series = [max(0, min(100, int(round(base + d)))) for d in pattern]
return {"times": ["00:00","04:00","08:00","12:00","16:00","20:00","24:00"], "values": series}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")
@router.get("/metrics/memory_usage")
async def memory_usage(cluster: str = Query(...), user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
"""获取指定集群的内存使用情况(单位:百分比)。"""
try:
name = _get_username(user)
cid = await _ensure_access(db, name, cluster)
if not cid:
raise HTTPException(status_code=403, detail="not_allowed")
res = await db.execute(select(Node.memory_usage).where(Node.cluster_id == cid))
vals = [v for v in res.scalars().all() if v is not None]
used = round(sum(vals) / len(vals), 1) if vals else 30.0
free = round(max(0.0, 100.0 - used), 1)
return {"used": used, "free": free}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")
@router.get("/metrics/cpu_trend_node")
async def cpu_trend_node(cluster: str = Query(...), node: str = Query(...), user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
"""获取指定节点的 CPU 使用率趋势数据。"""
try:
name = _get_username(user)
cid = await _ensure_access(db, name, cluster)
if not cid:
raise HTTPException(status_code=403, detail="not_allowed")
res = await db.execute(select(Node.cpu_usage).where(Node.cluster_id == cid, Node.hostname == node).limit(1))
v = res.scalars().first()
base = float(v) if v is not None else 30.0
pattern = [-10, -5, 0, 5, 10, 5, 0]
series = [max(0, min(100, int(round(base + d)))) for d in pattern]
return {"times": ["00:00","04:00","08:00","12:00","16:00","20:00","24:00"], "values": series}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")
@router.get("/metrics/memory_usage_node")
async def memory_usage_node(cluster: str = Query(...), node: str = Query(...), user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
"""获取指定节点的内存使用情况(单位:百分比)。"""
try:
name = _get_username(user)
cid = await _ensure_access(db, name, cluster)
if not cid:
raise HTTPException(status_code=403, detail="not_allowed")
res = await db.execute(select(Node.memory_usage).where(Node.cluster_id == cid, Node.hostname == node).limit(1))
v = res.scalars().first()
used = round(float(v), 1) if v is not None else 30.0
free = round(max(0.0, 100.0 - used), 1)
return {"used": used, "free": free}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")

@ -1,120 +0,0 @@
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, update, delete, func, text
from ..db import get_db
from ..deps.auth import get_current_user
from ..models.nodes import Node
from ..models.clusters import Cluster
from pydantic import BaseModel
from datetime import datetime, timezone
from ..config import now_bj
router = APIRouter()
def _get_username(u) -> str:
return getattr(u, "username", None) or (u.get("username") if isinstance(u, dict) else None)
def _status_to_contract(s: str) -> str:
if s == "healthy":
return "running"
if s == "unhealthy":
return "stopped"
return s or "unknown"
def _fmt_percent(v: float | None) -> str:
if v is None:
return "-"
return f"{int(round(v))}%"
def _fmt_updated(ts: datetime | None) -> str:
if not ts:
return "-"
now = now_bj()
diff = int((now - ts).total_seconds())
if diff < 60:
return "刚刚"
if diff < 3600:
return f"{diff // 60}分钟前"
return f"{diff // 3600}小时前"
class NodeDetail(BaseModel):
name: str
metrics: dict
@router.get("/nodes")
async def list_nodes(cluster: str = Query(...), user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
"""拉取指定集群的节点列表。"""
try:
name = _get_username(user)
uid_res = await db.execute(text("SELECT id FROM users WHERE username=:un LIMIT 1"), {"un": name})
uid_row = uid_res.first()
if not uid_row:
return {"nodes": []}
cid_res = await db.execute(select(Cluster.id).where(Cluster.uuid == cluster).limit(1))
cid = cid_res.scalars().first()
if not cid:
return {"nodes": []}
auth_res = await db.execute(text("SELECT 1 FROM user_cluster_mapping WHERE user_id=:uid AND cluster_id=:cid LIMIT 1"), {"uid": uid_row[0], "cid": cid})
if not auth_res.first():
raise HTTPException(status_code=403, detail="not_allowed")
result = await db.execute(select(Node).where(Node.cluster_id == cid).limit(500))
rows = result.scalars().all()
data = [
{
"name": n.hostname,
"ip": str(getattr(n, "ip_address", "")) if getattr(n, "ip_address", None) else None,
"status": _status_to_contract(n.status),
"cpu": _fmt_percent(n.cpu_usage),
"mem": _fmt_percent(n.memory_usage),
"updated": _fmt_updated(n.last_heartbeat),
}
for n in rows
]
return {"nodes": data}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")
@router.get("/nodes/{name}")
async def node_detail(name: str, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
"""查询节点详情。"""
try:
name_u = _get_username(user)
uid_res = await db.execute(text("SELECT id FROM users WHERE username=:un LIMIT 1"), {"un": name_u})
uid_row = uid_res.first()
if not uid_row:
raise HTTPException(status_code=404, detail="not_found")
# 仅返回用户可访问集群中的该节点
ids_res = await db.execute(text("SELECT cluster_id FROM user_cluster_mapping WHERE user_id=:uid"), {"uid": uid_row[0]})
cluster_ids = [r[0] for r in ids_res.all()]
if not cluster_ids:
raise HTTPException(status_code=404, detail="not_found")
res = await db.execute(select(Node).where(Node.hostname == name, Node.cluster_id.in_(cluster_ids)).limit(1))
n = res.scalars().first()
if not n:
raise HTTPException(status_code=404, detail="not_found")
return NodeDetail(
name=n.hostname,
metrics={
"cpu": _fmt_percent(n.cpu_usage),
"mem": _fmt_percent(n.memory_usage),
"disk": _fmt_percent(n.disk_usage),
"status": _status_to_contract(n.status),
"ip": str(getattr(n, "ip_address", "")) if getattr(n, "ip_address", None) else None,
"lastHeartbeat": getattr(n, "last_heartbeat", None).isoformat() if getattr(n, "last_heartbeat", None) else None,
},
).model_dump()
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")

@ -1,287 +0,0 @@
from fastapi import APIRouter, Depends, HTTPException
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, text
from pydantic import BaseModel, Field
from datetime import datetime, timezone
import shlex
import uuid as uuidlib
import asyncio
from ..db import get_db
from ..deps.auth import get_current_user, PermissionChecker
from ..models.nodes import Node
from ..models.clusters import Cluster
from ..models.sys_exec_logs import SysExecLog
from ..models.hadoop_exec_logs import HadoopExecLog
from ..services.runner import run_remote_command
from ..ssh_utils import SSHClient
from ..config import now_bj
router = APIRouter()
def _now() -> datetime:
"""返回当前 UTC 时间。"""
return now_bj()
def _get_username(u) -> str:
"""提取用户名。"""
return getattr(u, "username", None) or (u.get("username") if isinstance(u, dict) else None) or "system"
def _require_ops(u):
"""校验用户是否具有运维权限。"""
name = _get_username(u)
if name not in {"admin", "ops"}:
raise HTTPException(status_code=403, detail="not_allowed")
async def _find_accessible_node(db: AsyncSession, user_name: str, hostname: str) -> Node | None:
"""在用户可访问的集群中查找指定主机名的节点。"""
uid_res = await db.execute(text("SELECT id FROM users WHERE username=:un LIMIT 1"), {"un": user_name})
uid_row = uid_res.first()
if not uid_row:
return None
ids_res = await db.execute(text("SELECT cluster_id FROM user_cluster_mapping WHERE user_id=:uid"), {"uid": uid_row[0]})
cluster_ids = [r[0] for r in ids_res.all()]
if not cluster_ids:
return None
res = await db.execute(select(Node).where(Node.hostname == hostname, Node.cluster_id.in_(cluster_ids)).limit(1))
return res.scalars().first()
def _gen_exec_id() -> str:
"""生成执行记录ID。"""
return uuidlib.uuid4().hex[:32]
class ReadLogReq(BaseModel):
node: str = Field(..., description="目标节点主机名")
path: str = Field(..., description="日志文件路径")
lines: int = Field(200, ge=1, le=5000, description="读取行数")
pattern: str | None = Field(None, description="可选过滤正则")
sshUser: str | None = Field(None, description="SSH 用户名(可选)")
timeout: int = Field(20, ge=1, le=120, description="命令超时时间")
async def _write_exec_log(db: AsyncSession, operation_id: str, description: str, user_id: int):
"""写入系统操作日志。"""
row = SysExecLog(
user_id=user_id,
description=description,
operation_time=_now()
)
db.add(row)
await db.flush()
await db.commit()
@router.post("/ops/read-log")
async def read_log(req: ReadLogReq, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
"""读取远端日志文件内容,支持可选筛选。"""
try:
_require_ops(user)
uname = _get_username(user)
# 假设这里需要 user_id从 user 对象获取或查询
user_id = getattr(user, "id", 1)
node = await _find_accessible_node(db, uname, req.node)
if not node:
raise HTTPException(status_code=404, detail="node_not_found")
path_q = shlex.quote(req.path)
cmd = f"tail -n {req.lines} {path_q}"
if req.pattern:
pat_q = shlex.quote(req.pattern)
cmd = f"{cmd} | grep -E {pat_q}"
start = _now()
code, out, err = await run_remote_command(str(getattr(node, "ip_address", "")), req.sshUser or "", cmd, timeout=req.timeout)
desc = f"Read log: {req.path} on {req.node} (Exit: {code})"
await _write_exec_log(db, None, desc, user_id)
if code != 0:
raise HTTPException(status_code=500, detail="exec_failed")
lines = [ln for ln in out.splitlines()]
return {"exitCode": code, "lines": lines}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")
async def _write_hadoop_exec_log(db: AsyncSession, user_id: int, cluster_name: str, description: str, start_time: datetime, end_time: datetime):
"""写入 Hadoop 执行审计日志。"""
row = HadoopExecLog(
from_user_id=user_id,
cluster_name=cluster_name,
description=description,
start_time=start_time,
end_time=end_time
)
db.add(row)
await db.flush()
await db.commit()
@router.post("/ops/clusters/{cluster_uuid}/start")
async def start_cluster(
cluster_uuid: str,
user=Depends(PermissionChecker(["cluster:start"])),
db: AsyncSession = Depends(get_db)
):
"""启动集群:在 NameNode 执行 hsfsstart在 ResourceManager 执行 yarnstart。"""
try:
# UUID 格式校验
try:
uuidlib.UUID(cluster_uuid)
except ValueError:
raise HTTPException(status_code=400, detail="invalid_uuid_format")
uname = _get_username(user)
user_id = getattr(user, "id", 1)
# 1. 查找集群
res = await db.execute(select(Cluster).where(Cluster.uuid == cluster_uuid).limit(1))
cluster = res.scalars().first()
if not cluster:
raise HTTPException(status_code=404, detail="cluster_not_found")
# 2. 获取 SSH 用户 (从关联节点中获取,默认为 hadoop)
node_res = await db.execute(select(Node).where(Node.cluster_id == cluster.id).limit(1))
node = node_res.scalars().first()
ssh_user = node.ssh_user if node and node.ssh_user else "hadoop"
start_time = _now()
logs = []
# 3. 在 NameNode 执行 start-dfs.sh
if cluster.namenode_ip and cluster.namenode_psw:
try:
def run_nn_start():
with SSHClient(str(cluster.namenode_ip), ssh_user, cluster.namenode_psw) as client:
return client.execute_command("start-dfs.sh")
out, err = await asyncio.to_thread(run_nn_start)
logs.append(f"NameNode ({cluster.namenode_ip}) start: {out} {err}")
except Exception as e:
logs.append(f"NameNode ({cluster.namenode_ip}) start failed: {str(e)}")
# 4. 在 ResourceManager 执行 start-yarn.sh
if cluster.rm_ip and cluster.rm_psw:
try:
def run_rm_start():
with SSHClient(str(cluster.rm_ip), ssh_user, cluster.rm_psw) as client:
return client.execute_command("start-yarn.sh")
out, err = await asyncio.to_thread(run_rm_start)
logs.append(f"ResourceManager ({cluster.rm_ip}) start: {out} {err}")
except Exception as e:
logs.append(f"ResourceManager ({cluster.rm_ip}) start failed: {str(e)}")
end_time = _now()
# 5. 更新集群状态 (仅当所有尝试都未抛出异常时)
# 改进:检查是否有失败日志
has_failed = any("failed" in log.lower() for log in logs)
if not has_failed:
cluster.health_status = "healthy"
else:
cluster.health_status = "error"
cluster.updated_at = end_time
await db.flush()
# 6. 记录日志
full_desc = " | ".join(logs)
await _write_hadoop_exec_log(db, user_id, cluster.name, f"Start Cluster: {full_desc}", start_time, end_time)
return {"status": "success", "logs": logs}
except HTTPException:
raise
except Exception as e:
print(f"Error starting cluster: {e}")
raise HTTPException(status_code=500, detail="server_error")
@router.post("/ops/clusters/{cluster_uuid}/stop")
async def stop_cluster(
cluster_uuid: str,
user=Depends(PermissionChecker(["cluster:stop"])),
db: AsyncSession = Depends(get_db)
):
"""停止集群:在 NameNode 执行 hsfsstop在 ResourceManager 执行 yarnstop。"""
try:
# UUID 格式校验
try:
uuidlib.UUID(cluster_uuid)
except ValueError:
raise HTTPException(status_code=400, detail="invalid_uuid_format")
uname = _get_username(user)
user_id = getattr(user, "id", 1)
# 1. 查找集群
res = await db.execute(select(Cluster).where(Cluster.uuid == cluster_uuid).limit(1))
cluster = res.scalars().first()
if not cluster:
raise HTTPException(status_code=404, detail="cluster_not_found")
# 2. 获取 SSH 用户
node_res = await db.execute(select(Node).where(Node.cluster_id == cluster.id).limit(1))
node = node_res.scalars().first()
ssh_user = node.ssh_user if node and node.ssh_user else "hadoop"
start_time = _now()
logs = []
# 3. 在 NameNode 执行 stop-dfs.sh
if cluster.namenode_ip and cluster.namenode_psw:
try:
def run_nn_stop():
with SSHClient(str(cluster.namenode_ip), ssh_user, cluster.namenode_psw) as client:
return client.execute_command("stop-dfs.sh")
out, err = await asyncio.to_thread(run_nn_stop)
logs.append(f"NameNode ({cluster.namenode_ip}) stop: {out} {err}")
except Exception as e:
logs.append(f"NameNode ({cluster.namenode_ip}) stop failed: {str(e)}")
# 4. 在 ResourceManager 执行 stop-yarn.sh
if cluster.rm_ip and cluster.rm_psw:
try:
def run_rm_stop():
with SSHClient(str(cluster.rm_ip), ssh_user, cluster.rm_psw) as client:
return client.execute_command("stop-yarn.sh")
out, err = await asyncio.to_thread(run_rm_stop)
logs.append(f"ResourceManager ({cluster.rm_ip}) stop: {out} {err}")
except Exception as e:
logs.append(f"ResourceManager ({cluster.rm_ip}) stop failed: {str(e)}")
end_time = _now()
# 5. 更新集群状态
cluster.health_status = "unknown"
cluster.updated_at = end_time
await db.flush()
# 6. 记录日志
full_desc = " | ".join(logs)
await _write_hadoop_exec_log(db, user_id, cluster.name, f"Stop Cluster: {full_desc}", start_time, end_time)
return {"status": "success", "logs": logs}
except HTTPException:
raise
except Exception as e:
print(f"Error stopping cluster: {e}")
raise HTTPException(status_code=500, detail="server_error")

@ -1,10 +0,0 @@
from fastapi import APIRouter, Depends
from ..deps.auth import get_current_user
router = APIRouter()
@router.get("/user/me")
async def me(user = Depends(get_current_user)):
if isinstance(user, dict):
return {"username": user.get("username"), "fullName": user.get("full_name"), "isActive": user.get("is_active")}
return {"username": user.username, "fullName": user.full_name, "isActive": user.is_active}

@ -1,61 +0,0 @@
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, delete, func
from ..db import get_db
from ..models.sys_exec_logs import SysExecLog
from ..deps.auth import get_current_user
from pydantic import BaseModel
from datetime import datetime
router = APIRouter()
class SysExecLogCreate(BaseModel):
user_id: int
description: str
@router.get("/sys-exec-logs")
async def list_sys_exec_logs(
user=Depends(get_current_user),
db: AsyncSession = Depends(get_db),
page: int = Query(1, ge=1),
size: int = Query(10, ge=1, le=100),
):
try:
stmt = select(SysExecLog).order_by(SysExecLog.operation_time.desc()).offset((page - 1) * size).limit(size)
count_stmt = select(func.count(SysExecLog.operation_id))
rows = (await db.execute(stmt)).scalars().all()
total = (await db.execute(count_stmt)).scalar() or 0
return {
"items": [r.to_dict() for r in rows],
"total": int(total)
}
except Exception as e:
print(f"Error listing sys exec logs: {e}")
raise HTTPException(status_code=500, detail="server_error")
@router.post("/sys-exec-logs")
async def create_sys_exec_log(req: SysExecLogCreate, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
row = SysExecLog(
user_id=req.user_id,
description=req.description
)
db.add(row)
await db.commit()
return {"ok": True, "operation_id": str(row.operation_id)}
except Exception as e:
print(f"Error creating sys exec log: {e}")
raise HTTPException(status_code=500, detail="server_error")
@router.delete("/sys-exec-logs/{operation_id}")
async def delete_sys_exec_log(operation_id: str, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
# Note: operation_id is UUID
await db.execute(delete(SysExecLog).where(SysExecLog.operation_id == operation_id))
await db.commit()
return {"ok": True}
except Exception as e:
print(f"Error deleting sys exec log: {e}")
raise HTTPException(status_code=500, detail="server_error")

@ -1,278 +0,0 @@
from fastapi import APIRouter, Depends, HTTPException
from pydantic import BaseModel
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, update, delete, func, text
from ..db import get_db
from ..models.users import User
from ..deps.auth import get_current_user
from passlib.hash import bcrypt
from datetime import datetime, timezone
import re
from ..config import now_bj
router = APIRouter()
ROLE_OVERRIDES: dict[str, str] = {}
class CreateUserRequest(BaseModel):
username: str
email: str
role: str
status: str
sort: int = 0
class UpdateUserRequest(BaseModel):
role: str | None = None
status: str | None = None
sort: int | None = None
class ChangePasswordRequest(BaseModel):
currentPassword: str
newPassword: str
def _status_to_active(status: str) -> bool:
return status == "enabled"
def _active_to_status(active: bool) -> str:
return "enabled" if active else "disabled"
async def _get_user_id(db: AsyncSession, username: str) -> int | None:
res = await db.execute(text("SELECT id FROM users WHERE username=:u LIMIT 1"), {"u": username})
row = res.first()
return row[0] if row else None
async def _get_role_id(db: AsyncSession, role_key: str) -> int | None:
res = await db.execute(text("SELECT id FROM roles WHERE role_key=:k LIMIT 1"), {"k": role_key})
row = res.first()
return row[0] if row else None
async def _get_role_key(db: AsyncSession, username: str) -> str | None:
res = await db.execute(
text(
"SELECT r.role_key FROM roles r JOIN user_role_mapping m ON r.id=m.role_id JOIN users u ON u.id=m.user_id WHERE u.username=:u LIMIT 1"
),
{"u": username},
)
row = res.first()
return row[0] if row else None
async def _set_user_role(db: AsyncSession, username: str, role_key: str) -> bool:
uid = await _get_user_id(db, username)
if uid is None:
return False
rid = await _get_role_id(db, role_key)
if rid is None:
return False
await db.execute(text("DELETE FROM user_role_mapping WHERE user_id=:uid"), {"uid": uid})
await db.execute(text("INSERT INTO user_role_mapping(user_id, role_id) VALUES(:uid, :rid)"), {"uid": uid, "rid": rid})
await db.commit()
return True
def _role_or_default(username: str) -> str:
if username in ROLE_OVERRIDES:
return ROLE_OVERRIDES[username]
if username == "admin":
return "admin"
if username == "ops":
return "operator"
if username == "obs":
return "observer"
return "observer"
def _get_username(u) -> str:
return getattr(u, "username", None) or (u.get("username") if isinstance(u, dict) else None)
def _require_permission(user, permission: str):
perms = user.get("permissions", []) if isinstance(user, dict) else getattr(user, "permissions", [])
if permission not in perms:
raise HTTPException(status_code=403, detail=f"Permission denied: {permission}")
@router.get("/users")
async def list_users(user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
_require_permission(user, "auth:manage")
result = await db.execute(select(User).order_by(User.sort.desc()).limit(500))
rows = result.scalars().all()
users = []
for u in rows:
rk = await _get_role_key(db, u.username)
users.append(
{
"username": u.username,
"email": u.email,
"role": rk or "observer",
"status": _active_to_status(u.is_active),
"sort": u.sort,
}
)
return {"users": users}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")
@router.post("/users")
async def create_user(req: CreateUserRequest, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
_require_permission(user, "auth:manage")
errors: list[dict] = []
if not (3 <= len(req.username) <= 50) or not re.fullmatch(r"^[A-Za-z][A-Za-z0-9_]{2,49}$", req.username or ""):
errors.append({"field": "username", "message": "用户名需以字母开头,支持字母/数字/下划线长度3-50"})
if not re.fullmatch(r"^[^@\s]+@[^@\s]+\.[^@\s]+$", req.email or ""):
errors.append({"field": "email", "message": "邮箱格式不正确"})
if req.role not in {"admin", "operator", "observer"}:
errors.append({"field": "role", "message": "角色必须为 admin/operator/observer"})
if req.status not in {"enabled", "pending", "disabled"}:
errors.append({"field": "status", "message": "状态必须为 enabled/pending/disabled"})
if errors:
raise HTTPException(status_code=400, detail={"errors": errors})
exists_username = await db.execute(select(User.id).where(User.username == req.username).limit(1))
if exists_username.scalars().first():
raise HTTPException(status_code=409, detail={"errors": [{"field": "username", "message": "用户名已存在"}]})
exists_email = await db.execute(select(User.id).where(User.email == req.email).limit(1))
if exists_email.scalars().first():
raise HTTPException(status_code=409, detail={"errors": [{"field": "email", "message": "邮箱已存在"}]})
temp_password = "TempPass#123"
password_hash = bcrypt.hash(temp_password)
now = now_bj()
user_obj = User(
username=req.username,
email=req.email,
password_hash=password_hash,
full_name=req.username,
is_active=_status_to_active(req.status),
sort=req.sort,
last_login=None,
created_at=now,
updated_at=now,
)
db.add(user_obj)
await db.flush()
await db.commit()
ok = await _set_user_role(db, req.username, req.role)
if not ok:
raise HTTPException(status_code=400, detail={"errors": [{"field": "role", "message": "角色不存在"}]})
return {"ok": True}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")
@router.patch("/users/{username}")
async def update_user(username: str, req: UpdateUserRequest, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
_require_permission(user, "auth:manage")
result = await db.execute(select(User).where(User.username == username).limit(1))
u = result.scalars().first()
if not u:
raise HTTPException(status_code=404, detail="not_found")
updates = {}
if req.status is not None:
if req.status not in {"enabled", "disabled"}:
raise HTTPException(status_code=400, detail="invalid_status")
updates["is_active"] = _status_to_active(req.status)
if req.sort is not None:
updates["sort"] = req.sort
if req.role is not None:
if req.role not in {"admin", "operator", "observer"}:
raise HTTPException(status_code=400, detail={"errors": [{"field": "role", "message": "不允许的角色"}]})
ok = await _set_user_role(db, username, req.role)
if not ok:
raise HTTPException(status_code=400, detail={"errors": [{"field": "role", "message": "角色不存在"}]})
if updates:
updates["updated_at"] = func.now()
await db.execute(update(User).where(User.id == u.id).values(**updates))
await db.commit()
return {"ok": True}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")
@router.delete("/users/{username}")
async def delete_user(username: str, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
_require_permission(user, "auth:manage")
result = await db.execute(select(User).where(User.username == username).limit(1))
u = result.scalars().first()
if not u:
ROLE_OVERRIDES.pop(username, None)
return {"ok": True}
await db.execute(delete(User).where(User.id == u.id))
await db.commit()
ROLE_OVERRIDES.pop(username, None)
return {"ok": True}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")
@router.get("/users/with-roles")
async def list_users_with_roles(user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
_require_admin(user)
res = await db.execute(
text(
"SELECT u.username,u.email,u.is_active,r.role_key FROM users u LEFT JOIN user_role_mapping m ON u.id=m.user_id LEFT JOIN roles r ON r.id=m.role_id LIMIT 500"
)
)
rows = res.all()
users = [
{
"username": r[0],
"email": r[1],
"role": r[3] or "observer",
"status": _active_to_status(r[2]),
}
for r in rows
]
return {"users": users}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")
@router.patch("/user/password")
async def change_password(req: ChangePasswordRequest, user=Depends(get_current_user), db: AsyncSession = Depends(get_db)):
try:
username = _get_username(user)
# 演示账号保护
if username in {"admin", "ops", "obs"}:
raise HTTPException(status_code=400, detail="demo_user_cannot_change_password")
# 密码强度校验
if not (8 <= len(req.newPassword) <= 128) or not re.search(r"[A-Z]", req.newPassword) or not re.search(r"[a-z]", req.newPassword) or not re.search(r"\d", req.newPassword):
raise HTTPException(status_code=400, detail="weak_new_password")
# 查找真实用户
res = await db.execute(select(User).where(User.username == username).limit(1))
u = res.scalars().first()
if not u:
raise HTTPException(status_code=401, detail="user_not_found")
# 验证旧密码
if not bcrypt.verify(req.currentPassword, u.password_hash):
raise HTTPException(status_code=400, detail="invalid_current_password")
# 更新密码
new_hash = bcrypt.hash(req.newPassword)
await db.execute(update(User).where(User.id == u.id).values(password_hash=new_hash, updated_at=func.now()))
await db.commit()
return {"ok": True}
except HTTPException:
raise
except Exception:
raise HTTPException(status_code=500, detail="server_error")

@ -1,39 +0,0 @@
from pydantic import BaseModel
from typing import List, Dict, Optional
class LogRequest(BaseModel):
"""Log request model"""
node_name: str
log_type: str
start_date: Optional[str] = None
end_date: Optional[str] = None
class SaveLogRequest(BaseModel):
"""Save log request model"""
node_name: str
log_type: str
local_file_path: str
class LogResponse(BaseModel):
"""Log response model"""
node_name: str
log_type: str
log_content: str
class MultiLogResponse(BaseModel):
"""Multiple logs response model"""
logs: Dict[str, str]
class SaveLogResponse(BaseModel):
"""Save log response model"""
message: str
local_file_path: str
class NodeListResponse(BaseModel):
"""Node list response model"""
nodes: List[str]
class LogFilesResponse(BaseModel):
"""Log files list response model"""
node_name: str
log_files: List[str]

@ -1,33 +0,0 @@
import asyncio
import argparse
from sqlalchemy import select
from app.db import SessionLocal
from app.models.nodes import Node
from app.models.clusters import Cluster
from app.metrics_collector import metrics_collector
async def run(uuid: str):
async with SessionLocal() as session:
cid_res = await session.execute(select(Cluster.id).where(Cluster.uuid == uuid).limit(1))
cid = cid_res.scalars().first()
if not cid:
print("NO_CLUSTER")
return
res = await session.execute(select(Node.id, Node.hostname, Node.ip_address).where(Node.cluster_id == cid))
rows = res.all()
if not rows:
print("NO_NODES")
return
for nid, hn, ip in rows:
cpu, mem = metrics_collector._read_cpu_mem(hn, str(ip))
await metrics_collector._save_metrics(nid, hn, cid, cpu, mem)
print("DONE", len(rows))
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--cluster", required=True)
args = parser.parse_args()
asyncio.run(run(args.cluster))
if __name__ == "__main__":
main()

@ -1,27 +0,0 @@
import os
import asyncio
from sqlalchemy import text
from app.db import engine
async def main():
uuid = os.environ.get("CLUSTER_UUID")
async with engine.begin() as conn:
await conn.execute(text("ALTER TABLE clusters ADD COLUMN IF NOT EXISTS cpu_avg double precision"))
await conn.execute(text("ALTER TABLE clusters ADD COLUMN IF NOT EXISTS memory_avg double precision"))
await conn.execute(text("ALTER TABLE clusters ADD COLUMN IF NOT EXISTS last_avg_at timestamptz"))
async with engine.begin() as conn:
if uuid:
res = await conn.execute(text("SELECT id FROM clusters WHERE uuid=:u LIMIT 1"), {"u": uuid})
row = res.first()
if row:
cid = row[0]
avg = await conn.execute(text("SELECT AVG(cpu_usage), AVG(memory_usage) FROM nodes WHERE cluster_id=:cid"), {"cid": cid})
ar = avg.first()
await conn.execute(text("UPDATE clusters SET cpu_avg=:ca, memory_avg=:ma, last_avg_at=NOW() WHERE id=:cid"), {"ca": float(ar[0] or 0.0), "ma": float(ar[1] or 0.0), "cid": cid})
else:
avg = await conn.execute(text("SELECT AVG(cpu_usage), AVG(memory_usage) FROM nodes"))
ar = avg.first()
await conn.execute(text("UPDATE clusters SET cpu_avg=:ca, memory_avg=:ma, last_avg_at=NOW()"), {"ca": float(ar[0] or 0.0), "ma": float(ar[1] or 0.0)})
if __name__ == "__main__":
asyncio.run(main())

@ -1,12 +0,0 @@
import asyncio
from sqlalchemy import text
from app.db import engine
async def main():
async with engine.begin() as conn:
res = await conn.execute(text('SELECT id, hostname, cpu_usage, memory_usage, last_heartbeat FROM nodes ORDER BY id LIMIT 5'))
for row in res.all():
print('NODE', row)
if __name__ == '__main__':
asyncio.run(main())

@ -1,2 +0,0 @@
from app.config import DATABASE_URL
print(DATABASE_URL)

@ -1,14 +0,0 @@
import asyncio
from sqlalchemy import text
from app.db import engine
async def main():
async with engine.begin() as conn:
c = await conn.execute(text('SELECT COUNT(*) FROM hadoop_logs'))
print('HADOOP_LOGS_COUNT', c.scalar() or 0)
rows = await conn.execute(text('SELECT cluster_name,node_host,title,log_time FROM hadoop_logs ORDER BY log_id DESC LIMIT 5'))
for r in rows.all():
print('LOG', r)
if __name__ == "__main__":
asyncio.run(main())

@ -1,17 +0,0 @@
import os
import asyncio
from sqlalchemy import text
from app.db import engine
async def main():
uuid = os.environ.get("CLUSTER_UUID")
async with engine.begin() as conn:
if uuid:
res = await conn.execute(text("SELECT cpu_avg, memory_avg FROM clusters WHERE uuid=:u LIMIT 1"), {"u": uuid})
else:
res = await conn.execute(text("SELECT cpu_avg, memory_avg FROM clusters LIMIT 1"))
row = res.first()
print("CLUSTER_AVG_STORED", (float(row[0]) if row and row[0] is not None else 0.0), (float(row[1]) if row and row[1] is not None else 0.0))
if __name__ == "__main__":
asyncio.run(main())

@ -1,12 +0,0 @@
import asyncio
from sqlalchemy import text
from app.db import engine
async def main():
async with engine.begin() as conn:
await conn.execute(text("ALTER TABLE clusters DROP COLUMN IF EXISTS cpu_avg"))
await conn.execute(text("ALTER TABLE clusters DROP COLUMN IF EXISTS memory_avg"))
await conn.execute(text("ALTER TABLE clusters DROP COLUMN IF EXISTS last_avg_at"))
if __name__ == "__main__":
asyncio.run(main())

@ -1,66 +0,0 @@
import os
import asyncio
import time
from sqlalchemy import select, text
from app.db import SessionLocal, engine
from app.models.clusters import Cluster
from app.models.nodes import Node
from app.log_reader import log_reader
from app.log_collector import log_collector
async def run(cluster_uuid: str, interval: int = 3, duration: int = 10):
async with engine.begin() as conn:
res = await conn.execute(text("SELECT id FROM clusters WHERE uuid=:u LIMIT 1"), {"u": cluster_uuid})
row = res.first()
if not row:
print("CLUSTER_NOT_FOUND")
return
cid = row[0]
before = await conn.execute(text("SELECT COUNT(*) FROM hadoop_logs"))
print("HADOOP_LOGS_BEFORE", before.scalar() or 0)
async with SessionLocal() as session:
nodes_res = await session.execute(select(Node.hostname, Node.ip_address).where(Node.cluster_id == cid))
nodes = [(r[0], str(r[1])) for r in nodes_res.all()]
started = []
for hn, ip in nodes:
try:
log_reader.find_working_log_dir(hn, ip)
files = log_reader.get_log_files_list(hn, ip=ip)
except Exception:
files = []
services = set()
for f in files:
lf = f.lower()
if "namenode" in lf:
services.add("namenode")
elif "secondarynamenode" in lf:
services.add("secondarynamenode")
elif "datanode" in lf:
services.add("datanode")
elif "resourcemanager" in lf:
services.add("resourcemanager")
elif "nodemanager" in lf:
services.add("nodemanager")
elif "historyserver" in lf:
services.add("historyserver")
for t in services:
ok = log_collector.start_collection(hn, t, ip=ip, interval=interval)
if ok:
started.append(f"{hn}_{t}")
time.sleep(duration)
log_collector.stop_all_collections()
async with engine.begin() as conn:
after = await conn.execute(text("SELECT COUNT(*) FROM hadoop_logs"))
print("HADOOP_LOGS_AFTER", after.scalar() or 0)
last = await conn.execute(text("SELECT cluster_name, node_host, title, log_time FROM hadoop_logs ORDER BY log_id DESC LIMIT 5"))
for row in last.all():
print("LOG", row)
def main():
uuid = os.environ.get("CLUSTER_UUID")
interval = int(os.environ.get("LOG_INTERVAL", "3"))
duration = int(os.environ.get("LOG_DURATION", "10"))
asyncio.run(run(uuid, interval=interval, duration=duration))
if __name__ == "__main__":
main()

@ -1,11 +0,0 @@
import asyncio
from sqlalchemy import text
from app.db import SessionLocal
async def main():
async with SessionLocal() as session:
res = await session.execute(text('SELECT 1'))
print('OK', res.scalar())
if __name__ == '__main__':
asyncio.run(main())

@ -1,38 +0,0 @@
import asyncio
import os
from sqlalchemy import text
from app.db import engine
async def main():
uuid = os.environ.get("CLUSTER_UUID")
async with engine.begin() as conn:
cid = None
if uuid:
res = await conn.execute(text("SELECT id FROM clusters WHERE uuid=:u LIMIT 1"), {"u": uuid})
row = res.first()
cid = row[0] if row else None
if cid:
res1 = await conn.execute(text("SELECT COUNT(*) FROM nodes WHERE cluster_id=:cid AND last_heartbeat IS NOT NULL"), {"cid": cid})
else:
res1 = await conn.execute(text("SELECT COUNT(*) FROM nodes WHERE last_heartbeat IS NOT NULL"))
c1 = res1.scalar() or 0
print('NODES_WITH_HEARTBEAT_BEFORE', c1)
await asyncio.sleep(10)
async with engine.begin() as conn:
if cid:
res2 = await conn.execute(text("SELECT COUNT(*) FROM nodes WHERE cluster_id=:cid AND last_heartbeat IS NOT NULL"), {"cid": cid})
res3 = await conn.execute(text("SELECT hostname, cpu_usage, memory_usage, last_heartbeat FROM nodes WHERE cluster_id=:cid ORDER BY last_heartbeat DESC NULLS LAST LIMIT 5"), {"cid": cid})
avg = await conn.execute(text("SELECT AVG(cpu_usage), AVG(memory_usage) FROM nodes WHERE cluster_id=:cid"), {"cid": cid})
else:
res2 = await conn.execute(text("SELECT COUNT(*) FROM nodes WHERE last_heartbeat IS NOT NULL"))
res3 = await conn.execute(text("SELECT hostname, cpu_usage, memory_usage, last_heartbeat FROM nodes ORDER BY last_heartbeat DESC NULLS LAST LIMIT 5"))
avg = await conn.execute(text("SELECT AVG(cpu_usage), AVG(memory_usage) FROM nodes"))
c2 = res2.scalar() or 0
print('NODES_WITH_HEARTBEAT_AFTER', c2)
for row in res3.all():
print('NODE', row)
ar = avg.first()
print('CLUSTER_AVG', float(ar[0] or 0.0), float(ar[1] or 0.0))
if __name__ == '__main__':
asyncio.run(main())

@ -1,51 +0,0 @@
from __future__ import annotations
from ..config import SSH_TIMEOUT
from ..ssh_utils import SSHClient
def collect_cluster_uuid(host: str, user: str, password: str, timeout: int | None = None) -> tuple[str | None, str | None, str | None]:
cli = None
try:
cli = SSHClient(str(host), user or "", password or "")
out, err = cli.execute_command_with_timeout(
"hdfs getconf -confKey dfs.namenode.name.dir",
timeout or SSH_TIMEOUT,
)
if not out or not out.strip():
return None, "probe_name_dirs", (err or "empty_output")
name_dir = out.strip().split(",")[0]
if name_dir.startswith("file://"):
name_dir = name_dir[7:]
version_path = f"{name_dir.rstrip('/')}/current/VERSION"
version_out, version_err = cli.execute_command_with_timeout(
f"cat {version_path}",
timeout or SSH_TIMEOUT,
)
if not version_out or not version_out.strip():
return None, "read_version", (version_err or "empty_output")
cluster_id = None
for line in version_out.splitlines():
if "clusterID" in line:
parts = line.strip().split("=", 1)
if len(parts) == 2 and parts[0].strip() == "clusterID":
cluster_id = parts[1].strip()
break
if not cluster_id:
return None, "parse_cluster_id", version_out.strip()
if cluster_id.startswith("CID-"):
cluster_id = cluster_id[4:]
return cluster_id, None, None
except Exception as e:
return None, "connect_or_exec", str(e)
finally:
try:
if cli:
cli.close()
except Exception:
pass

@ -1,120 +0,0 @@
import os
import json
from typing import Any, Dict, Iterable, List, Optional
from dotenv import load_dotenv
try:
import httpx
except Exception: # pragma: no cover
httpx = None
load_dotenv()
_shared_async_client: Any = None
def _get_async_client() -> Any:
global _shared_async_client
if httpx is None:
return None
if _shared_async_client is None:
_shared_async_client = httpx.AsyncClient(
headers={},
limits=httpx.Limits(max_keepalive_connections=20, max_connections=50),
http2=True,
)
return _shared_async_client
_DEFAULT_ENDPOINTS: Dict[str, str] = {
"openai": "https://api.openai.com/v1/chat/completions",
"siliconflow": "https://api.siliconflow.cn/v1/chat/completions",
"deepseek": "https://api.deepseek.com/v1/chat/completions",
}
_DEFAULT_MODELS: Dict[str, str] = {
"openai": "gpt-4o-mini",
"siliconflow": "deepseek-ai/DeepSeek-V3",
"deepseek": "deepseek-v3",
"r1": "Pro/deepseek-ai/DeepSeek-R1",
}
def _clean_str(s: str) -> str:
if s is None:
return ""
s = s.strip()
if (s.startswith("`") and s.endswith("`")) or (s.startswith('"') and s.endswith('"')) or (s.startswith("'") and s.endswith("'")):
s = s[1:-1].strip()
return s
def _normalize_endpoint(ep: str) -> str:
if not ep:
return ep
s = _clean_str(ep).rstrip("/")
if s.endswith("/v1"):
return s + "/chat/completions"
if s.endswith("/chat/completions"):
return s
return s
class LLMClient:
def __init__(self):
self.provider = os.getenv("LLM_PROVIDER", "openai").strip().lower()
raw_endpoint = os.getenv("LLM_ENDPOINT", "") or _DEFAULT_ENDPOINTS.get(self.provider, _DEFAULT_ENDPOINTS["openai"])
self.endpoint = _normalize_endpoint(raw_endpoint)
self.model = _clean_str(os.getenv("LLM_MODEL", _DEFAULT_MODELS.get(self.provider, "gpt-4o-mini")))
api_key = os.getenv("LLM_API_KEY") or os.getenv("OPENAI_API_KEY") or os.getenv("DEEPSEEK_API_KEY") or os.getenv("SILICONFLOW_API_KEY") or ""
self.api_key = api_key
self.simulate = os.getenv("LLM_SIMULATE", "false").lower() == "true"
self.timeout = int(os.getenv("LLM_TIMEOUT", "300"))
def _headers(self) -> Dict[str, str]:
return {
"Authorization": f"Bearer {self.api_key}" if self.api_key else "",
"Content-Type": "application/json",
}
async def chat(self, messages: List[Dict[str, Any]], tools: Optional[List[Dict[str, Any]]] = None, stream: bool = False, model: Optional[str] = None) -> Any:
if self.simulate or httpx is None:
if stream:
async def _sim_stream():
yield {"choices": [{"delta": {"content": "模拟流式输出检测到错误日志建议重启或kill相关进程"}, "index": 0}]}
return _sim_stream()
return {
"choices": [
{
"message": {
"role": "assistant",
"content": "模拟输出检测到错误日志建议重启或kill相关进程",
"tool_calls": [],
}
}
]
}
target_model = model or self.model
payload: Dict[str, Any] = {"model": target_model, "messages": messages, "stream": stream}
if tools:
payload["tools"] = tools
payload["tool_choice"] = "auto"
if stream:
async def _stream_gen():
client = _get_async_client()
async with client.stream("POST", self.endpoint, headers=self._headers(), json=payload, timeout=self.timeout) as resp:
resp.raise_for_status()
async for line in resp.aiter_lines():
if not line or not line.startswith("data: "):
continue
data_str = line[6:].strip()
if data_str == "[DONE]":
break
try:
yield json.loads(data_str)
except:
continue
return _stream_gen()
client = _get_async_client()
resp = await client.post(self.endpoint, headers=self._headers(), json=payload, timeout=self.timeout)
resp.raise_for_status()
return resp.json()

@ -1,820 +0,0 @@
import shlex
import asyncio
from typing import Any, Dict, List, Optional, Tuple
from datetime import datetime, timezone
import json
import re
import httpx
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, text
import requests
from bs4 import BeautifulSoup
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
from ..models.nodes import Node
from ..models.clusters import Cluster
from ..models.hadoop_exec_logs import HadoopExecLog
from ..ssh_utils import SSHClient, ssh_manager
from ..log_reader import log_reader
from ..config import now_bj
def _now() -> datetime:
"""返回当前 UTC 时间。"""
return now_bj()
async def _find_accessible_node(db: AsyncSession, user_name: str, hostname: str) -> Optional[Node]:
"""校验用户对节点的访问权限,并返回节点对象。"""
uid_res = await db.execute(text("SELECT id FROM users WHERE username=:un LIMIT 1"), {"un": user_name})
uid_row = uid_res.first()
if not uid_row:
return None
ids_res = await db.execute(text("SELECT cluster_id FROM user_cluster_mapping WHERE user_id=:uid"), {"uid": uid_row[0]})
cluster_ids = [r[0] for r in ids_res.all()]
if not cluster_ids:
return None
res = await db.execute(select(Node).where(Node.hostname == hostname, Node.cluster_id.in_(cluster_ids)).limit(1))
return res.scalars().first()
async def _user_has_cluster_access(db: AsyncSession, user_name: str, cluster_id: int) -> bool:
uid_res = await db.execute(text("SELECT id FROM users WHERE username=:un LIMIT 1"), {"un": user_name})
uid_row = uid_res.first()
if not uid_row:
return False
ok_res = await db.execute(
text("SELECT 1 FROM user_cluster_mapping WHERE user_id=:uid AND cluster_id=:cid LIMIT 1"),
{"uid": uid_row[0], "cid": cluster_id},
)
return ok_res.first() is not None
async def _write_exec_log(db: AsyncSession, exec_id: str, command_type: str, status: str, start: datetime, end: Optional[datetime], exit_code: Optional[int], operator: str, stdout: Optional[str] = None, stderr: Optional[str] = None):
"""写入执行审计日志。"""
# 查找 from_user_id 和 cluster_name
uid_res = await db.execute(text("SELECT id FROM users WHERE username=:un LIMIT 1"), {"un": operator})
uid_row = uid_res.first()
from_user_id = uid_row[0] if uid_row else 1
# 获取集群名称 (这里简化逻辑,取用户关联的第一个集群)
cluster_res = await db.execute(text("""
SELECT c.name
FROM clusters c
JOIN user_cluster_mapping m ON c.id = m.cluster_id
WHERE m.user_id = :uid LIMIT 1
"""), {"uid": from_user_id})
cluster_row = cluster_res.first()
cluster_name = cluster_row[0] if cluster_row else "default_cluster"
row = HadoopExecLog(
from_user_id=from_user_id,
cluster_name=cluster_name,
description=f"[{command_type}] {exec_id}",
start_time=start,
end_time=end
)
db.add(row)
await db.flush()
await db.commit()
async def tool_read_log(db: AsyncSession, user_name: str, node: str, path: str, lines: int = 200, pattern: Optional[str] = None, ssh_user: Optional[str] = None, timeout: int = 20) -> Dict[str, Any]:
"""工具:读取远端日志并可选筛选。"""
n = await _find_accessible_node(db, user_name, node)
if not n:
return {"error": "node_not_found"}
if not getattr(n, "ssh_password", None):
return {"error": "ssh_password_not_configured"}
path_q = shlex.quote(path)
cmd = f"tail -n {lines} {path_q}"
if pattern:
pat_q = shlex.quote(pattern)
cmd = f"{cmd} | grep -E {pat_q}"
start = _now()
bash_cmd = f"bash -lc {shlex.quote(cmd)}"
def _run():
client = ssh_manager.get_connection(
str(getattr(n, "hostname", node)),
ip=str(getattr(n, "ip_address", "")),
username=(ssh_user or getattr(n, "ssh_user", None) or "hadoop"),
password=str(getattr(n, "ssh_password", "")),
)
return client.execute_command_with_timeout_and_status(bash_cmd, timeout=timeout)
code, out, err = await asyncio.to_thread(_run)
end = _now()
exec_id = f"tool_{start.timestamp():.0f}"
await _write_exec_log(db, exec_id, "read_log", ("success" if code == 0 else "failed"), start, end, code, user_name, out, err)
return {"execId": exec_id, "exitCode": code, "stdout": out, "stderr": err}
async def _fetch_page_text(client: httpx.AsyncClient, url: str) -> str:
"""Fetch and extract text content from a URL."""
try:
# Skip if not a valid http url
if not url.startswith("http"):
return ""
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
resp = await client.get(url, headers=headers, follow_redirects=True)
if resp.status_code == 200:
soup = BeautifulSoup(resp.text, "html.parser")
# Remove scripts and styles
for script in soup(["script", "style", "nav", "footer", "header"]):
script.decompose()
text = soup.get_text(separator="\n", strip=True)
# Limit text length
return text[:2000]
except Exception:
pass
return ""
async def tool_web_search(query: str, max_results: int = 5) -> Dict[str, Any]:
"""工具联网搜索Baidu并读取网页内容。"""
try:
results = []
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
}
url = "https://www.baidu.com/s"
params = {"wd": query}
# Use sync requests for search page (stable)
resp = requests.get(url, params=params, headers=headers, timeout=10, verify=False)
if resp.status_code == 200:
soup = BeautifulSoup(resp.text, "html.parser")
# Baidu results are usually in div with class c-container
for item in soup.select("div.c-container, div.result.c-container")[:max_results]:
title_elem = item.select_one("h3")
if not title_elem:
continue
title = title_elem.get_text(strip=True)
link_elem = item.select_one("a")
href = link_elem.get("href") if link_elem else ""
# Abstract/Snippet
snippet = item.get_text(strip=True).replace(title, "")[:200]
results.append({
"title": title,
"href": href,
"body": snippet,
"full_content": "" # Placeholder
})
# Fetch full content for top 2 results
if results:
async with httpx.AsyncClient(timeout=10, verify=False) as client:
tasks = []
# Only fetch top 2 to avoid long wait
for r in results[:2]:
tasks.append(_fetch_page_text(client, r["href"]))
contents = await asyncio.gather(*tasks)
for i, content in enumerate(contents):
if content:
results[i]["full_content"] = content
# Append note to body to indicate full content is available
results[i]["body"] += "\n[Full content fetched]"
# Add current system time to help with "now" queries
current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S %A")
return {"query": query, "current_time": current_time, "results": results}
except Exception as e:
return {"error": str(e)}
async def tool_start_cluster(db: AsyncSession, user_name: str, cluster_uuid: str) -> Dict[str, Any]:
"""工具:启动 Hadoop 集群。"""
# 1. 权限与用户
uid_res = await db.execute(text("SELECT id FROM users WHERE username=:un LIMIT 1"), {"un": user_name})
uid_row = uid_res.first()
user_id = uid_row[0] if uid_row else 1
# 2. 查找集群
res = await db.execute(select(Cluster).where(Cluster.uuid == cluster_uuid).limit(1))
cluster = res.scalars().first()
if not cluster:
return {"error": "cluster_not_found"}
# 3. 获取 SSH 用户 (从关联节点中获取,默认为 hadoop)
node_res = await db.execute(select(Node).where(Node.cluster_id == cluster.id).limit(1))
node = node_res.scalars().first()
ssh_user = node.ssh_user if node and node.ssh_user else "hadoop"
start_time = _now()
logs = []
# 4. 在 NameNode 执行 start-dfs.sh
if cluster.namenode_ip and cluster.namenode_psw:
try:
def run_nn_start():
with SSHClient(str(cluster.namenode_ip), ssh_user, cluster.namenode_psw) as client:
return client.execute_command("start-dfs.sh")
out, err = await asyncio.to_thread(run_nn_start)
logs.append(f"NameNode ({cluster.namenode_ip}) start: {out} {err}")
except Exception as e:
logs.append(f"NameNode ({cluster.namenode_ip}) start failed: {str(e)}")
# 5. 在 ResourceManager 执行 start-yarn.sh
if cluster.rm_ip and cluster.rm_psw:
try:
def run_rm_start():
with SSHClient(str(cluster.rm_ip), ssh_user, cluster.rm_psw) as client:
return client.execute_command("start-yarn.sh")
out, err = await asyncio.to_thread(run_rm_start)
logs.append(f"ResourceManager ({cluster.rm_ip}) start: {out} {err}")
except Exception as e:
logs.append(f"ResourceManager ({cluster.rm_ip}) start failed: {str(e)}")
end_time = _now()
# 6. 更新集群状态 (改进:检查是否有失败日志)
has_failed = any("failed" in log.lower() for log in logs)
if not has_failed:
cluster.health_status = "healthy"
else:
cluster.health_status = "error"
cluster.updated_at = end_time
await db.flush()
# 7. 记录日志
full_desc = " | ".join(logs)
exec_row = HadoopExecLog(
from_user_id=user_id,
cluster_name=cluster.name,
description=f"AI Tool Start Cluster: {full_desc}",
start_time=start_time,
end_time=end_time
)
db.add(exec_row)
await db.commit()
return {"status": "success", "logs": logs}
async def tool_stop_cluster(db: AsyncSession, user_name: str, cluster_uuid: str) -> Dict[str, Any]:
"""工具:停止 Hadoop 集群。"""
uid_res = await db.execute(text("SELECT id FROM users WHERE username=:un LIMIT 1"), {"un": user_name})
uid_row = uid_res.first()
user_id = uid_row[0] if uid_row else 1
res = await db.execute(select(Cluster).where(Cluster.uuid == cluster_uuid).limit(1))
cluster = res.scalars().first()
if not cluster:
return {"error": "cluster_not_found"}
node_res = await db.execute(select(Node).where(Node.cluster_id == cluster.id).limit(1))
node = node_res.scalars().first()
ssh_user = node.ssh_user if node and node.ssh_user else "hadoop"
start_time = _now()
logs = []
if cluster.namenode_ip and cluster.namenode_psw:
try:
def run_nn_stop():
with SSHClient(str(cluster.namenode_ip), ssh_user, cluster.namenode_psw) as client:
return client.execute_command("stop-dfs.sh")
out, err = await asyncio.to_thread(run_nn_stop)
logs.append(f"NameNode ({cluster.namenode_ip}) stop: {out} {err}")
except Exception as e:
logs.append(f"NameNode ({cluster.namenode_ip}) stop failed: {str(e)}")
if cluster.rm_ip and cluster.rm_psw:
try:
def run_rm_stop():
with SSHClient(str(cluster.rm_ip), ssh_user, cluster.rm_psw) as client:
return client.execute_command("stop-yarn.sh")
out, err = await asyncio.to_thread(run_rm_stop)
logs.append(f"ResourceManager ({cluster.rm_ip}) stop: {out} {err}")
except Exception as e:
logs.append(f"ResourceManager ({cluster.rm_ip}) stop failed: {str(e)}")
end_time = _now()
cluster.health_status = "unknown"
cluster.updated_at = end_time
await db.flush()
full_desc = " | ".join(logs)
exec_row = HadoopExecLog(
from_user_id=user_id,
cluster_name=cluster.name,
description=f"AI Tool Stop Cluster: {full_desc}",
start_time=start_time,
end_time=end_time
)
db.add(exec_row)
await db.commit()
return {"status": "success", "logs": logs}
async def tool_read_cluster_log(
db: AsyncSession,
user_name: str,
cluster_uuid: str,
log_type: str,
node_hostname: Optional[str] = None,
lines: int = 100,
) -> Dict[str, Any]:
"""读取集群中特定服务类型的日志。"""
import uuid as uuidlib
try:
uuidlib.UUID(cluster_uuid)
except ValueError:
return {"status": "error", "message": "invalid_uuid_format"}
stmt = select(Cluster).where(Cluster.uuid == cluster_uuid)
result = await db.execute(stmt)
cluster = result.scalar_one_or_none()
if not cluster:
return {"status": "error", "message": "cluster_not_found"}
if not await _user_has_cluster_access(db, user_name, int(cluster.id)):
return {"status": "error", "message": "cluster_forbidden"}
target_ip: Optional[str] = None
target_hostname: Optional[str] = node_hostname
ssh_user: Optional[str] = None
ssh_password: Optional[str] = None
if log_type.lower() == "namenode":
target_ip = str(cluster.namenode_ip) if cluster.namenode_ip else None
ssh_password = cluster.namenode_psw
if not target_hostname:
node_stmt = select(Node).where(Node.ip_address == cluster.namenode_ip)
node_res = await db.execute(node_stmt)
node_obj = node_res.scalar_one_or_none()
target_hostname = node_obj.hostname if node_obj else "namenode"
if node_obj and node_obj.ssh_user:
ssh_user = node_obj.ssh_user
elif log_type.lower() == "resourcemanager":
target_ip = str(cluster.rm_ip) if cluster.rm_ip else None
ssh_password = cluster.rm_psw
if not target_hostname:
node_stmt = select(Node).where(Node.ip_address == cluster.rm_ip)
node_res = await db.execute(node_stmt)
node_obj = node_res.scalar_one_or_none()
target_hostname = node_obj.hostname if node_obj else "resourcemanager"
if node_obj and node_obj.ssh_user:
ssh_user = node_obj.ssh_user
if not target_ip and target_hostname:
node = await _find_accessible_node(db, user_name, target_hostname)
if not node:
return {"status": "error", "message": "node_not_found"}
target_ip = str(node.ip_address)
ssh_user = node.ssh_user or ssh_user
ssh_password = node.ssh_password or ssh_password
if not target_ip:
return {"status": "error", "message": f"could_not_determine_node_for_{log_type}"}
if not target_hostname:
target_hostname = target_ip
def _tail_via_ssh() -> Dict[str, Any]:
ip = str(target_ip)
hn = str(target_hostname)
log_reader.find_working_log_dir(hn, ip)
ssh_client = ssh_manager.get_connection(hn, ip=ip, username=ssh_user, password=ssh_password)
paths = log_reader.get_log_file_paths(hn, log_type.lower())
for p in paths:
p_q = shlex.quote(p)
out, err = ssh_client.execute_command(f"ls -la {p_q} 2>/dev/null")
if err or not out.strip():
continue
out2, err2 = ssh_client.execute_command(f"tail -n {int(lines)} {p_q} 2>/dev/null")
if err2:
continue
return {"status": "success", "node": hn, "log_type": log_type, "path": p, "content": out2}
base_dir = log_reader._node_log_dir.get(hn, log_reader.log_dir)
base_q = shlex.quote(base_dir)
out, err = ssh_client.execute_command(f"ls -1 {base_q} 2>/dev/null")
if err or not out.strip():
return {"status": "error", "message": "log_dir_not_found", "node": hn}
for fn in out.splitlines():
f = (fn or "").strip()
lf = f.lower()
if not f:
continue
if log_type.lower() in lf and hn.lower() in lf and (lf.endswith(".log") or lf.endswith(".out") or lf.endswith(".out.1")):
full = f"{base_dir}/{f}"
full_q = shlex.quote(full)
out2, err2 = ssh_client.execute_command(f"tail -n {int(lines)} {full_q} 2>/dev/null")
if not err2:
return {"status": "success", "node": hn, "log_type": log_type, "path": full, "content": out2}
return {"status": "error", "message": "log_file_not_found", "node": hn}
return await asyncio.to_thread(_tail_via_ssh)
_FAULT_RULES: List[Dict[str, Any]] = [
{
"id": "hdfs_safemode",
"severity": "high",
"title": "NameNode 处于 SafeMode",
"patterns": [r"SafeModeException", r"NameNode is in safe mode", r"Safe mode is ON"],
"advice": "检查 DataNode 是否全部注册、磁盘与网络是否正常;必要时执行 hdfs dfsadmin -safemode leave。",
},
{
"id": "hdfs_standby",
"severity": "high",
"title": "访问到 Standby NameNode",
"patterns": [r"StandbyException", r"Operation category READ is not supported in state standby"],
"advice": "确认客户端的 fs.defaultFS/HA 配置;确认 active/standby 切换状态是否正确。",
},
{
"id": "rpc_connection_refused",
"severity": "high",
"title": "RPC 连接被拒绝或目标服务未启动",
"patterns": [r"java\.net\.ConnectException:\s*Connection refused", r"Call to .* failed on local exception", r"Connection refused"],
"advice": "确认对应守护进程是否存活、端口是否监听、iptables/安全组是否放通。",
},
{
"id": "dns_or_route",
"severity": "high",
"title": "DNS/网络不可达",
"patterns": [r"UnknownHostException", r"No route to host", r"Network is unreachable", r"Connection timed out"],
"advice": "检查 DNS 解析、/etc/hosts、一致的主机名配置与网络连通性。",
},
{
"id": "disk_no_space",
"severity": "high",
"title": "磁盘空间不足",
"patterns": [r"No space left on device", r"DiskOutOfSpaceException", r"ENOSPC"],
"advice": "清理磁盘、检查日志/临时目录增长;确认 DataNode 存储目录剩余空间。",
},
{
"id": "permission_denied",
"severity": "medium",
"title": "权限不足或 HDFS ACL/权限问题",
"patterns": [r"Permission denied", r"AccessControlException"],
"advice": "检查用户/组映射、HDFS 权限与 ACL确认相关目录权限与 umask。",
},
{
"id": "kerberos_auth",
"severity": "high",
"title": "Kerberos 认证失败",
"patterns": [r"GSSException", r"Failed to find any Kerberos tgt", r"Client cannot authenticate via:\s*\[TOKEN, KERBEROS\]"],
"advice": "检查 KDC、keytab、principal、时间同步确认客户端已 kinit 且票据未过期。",
},
{
"id": "oom",
"severity": "high",
"title": "Java 内存溢出",
"patterns": [r"OutOfMemoryError", r"Java heap space", r"GC overhead limit exceeded"],
"advice": "检查相关服务 JVM 参数(-Xmx/-Xms、容器/节点内存;结合 GC 日志定位内存泄漏或峰值。",
},
{
"id": "jvm_exit_killed",
"severity": "medium",
"title": "进程异常退出或被杀",
"patterns": [r"ExitCodeException exitCode=143", r"Killed by signal", r"Container killed"],
"advice": "检查是否被资源管理器/系统 OOM killer 杀死;核对 YARN 队列资源与节点资源。",
},
]
def _detect_faults_from_log_text(text: str, max_examples_per_rule: int = 3) -> List[Dict[str, Any]]:
lines = (text or "").splitlines()
hits: List[Dict[str, Any]] = []
for rule in _FAULT_RULES:
patterns = rule.get("patterns") or []
compiled = [re.compile(p, re.IGNORECASE) for p in patterns]
examples: List[Dict[str, Any]] = []
for idx, line in enumerate(lines):
if not line:
continue
if any(rgx.search(line) for rgx in compiled):
examples.append({"lineNo": idx + 1, "line": line[:500]})
if len(examples) >= max_examples_per_rule:
break
if examples:
hits.append(
{
"id": rule.get("id"),
"severity": rule.get("severity"),
"title": rule.get("title"),
"advice": rule.get("advice"),
"examples": examples,
"matchCountApprox": len(examples),
}
)
return hits
async def tool_detect_cluster_faults(
db: AsyncSession,
user_name: str,
cluster_uuid: str,
components: Optional[List[str]] = None,
node_hostname: Optional[str] = None,
lines: int = 200,
) -> Dict[str, Any]:
import uuid as uuidlib
try:
uuidlib.UUID(cluster_uuid)
except ValueError:
return {"status": "error", "message": "invalid_uuid_format"}
comps = components or ["namenode", "resourcemanager"]
comps = [c for c in comps if isinstance(c, str) and c.strip()]
comps = [c.strip().lower() for c in comps]
if not comps:
return {"status": "error", "message": "no_components"}
reads: List[Dict[str, Any]] = []
faults: List[Dict[str, Any]] = []
for comp in comps:
r = await tool_read_cluster_log(
db=db,
user_name=user_name,
cluster_uuid=cluster_uuid,
log_type=comp,
node_hostname=node_hostname,
lines=lines,
)
reads.append({k: r.get(k) for k in ("status", "node", "log_type", "path", "message")})
if r.get("status") != "success":
continue
content = r.get("content") or ""
comp_faults = _detect_faults_from_log_text(content)
for f in comp_faults:
f2 = dict(f)
f2["component"] = comp
f2["node"] = r.get("node")
f2["path"] = r.get("path")
faults.append(f2)
severity_order = {"high": 0, "medium": 1, "low": 2}
faults.sort(key=lambda x: (severity_order.get((x.get("severity") or "").lower(), 9), x.get("id") or ""))
return {
"status": "success",
"cluster_uuid": cluster_uuid,
"components": comps,
"reads": reads,
"faults": faults[:20],
}
_OPS_COMMANDS: Dict[str, Dict[str, Any]] = {
"jps": {"cmd": "jps -lm", "target": "all_nodes"},
"hadoop_version": {"cmd": "hadoop version", "target": "namenode"},
"hdfs_report": {"cmd": "hdfs dfsadmin -report", "target": "namenode"},
"hdfs_safemode_get": {"cmd": "hdfs dfsadmin -safemode get", "target": "namenode"},
"hdfs_ls_root": {"cmd": "hdfs dfs -ls / | head -n 200", "target": "namenode"},
"yarn_node_list": {"cmd": "yarn node -list 2>/dev/null || yarn node -list -all", "target": "resourcemanager"},
"yarn_application_list": {"cmd": "yarn application -list 2>/dev/null || yarn application -list -appStates RUNNING,ACCEPTED,SUBMITTED", "target": "resourcemanager"},
"df_h": {"cmd": "df -h", "target": "all_nodes"},
"free_h": {"cmd": "free -h", "target": "all_nodes"},
"uptime": {"cmd": "uptime", "target": "all_nodes"},
}
async def tool_run_cluster_command(
db: AsyncSession,
user_name: str,
cluster_uuid: str,
command_key: str,
target: Optional[str] = None,
node_hostname: Optional[str] = None,
timeout: int = 30,
limit_nodes: int = 20,
) -> Dict[str, Any]:
import uuid as uuidlib
try:
uuidlib.UUID(cluster_uuid)
except ValueError:
return {"status": "error", "message": "invalid_uuid_format"}
spec = _OPS_COMMANDS.get((command_key or "").strip())
if not spec:
return {"status": "error", "message": "unsupported_command_key"}
stmt = select(Cluster).where(Cluster.uuid == cluster_uuid)
result = await db.execute(stmt)
cluster = result.scalar_one_or_none()
if not cluster:
return {"status": "error", "message": "cluster_not_found"}
if not await _user_has_cluster_access(db, user_name, int(cluster.id)):
return {"status": "error", "message": "cluster_forbidden"}
tgt = (target or spec.get("target") or "namenode").strip().lower()
cmd = str(spec.get("cmd") or "").strip()
if not cmd:
return {"status": "error", "message": "empty_command"}
bash_cmd = f"bash -lc {shlex.quote(cmd)}"
async def _exec_on_node(hostname: str, ip: str, ssh_user: Optional[str], ssh_password: Optional[str]) -> Dict[str, Any]:
def _run():
client = ssh_manager.get_connection(hostname, ip=ip, username=ssh_user, password=ssh_password)
exit_code, out, err = client.execute_command_with_timeout_and_status(bash_cmd, timeout=timeout)
return exit_code, out, err
exit_code, out, err = await asyncio.to_thread(_run)
return {
"node": hostname,
"ip": ip,
"exitCode": int(exit_code),
"stdout": out,
"stderr": err,
}
results: List[Dict[str, Any]] = []
if tgt == "namenode":
if not cluster.namenode_ip or not cluster.namenode_psw:
return {"status": "error", "message": "namenode_not_configured"}
ip = str(cluster.namenode_ip)
node_stmt = select(Node).where(Node.ip_address == cluster.namenode_ip).limit(1)
node_obj = (await db.execute(node_stmt)).scalars().first()
hostname = node_obj.hostname if node_obj else "namenode"
ssh_user = (node_obj.ssh_user if node_obj and node_obj.ssh_user else "hadoop")
results.append(await _exec_on_node(hostname, ip, ssh_user, cluster.namenode_psw))
elif tgt == "resourcemanager":
if not cluster.rm_ip or not cluster.rm_psw:
return {"status": "error", "message": "resourcemanager_not_configured"}
ip = str(cluster.rm_ip)
node_stmt = select(Node).where(Node.ip_address == cluster.rm_ip).limit(1)
node_obj = (await db.execute(node_stmt)).scalars().first()
hostname = node_obj.hostname if node_obj else "resourcemanager"
ssh_user = (node_obj.ssh_user if node_obj and node_obj.ssh_user else "hadoop")
results.append(await _exec_on_node(hostname, ip, ssh_user, cluster.rm_psw))
elif tgt == "node":
if not node_hostname:
return {"status": "error", "message": "node_hostname_required"}
node = await _find_accessible_node(db, user_name, node_hostname)
if not node:
return {"status": "error", "message": "node_not_found"}
results.append(await _exec_on_node(node.hostname, str(node.ip_address), node.ssh_user or "hadoop", node.ssh_password))
elif tgt == "all_nodes":
nodes_stmt = select(Node).where(Node.cluster_id == cluster.id).limit(limit_nodes)
nodes = (await db.execute(nodes_stmt)).scalars().all()
for n in nodes:
n2 = await _find_accessible_node(db, user_name, n.hostname)
if not n2:
continue
results.append(await _exec_on_node(n2.hostname, str(n2.ip_address), n2.ssh_user or "hadoop", n2.ssh_password))
else:
return {"status": "error", "message": "invalid_target"}
start = _now()
exec_id = f"tool_{start.timestamp():.0f}"
await _write_exec_log(db, exec_id, "run_cluster_command", "success", start, _now(), 0, user_name)
return {
"status": "success",
"cluster_uuid": cluster_uuid,
"command_key": command_key,
"target": tgt,
"executed": cmd,
"results": results,
}
def openai_tools_schema() -> List[Dict[str, Any]]:
"""返回 OpenAI 兼容的工具定义Function Calling"""
return [
{
"type": "function",
"function": {
"name": "read_log",
"description": "读取指定节点的日志文件并可按正则筛选",
"parameters": {
"type": "object",
"properties": {
"node": {"type": "string"},
"path": {"type": "string"},
"lines": {"type": "integer", "default": 200},
"pattern": {"type": "string"},
"sshUser": {"type": "string"},
},
"required": ["node", "path"],
},
},
},
{
"type": "function",
"function": {
"name": "web_search",
"description": "联网搜索互联网公开信息,当遇到未知错误码、技术名词或需要外部资料时使用",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "搜索关键词"},
"max_results": {"type": "integer", "default": 5},
},
"required": ["query"],
},
},
},
{
"type": "function",
"function": {
"name": "start_cluster",
"description": "启动指定的 Hadoop 集群",
"parameters": {
"type": "object",
"properties": {
"cluster_uuid": {"type": "string", "description": "集群的 UUID"},
},
"required": ["cluster_uuid"],
},
},
},
{
"type": "function",
"function": {
"name": "stop_cluster",
"description": "停止指定的 Hadoop 集群",
"parameters": {
"type": "object",
"properties": {
"cluster_uuid": {"type": "string", "description": "集群的 UUID"},
},
"required": ["cluster_uuid"],
},
},
},
{
"type": "function",
"function": {
"name": "read_cluster_log",
"description": "读取集群中特定组件的日志(如 namenode, datanode, resourcemanager",
"parameters": {
"type": "object",
"properties": {
"cluster_uuid": {"type": "string", "description": "集群的 UUID"},
"log_type": {
"type": "string",
"description": "组件类型,例如 namenode, datanode, resourcemanager, nodemanager, historyserver"
},
"node_hostname": {"type": "string", "description": "可选:指定节点的主机名。如果是 datanode 等非唯一组件,建议提供。"},
"lines": {"type": "integer", "default": 100, "description": "读取的行数"},
},
"required": ["cluster_uuid", "log_type"],
},
},
},
{
"type": "function",
"function": {
"name": "detect_cluster_faults",
"description": "基于集群组件日志识别常见故障并输出结构化结果",
"parameters": {
"type": "object",
"properties": {
"cluster_uuid": {"type": "string", "description": "集群的 UUID"},
"components": {"type": "array", "items": {"type": "string"}, "description": "要分析的组件列表,例如 [namenode, resourcemanager, datanode]"},
"node_hostname": {"type": "string", "description": "可选:指定节点主机名(适用于 datanode 等多实例组件)"},
"lines": {"type": "integer", "default": 200, "description": "每个组件读取的行数"},
},
"required": ["cluster_uuid"],
},
},
},
{
"type": "function",
"function": {
"name": "run_cluster_command",
"description": "在集群节点上执行常用运维命令(白名单)并返回结果",
"parameters": {
"type": "object",
"properties": {
"cluster_uuid": {"type": "string", "description": "集群的 UUID"},
"command_key": {"type": "string", "description": "命令标识,例如 jps, hdfs_report, yarn_node_list, df_h"},
"target": {"type": "string", "description": "执行目标namenode/resourcemanager/node/all_nodes不传则按命令默认目标"},
"node_hostname": {"type": "string", "description": "target=node 时必填"},
"timeout": {"type": "integer", "default": 30},
"limit_nodes": {"type": "integer", "default": 20, "description": "target=all_nodes 时最多执行的节点数"},
},
"required": ["cluster_uuid", "command_key"],
},
},
},
]

@ -1,64 +0,0 @@
import asyncio
import os
import shlex
from typing import Optional, Tuple
async def run_local_command(cmd: str, timeout: int = 30) -> Tuple[int, str, str]:
"""运行本地命令,返回 (exit_code, stdout, stderr)。"""
if os.name == "nt":
prog = ["powershell", "-NoProfile", "-NonInteractive", "-Command", cmd]
else:
prog = ["bash", "-lc", cmd]
proc = await asyncio.create_subprocess_exec(
*prog,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
try:
out, err = await asyncio.wait_for(proc.communicate(), timeout=timeout)
except asyncio.TimeoutError:
try:
proc.kill()
except Exception:
pass
return (124, "", "timeout")
return (proc.returncode or 0, out.decode(errors="ignore"), err.decode(errors="ignore"))
def _build_ssh_prog(host: str, user: str, cmd: str, port: Optional[int] = None, identity_file: Optional[str] = None) -> list:
"""构造 ssh 远程执行命令参数数组。"""
prog = [
"ssh",
"-o",
"BatchMode=yes",
"-o",
"StrictHostKeyChecking=no",
]
if port:
prog += ["-p", str(port)]
if identity_file:
prog += ["-i", identity_file]
target = f"{user}@{host}" if user else host
prog += [target, "bash", "-lc", cmd]
return prog
async def run_remote_command(host: str, user: str, cmd: str, timeout: int = 30, port: Optional[int] = None, identity_file: Optional[str] = None) -> Tuple[int, str, str]:
"""通过 ssh 在远端主机执行命令,返回 (exit_code, stdout, stderr)。"""
prog = _build_ssh_prog(host, user, cmd, port=port, identity_file=identity_file)
proc = await asyncio.create_subprocess_exec(
*prog,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
try:
out, err = await asyncio.wait_for(proc.communicate(), timeout=timeout)
except asyncio.TimeoutError:
try:
proc.kill()
except Exception:
pass
return (124, "", "timeout")
return (proc.returncode or 0, out.decode(errors="ignore"), err.decode(errors="ignore"))

@ -1,70 +0,0 @@
from ..ssh_utils import SSHClient
from ..config import SSH_TIMEOUT
def check_ssh_connectivity(host: str, user: str, password: str, timeout: int | None = None) -> tuple[bool, str | None]:
try:
cli = SSHClient(str(host), user or "", password or "")
out, _ = cli.execute_command_with_timeout("echo ok", timeout or SSH_TIMEOUT)
cli.close()
if out is None:
return (False, "no_output")
if out.strip():
return (True, None)
return (False, "empty_output")
except Exception as e:
try:
cli.close()
except Exception:
pass
return (False, str(e))
def get_hdfs_cluster_id(host: str, user: str, password: str, timeout: int | None = None) -> tuple[str | None, str | None]:
"""
通过以下步骤获取 HDFS 集群 UUID:
1. 执行 hdfs getconf -confKey dfs.namenode.name.dir 获取名称节点目录
2. 在该目录的 current 子目录下读取 VERSION 文件
3. 解析 VERSION 文件中的 clusterID 字段
4. 去掉 'CID-' 前缀并返回
"""
try:
cli = SSHClient(str(host), user or "", password or "")
# 1. 获取 dfs.namenode.name.dir
dir_out, dir_err = cli.execute_command_with_timeout("hdfs getconf -confKey dfs.namenode.name.dir", timeout or SSH_TIMEOUT)
if not dir_out or not dir_out.strip():
cli.close()
return None, f"Failed to get dfs.namenode.name.dir: {dir_err or 'Empty output'}"
# 处理可能存在的多个目录(取第一个)
name_dir = dir_out.strip().split(',')[0]
# 移除 file:// 前缀(如果存在)
if name_dir.startswith("file://"):
name_dir = name_dir[7:]
version_path = f"{name_dir.rstrip('/')}/current/VERSION"
# 2. 读取 VERSION 文件
version_out, version_err = cli.execute_command_with_timeout(f"cat {version_path}", timeout or SSH_TIMEOUT)
cli.close()
if not version_out or not version_out.strip():
return None, f"Failed to read VERSION file at {version_path}: {version_err or 'Empty output'}"
# 3. 解析 clusterID
cluster_id = None
for line in version_out.splitlines():
if line.startswith("clusterID="):
cluster_id = line.split("=")[1].strip()
break
if not cluster_id:
return None, f"clusterID not found in {version_path}"
# 4. 去掉 'CID-' 前缀
if cluster_id.startswith("CID-"):
cluster_id = cluster_id[4:]
return cluster_id, None
except Exception as e:
return None, str(e)

@ -1,242 +0,0 @@
import os
import socket
import paramiko
from typing import Optional, TextIO, Dict, Tuple
from .config import SSH_PORT, SSH_TIMEOUT
# Create a static node configuration dictionary that will be used for all requests
# This avoids the issue of environment variables not being available in child processes
STATIC_NODE_CONFIG = {
"hadoop102": ("192.168.10.102", "hadoop", "limouren..."),
"hadoop103": ("192.168.10.103", "hadoop", "limouren..."),
"hadoop104": ("192.168.10.104", "hadoop", "limouren..."),
"hadoop105": ("192.168.10.105", "hadoop", "limouren..."),
"hadoop100": ("192.168.10.100", "hadoop", "limouren...")
}
DEFAULT_SSH_USER = os.getenv("HADOOP_USER", "hadoop")
DEFAULT_SSH_PASSWORD = os.getenv("HADOOP_PASSWORD", "limouren...")
class SSHClient:
"""SSH Client for connecting to remote servers"""
def __init__(self, hostname: str, username: str, password: str, port: int = SSH_PORT):
self.hostname = hostname
self.username = username
self.password = password
self.port = port
self.client: Optional[paramiko.SSHClient] = None
def _ensure_connected(self) -> None:
if self.client is None:
self.connect()
return
try:
transport = self.client.get_transport()
if transport is None or not transport.is_active():
self.connect()
except Exception:
self.connect()
def connect(self) -> None:
"""Establish SSH connection"""
self.client = paramiko.SSHClient()
# Automatically add host keys
self.client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
sock = None
socks5 = os.getenv("TS_SOCKS5_SERVER") or os.getenv("TAILSCALE_SOCKS5_SERVER")
if socks5:
try:
sock = _socks5_connect(socks5, self.hostname, self.port, SSH_TIMEOUT)
except Exception:
sock = None
self.client.connect(
hostname=self.hostname,
username=self.username,
password=self.password,
port=self.port,
timeout=SSH_TIMEOUT,
sock=sock,
)
def execute_command(self, command: str) -> tuple:
"""Execute command on remote server"""
self._ensure_connected()
stdin, stdout, stderr = self.client.exec_command(command)
return stdout.read().decode(), stderr.read().decode()
def execute_command_with_status(self, command: str) -> tuple:
self._ensure_connected()
stdin, stdout, stderr = self.client.exec_command(command)
exit_code = stdout.channel.recv_exit_status()
return exit_code, stdout.read().decode(), stderr.read().decode()
def execute_command_with_timeout(self, command: str, timeout: int = 30) -> tuple:
"""Execute command with timeout"""
self._ensure_connected()
stdin, stdout, stderr = self.client.exec_command(command, timeout=timeout)
return stdout.read().decode(), stderr.read().decode()
def execute_command_with_timeout_and_status(self, command: str, timeout: int = 30) -> tuple:
self._ensure_connected()
stdin, stdout, stderr = self.client.exec_command(command, timeout=timeout)
exit_code = stdout.channel.recv_exit_status()
return exit_code, stdout.read().decode(), stderr.read().decode()
def read_file(self, file_path: str) -> str:
"""Read file content from remote server"""
self._ensure_connected()
with self.client.open_sftp() as sftp:
with sftp.open(file_path, 'r') as f:
return f.read().decode()
def download_file(self, remote_path: str, local_path: str) -> None:
"""Download file from remote server to local"""
self._ensure_connected()
with self.client.open_sftp() as sftp:
sftp.get(remote_path, local_path)
def close(self) -> None:
"""Close SSH connection"""
if self.client:
self.client.close()
self.client = None
def __enter__(self):
"""Context manager entry"""
self.connect()
return self
def __exit__(self, exc_type, exc_val, exc_tb):
"""Context manager exit"""
self.close()
class SSHConnectionManager:
"""SSH Connection Manager for managing multiple SSH connections"""
def __init__(self):
self.connections = {}
def get_connection(self, node_name: str, ip: str = None, username: str = None, password: str = None) -> SSHClient:
"""Get or create SSH connection for a node"""
if node_name in self.connections:
client = self.connections[node_name]
if ip and getattr(client, "hostname", None) != ip:
try:
client.close()
except Exception:
pass
del self.connections[node_name]
elif username and getattr(client, "username", None) != username:
try:
client.close()
except Exception:
pass
del self.connections[node_name]
elif password and getattr(client, "password", None) != password:
try:
client.close()
except Exception:
pass
del self.connections[node_name]
if node_name not in self.connections:
if not ip:
raise ValueError(f"IP address required for new connection to {node_name}")
_user = username or DEFAULT_SSH_USER
_pass = password or DEFAULT_SSH_PASSWORD
client = SSHClient(ip, _user, _pass)
self.connections[node_name] = client
return self.connections[node_name]
def close_all(self) -> None:
"""Close all SSH connections"""
for conn in self.connections.values():
conn.close()
self.connections.clear()
def __enter__(self):
"""Context manager entry"""
return self
def __exit__(self, exc_type, exc_val, exc_tb):
"""Context manager exit"""
self.close_all()
# Create a global SSH connection manager instance
ssh_manager = SSHConnectionManager()
def _parse_hostport(value: str, default_port: int) -> tuple[str, int]:
s = (value or "").strip()
if not s:
return ("127.0.0.1", default_port)
if s.startswith("http://"):
s = s[7:]
if s.startswith("socks5://"):
s = s[9:]
if "/" in s:
s = s.split("/", 1)[0]
if ":" in s:
host, port_s = s.rsplit(":", 1)
try:
return (host.strip() or "127.0.0.1", int(port_s.strip()))
except Exception:
return (host.strip() or "127.0.0.1", default_port)
return (s, default_port)
def _socks5_connect(proxy: str, dest_host: str, dest_port: int, timeout: int) -> socket.socket:
proxy_host, proxy_port = _parse_hostport(proxy, 1080)
s = socket.create_connection((proxy_host, proxy_port), timeout=timeout)
s.settimeout(timeout)
s.sendall(b"\x05\x01\x00")
resp = s.recv(2)
if len(resp) != 2 or resp[0] != 0x05 or resp[1] != 0x00:
s.close()
raise RuntimeError("socks5_auth_failed")
atyp = 0x03
addr = dest_host.encode("utf-8")
try:
packed = socket.inet_pton(socket.AF_INET, dest_host)
atyp = 0x01
addr_field = packed
except Exception:
try:
packed6 = socket.inet_pton(socket.AF_INET6, dest_host)
atyp = 0x04
addr_field = packed6
except Exception:
if len(addr) > 255:
s.close()
raise RuntimeError("socks5_domain_too_long")
addr_field = bytes([len(addr)]) + addr
port_field = int(dest_port).to_bytes(2, "big", signed=False)
req = b"\x05\x01\x00" + bytes([atyp]) + addr_field + port_field
s.sendall(req)
head = s.recv(4)
if len(head) != 4 or head[0] != 0x05:
s.close()
raise RuntimeError("socks5_bad_reply")
rep = head[1]
if rep != 0x00:
s.close()
raise RuntimeError(f"socks5_connect_failed:{rep}")
bnd_atyp = head[3]
if bnd_atyp == 0x01:
s.recv(4)
elif bnd_atyp == 0x04:
s.recv(16)
elif bnd_atyp == 0x03:
ln = s.recv(1)
if ln:
s.recv(ln[0])
s.recv(2)
return s

@ -1,38 +0,0 @@
import asyncio
import argparse
from sqlalchemy import select
from app.db import SessionLocal
from app.models.nodes import Node
from app.models.clusters import Cluster
from app.metrics_collector import metrics_collector
async def collect_once(cluster_uuid: str):
async with SessionLocal() as session:
cid_res = await session.execute(select(Cluster.id).where(Cluster.uuid == cluster_uuid).limit(1))
cid = cid_res.scalars().first()
if not cid:
return
res = await session.execute(select(Node.id, Node.hostname, Node.ip_address).where(Node.cluster_id == cid))
rows = res.all()
for nid, hn, ip in rows:
cpu, mem = metrics_collector._read_cpu_mem(hn, str(ip))
await metrics_collector._save_metrics(nid, hn, cid, cpu, mem)
async def runner(cluster_uuid: str, interval: int):
while True:
try:
await collect_once(cluster_uuid)
except Exception:
pass
await asyncio.sleep(interval)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--cluster", required=True, help="Cluster UUID to collect metrics for")
parser.add_argument("--interval", type=int, default=3, help="Collect interval seconds")
args = parser.parse_args()
metrics_collector.set_collection_interval(args.interval)
asyncio.run(runner(args.cluster, args.interval))
if __name__ == "__main__":
main()

@ -1,242 +0,0 @@
# AI 工具调用测试提示词(/api/v1/ai/chat
本页用于前端/测试同学对接与验证 AI 工具调用能力。所有提示词都建议通过 `/api/v1/ai/chat` 发送,并在消息里带上必要的 `cluster_uuid / node / log_type / path` 等信息,确保模型能直接触发工具调用。
## 通用前置条件
- 已能正常登录并拿到 token
- 已有至少 1 个可用集群 UUID记为 `<CLUSTER_UUID>`
- 集群已录入 NameNode/RM 的 IP 与密码(否则涉及 Namenode/RM 的命令会返回未配置)
- 当前登录用户已被映射到该集群(否则会返回 forbidden
- 如需指定节点:准备 1 个节点 hostname记为 `<NODE_HOSTNAME>`
- 如需读取具体文件:准备 1 个路径(记为 `<LOG_PATH>`
## 使用方式(推荐)
- 接口:`POST /api/v1/ai/chat`
- Body 示例(非流式):
```json
{
"sessionId": "test-ai-tools",
"message": "这里放测试提示词",
"stream": false,
"context": { "model": "可选:你的模型名" }
}
```
## 1. 故障识别detect_cluster_faults
### 1.1 默认组件namenode + resourcemanager
提示词:
> 我在集群 `<CLUSTER_UUID>` 上怀疑出现故障。请先调用 detect_cluster_faults 分析 namenode 和 resourcemanager 的最近 200 行日志,给出根因、证据行与建议。
期望:
- 模型触发 `detect_cluster_faults`,并返回 `faults` 列表(可能为空)
- 回答中包含根因推断、影响范围、证据examples与建议advice
### 1.2 指定组件列表
提示词:
> 集群 `<CLUSTER_UUID>` 的 YARN 任务提交失败。请调用 detect_cluster_faultscomponents 传入 ["resourcemanager"]lines=200输出结构化故障结论。
期望:
- tools: `detect_cluster_faults(components=["resourcemanager"])`
### 1.3 负向UUID 格式错误
提示词:
> 请对集群 `not-a-uuid` 调用 detect_cluster_faults 分析故障。
期望:
- 工具返回 `invalid_uuid_format`,模型应解释 UUID 不合法并提示正确格式
## 2. 组件日志读取read_cluster_log
### 2.1 读取 NameNode 日志
提示词:
> 读取集群 `<CLUSTER_UUID>` 的 namenode 日志最近 100 行。请调用 read_cluster_log。
期望:
- 工具返回 `status=success` 且包含 `content`(若找不到日志应给出原因 message
### 2.2 读取 ResourceManager 日志
提示词:
> 读取集群 `<CLUSTER_UUID>` 的 resourcemanager 日志最近 200 行lines=200。请调用 read_cluster_log 并把关键错误行摘出来。
期望:
- `read_cluster_log(log_type="resourcemanager", lines=200)`
### 2.3 指定节点主机名(多实例组件建议)
提示词:
> 集群 `<CLUSTER_UUID>` 的 datanode 日志我只想看 `<NODE_HOSTNAME>` 这台。请调用 read_cluster_loglog_type=datanodenode_hostname=`<NODE_HOSTNAME>`lines=120。
期望:
- `read_cluster_log` 能根据 node_hostname 定位到节点并读取匹配日志文件(可能返回 log_file_not_found但应给出解释
## 3. 单节点任意文件读取read_log
### 3.1 读取指定路径尾部
提示词:
> 在节点 `<NODE_HOSTNAME>` 上读取文件 `<LOG_PATH>` 的最后 200 行。请调用 read_log。
期望:
- 工具返回 `exitCode=0` 且 stdout 含日志内容
### 3.2 带正则筛选grep -E
提示词:
> 在节点 `<NODE_HOSTNAME>` 上读取 `<LOG_PATH>` 最后 500 行,并筛选包含 "ERROR|Exception" 的行。请调用 read_loglines=500pattern="ERROR|Exception"。
期望:
- stdout 更聚焦于错误行
### 3.3 负向:无权限节点
提示词:
> 在节点 `some_other_node` 上读取 `/var/log/messages` 最后 50 行。请调用 read_log。
期望:
- 工具返回 `node_not_found`(代表当前用户不可访问或节点不存在),模型应解释权限限制
## 4. 集群运维命令run_cluster_command白名单
说明:此工具只能执行白名单 `command_key`,无法执行任意命令字符串。
### 4.1 进程检查jps默认 all_nodes
提示词:
> 请对集群 `<CLUSTER_UUID>` 执行 jps 检查所有节点的 Java 进程run_cluster_commandcommand_key=jps。输出每个节点的关键进程。
期望:
- 返回 `results` 数组,每个元素包含 node/ip/exitCode/stdout/stderr
### 4.2 版本信息hadoop_versionnamenode
提示词:
> 我想确认集群 `<CLUSTER_UUID>` 的 Hadoop 版本。请调用 run_cluster_commandcommand_key=hadoop_version并总结版本号。
### 4.3 HDFS 总览hdfs_reportnamenode
提示词:
> 对集群 `<CLUSTER_UUID>` 执行 hdfs_report给出 DataNode 数量、总容量、已用容量的摘要。
### 4.4 SafeMode 状态hdfs_safemode_getnamenode
提示词:
> 对集群 `<CLUSTER_UUID>` 执行 hdfs_safemode_get告诉我当前是否处于 SafeMode并给出下一步建议。
### 4.5 YARN 节点yarn_node_listresourcemanager
提示词:
> 对集群 `<CLUSTER_UUID>` 执行 yarn_node_list输出节点数量并列出前 5 个节点信息。
### 4.6 YARN 应用yarn_application_listresourcemanager
提示词:
> 对集群 `<CLUSTER_UUID>` 执行 yarn_application_list给出当前 RUNNING 的应用数量和应用 ID 列表(最多 20 个)。
### 4.7 系统资源df_h / free_h / uptimeall_nodes
提示词(任选其一):
> 对集群 `<CLUSTER_UUID>` 执行 df_h汇总磁盘使用率最高的 3 台节点。
> 对集群 `<CLUSTER_UUID>` 执行 free_h汇总内存剩余最少的 3 台节点。
> 对集群 `<CLUSTER_UUID>` 执行 uptime给出平均负载最高的 3 台节点。
### 4.8 指定单节点执行target=node
提示词:
> 只在节点 `<NODE_HOSTNAME>` 上执行 df_h。请调用 run_cluster_commandcommand_key=df_htarget=nodenode_hostname=`<NODE_HOSTNAME>`。
期望:
- `results` 只有 1 条
### 4.9 负向:不支持的 command_key
提示词:
> 对集群 `<CLUSTER_UUID>` 执行 run_cluster_commandcommand_key=netstat_listen。
期望:
- 工具返回 `unsupported_command_key`,模型应解释目前只支持白名单键值
## 5. 集群启停start_cluster / stop_cluster
### 5.1 启动
提示词:
> 请启动集群 `<CLUSTER_UUID>`(调用 start_cluster。启动后再调用 run_cluster_command 的 jps 进行验证,并给出结论。
期望:
- 工具调用顺序start_cluster → run_cluster_command(jps)
### 5.2 停止
提示词:
> 请停止集群 `<CLUSTER_UUID>`(调用 stop_cluster。停止后再调用 run_cluster_command 的 jps 进行验证,并给出结论。
期望:
- 工具调用顺序stop_cluster → run_cluster_command(jps)
### 5.3 负向:集群 UUID 不存在
提示词:
> 请启动集群 `00000000-0000-0000-0000-000000000000`
期望:
- 工具返回 `cluster_not_found`,模型提示检查 UUID 与权限
## 6. 联网搜索web_search
### 6.1 查错误码/异常解释
提示词:
> 我看到报错 “StandbyException: Operation category READ is not supported in state standby”。请调用 web_search 查一下这类异常常见原因和处理方式,并结合 Hadoop 场景给出建议。
期望:
- 工具返回 resultstitle/href/body/full_content模型整合为中文结论

File diff suppressed because it is too large Load Diff

@ -1,72 +0,0 @@
# Hadoop 集群启动与停止接口前端联调指南
本文档提供了 Hadoop 集群启动与停止相关 API 的详细说明,用于指导前端开发人员进行接口对接。
## 1. 接口基本信息
| 功能 | 请求方法 | 接口路径 | 权限要求 |
| :--- | :--- | :--- | :--- |
| **启动集群** | `POST` | `/api/v1/ops/clusters/{cluster_uuid}/start` | `cluster:start` |
| **停止集群** | `POST` | `/api/v1/ops/clusters/{cluster_uuid}/stop` | `cluster:stop` |
- **Base URL**: `http://<server-ip>:<port>`
- **Content-Type**: `application/json`
- **认证方式**: 需要在 Header 中携带有效 JWT Token`Authorization: Bearer <your_token>`
## 2. 请求参数 (Path Parameters)
| 参数名 | 类型 | 必选 | 说明 |
| :--- | :--- | :--- | :--- |
| `cluster_uuid` | `string` | 是 | 集群的唯一标识符UUID可从 `/api/v1/clusters` 接口获取。 |
## 3. 响应结构
### 3.1 成功响应 (200 OK)
接口执行时间较长(涉及远程 SSH 指令),建议前端超时时间设置为 **60s**
```json
{
"status": "success",
"logs": [
"NameNode (192.168.1.10) start: Starting namenodes on [localhost]\nlocalhost: starting namenode...",
"ResourceManager (192.168.1.11) start: Starting resourcemanager..."
]
}
```
**字段说明:**
- `status`: 固定为 `"success"`
- `logs`: 字符串数组包含各关键组件NameNode, ResourceManager执行脚本后的标准输出与错误信息。
### 3.2 错误响应
- **401 Unauthorized**: 未提供 Token 或 Token 已失效。
- **403 Forbidden**: 权限不足(仅 `admin``ops` 角色可操作)。
- **404 Not Found**: 集群 UUID 不存在。
- **400 Bad Request**: 请求参数错误。
- `{"detail": "invalid_uuid_format"}`: 传入的 UUID 格式不正确(例如前端误传了 `[object Object]`)。
- **500 Internal Server Error**: 后端连接 SSH 超时或内部逻辑错误。
## 4. 前端联调建议
1. **Loading 状态**: 由于是长耗时操作UI 必须提供明确的加载提示,并禁用操作按钮以防重发。
2. **日志展示**: 建议将返回的 `logs` 数组内容渲染在侧边栏或弹窗的日志终端组件中。
3. **超时处理**: 请务必在 Axios 或 Fetch 配置中显式设置 `timeout: 60000`
4. **状态刷新**: 操作成功后,建议前端重新触发一次集群状态列表查询,以获取最新的 `health_status`
## 5. 代码参考 (JavaScript/Axios)
```javascript
import axios from 'axios';
const clusterApi = {
async controlCluster(uuid, action) {
// action: 'start' 或 'stop'
const response = await axios.post(`/api/v1/ops/clusters/${uuid}/${action}`, {}, {
timeout: 60000
});
return response.data;
}
};
```

@ -1,89 +0,0 @@
# 前端模型联调指南 (V3 & R1)
本文档旨在指导前端开发人员如何通过 API 调用不同的 LLM 模型DeepSeek-V3 和 DeepSeek-R1
---
## 1. 模型标识符说明
在调用 API 时,请使用以下字符串作为模型标识:
| 模型名称 | 标识符 (API 传参值) | 适用场景 |
| :--- | :--- | :--- |
| **DeepSeek-V3** | `deepseek-ai/DeepSeek-V3` | 普通对话、通用问答、响应速度快。 |
| **DeepSeek-R1** | `Pro/deepseek-ai/DeepSeek-R1` | 复杂逻辑推理、深度故障诊断、代码生成。 |
---
## 2. 接口调用方式
### 2.1 AI 聊天接口 (`/api/v1/ai/chat`)
前端需要在请求体的 `context` 对象中传入 `model` 字段。
#### 请求示例 (TypeScript/Fetch):
```typescript
const response = await fetch('/api/v1/ai/chat', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${userToken}`
},
body: JSON.stringify({
sessionId: "current-chat-id",
message: "如何解决 NameNode 处于安全模式?",
stream: true, // 建议开启流式
context: {
model: "Pro/deepseek-ai/DeepSeek-R1", // 切换模型
agent: "HadoopExpert"
}
})
});
```
#### 响应处理 (SSE 流式):
`stream: true` 时,响应格式为 SSE (Server-Sent Events)。每一行 `data` 是一个 JSON 字符串,包含 `content` (正文) 和 `reasoning` (思维链,仅 R1 支持)。
```javascript
// 处理逻辑示例
const reader = response.body.getReader();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = new TextDecoder().decode(value);
const lines = chunk.split('\n');
lines.forEach(line => {
if (line.startsWith('data: ')) {
const { content, reasoning } = JSON.parse(line.slice(6));
if (reasoning) updateReasoningUI(reasoning); // 更新思考过程 UI
if (content) updateContentUI(content); // 更新正文 UI
}
});
}
```
---
## 3. 故障诊断与自动修复接口 (`/api/v1/ai/diagnose-repair`)
对于自动诊断任务,模型参数直接放在请求体根部。
#### 请求示例:
```bash
curl -X POST http://localhost:8000/api/v1/ai/diagnose-repair \
-H "Content-Type: application/json" \
-d '{
"cluster": "cluster-uuid-123",
"model": "Pro/deepseek-ai/DeepSeek-R1",
"auto": true,
"maxSteps": 3
}'
```
---
## 4. 注意事项
1. **默认模型**: 如果前端不传递 `model` 参数,后端将使用 `.env` 文件中配置的 `LLM_MODEL` (目前默认为 V3)。
2. **R1 推理过程**: DeepSeek-R1 会输出 `reasoning_content`。前端建议提供一个可折叠的“思考过程”组件,用来展示 `reasoning` 字段的内容,以增强用户体验。
3. **错误处理**: 若传入无效的模型名称,后端可能会返回 `403``502` 错误,请前端做好兜底逻辑。

@ -1,112 +0,0 @@
# Hadoop 常见故障诊断与智能体自动修复方案
本文档列出了 Hadoop 集群中常见的故障及其解决方案,并提供了一套基于 LLM 智能体的自动检测与修复编码方案。
## 一、 Hadoop 常见故障及解决方案
### 1. NameNode 处于安全模式 (Safe Mode)
- **故障现象**: HDFS 只能读取,无法写入或删除文件。
- **原因**: 集群启动时块汇报未达到阈值,或者磁盘空间不足、元数据损坏。
- **解决方案**:
- 检查磁盘空间:`df -h`
- 查看状态:`hdfs dfsadmin -safemode get`
- 手动退出(确认安全后):`hdfs dfsadmin -safemode leave`
### 2. DataNode 进程丢失或无法连接
- **故障现象**: 节点列表显示 DataNode 死亡,副本数不足。
- **原因**: 内存溢出 (OOM)、磁盘损坏、网络隔离、PID 文件丢失导致无法启动。
- **解决方案**:
- 检查日志:`tail -n 200 /var/log/hadoop/hadoop-hadoop-datanode.log`
- 检查磁盘:`fsck` 或查看 `dmesg`
- 重启服务:`hdfs --daemon start datanode`
### 3. YARN 资源调度器 (ResourceManager) 响应慢/无法提交作业
- **故障现象**: 提交任务卡死Web UI 无法访问。
- **原因**: 堆内存不足、Zookeeper 锁冲突、日志过多占满磁盘。
- **解决方案**:
- 调整 JVM 参数:增加 `-Xmx`
- 清理临时文件:`yarn cache -clean`
- 重启 RM`yarn --daemon start resourcemanager`
---
## 二、 智能体自动检测与修复编码方案
本方案基于项目现有的 [diagnosis_agent.py](file:///home/devbox/project/backend/app/agents/diagnosis_agent.py) 架构,通过 **Function Calling (工具调用)** 模式实现闭环。
### 1. 核心流程设计 (Observe-Think-Act)
1. **观察 (Observe)**: 智能体通过监控告警或用户指令触发,调用 `read_log``execute_command` 获取集群现状。
2. **思考 (Think)**: LLM 根据获取的上下文(日志片段、进程状态、磁盘水位)分析根因。
3. **行动 (Act)**: LLM 选择最合适的修复工具(如 `fix_safemode`, `restart_service`)并执行。
4. **验证 (Verify)**: 执行后再检查一次状态,确保修复成功。
### 2. 工具定义接口 (Tools Schema)
建议在 [ops_tools.py](file:///home/devbox/project/backend/app/services/ops_tools.py) 中扩展以下原子工具:
```python
def get_repair_tools():
return [
{
"name": "check_hdfs_health",
"description": "检查 HDFS 整体健康状态和安全模式",
"parameters": {"type": "object", "properties": {}}
},
{
"name": "manage_service",
"description": "管理 Hadoop 服务(启动/停止/重启)",
"parameters": {
"type": "object",
"properties": {
"node": {"type": "string", "description": "节点主机名"},
"service": {"type": "string", "enum": ["datanode", "namenode", "resourcemanager"]},
"action": {"type": "string", "enum": ["start", "stop", "restart"]}
},
"required": ["node", "service", "action"]
}
},
{
"name": "fix_disk_space",
"description": "清理指定目录下的日志或临时文件以释放空间",
"parameters": {
"type": "object",
"properties": {
"node": {"type": "string"},
"path": {"type": "string", "description": "待清理的路径"}
}
}
}
]
```
### 3. 智能体循环执行逻辑
在 [diagnosis_agent.py](file:///home/devbox/project/backend/app/agents/diagnosis_agent.py) 的 `run_diagnose_and_repair` 中,核心逻辑如下:
```python
async def run_diagnose_and_repair(db, operator, context, auto=True):
# 1. 初始提示词
messages = [{"role": "system", "content": "你是 Hadoop 专家。诊断并修复故障。"}]
# 2. 循环诊断与修复
for step in range(max_steps):
# 让 LLM 决定下一步动作
response = await llm.chat(messages, tools=tools)
# 如果 LLM 给出结论而不再调用工具,则结束
if not response.tool_calls:
return {"status": "finished", "root_cause": response.content}
# 3. 执行工具(例如重启服务)
for tool in response.tool_calls:
result = await execute_tool(tool) # 调用 SSHClient 执行命令
messages.append({"role": "tool", "content": str(result), "name": tool.name})
return {"status": "max_steps_reached"}
```
### 4. 关键实现细节
- **安全隔离**: 修复动作必须在 `SSHClient` 层面进行权限控制,避免 LLM 执行 `rm -rf /`
- **状态感知**: 智能体应优先检查 `hdfs dfsadmin -report` 的结果,作为思考的基准数据。
- **上下文注入**: 在 `context` 中注入 [STATIC_NODE_CONFIG](file:///home/devbox/project/backend/app/ssh_utils.py),让智能体知道有哪些 IP 和用户可以使用。

@ -1,107 +0,0 @@
# Hadoop 故障复现与修复全过程指南
本文档详细说明了如何手动复现 Hadoop 集群的常见故障,并提供了详细的修复步骤,以及智能体在这些场景下的自动化处理逻辑。
---
## 故障 1NameNode 强制进入安全模式 (Safe Mode)
### 1.1 手动复现步骤
在 NameNode 节点上执行以下命令,强制集群进入安全模式:
```bash
# 进入安全模式
hdfs dfsadmin -safemode enter
# 验证状态(此时 HDFS 无法进行写入操作)
hdfs dfsadmin -safemode get
hdfs dfs -touchz /test_file # 预期报错Name node is in safe mode.
```
### 1.2 手动修复步骤
1. **检查原因**: 确认是否有磁盘空间不足或数据块损坏。
```bash
df -h
hdfs fsck /
```
2. **强制退出**: 如果确认数据安全,手动退出:
```bash
hdfs dfsadmin -safemode leave
```
### 1.3 智能体自动化方案
- **检测**: 智能体调用 `execute_command("hdfs dfsadmin -safemode get")` 解析输出。
- **决策**: 若检测到 `Safe mode is ON`,智能体接着检查磁盘空间和 `fsck` 结果。
- **修复**: 若各项指标正常,智能体调用工具执行 `hdfs dfsadmin -safemode leave`
---
## 故障 2DataNode 进程异常停止
### 2.1 手动复现步骤
在任意 DataNode 节点上模拟进程崩溃:
```bash
# 找到 DataNode 进程并强杀
ps -ef | grep DataNode | grep -v grep | awk '{print $2}' | xargs kill -9
# 验证状态(在 NameNode Web UI 或通过命令行查看)
hdfs dfsadmin -report | grep "Live datanodes"
```
### 2.2 手动修复步骤
1. **查看日志**: 定位死亡原因(如 OOM
```bash
tail -n 100 $HADOOP_HOME/logs/hadoop-*-datanode-*.log
```
2. **重启进程**:
```bash
hdfs --daemon start datanode
```
3. **确认恢复**:
```bash
jps | grep DataNode
```
### 2.3 智能体自动化方案
- **检测**: 智能体通过 [metrics_collector.py](file:///home/devbox/project/backend/app/metrics_collector.py) 发现节点 `health_status` 变为 `dead`
- **决策**: 智能体调用 `read_log` 查找关键字 `OutOfMemoryError``FATAL`
- **修复**: 智能体调用 [ops_tools.py](file:///home/devbox/project/backend/app/services/ops_tools.py) 中的 `restart_service` 接口执行重启。
---
## 故障 3ResourceManager 挂掉导致任务无法提交
### 3.1 手动复现步骤
在 ResourceManager 节点上停止服务:
```bash
# 停止 RM
yarn --daemon stop resourcemanager
# 验证(提交一个简单的任务会卡死或报错)
yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 1 1
```
### 3.2 手动修复步骤
1. **检查端口**: 确认 8088 端口是否监听。
```bash
netstat -tpln | grep 8088
```
2. **启动服务**:
```bash
yarn --daemon start resourcemanager
```
### 3.3 智能体自动化方案
- **检测**: 智能体调用 [ssh_probe.py](file:///home/devbox/project/backend/app/services/ssh_probe.py) 探测端口失败,或 API 返回 502。
- **决策**: 判定为管理进程缺失。
- **修复**: 智能体通过 [diagnosis_agent.py](file:///home/devbox/project/backend/app/agents/diagnosis_agent.py) 自动触发 `start_cluster` 或特定的 `manage_service` 工具。
---
## 总结:通用修复流水线
无论是哪种故障,智能体都遵循以下标准流水线:
1. **告警触发**: 接收到 Prometheus 或数据库状态异常。
2. **环境快照**: 自动运行 `jps`, `df -h`, `free -m` 收集当前节点“画像”。
3. **日志钻取**: 使用 `grep -E "ERROR|FATAL|Exception"` 在最近 5 分钟日志中搜索。
4. **执行修复**: 调用预定义的运维脚本。
5. **健康检查**: 修复后持续观察 1 分钟,确保进程没有再次崩溃。

@ -1,96 +0,0 @@
# Metrics 采集器前端联调指南
## 1. 概述
本文档旨在指导前端开发人员如何调用新增的 Metrics 采集器接口,实现对集群节点 CPU、内存等指标的实时持续采集。该功能通过后台线程运行每隔固定周期默认 5 秒)自动更新数据库中的节点状态。
## 2. 接口说明
所有接口均需在 Header 中携带有效的 JWT Token 进行认证。
### 2.1 启动集群采集
**接口地址**: `POST /api/v1/metrics/collectors/start-by-cluster/{cluster_uuid}`
**功能描述**: 启动指定集群下所有节点的后台采集线程。如果采集已在运行,此操作将重启采集并应用新的 `interval`
**Query 参数**:
- `interval` (int, 可选): 采集周期,单位为秒。默认为 `5`
**请求示例**:
`POST /api/v1/metrics/collectors/start-by-cluster/550e8400-e29b-41d4-a716-446655440000?interval=10`
**响应示例**:
```json
{
"ok": true,
"message": "Metrics collection started for cluster 550e8400-e29b-41d4-a716-446655440000 with interval 10s"
}
```
---
### 2.2 获取采集器状态
**接口地址**: `GET /api/v1/metrics/collectors/status`
**功能描述**: 查询当前后台采集器的运行状态,包括活跃的采集线程数、周期以及最近的错误信息。
**Query 参数**:
- `cluster` (string, 可选): 指定集群 UUID 过滤状态。
**请求示例**:
`GET /api/v1/metrics/collectors/status?cluster=550e8400-e29b-41d4-a716-446655440000`
**响应示例**:
```json
{
"is_running": true,
"active_collectors_count": 3,
"interval": 5,
"collectors": {
"node-01": "running",
"node-02": "running"
},
"errors": {
"node-03": "SSH Timeout"
}
}
```
---
### 2.3 停止集群采集
**接口地址**: `POST /api/v1/metrics/collectors/stop-by-cluster/{cluster_uuid}`
**功能描述**: 停止指定集群下所有节点的后台采集线程。
**请求示例**:
`POST /api/v1/metrics/collectors/stop-by-cluster/550e8400-e29b-41d4-a716-446655440000`
**响应示例**:
```json
{
"ok": true,
"message": "Metrics collection stopping for cluster 550e8400-e29b-41d4-a716-446655440000"
}
```
## 3. 前端集成逻辑建议
### 3.1 页面加载时同步状态
当用户进入“集群监控”或“节点列表”页面时,应先调用 `GET /api/v1/metrics/collectors/status` 接口。
- 如果 `is_running``false`,界面可以显示“启动监控”按钮。
- 如果为 `true`,界面显示“监控中”状态,并可以根据 `interval` 开启前端定时刷新(调用原有的节点数据接口获取最新值)。
### 3.2 启动监控
点击“启动监控”按钮后,调用 `start-by-cluster` 接口。成功后,前端应每隔一段时间(建议大于等于 `interval`)重新获取节点列表数据,以展示最新的 CPU 和内存使用率。
### 3.3 错误处理
如果在状态接口中发现 `errors` 字段有内容,前端应在对应的节点行或监控卡片上显示错误图标及详情(如 SSH 连接失败等)。
## 4. 注意事项
1. **权限控制**: 只有拥有该集群访问权限的用户才能操作其采集器。
2. **性能影响**: 采集器运行在后台线程,过多频繁的 SSH 轮询可能对目标节点产生轻微负载,建议 `interval` 不要小于 `2` 秒。
3. **数据一致性**: 采集器更新的是 `nodes` 表。前端展示数据时,请调用 `GET /api/v1/nodes` 相关接口获取最新字段。
---
**版本**: v1.0
**最后更新**: 2026-01-07

@ -1,119 +0,0 @@
# Tailscale 启动指南
本文档用于在本项目部署/联调环境中启动与验证 Tailscale 连接。覆盖两类常见环境:
- 机器/VM 有 systemd可用 `systemctl` 管理 `tailscaled`
- 容器/受限环境无 systemd例如 PID 1 不是 systemd使用 userspace networking 启动 `tailscaled`
## 1. 前置检查
确认已安装:
```bash
which tailscale
tailscale version
```
查看当前状态:
```bash
tailscale status
```
如果提示 `failed to connect to local tailscaled`,说明 `tailscaled` 没有运行或 socket 不对。
## 2. 方式 Asystemd 环境启动(推荐)
启动并设置开机自启:
```bash
sudo systemctl start tailscaled
sudo systemctl enable tailscaled
```
登录并启用(首次需要网页登录授权):
```bash
sudo tailscale up --accept-dns=false --accept-routes=false
```
验证:
```bash
tailscale status
tailscale ip -4
```
## 3. 方式 B无 systemd / 容器环境启动userspace networking
`systemctl` 不可用(例如 PID 1 不是 systemd用 userspace networking 启动 `tailscaled`
```bash
sudo tailscaled \
--tun=userspace-networking \
--socket=/var/run/tailscale/tailscaled.sock \
--state=/var/lib/tailscale/tailscaled.state
```
为了让其在后台运行且不占用终端,可使用:
```bash
sudo nohup tailscaled \
--tun=userspace-networking \
--socket=/var/run/tailscale/tailscaled.sock \
--state=/var/lib/tailscale/tailscaled.state \
>/tmp/tailscaled.log 2>&1 &
```
首次登录(会输出一个 URL打开后完成授权
```bash
sudo tailscale up --accept-dns=false --accept-routes=false
```
验证:
```bash
tailscale status
tailscale ip -4
```
## 4. 常用参数说明
- `--accept-dns=false`:避免 Tailscale 接管系统 DNS更稳妥减少联调环境干扰
- `--accept-routes=false`:不接收其它节点宣告的子网路由(除非明确需要)
如果你看到 `Some peers are advertising routes but --accept-routes is false` 属正常提示。
## 5. 常见问题
### 5.1 `Logged out.` / `NeedsLogin`
执行:
```bash
sudo tailscale up --accept-dns=false --accept-routes=false
```
根据输出提示访问登录链接完成授权。
### 5.2 `failed to connect to local tailscaled`
说明 `tailscaled` 未运行或 socket 路径不一致:
- systemd 环境:确认 `sudo systemctl status tailscaled`
- 无 systemd 环境:确认 `tailscaled` 进程存在,以及 `--socket` 路径与 `tailscale` 命令一致
### 5.3 退出/停止
退出网络:
```bash
sudo tailscale down
```
停止守护进程:
- systemd`sudo systemctl stop tailscaled`
- 无 systemd`sudo pkill tailscaled`

@ -1,77 +0,0 @@
# 前端登录权限联调指南
## 1. 概述
本文档旨在指导前端开发人员如何对接更新后的登录接口,并利用返回的 `permissions` 字段实现基于权限的页面访问控制RBAC
## 2. 接口变更说明
**接口地址**: `POST /api/user/login` (或对应网关地址)
**返回数据结构变更**:
在原有的返回结构中新增了 `permissions` 字段,该字段为一个字符串数组,包含当前用户所属角色拥有的所有权限键。
### 示例响应
```json
{
"ok": true,
"username": "admin",
"fullName": "系统管理员",
"token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
"roles": ["admin"],
"permissions": [
"auth:manage",
"fault:diagnose",
"cluster:list",
"cluster:log",
"cluster:op_log"
]
}
```
## 3. 权限键值表 (Permission Keys)
目前系统定义的权限键及其对应含义如下:
| 权限键 (Permission Key) | 功能描述 | 建议对应页面/模块 |
| :--- | :--- | :--- |
| `auth:manage` | 角色权限控制 | 用户管理、角色分配、权限配置页面 |
| `fault:diagnose` | 故障诊断 | 故障分析、AI 诊断建议模块 |
| `cluster:list` | 集群列表 | 集群状态看板、集群基础信息页面 |
| `cluster:log` | 集群日志 | HDFS/Yarn/HBase 等组件日志查询页面 |
| `cluster:op_log` | 集群操作日志 | 系统执行记录、审计日志查看页面 |
## 4. 前端实现建议
### 4.1 存储权限信息
建议在登录成功后,将 `permissions` 数组持久化存储到 `localStorage` 或全局状态管理(如 Vuex/Pinia 或 Redux中。
### 4.2 路由守卫 (Route Guard)
在全局路由拦截器中,根据路由配置的 `meta.permission` 属性与用户拥有的 `permissions` 进行比对。
```javascript
// 示例Vue Router 导航守卫
router.beforeEach((to, from, next) => {
const userPermissions = store.getters.permissions; // 从全局状态获取
if (to.meta.permission && !userPermissions.includes(to.meta.permission)) {
// 如果用户没有该页面所需的权限,重定向到 403 页面
next({ name: '403' });
} else {
next();
}
});
```
### 4.3 按钮级控制
可以通过自定义指令(如 Vue 的 `v-permission`)来控制按钮的显示/隐藏。
```html
<!-- 只有拥有角色权限控制权限的用户才能看到此按钮 -->
<button v-permission="'auth:manage'">分配角色</button>
```
## 5. 角色预设参考
- **管理员 (`admin`)**: 拥有所有权限键。
- **操作员 (`operator`)**: 拥有除 `auth:manage` 外的所有权限。
- **观察员 (`observer`)**: 仅拥有 `cluster:list`, `cluster:log`, `cluster:op_log`
---
**版本**: v1.1
**最后更新**: 2025-12-30

@ -1,111 +0,0 @@
# 后端 Socks 代理启动指南
本文档说明如何在“无 TUN / 无 systemd”的环境中通过 Tailscale userspace networking + 本地 SOCKS5让后端在注册集群时能正常用 SSH 连接到各节点。
适用场景:
- `ls -l /dev/net/tun` 提示不存在
- `ssh hadoop@100.x.x.x` 出现 `Connection closed by UNKNOWN port 65535` 或 TCP 22 超时
- 后端集群注册返回 `注册失败SSH不可连接 (timed out)`
## 1. 启动 tailscaleduserspace + SOCKS5
以 root 权限后台启动,并在本机开一个 SOCKS5 代理端口 `127.0.0.1:1080`
```bash
sudo nohup /usr/sbin/tailscaled \
--tun=userspace-networking \
--socket=/var/run/tailscale/tailscaled.sock \
--state=/var/lib/tailscale/tailscaled.state \
--socks5-server=127.0.0.1:1080 \
>/tmp/tailscaled.log 2>&1 &
```
检查是否启动成功:
```bash
pgrep -a tailscaled
python3 -c "import socket; s=socket.create_connection(('127.0.0.1',1080),2); print('socks_up'); s.close()"
```
## 2. 登录并加入 tailnet
首次使用需要登录授权:
```bash
sudo tailscale up --accept-dns=false --accept-routes=true
```
按输出提示打开登录链接完成授权,然后验证:
```bash
tailscale status
```
## 3. 启动后端(让 SSH 走 SOCKS5
启动后端时设置环境变量 `TS_SOCKS5_SERVER`,让后端的 SSH 探测通过 SOCKS5 走 tailscale netstack
```bash
TS_SOCKS5_SERVER='127.0.0.1:1080' \
python -m uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
```
说明:
- `TS_SOCKS5_SERVER`:后端通过此 SOCKS5 建立到节点的 SSH 连接
- `--reload`:开发态自动重载(生产环境建议去掉)
## 4. 快速验证(后端视角)
在后端启动后运行集群注册测试(见 [集群注册测试运行指南.md](file:///home/devbox/project/backend/docs/%E9%9B%86%E7%BE%A4%E6%B3%A8%E5%86%8C%E6%B5%8B%E8%AF%95%E8%BF%90%E8%A1%8C%E6%8C%87%E5%8D%97.md)
```bash
LOGIN_PASSWORD='admin123' \
HADOOP_PASSWORD='limouren...' \
BASE_URL='http://127.0.0.1:8000' \
python3 /home/devbox/project/backend/tests/test_cluster_register.py
```
如果成功用例通过,说明 SSH 路由链路已打通。
## 5. 常见问题与排查
### 5.1 `注册失败SSH不可连接` (Connection refused)
**现象**:后端返回 `[Errno 111] Connection refused`
**原因**:后端配置了 `TS_SOCKS5_SERVER`,但本地 `1080` 端口没有 SOCKS5 服务在运行。
**解决**
1. 检查 `tailscaled` 是否带 `--socks5-server=127.0.0.1:1080` 参数启动。
2. 使用 `ss -tunlp | grep 1080` 确认端口是否被监听。
### 5.2 `注册失败SSH不可连接` (timed out)
**现象**:后端返回 `timed out`
**原因**SOCKS5 代理虽然开了,但 Tailscale 未连接或目标节点不在线。
**解决**
1. 执行 `tailscale status` 检查节点状态。
2. 检查 `/tmp/tailscaled.log` 是否有连接报错。
### 5.3 `tailscale status` 显示 `NeedsLogin`
重新执行:
```bash
sudo tailscale up --accept-dns=false --accept-routes=true
```
### 5.4 快速重启代理
如果发现代理失效,可以使用以下组合命令:
```bash
sudo pkill tailscaled
sudo nohup /usr/sbin/tailscaled \
--tun=userspace-networking \
--socket=/var/run/tailscale/tailscaled.sock \
--state=/var/lib/tailscale/tailscaled.state \
--socks5-server=127.0.0.1:1080 \
>/tmp/tailscaled.log 2>&1 &
```

@ -1,77 +0,0 @@
# 账号管理后端联调指南
本指南详细说明了如何将“账号管理”前端页面与 FastAPI 后端进行联调,特别是修改密码功能的实现。
## 1. 后端接口说明
### 1.1 修改密码接口
- **路径**: `/api/v1/user/password`
- **方法**: `PATCH`
- **认证**: 需要 JWT Token放在请求头的 `Authorization: Bearer {token}` 中)
- **请求体 (JSON)**:
```json
{
"currentPassword": "当前旧密码",
"newPassword": "新密码"
}
```
- **验证规则**:
- `newPassword`: 长度 8-128 位,必须包含大写字母、小写字母和数字。
- **响应格式**:
- 成功: `{"ok": true}`
- 失败: 返回 400 错误及详细错误信息(如 `invalid_current_password`, `weak_new_password` 等)。
### 1.2 获取用户信息接口 (参考)
- **路径**: `/api/v1/user/me`
- **方法**: `GET`
- **功能**: 用于在账号管理页面显示当前登录的用户名等信息。
## 2. 前端实现细节
### 2.1 API 调用工具
前端统一使用 `src/app/lib/api.ts` 中定义的 axios 实例进行请求。
```typescript
import api from '../lib/api'
// 示例调用
const response = await api.patch('/v1/user/password', {
currentPassword: '...',
newPassword: '...'
}, {
headers: { Authorization: `Bearer ${token}` }
})
```
### 2.2 状态管理
使用 Pinia store (`src/app/stores/auth.ts`) 管理用户登录状态和 Token。
```typescript
import { useAuthStore } from '../stores/auth'
const auth = useAuthStore()
// 使用 auth.token 获取当前 Token
```
### 2.3 错误处理逻辑
`Account.vue` 中,我们根据后端返回的 `detail` 字段进行友好的错误提示:
- `invalid_current_password`: 提示“当前密码错误”。
- `weak_new_password`: 提示“新密码太弱(需包含大小写字母和数字)”。
- `demo_user_cannot_change_password`: 提示“演示账号不允许修改密码”。
- 其他错误: 提示“服务器错误,请稍后再试”。
## 3. 联调注意事项
1. **跨域配置**: 后端 `main.py` 已配置 `CORSMiddleware` 允许所有来源 (`*`),但在生产环境下建议限制为前端域名。
2. **演示账号限制**: 系统内置的演示账号(如 `admin`, `ops`, `obs`)在后端 `users.py` 中被拦截,不允许修改密码,以保护公共演示环境的安全。
3. **数据模型**: 修改密码操作会直接更新数据库中 `users` 表的 `password_hash` 字段,并同步更新 `updated_at` 时间戳。
## 4. 后续扩展
- **双因素认证 (2FA)**: 目前前端预留了位置,后端尚未实现相关接口。
- **头像上传**: 建议后续增加 `/user/avatar` 接口支持。

@ -1,45 +0,0 @@
# 集群管理与日志收集开发协作方案
为了确保两人协作时代码不产生冲突,采用**关注点分离Separation of Concerns**原则,将运维操作与数据采集完全解耦。
## 1. 分工概览
| 功能模块 | 核心职责 | 涉及层级 | 推荐开发者 |
| :--- | :--- | :--- | :--- |
| **集群启停 (Ops)** | 负责执行远程 SSH 命令(如 `start-all.sh`),处理操作状态反馈。 | Router: `ops.py`<br>Service: `ssh_utils.py` | A 同学 |
| **日志收集 (Logs)** | 负责从远程节点拉取/解析日志文件,持久化到数据库或提供流式读取。 | Router: `hadoop_logs.py`<br>Service: `ops_tools.py` | B 同学 |
## 2. 详细路径与接口设计
### A 同学:集群运维 (Ops)
- **路由文件**: `backend/app/routers/ops.py`
- **核心工具**: `backend/app/services/runner.py` (本地执行) & `backend/app/ssh_utils.py` (远程执行)
- **主要接口**:
- `POST /api/v1/ops/clusters/{cluster_uuid}/start`: 启动集群
- `POST /api/v1/ops/clusters/{cluster_uuid}/stop`: 停止集群
- **核心逻辑**:
- 从数据库读取集群 SSH 信息。
- **必须使用** `ssh_utils.py` 中的 `SSHClient` 以支持 SOCKS5 代理环境。
- 执行启动/停止脚本(如 `start-all.sh`)。
- 更新 `clusters` 表的 `health_status` 字段。
### B 同学:日志管理 (Logs)
- **路由文件**: `backend/app/routers/hadoop_logs.py`
- **服务文件**: `backend/app/services/ops_tools.py` (提供 `tool_read_log` 等原子操作)
- **主要接口**:
- `POST /api/v1/hadoop-logs/collect`: 触发特定节点的日志收集
- `GET /api/v1/hadoop-logs/{cluster_uuid}/status`: 查看收集任务状态
- **核心逻辑**:
- 定位远程节点日志路径(默认:`/usr/local/hadoop/logs`)。
- 使用 `SSHClient.read_file``SSHClient.execute_command` 拉取日志。
- 异步解析内容并存入 `hadoop_logs` 表。
## 3. 协作规范
1. **Schema 共享**: 两人应先在 `backend/app/schemas/` 中定义好各自的请求/响应模型。
2. **工具类复用**: **强制统一**使用 `backend/app/ssh_utils.py` 进行 SSH 连接管理。该工具已内置 SOCKS5 代理逻辑(环境变量 `TS_SOCKS5_SERVER`),能自动处理无 TUN 环境下的连接。
3. **数据库操作**:
- 运维模块主要**读取**集群配置并**更新**状态。
- 日志模块主要**写入**日志记录,不应修改集群核心配置。
4. **日志审计**: 关键操作应调用 `ops_tools.py` 中的 `_write_exec_log` 记录执行记录,便于后续审计。
5. **异常处理**: 统一返回 `HTTPException(status_code=400/500, detail={"errors": [...]})` 格式,与集群注册保持一致。

@ -1,73 +0,0 @@
# 集群注册测试运行指南
本文档说明如何运行集群注册测试脚本 `backend/tests/test_cluster_register.py`,覆盖“成功/失败”两种注册情况。
## 1. 前置条件
- 后端服务已启动并可访问(默认 `http://127.0.0.1:8000`
- 你使用的登录账号有集群注册权限(后端限制为 `admin``ops`
- 后端能够连通集群各节点的 SSH若当前环境无 TUN请先按“后端带 SOCKS5 代理启动指南”启动)
## 2. 测试数据说明
脚本内置了你提供的“正确集群 test”节点信息
- hadoop102 100.71.90.16 NameNode
- hadoop103 100.74.47.4 ResourceManager
- hadoop104 100.99.172.96 SecondaryNameNode
- hadoop105 100.91.174.104
- hadoop100 100.73.220.46
所有节点默认使用:
- 用户名:`hadoop`
- 密码:`limouren...`
失败用例由脚本自动生成(将 `type` 改为非法值),期望后端返回 `400`
## 3. 运行方式
在项目根目录执行:
```bash
LOGIN_PASSWORD='admin123' \
HADOOP_PASSWORD='limouren...' \
BASE_URL='http://127.0.0.1:8000' \
python3 /home/devbox/project/backend/tests/test_cluster_register.py
```
可选环境变量:
- `BASE_URL`:后端地址,默认 `http://127.0.0.1:8000`
- `LOGIN_USER`:登录用户名,默认 `admin`
- `LOGIN_PASSWORD`:登录密码(必填)
- `HADOOP_USER`:集群 SSH 用户名,默认 `hadoop`
- `HADOOP_PASSWORD`:集群 SSH 密码(必填)
## 4. 预期输出
成功时输出类似:
- `成功用例通过: uuid= ...`
- `失败用例通过: status=400 detail= ...`
脚本退出码:
- `0`:两种情况都符合预期
- `1`:用例未按预期(例如成功用例返回非 200或失败用例未返回 400
- `2`:缺少必要环境变量
## 5. 常见失败与排查
### 5.1 成功用例报 SSH 不可连接 / timed out
现象通常为后端返回:
- `注册失败SSH不可连接`
- `detail: timed out`
处理方式:
- 确认 tailscale 侧能够看到节点在线:`tailscale status`
- 若当前环境没有 `/dev/net/tun`,必须使用 userspace networking + SOCKS5并让后端通过 `TS_SOCKS5_SERVER` 走代理启动

Binary file not shown.

@ -1,15 +0,0 @@
fastapi
uvicorn[standard]
SQLAlchemy
asyncpg
python-dotenv
passlib[bcrypt]
bcrypt==3.2.0
PyJWT
langchain
langchain-openai
httpx
paramiko
pydantic-settings
requests
beautifulsoup4

@ -1,20 +0,0 @@
import asyncio
import os
import sys
# Add backend directory to sys.path
sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
from backend.app.db import engine
from backend.app.models.chat import Base as ChatBase
async def init_db():
async with engine.begin() as conn:
print("Dropping chat tables if exist...")
await conn.run_sync(ChatBase.metadata.drop_all)
print("Creating chat tables...")
await conn.run_sync(ChatBase.metadata.create_all)
print("Done.")
if __name__ == "__main__":
asyncio.run(init_db())

@ -1,38 +0,0 @@
import os
import json
import requests
def main():
base = os.getenv("API_BASE", "http://localhost:8000")
token = os.getenv("API_TOKEN", "")
name = os.getenv("CLUSTER_NAME", "test-cluster")
ctype = os.getenv("CLUSTER_TYPE", "hadoop")
nodes_env = os.getenv("CLUSTER_NODES")
if not nodes_env:
print("请通过环境变量 CLUSTER_NODES 提供节点信息,示例:")
print('[{"hostname":"nn","ip_address":"10.0.0.1","ssh_user":"u","ssh_password":"p"}]')
return
nodes = json.loads(nodes_env)
payload = {
"name": name,
"type": ctype,
"node_count": len(nodes),
"health_status": "unknown",
"nodes": nodes
}
if not token:
try:
r = requests.post(f"{base}/user/login", json={"username": "admin", "password": "admin123"}, timeout=15)
if r.status_code == 200:
token = r.json().get("token") or ""
print("已自动登录获取 token")
else:
print("自动登录失败,请设置 API_TOKEN 环境变量")
except Exception as e:
print(f"自动登录异常:{e}")
headers = {"Authorization": f"Bearer {token}"} if token else {}
r = requests.post(f"{base}/clusters", json=payload, headers=headers, timeout=30)
print(r.status_code, r.text)
if __name__ == "__main__":
main()

@ -1,40 +0,0 @@
#!/bin/bash
# 确保脚本在 backend 目录下运行
cd "$(dirname "$0")"
echo "=== 正在检查 Tailscale 状态 ==="
# 1. 检查并启动 tailscaled (SOCKS5 代理模式)
if ! pgrep -f "tailscaled.*--socks5-server=127.0.0.1:1080" > /dev/null; then
echo "Tailscale SOCKS5 代理未运行,正在启动..."
sudo pkill tailscaled 2>/dev/null || true
sudo nohup /usr/sbin/tailscaled \
--tun=userspace-networking \
--socket=/var/run/tailscale/tailscaled.sock \
--state=/var/lib/tailscale/tailscaled.state \
--socks5-server=127.0.0.1:1080 \
>/tmp/tailscaled.log 2>&1 &
# 等待启动完成
sleep 2
# 确保加入网络
sudo tailscale up --accept-dns=false --accept-routes=true
else
echo "Tailscale SOCKS5 代理已在 127.0.0.1:1080 运行。"
fi
# 2. 验证代理端口是否可用
if python3 -c "import socket; s=socket.create_connection(('127.0.0.1',1080),2); s.close()" 2>/dev/null; then
echo "SOCKS5 代理验证成功。"
else
echo "错误: SOCKS5 代理端口 1080 无法访问,请检查 /tmp/tailscaled.log"
exit 1
fi
echo "=== 正在启动后端服务 ==="
# 3. 启动后端服务 (注入代理环境变量)
export TS_SOCKS5_SERVER='127.0.0.1:1080'
python3 -m uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

@ -1,49 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
BASE_URL="${BASE_URL:-http://127.0.0.1:8000}"
USERNAME="${USERNAME:-admin}"
PASSWORD="${PASSWORD:-admin123}"
SESSION_ID="${SESSION_ID:-curl-sse-$(date +%s)}"
MESSAGE="${MESSAGE:-杀戮尖塔的观者怎么玩}"
TOKEN="$(
curl -sS "${BASE_URL}/api/v1/user/login" \
-H 'Content-Type: application/json' \
-d "{\"username\":\"${USERNAME}\",\"password\":\"${PASSWORD}\"}" \
| python3 -c 'import sys, json; print(json.load(sys.stdin)["token"])'
)"
TMP_OUT="$(mktemp)"
cleanup() { rm -f "${TMP_OUT}"; }
trap cleanup EXIT
curl -N -sS "${BASE_URL}/api/v1/ai/chat" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d "$(python3 - <<'PY'
import json, os
payload = {
"sessionId": os.environ["SESSION_ID"],
"message": os.environ["MESSAGE"],
"stream": True,
"context": {
"webSearch": True,
"agent": "Hadoop助手. You MUST use the web_search tool before answering."
}
}
print(json.dumps(payload, ensure_ascii=False))
PY
)" | tee "${TMP_OUT}"
python3 - <<'PY'
import sys, re, pathlib
p = pathlib.Path(sys.argv[1])
s = p.read_text(encoding="utf-8", errors="ignore")
n = len(re.findall(r"^data: ", s, flags=re.M))
if n <= 0:
raise SystemExit("未收到任何 SSE data 行,测试失败")
print(f"OK: 收到 {n} 条 SSE data 行")
PY "${TMP_OUT}"

@ -1,138 +0,0 @@
import os
import time
import json
import httpx
def _env(name: str, default: str | None = None) -> str | None:
v = os.environ.get(name)
if v is None:
return default
v2 = v.strip()
return v2 if v2 else default
def _login(client: httpx.Client, base_url: str, username: str, password: str) -> str:
r = client.post(
f"{base_url}/api/v1/user/login",
json={"username": username, "password": password},
timeout=20,
)
r.raise_for_status()
data = r.json()
token = data.get("token")
if not token:
raise RuntimeError("login_no_token")
return token
def _auth_headers(token: str) -> dict:
return {"Authorization": f"Bearer {token}"}
def _list_clusters(client: httpx.Client, base_url: str, token: str) -> list[dict]:
r = client.get(f"{base_url}/api/v1/clusters", headers=_auth_headers(token), timeout=20)
r.raise_for_status()
data = r.json() or {}
return data.get("clusters") or []
def _delete_cluster(client: httpx.Client, base_url: str, token: str, uuid: str) -> None:
r = client.delete(f"{base_url}/api/v1/clusters/{uuid}", headers=_auth_headers(token), timeout=30)
r.raise_for_status()
def _create_cluster(client: httpx.Client, base_url: str, token: str, payload: dict) -> httpx.Response:
return client.post(
f"{base_url}/api/v1/clusters",
headers={**_auth_headers(token), "Content-Type": "application/json"},
content=json.dumps(payload, ensure_ascii=False).encode("utf-8"),
timeout=120,
)
def _success_payload(cluster_name: str, ssh_user: str, ssh_password: str) -> dict:
nodes = [
{"hostname": "hadoop102", "ip_address": "100.71.90.16", "ssh_user": ssh_user, "ssh_password": ssh_password},
{"hostname": "hadoop103", "ip_address": "100.74.47.4", "ssh_user": ssh_user, "ssh_password": ssh_password},
{"hostname": "hadoop104", "ip_address": "100.99.172.96", "ssh_user": ssh_user, "ssh_password": ssh_password},
{"hostname": "hadoop105", "ip_address": "100.91.174.104", "ssh_user": ssh_user, "ssh_password": ssh_password},
{"hostname": "hadoop100", "ip_address": "100.73.220.46", "ssh_user": ssh_user, "ssh_password": ssh_password},
]
return {
"name": cluster_name,
"type": "hadoop",
"node_count": 5,
"health_status": "unknown",
"description": "test cluster register",
"namenode_ip": "100.71.90.16",
"namenode_psw": ssh_password,
"rm_ip": "100.74.47.4",
"rm_psw": ssh_password,
"nodes": nodes,
}
def _failure_payload(base: dict) -> dict:
bad = dict(base)
bad["name"] = f"{base['name']}-bad-{int(time.time())}"
bad["type"] = "bad_type"
return bad
def main() -> int:
base_url = _env("BASE_URL", "http://127.0.0.1:8000")
login_user = _env("LOGIN_USER", "admin2")
login_password = _env("LOGIN_PASSWORD", "123123Ab")
ssh_user = _env("HADOOP_USER", "hadoop")
ssh_password = _env("HADOOP_PASSWORD", "limouren...")
missing = [k for k, v in [("LOGIN_PASSWORD", login_password), ("HADOOP_PASSWORD", ssh_password)] if not v]
if missing:
print(f"缺少环境变量: {', '.join(missing)}")
print("示例LOGIN_PASSWORD=admin123 HADOOP_PASSWORD='limouren...' BASE_URL=http://127.0.0.1:8000 python3 backend/tests/test_cluster_register.py")
return 2
with httpx.Client() as client:
token = _login(client, base_url, login_user, login_password)
clusters = _list_clusters(client, base_url, token)
for c in clusters:
if c.get("name") == "test" and c.get("uuid"):
_delete_cluster(client, base_url, token, c["uuid"])
ok_payload = _success_payload("test", ssh_user, ssh_password)
r_ok = _create_cluster(client, base_url, token, ok_payload)
if r_ok.status_code != 200:
try:
print("成功用例失败:", r_ok.status_code, r_ok.json())
except Exception:
print("成功用例失败:", r_ok.status_code, r_ok.text[:500])
return 1
data_ok = r_ok.json() or {}
if data_ok.get("status") != "success":
print("成功用例返回异常:", data_ok)
return 1
uuid = data_ok.get("uuid")
print("成功用例通过: uuid=", uuid)
bad_payload = _failure_payload(ok_payload)
r_bad = _create_cluster(client, base_url, token, bad_payload)
if r_bad.status_code != 400:
try:
print("失败用例未按预期返回 400:", r_bad.status_code, r_bad.json())
except Exception:
print("失败用例未按预期返回 400:", r_bad.status_code, r_bad.text[:500])
return 1
try:
detail = r_bad.json()
except Exception:
detail = {"raw": r_bad.text[:500]}
print("失败用例通过: status=400 detail=", detail)
return 0
if __name__ == "__main__":
raise SystemExit(main())

@ -1,35 +0,0 @@
import pytest
from app.services.hadoop_cluster_uuid import collect_cluster_uuid
class _CliOK:
def __init__(self, host, user, pwd):
pass
def execute_command_with_timeout(self, cmd, timeout):
if "getconf" in cmd or "awk" in cmd:
return ("/data/hdfs/namenode", "")
if "VERSION" in cmd:
return ("clusterID=12345-abc", "")
return ("", "")
def close(self):
pass
class _CliNoDirs:
def __init__(self, host, user, pwd):
pass
def execute_command_with_timeout(self, cmd, timeout):
return ("", "")
def close(self):
pass
def test_collect_cluster_uuid_success(monkeypatch):
monkeypatch.setattr("app.services.hadoop_cluster_uuid.SSHClient", lambda h,u,p: _CliOK(h,u,p))
u, step, detail = collect_cluster_uuid("10.0.0.1", "u", "p")
assert u is not None
assert step is None
assert detail is None
def test_collect_cluster_uuid_fail_no_dirs(monkeypatch):
monkeypatch.setattr("app.services.hadoop_cluster_uuid.SSHClient", lambda h,u,p: _CliNoDirs(h,u,p))
u, step, detail = collect_cluster_uuid("10.0.0.1", "u", "p")
assert u is None
assert step == "probe_name_dirs"

@ -1,38 +0,0 @@
import app.log_collector as lc
import app.log_reader as lr
def test_parse_and_save_chunk_mock():
sample_lines = [
"[2024-12-17 10:00:00,123] INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Started",
"[2024-12-17 10:01:00,456] WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Disk nearly full",
"[2024-12-17 10:02:00,789] ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Write failed",
"Plain line without timestamp INFO something",
"",
]
content = "\n".join(sample_lines)
captured = []
async def _fake_save_logs_to_db_batch(items: list[dict]):
captured.extend(items)
# monkeypatch batch save method
lc.log_collector._save_logs_to_db_batch = _fake_save_logs_to_db_batch
# run save chunk
lc.log_collector._save_log_chunk("hadoop102", "datanode", content)
# verify non-empty lines saved
expected_saved = [ln for ln in sample_lines if ln.strip()]
assert len(captured) == len(expected_saved)
# check fields
for item in captured:
assert item["host"] == "hadoop102"
assert item["service"] == "datanode"
assert isinstance(item["message"], str) and item["message"]
assert item["log_level"] in {"INFO", "WARN", "ERROR", "DEBUG", "TRACE"}
assert getattr(item["timestamp"], "tzinfo", None) is not None
def test_log_file_path_namenode():
p = lr.log_reader.get_log_file_path("hadoop102", "namenode")
assert p.endswith("/hadoop-hadoop-namenode-hadoop102.log")

@ -1,123 +0,0 @@
import asyncio
import os
import sys
# Add backend directory to sys.path to import app modules
# Current file: backend/tests/test_llm.py
# Parent: backend/tests
# Grandparent: backend
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from app.services.llm import LLMClient
from app.services.ops_tools import openai_tools_schema, tool_web_search, tool_start_cluster, tool_stop_cluster
from app.db import SessionLocal
from dotenv import load_dotenv
import json
async def main():
# Load .env from backend directory
env_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), ".env")
load_dotenv(env_path)
print("Testing LLMClient with REAL Tools...")
try:
llm = LLMClient()
print(f"Provider: {llm.provider}")
print(f"Endpoint: {llm.endpoint}")
print(f"Model: {llm.model}")
print(f"Timeout: {llm.timeout}")
messages = [{"role": "user", "content": "停止集群 5c43a9c7-e2a9-4756-b75d-6813ac55d3ba"}]
# 1. Get tools definition
chat_tools = openai_tools_schema()
print(f"Tools loaded: {[t['function']['name'] for t in chat_tools]}")
print("Sending initial request...")
resp = await llm.chat(messages, tools=chat_tools)
if "choices" in resp and resp["choices"]:
msg = resp["choices"][0].get("message", {})
tool_calls = msg.get("tool_calls")
if tool_calls:
print(f"Tool calls triggered: {len(tool_calls)}")
# Append assistant message with tool_calls
messages.append(msg)
async with SessionLocal() as db:
for tc in tool_calls:
fn = tc.get("function", {})
name = fn.get("name")
args_str = fn.get("arguments", "{}")
print(f"Executing REAL tool: {name} with args: {args_str}")
if name == "web_search":
try:
args = json.loads(args_str)
tool_result = await tool_web_search(args.get("query"), args.get("max_results", 5))
messages.append({
"role": "tool",
"tool_call_id": tc.get("id"),
"name": name,
"content": json.dumps(tool_result, ensure_ascii=False)
})
print("Tool execution completed.")
except Exception as e:
print(f"Tool execution failed: {e}")
elif name == "start_cluster":
try:
args = json.loads(args_str)
cluster_uuid = args.get("cluster_uuid")
# Execute REAL tool
tool_result = await tool_start_cluster(db, "admin", cluster_uuid)
messages.append({
"role": "tool",
"tool_call_id": tc.get("id"),
"name": name,
"content": json.dumps(tool_result, ensure_ascii=False)
})
print(f"REAL tool start_cluster execution completed: {tool_result.get('status')}")
except Exception as e:
print(f"REAL tool execution failed: {e}")
elif name == "stop_cluster":
try:
args = json.loads(args_str)
cluster_uuid = args.get("cluster_uuid")
# Execute REAL tool
tool_result = await tool_stop_cluster(db, "admin", cluster_uuid)
messages.append({
"role": "tool",
"tool_call_id": tc.get("id"),
"name": name,
"content": json.dumps(tool_result, ensure_ascii=False)
})
print(f"REAL tool stop_cluster execution completed: {tool_result.get('status')}")
except Exception as e:
print(f"REAL tool execution failed: {e}")
# 2. Send follow-up request with tool results
print("Sending follow-up request...")
resp = await llm.chat(messages, tools=chat_tools)
if "choices" in resp and resp["choices"]:
final_msg = resp["choices"][0].get("message", {})
print("\nFinal Reply:")
print(final_msg.get('content'))
if "reasoning_content" in final_msg:
print(f"\nReasoning:\n{final_msg.get('reasoning_content')}")
else:
print("No tool calls triggered.")
print(f"Reply: {msg.get('content')}")
else:
print(resp)
except Exception as e:
import traceback
traceback.print_exc()
print(f"Error: {repr(e)}")
if __name__ == "__main__":
asyncio.run(main())

@ -1,58 +0,0 @@
import httpx
import asyncio
import json
import os
import pytest
async def _run_register_checks(base_url: str):
url = f"{base_url.rstrip('/')}/api/v1/user/register"
# 1. 测试字段缺失 (422)
print("\n1. Testing missing field...")
payload = {
"username": "testuser",
"email": "test@example.com",
"password": "password123"
# fullName missing
}
async with httpx.AsyncClient() as client:
r = await client.post(url, json=payload)
print(f"Status: {r.status_code}")
print(f"Response: {r.text}")
# 2. 测试校验错误 (400 with errors)
print("\n2. Testing validation error (short username)...")
payload = {
"username": "t",
"email": "invalid-email",
"password": "123",
"fullName": "Z"
}
async with httpx.AsyncClient() as client:
r = await client.post(url, json=payload)
print(f"Status: {r.status_code}")
print(f"Response: {r.text}")
# 3. 测试用户名已存在 (400 with message)
# 假设 admin 已存在
print("\n3. Testing duplicate username...")
payload = {
"username": "admin",
"email": "admin_new@example.com",
"password": "Password123",
"fullName": "Administrator"
}
async with httpx.AsyncClient() as client:
r = await client.post(url, json=payload)
print(f"Status: {r.status_code}")
print(f"Response: {r.text}")
def test_register_fix_e2e():
base_url = os.getenv("E2E_BASE_URL", "").strip()
if not base_url:
pytest.skip("需要设置 E2E_BASE_URL 并启动后端服务")
asyncio.run(_run_register_checks(base_url))
if __name__ == "__main__":
url = os.getenv("E2E_BASE_URL", "http://localhost:8000").strip()
asyncio.run(_run_register_checks(url))

@ -1,30 +0,0 @@
import asyncio
import os
import sys
# Add backend directory to sys.path to import app modules
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from app.services.ops_tools import tool_web_search
async def main():
print("Testing Web Search...")
query = "今天星期几"
print(f"Query: {query}")
try:
res = await tool_web_search(query)
if "error" in res:
print(f"Error: {res['error']}")
else:
print(f"Current Time: {res.get('current_time')}")
print(f"Results found: {len(res.get('results', []))}")
for i, r in enumerate(res.get("results", [])[:2]):
print(f"[{i+1}] {r.get('title')} - {r.get('href')}")
if r.get('full_content'):
print(f" Full content len: {len(r.get('full_content'))}")
print(f" Sample: {r.get('full_content')[:100]}...")
except Exception as e:
print(f"Exception: {e}")
if __name__ == "__main__":
asyncio.run(main())

@ -1,27 +0,0 @@
import pytest
from app.services.ssh_probe import check_ssh_connectivity
from app.ssh_utils import SSHClient
class _DummyCli:
def __init__(self, host, user, pwd):
self.closed = False
def execute_command_with_timeout(self, cmd, timeout):
return ("ok", "")
def close(self):
self.closed = True
def test_check_ssh_connectivity_success(monkeypatch):
monkeypatch.setattr("app.services.ssh_probe.SSHClient", lambda h,u,p: _DummyCli(h,u,p))
ok, err = check_ssh_connectivity("127.0.0.1", "u", "p", timeout=1)
assert ok is True
assert err is None
class _FailCli:
def __init__(self, host, user, pwd):
raise RuntimeError("connect_failed")
def test_check_ssh_connectivity_fail(monkeypatch):
monkeypatch.setattr("app.services.ssh_probe.SSHClient", lambda h,u,p: _FailCli(h,u,p))
ok, err = check_ssh_connectivity("127.0.0.1", "u", "p", timeout=1)
assert ok is False
assert "connect_failed" in str(err)

@ -1,59 +0,0 @@
import asyncio
import os
import sys
import json
# Add backend directory to sys.path to import app modules
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from app.services.llm import LLMClient
from dotenv import load_dotenv
async def main():
# Load .env from backend directory
env_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), ".env")
load_dotenv(env_path)
print("Testing LLMClient Streaming...")
try:
llm = LLMClient()
print(f"Provider: {llm.provider}")
print(f"Endpoint: {llm.endpoint}")
print(f"Model: {llm.model}")
messages = [{"role": "user", "content": ""}]
print("Sending streaming request...")
stream_gen = await llm.chat(messages, stream=True)
full_content = ""
full_reasoning = ""
print("\nStreaming Response:")
async for chunk in stream_gen:
choices = chunk.get("choices") or []
if not choices:
continue
delta = choices[0].get("delta") or {}
content = delta.get("content") or ""
reasoning = delta.get("reasoning_content") or ""
if reasoning:
full_reasoning += reasoning
print(f"[Reasoning] {reasoning}", end="", flush=True)
if content:
full_content += content
print(content, end="", flush=True)
print("\n\nStream Finished.")
print(f"Full Content Length: {len(full_content)}")
print(f"Full Reasoning Length: {len(full_reasoning)}")
except Exception as e:
import traceback
traceback.print_exc()
print(f"Error: {repr(e)}")
if __name__ == "__main__":
asyncio.run(main())

@ -1,27 +0,0 @@
#!/bin/bash
# 配置变量
BASE_URL="http://localhost:8000/api/v1"
USERNAME="admin"
PASSWORD="admin123"
CLUSTER_UUID="5c43a9c7-e2a9-4756-b75d-6813ac55d3ba"
echo "正在登录获取 Token..."
LOGIN_RESPONSE=$(curl -s -X POST "$BASE_URL/user/login" \
-H "Content-Type: application/json" \
-d "{\"username\": \"$USERNAME\", \"password\": \"$PASSWORD\"}")
TOKEN=$(echo $LOGIN_RESPONSE | grep -oP '(?<="token":")[^"]*')
if [ -z "$TOKEN" ]; then
echo "登录失败,无法获取 Token"
echo "响应内容: $LOGIN_RESPONSE"
exit 1
fi
echo "登录成功,正在调用集群停止接口..."
curl -X POST "$BASE_URL/ops/clusters/$CLUSTER_UUID/stop" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json"
echo -e "\n操作完成"

@ -1,27 +0,0 @@
#!/bin/bash
# 配置变量
BASE_URL="http://localhost:8000/api/v1"
USERNAME="admin"
PASSWORD="admin123"
CLUSTER_UUID="5c43a9c7-e2a9-4756-b75d-6813ac55d3ba"
echo "正在登录获取 Token..."
LOGIN_RESPONSE=$(curl -s -X POST "$BASE_URL/user/login" \
-H "Content-Type: application/json" \
-d "{\"username\": \"$USERNAME\", \"password\": \"$PASSWORD\"}")
TOKEN=$(echo $LOGIN_RESPONSE | grep -oP '(?<="token":")[^"]*')
if [ -z "$TOKEN" ]; then
echo "登录失败,无法获取 Token"
echo "响应内容: $LOGIN_RESPONSE"
exit 1
fi
echo "登录成功,正在调用集群启动接口..."
curl -X POST "$BASE_URL/ops/clusters/$CLUSTER_UUID/start" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json"
echo -e "\n操作完成"

@ -1,49 +1,49 @@
会议记录
会议基本信息
会议主题:利用大模型进行故障检测的新项目启动会
参会人员:李友焕、沈永佳、邢远鑫、邹佳轩、王祖旺、李涛
记录方式AI会议助手实时记录
会议助手(00:00): hi我是你的会议助手我正在帮你实时记录会议纪要请安心开会
会议助手(01:31): 李友焕要求复述故障检测需求,显示对需求理解可能存在模糊点,并建议投屏以便更直观讨论。沈永佳确认需求源于点表阶段,暗示该需求可能较为零散或不够系统化。双方快速确认了需求范围,但缺乏具体细节的展开。
会议助手(03:48): 李友焕在强调项目难度和现实意义,显然希望激发团队动力,提到大模型方向可行且能带来锻炼。
沈永佳在确认文档位置时有些混乱,可能对材料熟悉度不足。
双方在投屏细节上反复沟通,显示协作初期存在技术磨合问题。
李友焕提到腾讯项目经验时,隐含对当前资源条件的担忧。
会议助手(05:53): 李友焕提到与军方的合作因系统敏感性受阻转而进行预先研究。这显示项目面临合规性挑战需要迂回推进。他详细剖析了大数据平台的复杂性从分布式存储到各类组件如Spark、Hadoop、Elasticsearch强调大公司通过集中化技术中台管理数据流。其举例腾讯的实践暗示当前项目需要类似的系统性支撑但现有资源可能不足。
会议助手(08:06): 李友焕详细列举了大数据平台运行中可能出现的各类故障场景,包括内存溢出、资源分配异常、权限问题、数据误删等,凸显了复杂系统故障定位的困难性。他强调当前依赖人工排查的方式效率低下,暗示需要引入大模型等智能化手段来优化故障诊断流程。
会议助手(10:14): 李友焕提出了利用大模型进行实时监控和自动修复的需求强调需要精准诊断和工具调用能力。他提到当前依赖人工排查效率低而大模型可以提前发现问题并解决。但团队对大数据组件经验不足沈永佳表示仅了解spring boot暗示技术储备与需求存在差距。
会议助手(12:28): 李友焕发现团队成员对大数据组件缺乏经验这反而被视为学习机会他建议从HDFS和Hadoop入手逐步学习Spark和Hive强调这些技能在工业界仍有广泛应用。他计划分享学习资料并指导实践但显然意识到学习曲线可能较陡。
从之前的讨论来看,团队似乎正在探索如何利用大模型进行系统监控和问题修复,但当前的技术储备明显不足。
会议助手(14:41): 李友焕强调大数据工程师必须掌握HDFS、Hadoop等分布式系统技能建议通过虚拟机搭建环境进行实战演练并推荐了林子雨的线上课程作为学习资源。
他提出通过人为制造错误来测试大模型的监控和修复能力,认为这是展示项目能力的有效方式。
沈永佳全程以简短回应表示认同,显示讨论呈现单向指导性质。
会议助手(17:01): 李友焕强调学习大模型和提示词优化的重要性认为这是未来工作中无法绕开的技能建议用3-4天集中学习。他提到之前学生因代码量不足被质疑但大模型项目的核心难点早已解决。
沈永佳表示问题会在学习过程中出现,显示出对学习过程的务实态度。李友焕进一步说明这个项目对他的硕士生也在进行,透露出他希望学生能真正学到东西的初衷,即使效果不如预期也能接受。
会议助手(19:06): 李友焕强调大模型项目的评估重点已从代码量转向实际应用,表明团队方向正从技术实现转向价值落地。他决定重新接手部分项目,并建议团队集中学习大模型相关技能。值得注意的是,此前因考核标准偏差导致学生保研受挫的经历,似乎促使他更注重项目实效性而非形式指标。
后续对话显示学生正在处理会议录制和纪要等技术问题,但讨论较为零散,可能侧面反映团队在协作流程上仍需磨合。
记录时间:会议全程
会议记录
会议基本信息
会议主题:利用大模型进行故障检测的新项目启动会
参会人员:李友焕、沈永佳、邢远鑫、邹佳轩、王祖旺、李涛
记录方式AI会议助手实时记录
会议助手(00:00): hi我是你的会议助手我正在帮你实时记录会议纪要请安心开会
会议助手(01:31): 李友焕要求复述故障检测需求,显示对需求理解可能存在模糊点,并建议投屏以便更直观讨论。沈永佳确认需求源于点表阶段,暗示该需求可能较为零散或不够系统化。双方快速确认了需求范围,但缺乏具体细节的展开。
会议助手(03:48): 李友焕在强调项目难度和现实意义,显然希望激发团队动力,提到大模型方向可行且能带来锻炼。
沈永佳在确认文档位置时有些混乱,可能对材料熟悉度不足。
双方在投屏细节上反复沟通,显示协作初期存在技术磨合问题。
李友焕提到腾讯项目经验时,隐含对当前资源条件的担忧。
会议助手(05:53): 李友焕提到与军方的合作因系统敏感性受阻转而进行预先研究。这显示项目面临合规性挑战需要迂回推进。他详细剖析了大数据平台的复杂性从分布式存储到各类组件如Spark、Hadoop、Elasticsearch强调大公司通过集中化技术中台管理数据流。其举例腾讯的实践暗示当前项目需要类似的系统性支撑但现有资源可能不足。
会议助手(08:06): 李友焕详细列举了大数据平台运行中可能出现的各类故障场景,包括内存溢出、资源分配异常、权限问题、数据误删等,凸显了复杂系统故障定位的困难性。他强调当前依赖人工排查的方式效率低下,暗示需要引入大模型等智能化手段来优化故障诊断流程。
会议助手(10:14): 李友焕提出了利用大模型进行实时监控和自动修复的需求强调需要精准诊断和工具调用能力。他提到当前依赖人工排查效率低而大模型可以提前发现问题并解决。但团队对大数据组件经验不足沈永佳表示仅了解spring boot暗示技术储备与需求存在差距。
会议助手(12:28): 李友焕发现团队成员对大数据组件缺乏经验这反而被视为学习机会他建议从HDFS和Hadoop入手逐步学习Spark和Hive强调这些技能在工业界仍有广泛应用。他计划分享学习资料并指导实践但显然意识到学习曲线可能较陡。
从之前的讨论来看,团队似乎正在探索如何利用大模型进行系统监控和问题修复,但当前的技术储备明显不足。
会议助手(14:41): 李友焕强调大数据工程师必须掌握HDFS、Hadoop等分布式系统技能建议通过虚拟机搭建环境进行实战演练并推荐了林子雨的线上课程作为学习资源。
他提出通过人为制造错误来测试大模型的监控和修复能力,认为这是展示项目能力的有效方式。
沈永佳全程以简短回应表示认同,显示讨论呈现单向指导性质。
会议助手(17:01): 李友焕强调学习大模型和提示词优化的重要性认为这是未来工作中无法绕开的技能建议用3-4天集中学习。他提到之前学生因代码量不足被质疑但大模型项目的核心难点早已解决。
沈永佳表示问题会在学习过程中出现,显示出对学习过程的务实态度。李友焕进一步说明这个项目对他的硕士生也在进行,透露出他希望学生能真正学到东西的初衷,即使效果不如预期也能接受。
会议助手(19:06): 李友焕强调大模型项目的评估重点已从代码量转向实际应用,表明团队方向正从技术实现转向价值落地。他决定重新接手部分项目,并建议团队集中学习大模型相关技能。值得注意的是,此前因考核标准偏差导致学生保研受挫的经历,似乎促使他更注重项目实效性而非形式指标。
后续对话显示学生正在处理会议录制和纪要等技术问题,但讨论较为零散,可能侧面反映团队在协作流程上仍需磨合。
记录时间:会议全程
记录状态:完整

@ -1,45 +0,0 @@
# weekly 目录改动报告2025-12-07
## 改动清单
- 第11周
- 组文档改动:
- 新增并完善会议纪要:`doc/process/weekly/week-11/group/meeting-minutes-11.md:1`
- 新增并完善组周计划:`doc/process/weekly/week-11/group/weekly-plan-11.md:1`
- 新增并完善组周总结:`doc/process/weekly/week-11/group/weekly-summary-11.md:1`
- 成员文档改动:
- 沈永佳:周计划 `doc/process/weekly/week-11/members/shenyongjia-weekly-plan-11.md:1`;周总结 `doc/process/weekly/week-11/members/shenyongjia-weekly-summary-11.md:1`
- 李涛:周计划 `doc/process/weekly/week-11/members/litao-weekly-plan-11.md`;周总结 `doc/process/weekly/week-11/members/litao-weekly-summary-11.md`
- 王祖旺:周计划 `doc/process/weekly/week-11/members/wangzuwang-weekly-plan-11.md`;周总结 `doc/process/weekly/week-11/members/wangzuwang-weekly-summary-11.md`
- 邢远鑫:周计划 `doc/process/weekly/week-11/members/xingyuanxin-weekly-plan-11.md`;周总结 `doc/process/weekly/week-11/members/xingyuanxin-weekly-summary-11.md`
- 邹佳轩:周计划 `doc/process/weekly/week-11/members/zoujiaxuan-weekly-plan-11.md`;周总结 `doc/process/weekly/week-11/members/zoujiaxuan-weekly-summary-11.md`
- 第10周
- 组文档改动:
- 新增并完善组周计划:`doc/process/weekly/week-10/group/weekly-plan-10.md:1`
- 新增并完善组周总结:`doc/process/weekly/week-10/group/weekly-summary-10.md:1`
- 成员文档改动:
- 沈永佳:周计划 `doc/process/weekly/week-10/members/shenyongjia-weekly-plan-10.md:1`;周总结 `doc/process/weekly/week-10/members/shenyongjia-weekly-summary-10.md:1`
- 李涛:周计划 `doc/process/weekly/week-10/members/litao-weekly-plan-10.md`;周总结 `doc/process/weekly/week-10/members/litao-weekly-summary-10.md`
- 王祖旺:周计划 `doc/process/weekly/week-10/members/wangzuwang-weekly-plan-10.md`;周总结 `doc/process/weekly/week-10/members/wangzuwang-weekly-summary-10.md`
- 邢远鑫:周计划 `doc/process/weekly/week-10/members/xingyuanxin-weekly-plan-10.md`;周总结 `doc/process/weekly/week-10/members/xingyuanxin-weekly-summary-10.md`
- 邹佳轩:周计划 `doc/process/weekly/week-10/members/zoujiaxuan-weekly-plan-10.md`;周总结 `doc/process/weekly/week-10/members/zoujiaxuan-weekly-summary-10.md`
- 第9周
- 组文档改动:
- 组周总结增强:`doc/process/weekly/week-9/group/weekly-summary-9.md:1`
- 成员文档改动:
- 沈永佳:周总结补齐 `doc/process/weekly/week-9/members/shenyongjia-weekly-summary-9.md:1`
- 李涛:周总结 `doc/process/weekly/week-9/members/litao-weekly-summary-9.md`
- 王祖旺:周总结 `doc/process/weekly/week-9/members/wangzuwang-weekly-summary-9.md`
- 邢远鑫:周总结 `doc/process/weekly/week-9/members/xingyuanxin-weekly-summary-9.md`
- 邹佳轩:周总结 `doc/process/weekly/week-9/members/zoujiaxuan-weekly-summary-9.md`
## 主要内容摘要
- 第11周组会议纪要统一数据库联通与后端联调、前端认证闭环、Flume 双链路采集、FastAPI 能力补齐、AI/MCP 测试方案与分工
- 第11周组周计划数据库/后端联通、前端 Auth 封装与令牌管理、Flume 采集与监控脚本、FastAPI 示例与规范、AI/MCP 用例集
- 第11周组周总结完成局域网联通修复与认证闭环、Flume 双链路采集、FastAPI 学习沉淀与 AI/MCP 测试体系建设
- 第10周组周计划/总结:围绕登录/注册最小闭环、DB 初始化对接、Flume→HDFS 打通、前端实时/权限/诊断修复闭环、成员成果整合
- 第9周增强组总结补充前端接口封装与问题风险成员总结补齐 JWT 联调问题与下周解决计划
## 影响与后续
- 文档完整度提升第10/11周的组与成员文档均已具备计划-执行-复盘闭环
- 联调与排障指引明确:数据库/后端联通与前端认证流程的关键步骤与风险项已经固化到文档
- 建议后续在第12周组文档中增加图表与指标汇总认证成功率、Flume 吞吐、联调用例通过率),提高可视化与追踪性

@ -1,101 +0,0 @@
# 项目会议纪要
- 会议主题项目第9周复盘与第10周计划
- 会议时间2025-11-23
- 参会人员:沈永佳、李涛、邹佳轩、王祖旺、邢远鑫
- 会议类型:周度例会(线上)
## 会议背景
- 对第9周工作进行总结明确第10周的技术路线与阶段目标。
- 聚焦前后端联调、Hadoop 环境熟悉、AI Agent 开发与集成等关键方向。
---
## 成员汇报要点
- 沈永佳
- 完成数据库从 MySQL 迁移至 PostgreSQL并搭建本地 PostgreSQL 环境
- 李涛
- 在 Hadoop 集群上初步部署 Fruit 组件
- 邹佳轩
- 了解 Docker 容器及 AI 模型集成相关知识,启动 PostgreSQL 数据库构建
- 王祖旺
- 研究 Hadoop 集群基础操作及自动化测试脚本,支撑后续测试工作
- 邢远鑫
- 学习前后端联调步骤,计划开展小范围实验性开发
---
## 项目整体规划与讨论
- 项目目标拆分以“外部应用开发”与“Hadoop 环境熟悉”两条线并行推进
- 焦点与优先级:
- 外部应用开发优先,从登录与注册界面入手,对接真实数据库,逐步迭代
- Hadoop 集群配置、熟悉与常见 Bug 排查为全员前置技能
- 后续技术方向:
- 多 Agent 架构,理解 Prompt、Agent、MCP 概念与关系
- UI 布局参考 VS Code规划读日志、AI 问答、执行远程命令等功能
---
## 技术资源
- 前端建议学习Vite
- 后端建议学习PHP
---
## 关键决策
- 初期工具与验证:系统初期将使用豆包或 D 等工具进行测试
- 日志技术方案:使用 Flume 采集日志,先写入 HDFS 节点磁盘以便后续分析
- 前端应用架构:构建 AI 前端应用,首页用于管理和配置大模型调用工具
---
## 任务与待办
1. 前后端接口定义规范问题(沈永佳 发起,团队共识)
- 解决“头歌”仓库中前后端接口定义规范问题,形成统一规范
2. MPC Service 功能确认(邹佳轩、沈永佳)
- 确认 MPC service 是否已正确配置并可用,各自环境检查保障完整
3. AI 大模型与 API 学习(沈永佳)
- 学习 GPT、Gemini、通义千问等 API 与相关 Fluent 技术,为后续集成做准备
4. 前后端联调开发(邢远鑫)
- 从登录/注册开始,实现对接真实数据库,完成最小功能闭环
5. Hadoop 集群熟悉(王祖旺)
- 学习并掌握基础操作,准备测试与维护所需的自动化脚本
6. Prompt/Agent/MCP 关系学习(全体成员)
- 观看并学习相关视频,统一认知与术语
7. 日志管理项目(邹佳轩、李涛、王祖旺)
- 下周尝试使用 Host Agreement 完成数据库配置(邹佳轩)
- 搭建驻场 Flume完成日志采集与存储周末前提供截图李涛
---
## 团队下周工作规划
- 核心目标:完成前后端对接,实现用户登录注册功能
- 日志处理:李涛 搭建 Flume 完成日志采集,周末前提交截图证明
---
## 验收标准与里程碑
- 登录/注册联调:能在真实数据库中完成注册与登录校验,接口返回规范化
- Flume 日志采集:采集链路稳定,截图与说明齐备,采集至 HDFS 可验证
- MPC service配置可用健康检查与基本调用通过
- 学习积累Prompt/Agent/MCP 学习整理成笔记或小结并归档
---
## 结论与安排
- 明确 AI 系统初步开发目标与技术路线,落实到个人具体任务
- 行动项按本纪要列出的负责人推进,周末前完成阶段验收与归档
**下次会议**
- 时间2025-11-30周度例会

@ -1,41 +0,0 @@
# 第十周组周计划Week 10
## 目标概述
- 完成前后端对接,实现用户登录/注册最小功能闭环,并连接真实数据库
- 推进 Hadoop 环境熟悉与日志采集Flume → HDFS 端到端打通与截图验证
- 启动 AI Agent/MCP 学习与集成方向的准备工作,整理术语与关系
## 任务分解Owner
- 接口规范与联调(沈永佳、邢远鑫)
- 定义登录/注册接口请求/响应字段与错误码口径,完成最小闭环联调与截图
- 前端封装 `{code,msg,data}`、统一 `Authorization` 与错误提示,补充关键用例
- 数据库配置与后端对接(邹佳轩、沈永佳)
- 本地 PostgreSQL 初始化与 `users` 表结构准备;编写连接与初始化脚本
- 对齐 `backend/.env` 连接信息,跑通后端登录/注册真实校验
- Flume 日志采集(李涛)
- 部署与配置 Flume完成到 HDFS 的采集链路打通(全量+实时),输出端到端截图
- 形成采集方案与关键参数Channel 容量、滚动策略、压缩)说明
- Hadoop 与自动化测试(王祖旺)
- 熟悉 HDFS 高级功能与 MapReduce 编程;补充自动化测试脚本与数据管理规范
- Prompt/Agent/MCP 学习(全体)
- 梳理术语与技术关系,形成学习笔记与后续集成方向草案
## 时间安排
- 周一:接口规范初稿、数据库初始化与 `users` 表结构准备
- 周二MPC service 健康检查与基础调用核验,修正配置与文档
- 周三:登录/注册联调完成最小闭环;整理联调截图与用例
- 周四Flume → HDFS 采集验证与截图;参数与策略说明草案
- 周五成员各自产出文档与脚本归档Prompt/Agent/MCP 学习笔记沉淀
- 周末:阶段复盘与周总结,形成下周计划
## 验收标准
- 登录/注册:真实数据库校验通过;接口返回规范一致;联调截图与用例齐备
- 数据库:`users` 表结构可用;连接验证通过;有初始化脚本与说明
- Flume 采集:链路打通;全量/实时采集验证截图与流程说明完整
- MPC service健康检查与基础调用成功配置说明与检查清单完成
- 学习笔记Prompt/Agent/MCP 关系清晰,存档可复用
## 风险与缓解
- 认证与仓库推送凭据问题 → 先本地沉淀文档,凭据就绪后统一推送
- 环境差异与编码问题 → 统一英文路径与文件命名;`.gitignore` 忽略临时文件
- MPC 依赖与版本差异 → 列出依赖清单与 checklist逐项核验修正

@ -1,37 +0,0 @@
# 第十周组周总结Week 10
## 总体进展
- 接口规范与联调:完成登录/注册接口字段与错误码约定;前后端最小功能闭环联调通过并留存截图与用例
- 数据库与后端对接:本地 PostgreSQL 初始化与 `users` 表结构准备完成;对齐 `backend/.env` 连接信息,真实校验通过
- Flume 日志采集:跑通 Flume → HDFS 端到端采集(全量+实时),形成参数与策略说明,输出端到端验证截图
- 前端能力完善实时SSE/WS联调与心跳/重连日志下载权限与异常处理Diagnosis/Repair 流程闭环;错误码与字段统一映射与测试补齐
- Hadoop 与测试HDFS 高级功能学习与 MapReduce 编程实践;自动化测试数据管理规范与流程梳理
## 关键成果(成员维度)
- 李涛:实现 Flume“全量+实时”采集方案设计与落地,完成多场景适配、性能优化与高可用验证,端到端联调与监控告警配置(参考 `week-10/members/litao-weekly-summary-10.md:16-55`
- 邢远鑫:完成 SSE/WS 实时联调、日志下载权限封装、Diagnosis/Repair 流程闭环与错误码统一;更新 Postman 集合与单元测试(参考 `week-10/members/xingyuanxin-weekly-summary-10.md:6-24`
- 邹佳轩:完成登录/注册接口规范与联调、MPC service 配置核验、本地 PostgreSQL 初始化与后端对接说明;协同 Flume 验证与学习笔记(参考 `week-10/members/zoujiaxuan-weekly-summary-10.md:6-23`
- 王祖旺:完成 HDFS 高级功能与 MapReduce 编程学习与实践、自动化测试规范梳理;产出 AI Agent 测试方案初稿(参考 `week-10/members/wangzuwang-weekly-summary-10.md:7-23, 24-28`
- 沈永佳:数据库搭建与初始化、后台用户与权限、后端框架 `.env` 配置与健康检查/登录/注册联调;后端启动与部署说明优化(参考 `week-10/members/shenyongjia-weekly-summary-10.md:3-13`
## 数据与验证
- 登录/注册联调:最小闭环通过,接口返回规范一致;用例执行记录 ≥5
- MPC service健康检查与基础调用成功检查项覆盖率 ≥90%
- Flume 采集:全量无遗漏,实时延迟 ≤30 秒HDFS 写入吞吐与 Channel 队列稳定
- 前端实时:多数场景延迟 ≤30 秒;断线重连退避 ≤3 次;下载权限提示清晰
## 问题与不足
- 权限与错误码不一致:少量异常路径口径差异,需统一映射与文案
- 极端网络场景WS 回退策略下偶发重连失败 >3 次,需优化鲁棒性
- 压测与画像Flume 高并发与长时间运行压测数据有限,需补充
- 推送凭据与编码:远程推送认证与 Windows 控制台编码问题影响协作,需统一
## 下周计划
- 联调扩展与回归:登录/注册/当前用户异常场景覆盖与日志/告警完善
- 数据库与后端:补齐防火墙与网段策略统一,规范 `.env` 模板与排障清单
- Flume 压测与监控补充高并发与长时压测联动看板与告警Prometheus/Grafana
- 前端稳健性:优化 WS 回退与并发策略;统一异常路径字段与提示语;扩充单元测试
- AI/MCP完善测试环境配置与端到端异常场景用例形成报告
## 结论
- 团队完成从接口规范、数据库对接到日志采集与前端能力完善的阶段性目标,形成最小可运行闭环;下周将围绕联调覆盖、观测能力与性能压测持续提升整体稳定性与交付质量

Some files were not shown because too many files have changed in this diff Show More

Loading…
Cancel
Save