返回技能列表

hpc-runtime-doctor

HeshamFS
更新于 2 days ago
6 次查看
40
3
40
在 GitHub 上查看
设计aiapi

关于

This Claude Skill diagnoses runtime and scheduler issues for High-Performance Computing (HPC) materials simulations. It analyzes problems like MPI/OpenMP layout, GPU usage, environment modules, and resource mismatches when jobs fail or underperform on a cluster. Use it to get a diagnosis, environment checklist, and a safe retry plan for portability and performance issues.

快速安装

Claude Code

推荐
主要方式
npx skills add HeshamFS/materials-simulation-skills -a claude-code
插件命令备选方式
/plugin add https://github.com/HeshamFS/materials-simulation-skills
Git 克隆备选方式
git clone https://github.com/HeshamFS/materials-simulation-skills.git ~/.claude/skills/hpc-runtime-doctor

在 Claude Code 中复制并粘贴此命令以安装该技能

技能文档

HPC Runtime Doctor

Goal

Turn cluster symptoms into a resource-layout diagnosis, environment checklist, and safe retry plan.

Requirements

  • Python 3.10+
  • No external dependencies
  • Works on Linux, macOS, and Windows

Inputs to Gather

InputDescriptionExample
SchedulerSLURM, PBS, LSF, localslurm
Nodes/tasks/threadsRuntime layout2 nodes, 128 tasks, 2 threads
GPUsGPUs requested4
SymptomsObserved failureoom,killed,slow-gpu
MPI/OpenMP/GPU useParallel modesmpi+openmp+gpu
WalltimeRequested time12:00:00
ScratchWhether scratch is usedtrue

Decision Guidance

  • Check resource layout before changing physics settings.
  • Confirm module/compiler/MPI/CUDA consistency before debugging solver behavior.
  • Treat missing restart files and scratch cleanup as workflow failures, not physics failures.
  • For GPU jobs, confirm the executable was built with the requested accelerator backend.

Script Outputs

scripts/hpc_runtime_doctor.py emits:

  • resource_layout
  • diagnoses
  • environment_checks
  • retry_plan
  • scheduler_notes

Workflow

python3 skills/hpc-deployment/hpc-runtime-doctor/scripts/hpc_runtime_doctor.py \
  --scheduler slurm \
  --nodes 2 \
  --tasks 128 \
  --cpus-per-task 2 \
  --gpus 4 \
  --symptoms oom,slow-gpu \
  --uses-mpi \
  --uses-openmp \
  --uses-gpu \
  --json

Error Handling

Invalid resource counts stop with exit code 2. Unknown symptoms are preserved as custom items for human review.

Limitations

This skill does not query a live scheduler. It diagnoses from the submitted layout and symptoms.

Security

  • Inputs are scalar CLI values and booleans only.
  • The script does not execute scheduler commands or inspect environment variables.
  • The skill uses Bash only to run its bundled script.

References

  • See references/hpc_runtime_patterns.md for scheduler and runtime diagnosis patterns.

Version History

  • 1.0.0: Initial HPC runtime diagnosis skill.

GitHub 仓库

HeshamFS/materials-simulation-skills
路径: skills/hpc-deployment/hpc-runtime-doctor
0
agent-skillsagentscli-toolscomputational-sciencellmmaterials-science

相关推荐技能

executing-plans

设计

该Skill用于当开发者提供完整实施计划时,以受控批次方式执行代码实现。它会先审阅计划并提出疑问,然后分批次执行任务(默认每批3个任务),并在批次间暂停等待审查。关键特性包括分批次执行、内置检查点和架构师审查机制,确保复杂系统实现的可控性。

查看技能

requesting-code-review

设计

该Skill可在完成任务、实现主要功能或合并代码前自动调度代码审查子代理,确保实现符合需求和计划。它支持通过指定git SHA范围进行精准的代码变更审查,帮助开发者在关键节点及时发现潜在问题。核心原则是"早审查、勤审查",适用于开发流程的各个关键阶段。

查看技能

connect-mcp-server

设计

这个Skill指导开发者如何将MCP服务器连接到Claude Code,支持HTTP、stdio和SSE三种传输协议。它涵盖了从安装配置到认证安全的完整流程,适用于集成GitHub、Notion、数据库等外部服务。当开发者需要添加集成、配置外部工具或提及MCP相关功能时,这个Skill能提供实用的操作指南。

查看技能

web-cli-teleport

设计

该Skill帮助开发者根据任务特性选择Claude Code的Web或CLI界面,并指导如何在两种环境间无缝迁移会话。它能分析任务复杂度、迭代需求等要素,推荐最优工作界面和工作流。关键特性包括会话状态管理、环境切换指导和上下文优化建议。

查看技能