simulation-failure-triage
About
This skill helps developers triage failed materials simulations by diagnosing common issues like nonconvergence, NaN/Inf errors, and unstable timesteps. It proposes safe, defensible retry ladders and immediate actions for recovery. Use it when you encounter a suspicious or failed simulation and need a structured first response.
Quick Install
Claude Code
Recommendednpx skills add HeshamFS/materials-simulation-skills -a claude-code/plugin add https://github.com/HeshamFS/materials-simulation-skillsgit clone https://github.com/HeshamFS/materials-simulation-skills.git ~/.claude/skills/simulation-failure-triageCopy and paste this command in Claude Code to install this skill
Documentation
Simulation Failure Triage
Goal
Classify common simulation failure signatures and return immediate actions, retry ladders, and stop conditions.
Requirements
- Python 3.10+
- No external dependencies
- Works on Linux, macOS, and Windows
Inputs to Gather
| Input | Description | Example |
|---|---|---|
| Code | Simulation code | LAMMPS, VASP, MOOSE, QE |
| Stage | Setup, runtime, postprocess | runtime |
| Symptoms | Failure signs | nan,pressure-blowup |
| Log text or file | Error evidence | Lost atoms, ZBRENT |
| Recent change | Last modified setting | larger timestep |
Decision Guidance
- First preserve evidence: logs, inputs, executable version, and scheduler output.
- Separate setup errors from numerical instability and physical model issues.
- Retry with a single controlled change.
- Stop retrying when the result becomes scientifically meaningless or a required model input is missing.
Script Outputs
scripts/failure_triage.py emits:
likely_causesimmediate_actionsretry_ladderstop_conditionsevidence
Workflow
python3 skills/robustness/simulation-failure-triage/scripts/failure_triage.py \
--code LAMMPS \
--stage runtime \
--symptoms nan,pressure-blowup \
--recent-change "increased timestep" \
--json
Error Handling
Invalid stages or oversized log files stop with exit code 2. Unknown symptoms are retained as custom evidence.
Limitations
This skill gives first-response triage. It does not guarantee that a failed simulation can be repaired.
Security
- Log files are read with a 10 MB size cap.
- Log text is truncated and never executed.
- The script does not run external solvers.
- The skill uses
Bashonly to run its bundled script.
References
- See
references/failure_patterns.mdfor common failure signatures and retry ladders.
Version History
- 1.0.0: Initial cross-code simulation failure triage skill.
GitHub Repository
Related Skills
qmd
Developmentqmd is a local search and indexing CLI tool that enables developers to index and search through local files using hybrid search combining BM25, vector embeddings, and reranking. It supports both command-line usage and MCP (Model Context Protocol) mode for integration with Claude. The tool uses Ollama for embeddings and stores indexes locally, making it ideal for searching documentation or codebases directly from the terminal.
subagent-driven-development
DevelopmentThis skill executes implementation plans by dispatching a fresh subagent for each independent task, with code review between tasks. It enables fast iteration while maintaining quality gates through this review process. Use it when working on mostly independent tasks within the same session to ensure continuous progress with built-in quality checks.
mcporter
DevelopmentThe mcporter skill enables developers to manage and call Model Context Protocol (MCP) servers directly from Claude. It provides commands to list available servers, call their tools with arguments, and handle authentication and daemon lifecycle. Use this skill for integrating and testing MCP server functionality in your development workflow.
adk-deployment-specialist
DevelopmentThis skill deploys and orchestrates Vertex AI ADK agents using A2A protocol, managing AgentCard discovery, task submission, and supporting tools like Code Execution Sandbox and Memory Bank. It enables building multi-agent systems with sequential, parallel, or loop orchestration patterns in Python, Java, or Go. Use it when asked to deploy ADK agents or orchestrate agent workflows on Google Cloud.
