Back to Skills

simulation-failure-triage

HeshamFS
Updated 2 days ago
7 views
40
3
40
View on GitHub
Developmentai

About

This skill helps developers triage failed materials simulations by diagnosing common issues like nonconvergence, NaN/Inf errors, and unstable timesteps. It proposes safe, defensible retry ladders and immediate actions for recovery. Use it when you encounter a suspicious or failed simulation and need a structured first response.

Quick Install

Claude Code

Recommended
Primary
npx skills add HeshamFS/materials-simulation-skills -a claude-code
Plugin CommandAlternative
/plugin add https://github.com/HeshamFS/materials-simulation-skills
Git CloneAlternative
git clone https://github.com/HeshamFS/materials-simulation-skills.git ~/.claude/skills/simulation-failure-triage

Copy and paste this command in Claude Code to install this skill

Documentation

Simulation Failure Triage

Goal

Classify common simulation failure signatures and return immediate actions, retry ladders, and stop conditions.

Requirements

  • Python 3.10+
  • No external dependencies
  • Works on Linux, macOS, and Windows

Inputs to Gather

InputDescriptionExample
CodeSimulation codeLAMMPS, VASP, MOOSE, QE
StageSetup, runtime, postprocessruntime
SymptomsFailure signsnan,pressure-blowup
Log text or fileError evidenceLost atoms, ZBRENT
Recent changeLast modified settinglarger timestep

Decision Guidance

  • First preserve evidence: logs, inputs, executable version, and scheduler output.
  • Separate setup errors from numerical instability and physical model issues.
  • Retry with a single controlled change.
  • Stop retrying when the result becomes scientifically meaningless or a required model input is missing.

Script Outputs

scripts/failure_triage.py emits:

  • likely_causes
  • immediate_actions
  • retry_ladder
  • stop_conditions
  • evidence

Workflow

python3 skills/robustness/simulation-failure-triage/scripts/failure_triage.py \
  --code LAMMPS \
  --stage runtime \
  --symptoms nan,pressure-blowup \
  --recent-change "increased timestep" \
  --json

Error Handling

Invalid stages or oversized log files stop with exit code 2. Unknown symptoms are retained as custom evidence.

Limitations

This skill gives first-response triage. It does not guarantee that a failed simulation can be repaired.

Security

  • Log files are read with a 10 MB size cap.
  • Log text is truncated and never executed.
  • The script does not run external solvers.
  • The skill uses Bash only to run its bundled script.

References

  • See references/failure_patterns.md for common failure signatures and retry ladders.

Version History

  • 1.0.0: Initial cross-code simulation failure triage skill.

GitHub Repository

HeshamFS/materials-simulation-skills
Path: skills/robustness/simulation-failure-triage
0
agent-skillsagentscli-toolscomputational-sciencellmmaterials-science

Related Skills

qmd

Development

qmd is a local search and indexing CLI tool that enables developers to index and search through local files using hybrid search combining BM25, vector embeddings, and reranking. It supports both command-line usage and MCP (Model Context Protocol) mode for integration with Claude. The tool uses Ollama for embeddings and stores indexes locally, making it ideal for searching documentation or codebases directly from the terminal.

View skill

subagent-driven-development

Development

This skill executes implementation plans by dispatching a fresh subagent for each independent task, with code review between tasks. It enables fast iteration while maintaining quality gates through this review process. Use it when working on mostly independent tasks within the same session to ensure continuous progress with built-in quality checks.

View skill

mcporter

Development

The mcporter skill enables developers to manage and call Model Context Protocol (MCP) servers directly from Claude. It provides commands to list available servers, call their tools with arguments, and handle authentication and daemon lifecycle. Use this skill for integrating and testing MCP server functionality in your development workflow.

View skill

adk-deployment-specialist

Development

This skill deploys and orchestrates Vertex AI ADK agents using A2A protocol, managing AgentCard discovery, task submission, and supporting tools like Code Execution Sandbox and Memory Bank. It enables building multi-agent systems with sequential, parallel, or loop orchestration patterns in Python, Java, or Go. Use it when asked to deploy ADK agents or orchestrate agent workflows on Google Cloud.

View skill