write-incident-runbook
について
このClaudeスキルは、構造化されたインシデント実行手順書を生成し、対応手順の標準化と改善を実現します。診断ステップ、解決アクション、エスカレーションパス、コミュニケーションテンプレートを含む文書を作成します。再発アラートの平均解決時間(MTTR)短縮、チームメンバーのトレーニング、アラートと解決ステップの直接連携にご活用ください。
クイックインストール
Claude Code
推奨npx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/write-incident-runbookこのコマンドをClaude Codeにコピー&ペーストしてスキルをインストールします
ドキュメント
Write Incident Runbook
Actionable runbooks → guide responders through incident diagnosis + resolution.
Use When
- Doc response procedures for recurring alerts|incidents
- Standardize response across on-call rotation
- Reduce MTTR via clear diagnostic steps
- Training for new team on incident handling
- Establish escalation paths + comm protocols
- Migrate tribal knowledge → written
- Link alerts → resolution (alert annotations)
In
- Required: Incident|alert name|desc
- Required: Historical incident data + resolution patterns
- Optional: Diagnostic queries (Prometheus, logs, traces)
- Optional: Escalation contacts + comm channels
- Optional: Prev incident post-mortems
Do
Step 1: Choose Template
See Extended Examples for complete template files.
Select per incident type + complexity.
Basic runbook template structure:
# [Alert/Incident Name] Runbook
## Overview | Severity | Symptoms
## Diagnostic Steps | Resolution Steps
## Escalation | Communication | Prevention | Related
Advanced SRE template (excerpt):
# [Service Name] - [Incident Type] Runbook
## Metadata
- Service, Owner, Severity, On-Call, Last Updated
## Diagnostic Phase
### Quick Health Check (< 5 min): Dashboard, error rate, deployments
### Detailed Investigation (5-20 min): Metrics, logs, traces, failure patterns
# ... (see EXAMPLES.md for complete template)
Key components:
- Metadata: Service ownership, severity, on-call rotation
- Diagnostic Phase: Quick checks → detailed → failure patterns
- Resolution: Immediate mitigation → root cause fix → verify
- Escalation: Criteria + contact paths
- Comm: Internal|external templates
- Prevention: Short|long-term actions
Got: Template selected matches incident complexity, sections appropriate for service type.
If err:
- Start basic, iterate per incident patterns
- Review industry examples (Google SRE books, vendor runbooks)
- Adapt per team feedback after first use
Step 2: Diagnostic Procedures
See Extended Examples for complete diagnostic queries and decision trees.
Step-by-step investigation w/ specific queries.
6-step checklist:
-
Verify Service Health: Health endpoint + uptime metrics
curl -I https://api.example.com/health # Expected: HTTP 200 OKup{job="api-service"} # Expected: 1 for all instances -
Check Error Rate: Current % + breakdown by endpoint
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # Expected: < 1% -
Analyze Logs: Recent errs + top err msgs from Loki
{job="api-service"} |= "error" | json | level="error" -
Resource Util: CPU, memory, conn pool status
avg(rate(container_cpu_usage_seconds_total{pod=~"api-service.*"}[5m])) * 100 # Expected: < 70% -
Recent Changes: Deployments, git commits, infra changes
-
Dependencies: Downstream service health, DB|API latency
Failure pattern decision tree (excerpt):
- Service down? → Check all pods|instances
- Error rate elevated? → Check specific err types (5xx, gateway, DB, timeouts)
- When started? → After deployment (rollback), gradual (resource leak), sudden (traffic|dep)
Got: Diagnostic procedures specific, expected vs actual vals, guides responder.
If err:
- Test queries in actual monitoring before doc
- Screenshots of dashboards for visual ref
- "Common mistakes" section for missed steps
- Iterate per responder feedback
Step 3: Resolution Procedures
See Extended Examples for all 5 resolution options with full commands and rollback procedures.
Step-by-step remediation w/ rollback.
5 resolution options (brief):
-
Rollback Deployment (fastest): For post-deployment errs
kubectl rollout undo deployment/api-serviceVerify → Monitor → Confirm (err rate < 1%, latency normal, no alerts)
-
Scale Up: High CPU|memory, conn pool exhaustion
kubectl scale deployment/api-service --replicas=$((current * 3/2)) -
Restart Service: Memory leaks, stuck conns, cache corruption
kubectl rollout restart deployment/api-service -
Feature Flag | Circuit Breaker: Specific feature errs|external dep failures
kubectl set env deployment/api-service FEATURE_NAME=false -
DB Remediation: Conns, slow queries, pool exhaustion
-- Kill long-running queries, restart connection pool, increase pool size
Universal verify checklist:
- Err rate < 1%
- Latency P99 < threshold
- Throughput at baseline
- Resource healthy (CPU < 70%, Memory < 80%)
- Deps healthy
- User-facing tests pass
- No active alerts
Rollback: Resolution worsens → pause|cancel → revert → reassess
Got: Resolution clear, verify checks, rollback options per action.
If err:
- Granular steps for complex
- Screenshots|diagrams for multi-step
- Doc cmd outs (expected vs actual)
- Separate runbook for complex resolution
Step 4: Escalation Paths
See Extended Examples for full escalation levels and contact directory template.
When + how to escalate.
Escalate immediately:
- Customer-facing outage > 15 min
- SLO err budget > 10% depleted
- Data loss|corruption|security breach suspected
- Can't ID root cause in 20 min
- Mitigation fails|worsens
5 escalation levels:
- Primary On-Call (5 min response): Deploy fixes, rollback, scale (up to 30 min solo)
- Secondary On-Call (auto after 15 min): Investigation support
- Team Lead (architectural): DB changes, vendor escalation, > 1 hour
- Incident Commander (cross-team): Multi teams, customer comms, > 2 hours
- Executive (C-level): Major impact (>50% users), SLA breach, media|PR, > 4 hours
Process:
- Notify target: status, impact, actions taken, help needed, dashboard link
- Handoff: timeline, actions, access, remain available
- No silence: update every 15 min, ask questions, feedback
Contact directory: Table w/ role, Slack, phone, PagerDuty for:
- Platform|DB|Security|Network teams
- Incident Commander
- External vendors (AWS, DB vendor, CDN provider)
Got: Clear escalation criteria, contact info accessible, paths align w/ org.
If err:
- Validate contact current (test quarterly)
- Decision tree for when to escalate
- Examples of escalation msgs
- Doc response time per level
Step 5: Comm Templates
See Extended Examples for all internal and external templates with full formatting.
Pre-written msgs for incident updates.
Internal (Slack #incident-response):
-
Initial Declaration:
🚨 INCIDENT: [Title] | Severity: [Critical/High/Medium] Impact: [users/services] | Owner: @username | Dashboard: [link] Quick Summary: [1-2 sentences] | Next update: 15 min -
Progress Update (every 15-30 min):
📊 UPDATE #N | Status: [Investigating/Mitigating/Monitoring] Actions: [what we tried and outcomes] Theory: [what we think is happening] Next: [planned actions] -
Mitigation Complete:
✅ MITIGATION | Metrics: Error [before→after], Latency [before→after] Root Cause: [brief or "investigating"] | Monitoring 30min before resolved -
Resolution:
🎉 RESOLVED | Duration: [time] | Root Cause + Impact + Follow-up actions -
False Alarm: No impact, no follow-up
External (status page):
- Initial: Investigating, started time, next update in 15 min
- Progress: ID'd cause (customer-friendly), implementing fix, est resolution
- Resolution: Resolved time, root cause (simple), duration, prevention
Customer email template: Timeline, impact, resolution, prevention, compensation (if applicable)
Got: Templates save time, consistent comm, reduce cognitive load on responders.
If err:
- Customize to company comm style
- Pre-fill w/ common incident types
- Slack workflow|bot to populate auto
- Review during retrospectives
Step 6: Link Runbook → Monitoring
See Extended Examples for complete Prometheus alert configuration and Grafana dashboard JSON.
Integrate w/ alerts + dashboards.
Add runbook links to Prometheus alerts:
- alert: HighErrorRate
annotations:
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
dashboard_url: "https://grafana.example.com/d/service-overview"
incident_channel: "#incident-platform"
Embed quick diagnostic links in runbook:
- Service Overview Dashboard
- Error Rate Last 1h (Prometheus direct link)
- Recent Error Logs (Loki|Grafana Explore)
- Recent Deployments (GitHub|CI)
- PagerDuty Incidents
Grafana dashboard panel w/ runbook links (md panel listing all incident runbooks w/ on-call + escalation)
Got: Responders access runbooks direct from alerts|dashboards, diagnostic queries pre-filled, one-click access.
If err:
- Verify URLs accessible w/o VPN|login
- URL shorteners for complex Grafana|Prometheus
- Test links quarterly → no break
- Browser bookmarks for frequent
Check
- Runbook follows consistent template
- Diagnostic procedures w/ specific queries + expected vals
- Resolution actionable w/ clear cmds
- Escalation criteria + contacts current
- Comm templates for internal + external
- Linked from monitoring alerts + dashboards
- Tested during incident sim or actual
- Responder feedback incorporated
- Revision history tracked w/ dates + authors
- Accessible w/o auth (or cached offline)
Traps
- Too generic: Vague "check the logs" w/o specific queries → not actionable. Specific.
- Outdated: Refs old systems|cmds → useless. Quarterly review.
- No verify: Resolution w/o verify → false positives. "How to confirm fixed."
- Missing rollback: Every action → rollback plan. Don't trap responders worse state.
- Assume knowledge: Expert-only → excludes juniors. Write for least experienced on rotation.
- No ownership: No owners → stale. Assign team|person responsible.
- Hidden behind auth: Inaccessible during VPN|SSO issues → useless during crisis. Cache copies or public wiki.
→
configure-alerting-rules— Link runbooks to alert annotations for immediate accessbuild-grafana-dashboards— Embed runbook links in dashboards + diagnostic panelssetup-prometheus-monitoring— Diagnostic queries from Prometheus in runbook proceduresdefine-slo-sli-sla— Reference SLO impact in incident severity classification
GitHub リポジトリ
関連スキル
content-collections
メタこのスキルは、Content Collections(Markdown/MDXファイルを型安全なデータコレクションに変換するTypeScriptファーストのツール)の本番環境でテストされた設定を提供します。Zodバリデーションによる型安全性を実現し、ブログ、ドキュメントサイト、コンテンツ重視のVite + Reactアプリケーション構築時にご利用ください。Viteプラグインの設定、MDXコンパイルから、デプロイ最適化、スキーマバリデーションまで、すべてを網羅しています。
polymarket
メタこのスキルは、開発者がPolymarket予測市場プラットフォームを活用したアプリケーション構築を可能にします。API統合による取引や市場データの取得に加え、WebSocketを介したリアルタイムデータストリーミングにより、ライブ取引や市場活動を監視できます。取引戦略の実装や、ライブ市場更新を処理するツールの作成にご利用ください。
creating-opencode-plugins
メタこのスキルは、開発者がコマンド、ファイル、LSP操作など25種類以上のイベントタイプにフックするOpenCodeプラグインを作成することを支援します。JavaScript/TypeScriptモジュール向けに、プラグイン構造、イベントAPI仕様、および実装パターンを提供します。カスタムイベント駆動ロジックでOpenCode AIアシスタントのライフサイクルをインターセプト、監視、または拡張する必要がある場合にご利用ください。
sglang
メタSGLangは、高性能なLLMサービングフレームワークであり、RadixAttentionプレフィックスキャッシュを活用したJSON、正規表現、エージェントワークフロー向けの高速で構造化された生成を特長とします。特にプレフィックスが繰り返されるタスクにおいて、大幅に高速な推論を実現し、複雑な構造化出力やマルチターン対話に最適です。制約付きデコードが必要な場合や、広範なプレフィックス共有を伴うアプリケーションを構築する場合は、vLLMなどの代替案ではなくSGLangを選択してください。
