SKILL·425F49

write-incident-runbook

Name: write-incident-runbook
Author: pjt222

pjt222

Обновлено 1 month ago

9 просмотров

Метаwordai

О программе

Этот навык Claude создает структурированные руководства по инцидентам для стандартизации и улучшения процедур реагирования. Он генерирует документы с диагностическими шагами, действиями по устранению, путями эскалации и шаблонами коммуникации. Используйте его для сокращения MTTR при повторяющихся оповещениях, обучения членов команды и прямого связывания оповещений с шагами по устранению.

Быстрая установка

Claude Code

Рекомендуется

Основной

npx skills add pjt222/agent-almanac -a claude-code

Команда плагинаАльтернативный

/plugin add https://github.com/pjt222/agent-almanac

Git клонированиеАльтернативный

git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/write-incident-runbook

Скопируйте и вставьте эту команду в Claude Code для установки этого навыка

Документация

Write Incident Runbook

Actionable runbooks → guide responders through incident diagnosis + resolution.

Use When

Doc response procedures for recurring alerts|incidents
Standardize response across on-call rotation
Reduce MTTR via clear diagnostic steps
Training for new team on incident handling
Establish escalation paths + comm protocols
Migrate tribal knowledge → written
Link alerts → resolution (alert annotations)

In

Required: Incident|alert name|desc
Required: Historical incident data + resolution patterns
Optional: Diagnostic queries (Prometheus, logs, traces)
Optional: Escalation contacts + comm channels
Optional: Prev incident post-mortems

Do

Step 1: Choose Template

See Extended Examples for complete template files.

Select per incident type + complexity.

Basic runbook template structure:

# [Alert/Incident Name] Runbook
## Overview | Severity | Symptoms
## Diagnostic Steps | Resolution Steps
## Escalation | Communication | Prevention | Related

Advanced SRE template (excerpt):

# [Service Name] - [Incident Type] Runbook

## Metadata
- Service, Owner, Severity, On-Call, Last Updated

## Diagnostic Phase
### Quick Health Check (< 5 min): Dashboard, error rate, deployments
### Detailed Investigation (5-20 min): Metrics, logs, traces, failure patterns
# ... (see EXAMPLES.md for complete template)

Key components:

Metadata: Service ownership, severity, on-call rotation
Diagnostic Phase: Quick checks → detailed → failure patterns
Resolution: Immediate mitigation → root cause fix → verify
Escalation: Criteria + contact paths
Comm: Internal|external templates
Prevention: Short|long-term actions

Got: Template selected matches incident complexity, sections appropriate for service type.

If err:

Start basic, iterate per incident patterns
Review industry examples (Google SRE books, vendor runbooks)
Adapt per team feedback after first use

Step 2: Diagnostic Procedures

See Extended Examples for complete diagnostic queries and decision trees.

Step-by-step investigation w/ specific queries.

6-step checklist:

Verify Service Health: Health endpoint + uptime metrics

curl -I https://api.example.com/health  # Expected: HTTP 200 OK

up{job="api-service"}  # Expected: 1 for all instances

Check Error Rate: Current % + breakdown by endpoint

sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100  # Expected: < 1%

Analyze Logs: Recent errs + top err msgs from Loki

{job="api-service"} |= "error" | json | level="error"

Resource Util: CPU, memory, conn pool status

avg(rate(container_cpu_usage_seconds_total{pod=~"api-service.*"}[5m])) * 100
# Expected: < 70%

Recent Changes: Deployments, git commits, infra changes
Dependencies: Downstream service health, DB|API latency

Failure pattern decision tree (excerpt):

Service down? → Check all pods|instances
Error rate elevated? → Check specific err types (5xx, gateway, DB, timeouts)
When started? → After deployment (rollback), gradual (resource leak), sudden (traffic|dep)

Got: Diagnostic procedures specific, expected vs actual vals, guides responder.

If err:

Test queries in actual monitoring before doc
Screenshots of dashboards for visual ref
"Common mistakes" section for missed steps
Iterate per responder feedback

Step 3: Resolution Procedures

See Extended Examples for all 5 resolution options with full commands and rollback procedures.

Step-by-step remediation w/ rollback.

5 resolution options (brief):

Rollback Deployment (fastest): For post-deployment errs
```
kubectl rollout undo deployment/api-service
```
Verify → Monitor → Confirm (err rate < 1%, latency normal, no alerts)

Scale Up: High CPU|memory, conn pool exhaustion

kubectl scale deployment/api-service --replicas=$((current * 3/2))

Restart Service: Memory leaks, stuck conns, cache corruption
```
kubectl rollout restart deployment/api-service
```
Feature Flag | Circuit Breaker: Specific feature errs|external dep failures
```
kubectl set env deployment/api-service FEATURE_NAME=false
```

DB Remediation: Conns, slow queries, pool exhaustion

-- Kill long-running queries, restart connection pool, increase pool size

Universal verify checklist:

Rollback: Resolution worsens → pause|cancel → revert → reassess

Got: Resolution clear, verify checks, rollback options per action.

If err:

Granular steps for complex
Screenshots|diagrams for multi-step
Doc cmd outs (expected vs actual)
Separate runbook for complex resolution

Step 4: Escalation Paths

See Extended Examples for full escalation levels and contact directory template.

When + how to escalate.

Escalate immediately:

Customer-facing outage > 15 min
SLO err budget > 10% depleted
Data loss|corruption|security breach suspected
Can't ID root cause in 20 min
Mitigation fails|worsens

5 escalation levels:

Primary On-Call (5 min response): Deploy fixes, rollback, scale (up to 30 min solo)
Secondary On-Call (auto after 15 min): Investigation support
Team Lead (architectural): DB changes, vendor escalation, > 1 hour
Incident Commander (cross-team): Multi teams, customer comms, > 2 hours
Executive (C-level): Major impact (>50% users), SLA breach, media|PR, > 4 hours

Process:

Notify target: status, impact, actions taken, help needed, dashboard link
Handoff: timeline, actions, access, remain available
No silence: update every 15 min, ask questions, feedback

Contact directory: Table w/ role, Slack, phone, PagerDuty for:

Platform|DB|Security|Network teams
Incident Commander
External vendors (AWS, DB vendor, CDN provider)

Got: Clear escalation criteria, contact info accessible, paths align w/ org.

If err:

Validate contact current (test quarterly)
Decision tree for when to escalate
Examples of escalation msgs
Doc response time per level

Step 5: Comm Templates

See Extended Examples for all internal and external templates with full formatting.

Pre-written msgs for incident updates.

Internal (Slack #incident-response):

Initial Declaration:

🚨 INCIDENT: [Title] | Severity: [Critical/High/Medium]
Impact: [users/services] | Owner: @username | Dashboard: [link]
Quick Summary: [1-2 sentences] | Next update: 15 min

Progress Update (every 15-30 min):

📊 UPDATE #N | Status: [Investigating/Mitigating/Monitoring]
Actions: [what we tried and outcomes]
Theory: [what we think is happening]
Next: [planned actions]

Mitigation Complete:

✅ MITIGATION | Metrics: Error [before→after], Latency [before→after]
Root Cause: [brief or "investigating"] | Monitoring 30min before resolved

Resolution:

🎉 RESOLVED | Duration: [time] | Root Cause + Impact + Follow-up actions

False Alarm: No impact, no follow-up

External (status page):

Initial: Investigating, started time, next update in 15 min
Progress: ID'd cause (customer-friendly), implementing fix, est resolution
Resolution: Resolved time, root cause (simple), duration, prevention

Customer email template: Timeline, impact, resolution, prevention, compensation (if applicable)

Got: Templates save time, consistent comm, reduce cognitive load on responders.

If err:

Customize to company comm style
Pre-fill w/ common incident types
Slack workflow|bot to populate auto
Review during retrospectives

Step 6: Link Runbook → Monitoring

See Extended Examples for complete Prometheus alert configuration and Grafana dashboard JSON.

Integrate w/ alerts + dashboards.

Add runbook links to Prometheus alerts:

- alert: HighErrorRate
  annotations:
    runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
    dashboard_url: "https://grafana.example.com/d/service-overview"
    incident_channel: "#incident-platform"

Embed quick diagnostic links in runbook:

Service Overview Dashboard
Error Rate Last 1h (Prometheus direct link)
Recent Error Logs (Loki|Grafana Explore)
Recent Deployments (GitHub|CI)
PagerDuty Incidents

Grafana dashboard panel w/ runbook links (md panel listing all incident runbooks w/ on-call + escalation)

Got: Responders access runbooks direct from alerts|dashboards, diagnostic queries pre-filled, one-click access.

If err:

Verify URLs accessible w/o VPN|login
URL shorteners for complex Grafana|Prometheus
Test links quarterly → no break
Browser bookmarks for frequent

Check

Traps

Too generic: Vague "check the logs" w/o specific queries → not actionable. Specific.
Outdated: Refs old systems|cmds → useless. Quarterly review.
No verify: Resolution w/o verify → false positives. "How to confirm fixed."
Missing rollback: Every action → rollback plan. Don't trap responders worse state.
Assume knowledge: Expert-only → excludes juniors. Write for least experienced on rotation.
No ownership: No owners → stale. Assign team|person responsible.
Hidden behind auth: Inaccessible during VPN|SSO issues → useless during crisis. Cache copies or public wiki.

→

configure-alerting-rules — Link runbooks to alert annotations for immediate access
build-grafana-dashboards — Embed runbook links in dashboards + diagnostic panels
setup-prometheus-monitoring — Diagnostic queries from Prometheus in runbook procedures
define-slo-sli-sla — Reference SLO impact in incident severity classification

GitHub репозиторий

pjt222/agent-almanac

Путь: i18n/caveman-ultra/skills/write-incident-runbook

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the write-incident-runbook skill?

write-incident-runbook is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform write-incident-runbook-related tasks without extra prompting.

How do I install write-incident-runbook?

Use the install commands on this page: add write-incident-runbook to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does write-incident-runbook belong to?

write-incident-runbook is in the Meta category, tagged word and ai.

Is write-incident-runbook free to use?

Yes. write-incident-runbook is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

Похожие навыки

content-collections

Мета

Этот навык предоставляет проверенную в продакшене настройку для Content Collections — TypeScript-ориентированного инструмента, который преобразует файлы Markdown/MDX в типобезопасные коллекции данных с валидацией Zod. Используйте его при создании блогов, сайтов документации или контентных приложений на Vite + React для обеспечения типобезопасности и автоматической проверки содержимого. Он охватывает всё: от настройки плагина Vite и компиляции MDX до оптимизации развертывания и валидации схем.

Просмотреть навык

polymarket

Мета

Этот навык позволяет разработчикам создавать приложения на платформе прогнозных рынков Polymarket, включая интеграцию с API для торговли и получения рыночных данных. Он также обеспечивает потоковую передачу данных в реальном времени через WebSocket для отслеживания текущих сделок и рыночной активности. Используйте его для реализации торговых стратегий или создания инструментов, обрабатывающих обновления рынка в реальном времени.

Просмотреть навык

creating-opencode-plugins

Мета

Этот навык помогает разработчикам создавать плагины OpenCode, которые подключаются к более чем 25 типам событий, таким как команды, файлы и операции LSP. Он предоставляет структуру плагина, спецификации API событий и шаблоны реализации для модулей на JavaScript/TypeScript. Используйте его, когда вам нужно перехватывать, отслеживать или расширять жизненный цикл ассистента OpenCode AI с помощью пользовательской событийно-ориентированной логики.

Просмотреть навык

sglang

Мета

SGLang — это высокопроизводительный фреймворк для обслуживания больших языковых моделей (LLM), специализирующийся на быстрой структурированной генерации JSON, regex и рабочих процессов агентов с использованием кэширования префиксов RadixAttention. Он обеспечивает значительно более высокую скорость вывода, особенно для задач с повторяющимися префиксами, что делает его идеальным для сложных структурированных результатов и многократных диалогов. Выбирайте SGLang вместо альтернатив, таких как vLLM, когда вам требуется ограниченное декодирование или вы создаете приложения с интенсивным совместным использованием префиксов.

Просмотреть навык