SKILL·63D042

incident-response

Name: incident-response
Author: rampstackco

rampstackco

Mis à jour 1 month ago

8 vues

424

Voir sur GitHub

Autregeneral

À propos

Cette Compétence Claude gère les incidents de production actifs, de la détection à la résolution, en fournissant des directives structurées pour le triage, l'atténuation et la communication. Elle se déclenche sur des termes tels que panne, P0/P1, ou lorsqu'un problème survient en production, offrant un support indépendant des outils pour les commandants d'incident et les intervenants de permanence. Utilisez-la pour les incidents actifs, et non pour les analyses post-mortem ou les lancements planifiés.

Installation rapide

Claude Code

Recommandé

Principal

npx skills add rampstackco/claude-skills -a claude-code

Commande PluginAlternatif

/plugin add https://github.com/rampstackco/claude-skills

Git CloneAlternatif

git clone https://github.com/rampstackco/claude-skills.git ~/.claude/skills/incident-response

Copiez et collez cette commande dans Claude Code pour installer cette compétence

Documentation

Incident Response

Manage active production incidents from detection to resolution. Stack-agnostic. Tool-agnostic.

This skill is for active incidents and incident process. For after-the-fact analysis, use after-action-report. For planned launches, use launch-runbook.

When to use

An active incident is happening
Building incident response procedures
Defining severity levels
Setting up on-call rotations
Training a team on incident response

When NOT to use

Post-incident retrospective (use after-action-report)
Planned launches (use launch-runbook)
Pre-launch issue triage (use qa-testing)

Required inputs

Awareness of the incident (alert, customer report, internal observation)
Access to production systems and monitoring
Roles and authorities clearly defined
Communication channels operational

The framework: 5 phases

1. Detection

How the incident becomes known.

Detection sources:

Automated alerts (monitoring, SLO violations, error rate spikes)
Customer reports (support tickets, social media, status page subscribers)
Internal observation (engineer notices something off)
Third-party (security researchers, partners)

On detection:

Acknowledge within target time (typically 5 to 15 minutes for critical)
Assess severity (see severity rubric below)
Page the on-call if not already paged
Open the incident channel

2. Triage

Establish severity and impact.

Severity rubric:

Severity	Definition	Response
SEV-1 (Critical)	Major customer-facing functionality broken. Data integrity at risk. Security breach.	All-hands. Incident commander. Active war room. Public communication required.
SEV-2 (Major)	Significant degradation. Some customers affected. Revenue impact.	Incident commander assigned. Active response. Internal communication. May or may not need public communication.
SEV-3 (Minor)	Limited impact. Workaround available. Affecting a small group of users.	Standard on-call response. Single owner.
SEV-4 (Low)	Cosmetic, edge-case, or low-frequency. No urgent action needed.	Tracked as bug. Addressed in normal queue.

Severity can change. Re-evaluate as more info emerges.

3. Mitigation

Stop the bleeding before fixing the cause.

Mitigation patterns (faster than full fix):

Rollback (revert recent deploy)
Feature flag off (disable the broken feature without deploy)
Failover (route to healthy replica or region)
Scale up (more capacity to absorb the load)
Throttle (reject some traffic to protect the rest)
Graceful degradation (turn off non-essential features to keep core functional)
Maintenance mode (last resort, blocks all users)

Mitigation principle: Stop user impact first. Cause analysis second.

4. Communication

Three audiences during an incident:

Internal team:

Real-time updates in incident channel
Cadence: every 15 minutes minimum during active incident
Format: timestamped status updates with what we know, what we're doing, ETA

Internal stakeholders:

Higher-level updates to broader org
Cadence: every 30 to 60 minutes
Format: business-impact framing, not technical detail

External / customers:

Status page updates
Cadence: every 30 minutes minimum during active incident
Format: plain language, no blame, what users are experiencing, what to expect

Communication principles:

Acknowledge before you have answers ("We're aware and investigating")
Update on schedule even if no progress ("Still investigating, no new information")
Never speculate publicly about cause
Confirm resolution explicitly when restored

5. Resolution

Verified fix, customers restored, incident closed.

Resolution criteria:

Mitigation in place and verified
Root cause identified (or explicitly deferred to AAR)
All affected systems back to normal
Customers can resume normal use
Final status update posted (internal and external)
Incident channel can be closed (or archived for AAR)

After closure:

Schedule AAR within 1 to 2 weeks
Capture initial timeline while memories are fresh
Track follow-up action items

Roles during an incident

Role	Responsibility
Incident commander (IC)	Owns the response. Calls decisions. Assigns work. Not necessarily the most technical person; needs to coordinate.
Communications lead	Owns internal and external messaging. Reduces IC's communication burden.
Operations lead	Drives the technical investigation and mitigation. Often the most senior on-call engineer.
Scribe	Captures the timeline as the incident unfolds. Critical for AAR.
Subject matter experts	Pulled in as needed. Service owners, database experts, security experts.

For small teams or low-severity incidents, one person can hold multiple roles. Each role's responsibilities should still be explicit.

Decision-making during an incident

The IC's authority:

Call rollback or other mitigations
Pull additional people in
Escalate severity
Make the call when unclear options exist

Non-decisions to avoid:

"Let's wait and see" when mitigations are available and impact is occurring
Discussing root cause while users are actively impacted (mitigate first)
Premature resolution announcements before verification
Death-by-committee (pull in lots of people, no one decides)

When in doubt: act. A wrong action that can be rolled back beats inaction while users suffer.

Status page communication patterns

Initial:

"We are investigating reports of [issue]. Updates to follow."

Identified:

"We have identified the issue affecting [scope]. Engineers are working on a fix. Next update by [time]."

Monitoring:

"A fix has been applied. We are monitoring to confirm resolution. Next update by [time]."

Resolved:

"This incident has been resolved. Service has been restored. A full incident report will be posted within [timeframe]."

Patterns to avoid:

Vague language ("experiencing some issues" - what kind?)
Missing affected scope ("login is down" - everywhere or just one region?)
Missing time commitments
"Should be resolved soon" without verification
Using "back up" before verification

Workflow

Acknowledge. First responder acknowledges within target time.
Assess severity. Use the rubric. Open the appropriate response channel.
Assign roles. IC, comms, ops at minimum.
Communicate. Initial status update. Internal channel active.
Investigate. Logs, metrics, recent changes. The four most common causes: a recent deploy, a configuration change, a third-party dependency change, a load spike.
Mitigate. Stop the bleeding. Don't wait for full root cause.
Verify mitigation. Don't trust dashboards alone; test the user flow.
Communicate resolution. Internal and external.
Close incident. Final timeline noted. Action items tracked.
Schedule AAR. Within 1 to 2 weeks.

Failure patterns

No clear IC. Multiple people debugging in parallel, no coordination. Slower to mitigate, easier to make conflicting changes.
Skipping mitigation, going straight to root cause. Users keep suffering while engineers debug.
Premature "all clear." Announcing resolution before verification.
Communication silence. Users don't know if anyone is working on it.
Status updates too vague. "We're working on it" with no detail.
Speculating publicly about cause. Often wrong, always damaging trust.
Pulling in too many people. Coordination overhead exceeds value.
No scribe. The timeline gets lost. AAR has to reconstruct from chat logs.
Skipping AAR for "minor" incidents. Patterns get missed. Lessons get re-learned.
Blame culture. People hide mistakes, incidents take longer.

Output format

During an active incident: incident channel updates and status page updates as per the framework above.

After incident close: a brief incident summary feeding into the AAR.

# Incident: [Brief title]

**Date:** [YYYY-MM-DD]
**Severity:** [SEV-1 / 2 / 3 / 4]
**Duration:** [Detection to resolution]
**Customer impact:** [Who, how many, how]

## Summary
[1 to 2 paragraphs]

## Timeline
[Timestamped events]

## Mitigation
[What was done]

## Action items
[Follow-ups, with owners]

## AAR scheduled for
[Date]

Reference files

references/incident-playbook.md - Severity definitions, roles, status page templates, decision rubrics.

Dépôt GitHub

rampstackco/claude-skills

Chemin: skills/incident-response

agent-skillsai-agentsanthropicclaudeclaude-aiclaude-code

FAQ

Frequently asked questions

What is the incident-response skill?

incident-response is a Claude Skill by rampstackco. Skills package instructions and resources that Claude loads on demand, so Claude can perform incident-response-related tasks without extra prompting.

How do I install incident-response?

Use the install commands on this page: add incident-response to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does incident-response belong to?

incident-response is in the operations category, tagged general.

Is incident-response free to use?

Yes. incident-response is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

Compétences associées

monitoring-and-alerting

Autre

Cette compétence aide les développeurs à concevoir et mettre en œuvre un système de surveillance, couvrant la définition des SLO, les contrôles de disponibilité et le suivi des erreurs. Elle guide la configuration d'alertes actionnables, la mise en place de rotations de garde et la résolution de la fatigue d'alerte. Utilisez-la lors de la mise en place de l'observabilité ou lorsqu'un incident révèle une lacune dans la surveillance.

Voir la compétence

security-baseline

Autre

La compétence de base de sécurité aide les développeurs à établir et à auditer les configurations de sécurité web essentielles. Elle fournit des conseils pour la configuration HTTPS/TLS, les en-têtes de sécurité, la CSP (Politique de Sécurité du Contenu), la gestion des secrets et le durcissement pré-lancement. Utilisez-la pour les revues de conformité, les évaluations de vulnérabilités et les audits de sécurité périodiques.

Voir la compétence

media-asset-management

Autre

Cette compétence aide les développeurs à concevoir et optimiser les pipelines média pour les images, les vidéos et les ressources téléchargeables. Elle fournit des conseils sur le stockage, la sélection de formats modernes (comme WebP/AVIF), les images responsives, l'hébergement vidéo et l'organisation des bibliothèques de ressources. Utilisez-la lors de la construction, de l'audit ou de l'amélioration des systèmes de diffusion média, en particulier pour des problèmes de performance ou d'organisation.

Voir la compétence

email-deliverability

Autre

Cette compétence Claude aide les développeurs à garantir la délivrabilité des e-mails en mettant en œuvre et en résolvant les problèmes liés aux protocoles d'authentification comme SPF, DKIM et DMARC. Elle assiste dans le diagnostic des problèmes de classement en spam, la surveillance de la réputation de l'expéditeur et la sécurisation des domaines contre l'usurpation d'identité. Utilisez-la lors de la configuration de systèmes de messagerie ou lorsque les e-mails marketing ou transactionnels n'atteignent pas les utilisateurs.

Voir la compétence