chaos-engineering-resilience
About
This skill applies chaos engineering principles to test distributed systems by injecting controlled failures like network or instance outages. It helps validate fault tolerance and disaster recovery by measuring system behavior against defined steady-state metrics. Use it when building confidence in resilience or performing resilience testing.
Documentation
Chaos Engineering & Resilience Testing
<default_to_action> When testing system resilience or injecting failures:
- DEFINE steady state (normal metrics: error rate, latency, throughput)
- HYPOTHESIZE system continues in steady state during failure
- INJECT real-world failures (network, instance, disk, CPU)
- OBSERVE and measure deviation from steady state
- FIX weaknesses discovered, document runbooks, repeat
Quick Chaos Steps:
- Start small: Dev → Staging → 1% prod → gradual rollout
- Define clear rollback triggers (error_rate > 5%)
- Measure blast radius, never exceed planned scope
- Document findings → runbooks → improved resilience
Critical Success Factors:
- Controlled experiments with automatic rollback
- Steady state must be measurable
- Start in non-production, graduate to production </default_to_action>
Quick Reference Card
When to Use
- Distributed systems validation
- Disaster recovery testing
- Building confidence in fault tolerance
- Pre-production resilience verification
Failure Types to Inject
| Category | Failures | Tools |
|---|---|---|
| Network | Latency, packet loss, partition | tc, toxiproxy |
| Infrastructure | Instance kill, disk failure, CPU | Chaos Monkey |
| Application | Exceptions, slow responses, leaks | Gremlin, LitmusChaos |
| Dependencies | Service outage, timeout | WireMock |
Blast Radius Progression
Dev (safe) → Staging → 1% prod → 10% → 50% → 100%
↓ ↓ ↓ ↓
Learn Validate Careful Full confidence
Steady State Metrics
| Metric | Normal | Alert Threshold |
|---|---|---|
| Error rate | < 0.1% | > 1% |
| p99 latency | < 200ms | > 500ms |
| Throughput | baseline | -20% |
Chaos Experiment Structure
// Chaos experiment definition
const experiment = {
name: 'Database latency injection',
hypothesis: 'System handles 500ms DB latency gracefully',
steadyState: {
errorRate: '< 0.1%',
p99Latency: '< 300ms'
},
method: {
type: 'network-latency',
target: 'database',
delay: '500ms',
duration: '5m'
},
rollback: {
automatic: true,
trigger: 'errorRate > 5%'
}
};
Agent-Driven Chaos
// qe-chaos-engineer runs controlled experiments
await Task("Chaos Experiment", {
target: 'payment-service',
failure: 'terminate-random-instance',
blastRadius: '10%',
duration: '5m',
steadyStateHypothesis: {
metric: 'success-rate',
threshold: 0.99
},
autoRollback: true
}, "qe-chaos-engineer");
// Validates:
// - System recovers automatically
// - Error rate stays within threshold
// - No data loss
// - Alerts triggered appropriately
Agent Coordination Hints
Memory Namespace
aqe/chaos-engineering/
├── experiments/* - Experiment definitions & results
├── steady-states/* - Baseline measurements
├── runbooks/* - Generated recovery procedures
└── blast-radius/* - Impact analysis
Fleet Coordination
const chaosFleet = await FleetManager.coordinate({
strategy: 'chaos-engineering',
agents: [
'qe-chaos-engineer', // Experiment execution
'qe-performance-tester', // Baseline metrics
'qe-production-intelligence' // Production monitoring
],
topology: 'sequential'
});
Related Skills
- shift-right-testing - Production testing
- performance-testing - Load testing
- test-environment-management - Environment stability
Remember
Break things on purpose to prevent unplanned outages. Find weaknesses before users do. Define steady state, inject failures, measure impact, fix weaknesses, create runbooks. Start small, increase blast radius gradually.
With Agents: qe-chaos-engineer automates chaos experiments with blast radius control, automatic rollback, and comprehensive resilience validation. Generates runbooks from experiment results.
Quick Install
/plugin add https://github.com/proffesor-for-testing/agentic-qe/tree/main/chaos-engineering-resilienceCopy and paste this command in Claude Code to install this skill
GitHub 仓库
Related Skills
moai-project-config-manager
TestingThis skill provides complete CRUD operations for config.json files with built-in validation and merge strategies. It handles project initialization, configuration updates, and includes intelligent backup and recovery features. Use it for robust configuration management with error handling in your development workflows.
moai-project-config-manager
TestingThis Claude Skill provides complete CRUD operations for config.json files with built-in validation and merge strategies. It handles project initialization, configuration updates, and management with intelligent backup and error recovery. Use it for reliable project configuration workflows including safe modifications and rollback capabilities.
test-environment-management
OtherThis Claude Skill manages test infrastructure using infrastructure as code, Docker/Kubernetes for consistent environments, and service virtualization. It helps developers ensure environment parity with production and optimize testing costs through auto-shutdown and spot instances. Use it when provisioning test environments or managing testing infrastructure.
test-design-techniques
OtherThis Claude Skill provides systematic test design techniques including boundary value analysis, equivalence partitioning, and decision tables. It helps developers create comprehensive test cases while reducing redundancy through methods like pairwise testing. Use it when designing tests for complex business rules, stateful behavior, or ensuring systematic coverage.
