chaos-engineering-resilience
关于
This skill applies chaos engineering principles to test distributed systems by injecting controlled failures like network or instance outages. It helps validate fault tolerance and disaster recovery by measuring system behavior against defined steady-state metrics. Use it when building confidence in resilience or performing resilience testing.
快速安装
Claude Code
推荐/plugin add https://github.com/proffesor-for-testing/agentic-qegit clone https://github.com/proffesor-for-testing/agentic-qe.git ~/.claude/skills/chaos-engineering-resilience在 Claude Code 中复制并粘贴此命令以安装该技能
技能文档
Chaos Engineering & Resilience Testing
<default_to_action> When testing system resilience or injecting failures:
- DEFINE steady state (normal metrics: error rate, latency, throughput)
- HYPOTHESIZE system continues in steady state during failure
- INJECT real-world failures (network, instance, disk, CPU)
- OBSERVE and measure deviation from steady state
- FIX weaknesses discovered, document runbooks, repeat
Quick Chaos Steps:
- Start small: Dev → Staging → 1% prod → gradual rollout
- Define clear rollback triggers (error_rate > 5%)
- Measure blast radius, never exceed planned scope
- Document findings → runbooks → improved resilience
Critical Success Factors:
- Controlled experiments with automatic rollback
- Steady state must be measurable
- Start in non-production, graduate to production </default_to_action>
Quick Reference Card
When to Use
- Distributed systems validation
- Disaster recovery testing
- Building confidence in fault tolerance
- Pre-production resilience verification
Failure Types to Inject
| Category | Failures | Tools |
|---|---|---|
| Network | Latency, packet loss, partition | tc, toxiproxy |
| Infrastructure | Instance kill, disk failure, CPU | Chaos Monkey |
| Application | Exceptions, slow responses, leaks | Gremlin, LitmusChaos |
| Dependencies | Service outage, timeout | WireMock |
Blast Radius Progression
Dev (safe) → Staging → 1% prod → 10% → 50% → 100%
↓ ↓ ↓ ↓
Learn Validate Careful Full confidence
Steady State Metrics
| Metric | Normal | Alert Threshold |
|---|---|---|
| Error rate | < 0.1% | > 1% |
| p99 latency | < 200ms | > 500ms |
| Throughput | baseline | -20% |
Chaos Experiment Structure
// Chaos experiment definition
const experiment = {
name: 'Database latency injection',
hypothesis: 'System handles 500ms DB latency gracefully',
steadyState: {
errorRate: '< 0.1%',
p99Latency: '< 300ms'
},
method: {
type: 'network-latency',
target: 'database',
delay: '500ms',
duration: '5m'
},
rollback: {
automatic: true,
trigger: 'errorRate > 5%'
}
};
Agent-Driven Chaos
// qe-chaos-engineer runs controlled experiments
await Task("Chaos Experiment", {
target: 'payment-service',
failure: 'terminate-random-instance',
blastRadius: '10%',
duration: '5m',
steadyStateHypothesis: {
metric: 'success-rate',
threshold: 0.99
},
autoRollback: true
}, "qe-chaos-engineer");
// Validates:
// - System recovers automatically
// - Error rate stays within threshold
// - No data loss
// - Alerts triggered appropriately
Agent Coordination Hints
Memory Namespace
aqe/chaos-engineering/
├── experiments/* - Experiment definitions & results
├── steady-states/* - Baseline measurements
├── runbooks/* - Generated recovery procedures
└── blast-radius/* - Impact analysis
Fleet Coordination
const chaosFleet = await FleetManager.coordinate({
strategy: 'chaos-engineering',
agents: [
'qe-chaos-engineer', // Experiment execution
'qe-performance-tester', // Baseline metrics
'qe-production-intelligence' // Production monitoring
],
topology: 'sequential'
});
Related Skills
- shift-right-testing - Production testing
- performance-testing - Load testing
- test-environment-management - Environment stability
Remember
Break things on purpose to prevent unplanned outages. Find weaknesses before users do. Define steady state, inject failures, measure impact, fix weaknesses, create runbooks. Start small, increase blast radius gradually.
With Agents: qe-chaos-engineer automates chaos experiments with blast radius control, automatic rollback, and comprehensive resilience validation. Generates runbooks from experiment results.
GitHub 仓库
相关推荐技能
moai-project-config-manager
测试moai-project-config-manager 提供了完整的 config.json CRUD 操作,适用于项目初始化、配置更新和管理。它具有验证、合并策略和错误恢复能力,并支持智能备份和恢复功能。开发者可以依赖它来确保配置变更的安全性和一致性。
moai-project-config-manager
测试moai-project-config-manager 提供了完整的 config.json CRUD 操作,支持验证、合并策略和错误恢复。它适用于项目初始化、设置更新和配置管理,具备智能备份和恢复功能。开发者可以用它来可靠地管理项目配置文件。
regression-testing
其他该Skill提供智能化的回归测试策略,帮助开发者在验证代码修复时确保现有功能不被破坏。它通过变更影响分析和风险评估,智能选择测试用例并优化执行顺序,避免全量测试的时间消耗。适用于CI/CD流程中的测试套件规划、变更验证和快速反馈场景,能显著提升回归测试效率。
test-environment-management
其他该Skill专注于测试环境管理,提供基于基础设施即代码的自动化配置,支持使用Docker/Kubernetes确保环境一致性。它适用于创建与生产环境保持高度一致的测试环境,并通过服务虚拟化和成本优化策略来提升测试效率。开发者可用它来管理测试基础设施生命周期,优化资源使用成本。
