MCP HubMCP Hub
返回技能列表

chaos-engineering-resilience

proffesor-for-testing
更新于 Today
82 次查看
99
21
99
在 GitHub 上查看
其他chaosresiliencefault-injectiondistributed-systemsrecoverynetflix

关于

This skill applies chaos engineering principles to test distributed systems by injecting controlled failures like network or instance outages. It helps validate fault tolerance and disaster recovery by measuring system behavior against defined steady-state metrics. Use it when building confidence in resilience or performing resilience testing.

快速安装

Claude Code

推荐
插件命令推荐
/plugin add https://github.com/proffesor-for-testing/agentic-qe
Git 克隆备选方式
git clone https://github.com/proffesor-for-testing/agentic-qe.git ~/.claude/skills/chaos-engineering-resilience

在 Claude Code 中复制并粘贴此命令以安装该技能

技能文档

Chaos Engineering & Resilience Testing

<default_to_action> When testing system resilience or injecting failures:

  1. DEFINE steady state (normal metrics: error rate, latency, throughput)
  2. HYPOTHESIZE system continues in steady state during failure
  3. INJECT real-world failures (network, instance, disk, CPU)
  4. OBSERVE and measure deviation from steady state
  5. FIX weaknesses discovered, document runbooks, repeat

Quick Chaos Steps:

  • Start small: Dev → Staging → 1% prod → gradual rollout
  • Define clear rollback triggers (error_rate > 5%)
  • Measure blast radius, never exceed planned scope
  • Document findings → runbooks → improved resilience

Critical Success Factors:

  • Controlled experiments with automatic rollback
  • Steady state must be measurable
  • Start in non-production, graduate to production </default_to_action>

Quick Reference Card

When to Use

  • Distributed systems validation
  • Disaster recovery testing
  • Building confidence in fault tolerance
  • Pre-production resilience verification

Failure Types to Inject

CategoryFailuresTools
NetworkLatency, packet loss, partitiontc, toxiproxy
InfrastructureInstance kill, disk failure, CPUChaos Monkey
ApplicationExceptions, slow responses, leaksGremlin, LitmusChaos
DependenciesService outage, timeoutWireMock

Blast Radius Progression

Dev (safe) → Staging → 1% prod → 10% → 50% → 100%
     ↓           ↓         ↓        ↓
  Learn      Validate   Careful   Full confidence

Steady State Metrics

MetricNormalAlert Threshold
Error rate< 0.1%> 1%
p99 latency< 200ms> 500ms
Throughputbaseline-20%

Chaos Experiment Structure

// Chaos experiment definition
const experiment = {
  name: 'Database latency injection',
  hypothesis: 'System handles 500ms DB latency gracefully',
  steadyState: {
    errorRate: '< 0.1%',
    p99Latency: '< 300ms'
  },
  method: {
    type: 'network-latency',
    target: 'database',
    delay: '500ms',
    duration: '5m'
  },
  rollback: {
    automatic: true,
    trigger: 'errorRate > 5%'
  }
};

Agent-Driven Chaos

// qe-chaos-engineer runs controlled experiments
await Task("Chaos Experiment", {
  target: 'payment-service',
  failure: 'terminate-random-instance',
  blastRadius: '10%',
  duration: '5m',
  steadyStateHypothesis: {
    metric: 'success-rate',
    threshold: 0.99
  },
  autoRollback: true
}, "qe-chaos-engineer");

// Validates:
// - System recovers automatically
// - Error rate stays within threshold
// - No data loss
// - Alerts triggered appropriately

Agent Coordination Hints

Memory Namespace

aqe/chaos-engineering/
├── experiments/*       - Experiment definitions & results
├── steady-states/*     - Baseline measurements
├── runbooks/*          - Generated recovery procedures
└── blast-radius/*      - Impact analysis

Fleet Coordination

const chaosFleet = await FleetManager.coordinate({
  strategy: 'chaos-engineering',
  agents: [
    'qe-chaos-engineer',          // Experiment execution
    'qe-performance-tester',      // Baseline metrics
    'qe-production-intelligence'  // Production monitoring
  ],
  topology: 'sequential'
});

Related Skills


Remember

Break things on purpose to prevent unplanned outages. Find weaknesses before users do. Define steady state, inject failures, measure impact, fix weaknesses, create runbooks. Start small, increase blast radius gradually.

With Agents: qe-chaos-engineer automates chaos experiments with blast radius control, automatic rollback, and comprehensive resilience validation. Generates runbooks from experiment results.

GitHub 仓库

proffesor-for-testing/agentic-qe
路径: .claude/skills/chaos-engineering-resilience
agenticqeagenticsfoundationagentsquality-engineering

相关推荐技能