MCP HubMCP Hub
スキル一覧に戻る

chaos-engineering-resilience

proffesor-for-testing
更新日 Today
110 閲覧
99
21
99
GitHubで表示
その他chaosresiliencefault-injectiondistributed-systemsrecoverynetflix

について

このスキルは、ネットワーク障害やインスタンス停止などの制御された障害を注入することで、分散システムをテストするカオスエンジニアリングの原則を適用します。定義された定常状態メトリクスに対するシステムの挙動を測定することで、耐障害性と災害復旧力を検証するのに役立ちます。システムのレジリエンスに対する信頼を構築する際や、レジリエンステストを実施する際にご利用ください。

クイックインストール

Claude Code

推奨
プラグインコマンド推奨
/plugin add https://github.com/proffesor-for-testing/agentic-qe
Git クローン代替
git clone https://github.com/proffesor-for-testing/agentic-qe.git ~/.claude/skills/chaos-engineering-resilience

このコマンドをClaude Codeにコピー&ペーストしてスキルをインストールします

ドキュメント

Chaos Engineering & Resilience Testing

<default_to_action> When testing system resilience or injecting failures:

  1. DEFINE steady state (normal metrics: error rate, latency, throughput)
  2. HYPOTHESIZE system continues in steady state during failure
  3. INJECT real-world failures (network, instance, disk, CPU)
  4. OBSERVE and measure deviation from steady state
  5. FIX weaknesses discovered, document runbooks, repeat

Quick Chaos Steps:

  • Start small: Dev → Staging → 1% prod → gradual rollout
  • Define clear rollback triggers (error_rate > 5%)
  • Measure blast radius, never exceed planned scope
  • Document findings → runbooks → improved resilience

Critical Success Factors:

  • Controlled experiments with automatic rollback
  • Steady state must be measurable
  • Start in non-production, graduate to production </default_to_action>

Quick Reference Card

When to Use

  • Distributed systems validation
  • Disaster recovery testing
  • Building confidence in fault tolerance
  • Pre-production resilience verification

Failure Types to Inject

CategoryFailuresTools
NetworkLatency, packet loss, partitiontc, toxiproxy
InfrastructureInstance kill, disk failure, CPUChaos Monkey
ApplicationExceptions, slow responses, leaksGremlin, LitmusChaos
DependenciesService outage, timeoutWireMock

Blast Radius Progression

Dev (safe) → Staging → 1% prod → 10% → 50% → 100%
     ↓           ↓         ↓        ↓
  Learn      Validate   Careful   Full confidence

Steady State Metrics

MetricNormalAlert Threshold
Error rate< 0.1%> 1%
p99 latency< 200ms> 500ms
Throughputbaseline-20%

Chaos Experiment Structure

// Chaos experiment definition
const experiment = {
  name: 'Database latency injection',
  hypothesis: 'System handles 500ms DB latency gracefully',
  steadyState: {
    errorRate: '< 0.1%',
    p99Latency: '< 300ms'
  },
  method: {
    type: 'network-latency',
    target: 'database',
    delay: '500ms',
    duration: '5m'
  },
  rollback: {
    automatic: true,
    trigger: 'errorRate > 5%'
  }
};

Agent-Driven Chaos

// qe-chaos-engineer runs controlled experiments
await Task("Chaos Experiment", {
  target: 'payment-service',
  failure: 'terminate-random-instance',
  blastRadius: '10%',
  duration: '5m',
  steadyStateHypothesis: {
    metric: 'success-rate',
    threshold: 0.99
  },
  autoRollback: true
}, "qe-chaos-engineer");

// Validates:
// - System recovers automatically
// - Error rate stays within threshold
// - No data loss
// - Alerts triggered appropriately

Agent Coordination Hints

Memory Namespace

aqe/chaos-engineering/
├── experiments/*       - Experiment definitions & results
├── steady-states/*     - Baseline measurements
├── runbooks/*          - Generated recovery procedures
└── blast-radius/*      - Impact analysis

Fleet Coordination

const chaosFleet = await FleetManager.coordinate({
  strategy: 'chaos-engineering',
  agents: [
    'qe-chaos-engineer',          // Experiment execution
    'qe-performance-tester',      // Baseline metrics
    'qe-production-intelligence'  // Production monitoring
  ],
  topology: 'sequential'
});

Related Skills


Remember

Break things on purpose to prevent unplanned outages. Find weaknesses before users do. Define steady state, inject failures, measure impact, fix weaknesses, create runbooks. Start small, increase blast radius gradually.

With Agents: qe-chaos-engineer automates chaos experiments with blast radius control, automatic rollback, and comprehensive resilience validation. Generates runbooks from experiment results.

GitHub リポジトリ

proffesor-for-testing/agentic-qe
パス: .claude/skills/chaos-engineering-resilience
agenticqeagenticsfoundationagentsquality-engineering

関連スキル

moai-project-config-manager

テスト

This skill provides complete CRUD operations for config.json files with built-in validation and merge strategies. It handles project initialization, configuration updates, and includes intelligent backup and recovery features. Use it for robust configuration management with error handling in your development workflows.

スキルを見る

moai-project-config-manager

テスト

This Claude Skill provides complete CRUD operations for config.json files with built-in validation and merge strategies. It handles project initialization, configuration updates, and management with intelligent backup and error recovery. Use it for reliable project configuration workflows including safe modifications and rollback capabilities.

スキルを見る

regression-testing

その他

This skill strategically selects and runs regression tests based on code changes and risk, ensuring fixes don't break existing functionality. It analyzes impact, optimizes test execution for faster feedback, and helps manage continuous regression within CI/CD. Use it for verifying changes, planning test suites, or streamlining test execution.

スキルを見る

test-environment-management

その他

This Claude Skill manages test infrastructure using infrastructure as code, Docker/Kubernetes for consistent environments, and service virtualization. It helps developers ensure environment parity with production and optimize testing costs through auto-shutdown and spot instances. Use it when provisioning test environments or managing testing infrastructure.

スキルを見る