SKILL·A621E0

run-chaos-experiment

Name: run-chaos-experiment
Author: pjt222

pjt222

업데이트됨 1 month ago

11 조회

기타general

정보

이 스킬은 Litmus 또는 Chaos Mesh를 사용하여 제어된 카오스 실험을 설계하고 실행함으로써 시스템 복원력을 결함 주입을 통해 테스트할 수 있도록 지원합니다. 실패 모드에 대한 가설을 검증하고 복구 능력을 향상시켜, 출시 전 테스트, 아키텍처 검증 및 SRE 성숙도 프로그램에 이상적입니다. 주요 기능으로는 가설 기반 테스트, 제어된 결함 주입, 그리고 복원력 검증을 위한 Kubernetes 환경 통합이 포함됩니다.

빠른 설치

Claude Code

문서

name: run-chaos-experiment description: > 使用 Litmus 或 Chaos Mesh 设计并执行混沌工程实验。通过受控故障注入测试系统韧性、验证假设驱动的测试、改善故障恢复能力。适用于重大产品发布前、架构变更后验证韧性、 GameDay 或灾难恢复演练期间、验证对故障模式的假设，或作为 SRE 成熟度计划的一部分。 locale: zh-CN source_locale: en source_commit: 6f65f316 translator: claude-opus-4-6 translation_date: 2026-03-16 license: MIT allowed-tools: Read Write Edit Bash Grep Glob metadata: author: Philipp Thoss version: "1.0" domain: observability complexity: advanced language: multi tags: chaos-engineering, litmus, chaos-mesh, resilience, fault-injection

Run Chaos Experiment

注入受控故障，测试并改善系统韧性。

适用场景

重大产品发布前（负载测试）
架构变更后（验证韧性）
GameDay 或灾难恢复演练期间
验证对故障模式的假设
作为 SRE 成熟度计划的一部分

输入

必填：Kubernetes 集群（用于 Litmus 或 Chaos Mesh）
必填：稳态定义（"正常"状态的样子）
必填：待测试的假设（例如，"如果一个 Pod 崩溃，API 保持可用"）
可选：可观测性栈（Prometheus、Grafana）用于测量影响
可选：回滚计划

步骤

第 1 步：定义稳态和假设

记录正常系统行为：

## Steady State Definition

### Service: API Gateway
- **Availability**: 99.9% (< 0.1% error rate)
- **Latency**: p95 < 200ms
- **Throughput**: 1000 req/s
- **Dependencies**: Database (Postgres), Cache (Redis), Auth Service

### Metrics
- `rate(http_requests_total{job="api"}[5m])`
- `histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))`
- `rate(http_requests_total{status=~"5.."}[5m])`

## Hypothesis
**"If one API pod is killed, the remaining pods will handle the load with <5s
disruption and no increase in error rate."**

### Validation Criteria
- Error rate remains <1%
- p95 latency stays <300ms (50ms grace)
- Service recovers within 5 seconds
- No cascading failures to downstream services

预期结果： 清晰、可测量的正常行为定义和成功标准。

失败处理： 如果无法定义稳态，说明可观测性不足。先添加指标。

第 2 步：设置爆炸半径限制

将实验范围限定在最小风险范围内：

# chaos-config.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: chaos-testing

---
# Label pods participating in chaos experiments
apiVersion: v1
kind: Pod
metadata:
  labels:
    chaos-enabled: "true"
    environment: staging  # NEVER production for first run

设置安全措施：

## Blast Radius Controls

### Environment
- **Scope**: Staging only (first 5 runs)
- **Production**: Only after 5 successful staging runs
- **Timing**: Business hours (09:00-17:00 local), never weekends/holidays

### Target Selection
- **Limit**: Max 1 pod per service
- **Percentage**: Max 25% of replicas
- **Exclusions**: Database, payment service, auth service (critical path)

### Auto-Abort Conditions
- Error rate >10% for >30 seconds
- Customer-facing alerts fire
- Manual abort signal from on-call engineer

### Rollback Plan
- Kubernetes will auto-restart killed pods
- Manual rollback: `kubectl rollout undo deployment/api`
- Incident declared if recovery takes >5 minutes

预期结果： 实验有明确边界，不会使整个系统宕机。

失败处理： 如果爆炸半径过大，缩小范围。从一个非关键服务开始。

第 3 步：安装 Chaos Mesh

部署 Chaos Mesh（Kubernetes 原生）：

# Add Chaos Mesh Helm repo
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

# Install Chaos Mesh in isolated namespace
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh \
  --create-namespace \
  --set dashboard.create=true \
  --set controllerManager.replicaCount=1

# Verify installation
kubectl get pods -n chaos-mesh

# Access dashboard
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333
# Open http://localhost:2333

替代方案：Litmus（供应商中立）：

# Install Litmus
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v2.14.0.yaml

# Wait for Litmus pods
kubectl get pods -n litmus

# Install Litmus CRDs
kubectl apply -f https://hub.litmuschaos.io/api/chaos/master?file=charts/generic/experiments.yaml

预期结果： Chaos Mesh 或 Litmus 运行，仪表板可访问。

失败处理： 检查 RBAC 权限。混沌工具需要集群范围的访问权限。

第 4 步：创建并执行实验

示例：Pod Kill 实验（Chaos Mesh）：

# pod-kill-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: api-pod-kill-test
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one  # Kill one pod only
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-gateway
      chaos-enabled: "true"
  duration: "30s"
  scheduler:
    cron: "@every 5m"  # Repeat every 5 minutes (for sustained testing)

应用实验：

# Apply experiment
kubectl apply -f pod-kill-experiment.yaml

# Watch experiment status
kubectl get podchaos -n chaos-testing -w

# View detailed status
kubectl describe podchaos api-pod-kill-test -n chaos-testing

# Check which pods were affected
kubectl get events -n production --sort-by=.metadata.creationTimestamp | grep api-gateway

在 Grafana 中监控影响：

# Error rate during experiment
rate(http_requests_total{status=~"5..", job="api"}[1m])

# Latency spike
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="api"}[1m]))

# Pod restarts
rate(kube_pod_container_status_restarts_total{pod=~"api-.*"}[5m])

预期结果： Pod 被杀死，Kubernetes 重启它，服务以轻微抖动继续运行。

失败处理： 如果错误率激增或服务显著降级，中止实验并调查。

第 5 步：分析结果并迭代

创建实验报告：

# Chaos Experiment Report: API Pod Kill

**Date**: 2025-02-09
**Hypothesis**: API stays available if one pod crashes
**Tool**: Chaos Mesh
**Environment**: Staging
**Duration**: 30 seconds (pod kill + recovery)

## Results

### Metrics During Experiment
- **Error Rate**: Increased from 0.1% to 2.3% (spike lasted 8 seconds)
- **p95 Latency**: Increased from 180ms to 450ms (spike lasted 12 seconds)
- **Recovery Time**: 8 seconds (pod restart + load balancer update)

### Hypothesis Outcome
**FAILED**: Error rate exceeded 1% threshold, latency spike >300ms

## Root Cause Analysis
- Load balancer continued routing to killed pod for 8 seconds (stale endpoint)
- Readiness probe set to 10s interval (too slow)
- No pre-stop hook to drain connections gracefully

## Improvements Made
1. **Reduced readiness probe interval**: 10s → 2s
2. **Added pre-stop hook**: 5-second sleep for connection draining
3. **Tuned load balancer**: Enabled faster endpoint updates

## Follow-Up Experiment
- Re-run with same parameters in 1 week
- Expected: Error rate <1%, recovery <5s

在日志中追踪实验：

# chaos-experiment-log.csv
date,experiment,environment,status,error_rate_peak,recovery_time_s,outcome
2025-02-09,pod-kill-api,staging,complete,2.3%,8,failed
2025-02-16,pod-kill-api,staging,complete,0.8%,4,passed
2025-02-23,network-delay-db,staging,aborted,15%,N/A,failed

预期结果： 经验教训已记录，修复已实施，跟进已安排。

失败处理： 如果实验后不采取行动，混沌工程就变成了形式主义。优先处理从中学到的修复。

第 6 步：谨慎推进到生产环境

暂存实验持续通过后：

# Production pod-kill experiment (more conservative)
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: api-pod-kill-prod
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-gateway
      chaos-enabled: "true"
  duration: "10s"  # Shorter than staging
  scheduler:
    cron: "0 10 * * 2"  # Tuesdays at 10 AM only (predictable, low-risk time)

生产环境安全措施：

# Create a kill switch for production chaos
kubectl create configmap chaos-killswitch \
  -n chaos-testing \
  --from-literal=enabled=true

# Update experiments to check kill switch
# (implementation depends on chaos tool)

预期结果： 生产实验在低风险时间窗口运行，紧急开关随时可用。

失败处理： 如果生产实验导致事故，立即停用并进行后复盘。

验证清单

稳态和假设定义清晰
爆炸半径受限（环境、范围、时间）
混沌工具（Chaos Mesh 或 Litmus）已安装并测试
实验在暂存环境成功运行
结果记录包含指标和分析
根据发现实施了改进
跟进实验验证了修复
生产实验仅在 5+ 次暂存成功后运行

常见问题

没有假设：运行混沌实验"看看会发生什么"是在浪费时间。始终要有假设。
范围过大：一次性杀死所有 Pod 测试的是灾难恢复，而非韧性。从小处开始。
生产优先：永远不要在生产环境运行第一个实验。始终先在暂存环境。
忽视结果：没有行动的混沌是形式主义。修复你学到的问题。
告警疲劳：混沌实验会触发告警。在 Grafana 中添加注解或静默预期告警。
没有中止计划：如果实验出错，需要紧急开关。提前准备好。

GitHub 저장소

pjt222/agent-almanac

경로: i18n/zh-CN/skills/run-chaos-experiment

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the run-chaos-experiment skill?

run-chaos-experiment is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform run-chaos-experiment-related tasks without extra prompting.

How do I install run-chaos-experiment?

Use the install commands on this page: add run-chaos-experiment to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does run-chaos-experiment belong to?

run-chaos-experiment is in the Other category, tagged general.

Is run-chaos-experiment free to use?

Yes. run-chaos-experiment is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

연관 스킬

llamaguard

기타

LlamaGuard는 폭력 및 혐오 발언 등 6가지 안전 범주에서 LLM 입력과 출력을 조정하기 위한 Meta의 70-80억 파라미터 모델입니다. 94-95% 정확도를 제공하며 vLLM, Hugging Face 또는 Amazon SageMaker를 사용해 배포할 수 있습니다. 이 기술을 사용하여 AI 애플리케이션에 콘텐츠 필터링 및 안전 가드레일을 손쉽게 통합하세요.

스킬 보기

cost-optimization

기타

이 Claude Skill은 리소스 적정화, 태깅 전략, 지출 분석을 통해 개발자들이 클라우드 비용을 최적화할 수 있도록 지원합니다. AWS, Azure, GCP에서 클라우드 비용을 절감하고 비용 거버넌스를 구현하기 위한 프레임워크를 제공합니다. 인프라 비용을 분석하거나, 리소스를 적정화하거나, 예산 제약을 충족해야 할 때 사용하세요.

스킬 보기

sports-betting-analyzer

기타

이 Claude Skill은 스프레드, 오버/언더, 프로프 베트를 포함한 스포츠 베팅 시장을 분석합니다. 역사적 추이와 상황별 통계를 검토하여 가치 베트를 발견하고, 교육적 목적으로 실행 가능한 권장 사항이 담긴 구조화된 마크다운 결과를 제공합니다. 개발자는 이 기능을 스포츠 베팅 분석 도구에 활용할 수 있으며, 단순히 엔터테인먼트/교육 목적으로만 설계되었음을 유의해야 합니다.

스킬 보기

quantizing-models-bitsandbytes

기타

이 스킬은 bitsandbytes를 사용하여 LLM을 8비트 또는 4비트 정밀도로 양자화하며, 최소한의 정확도 손실로 50-75%의 메모리 감소를 달성합니다. 제한된 GPU 메모리에서 더 큰 모델을 실행하거나 추론을 가속화하는 데 이상적이며, INT8, NF4, FP4와 같은 형식을 지원합니다. 이 스킬은 HuggingFace Transformers와 통합되어 QLoRA 학습 및 8비트 옵티마이저를 가능하게 합니다.

스킬 보기

run-chaos-experiment

정보

빠른 설치

Claude Code

문서

Run Chaos Experiment

适用场景

输入

步骤

第 1 步：定义稳态和假设

第 2 步：设置爆炸半径限制

第 3 步：安装 Chaos Mesh

第 4 步：创建并执行实验

第 5 步：分析结果并迭代

第 6 步：谨慎推进到生产环境

验证清单

常见问题

相关技能

GitHub 저장소

Frequently asked questions

What is the run-chaos-experiment skill?

How do I install run-chaos-experiment?

What category does run-chaos-experiment belong to?

Is run-chaos-experiment free to use?

연관 스킬