SKILL·35FCE9

run-chaos-experiment

Name: run-chaos-experiment
Author: pjt222

pjt222

Updated 1 month ago

21 views

Testingaitestingdesign

About

This skill enables developers to design and execute chaos engineering experiments using Litmus or Chaos Mesh in Kubernetes. It performs controlled fault injection to test system resilience, validate failure hypotheses, and improve recovery processes. Use it before major launches, after architectural changes, or during resilience drills to proactively strengthen your system's reliability.

Quick Install

Claude Code

Recommended

Primary

npx skills add pjt222/agent-almanac -a claude-code

Plugin CommandAlternative

/plugin add https://github.com/pjt222/agent-almanac

Git CloneAlternative

git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/run-chaos-experiment

Copy and paste this command in Claude Code to install this skill

Documentation

行混驗

控注故以測且改系韌也。

用

大釋前（負測）→用
構改後（驗韌）→用
GameDay 或災復演→用
驗故模假→用
SRE 熟度計畫→用

入

必：Kubernetes 叢（Litmus 或 Chaos Mesh）
必：穩態定（「常」貌）
必：假設（如「一 pod 死，API 仍可用」）
可：察棧（Prometheus、Grafana）量影
可：回退計

行

一：定穩態與假

文錄常態：

## Steady State Definition

### Service: API Gateway
- **Availability**: 99.9% (< 0.1% error rate)
- **Latency**: p95 < 200ms
- **Throughput**: 1000 req/s

## Hypothesis
"If one API pod is killed, the remaining pods will handle the load with <5s
disruption and no increase in error rate."

得：明、可量之常與成準。

敗：穩態不可定→察不足，先增指。

二：限爆徑

縮驗以減險：

# chaos-config.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: chaos-testing

設護：

## Blast Radius Controls
### Environment
- **Scope**: Staging only (first 5 runs)
- **Production**: Only after 5 successful staging runs
- **Timing**: Business hours (09:00-17:00 local)
### Auto-Abort Conditions
- Error rate >10% for >30 seconds

得：驗有界、不傾全系。

敗：徑過大→縮範。一非關服始。

三：裝 Chaos Mesh

# Add Chaos Mesh Helm repo
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

# Install Chaos Mesh
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh \
  --create-namespace \
  --set dashboard.create=true \
  --set controllerManager.replicaCount=1

# Verify
kubectl get pods -n chaos-mesh

# Dashboard
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333

替：Litmus（中立）：

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v2.14.0.yaml
kubectl get pods -n litmus

得：Chaos Mesh 或 Litmus 行、面板可達。

敗：查 RBAC。混工需叢級權。

四：建行驗

例：Pod Kill（Chaos Mesh）：

# pod-kill-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: api-pod-kill-test
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-gateway
      chaos-enabled: "true"
  duration: "30s"

施驗：

kubectl apply -f pod-kill-experiment.yaml
kubectl get podchaos -n chaos-testing -w
kubectl describe podchaos api-pod-kill-test -n chaos-testing

察影於 Grafana：

rate(http_requests_total{status=~"5..", job="api"}[1m])
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="api"}[1m]))
rate(kube_pod_container_status_restarts_total{pod=~"api-.*"}[5m])

得：pod 死、k8s 重啟、服續微擾。

敗：誤率躍或服衰→停驗、查。

五：析果迭

書驗報：

# Chaos Experiment Report
**Hypothesis**: API stays available if one pod crashes
**Tool**: Chaos Mesh
## Results
- **Error Rate**: 0.1% → 2.3% (8s)
- **Recovery Time**: 8 seconds
## Hypothesis Outcome
**FAILED**: Error rate exceeded 1% threshold
## Improvements Made
1. Reduced readiness probe interval: 10s → 2s
2. Added pre-stop hook: 5-second sleep

記驗於日誌：

date,experiment,environment,status,error_rate_peak,recovery_time_s,outcome
2025-02-09,pod-kill-api,staging,complete,2.3%,8,failed

得：習得記、修施、後驗約。

敗：驗後無動→混工程為戲。優先修。

六：升至產（慎）

預驗常過後：

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: api-pod-kill-prod
spec:
  action: pod-kill
  duration: "10s"
  scheduler:
    cron: "0 10 * * 2"

產護：

kubectl create configmap chaos-killswitch \
  -n chaos-testing \
  --from-literal=enabled=true

得：產驗於低險窗、急停備。

敗：產驗致事故→立禁、覆盤。

驗

忌

無假：「看何發」費時。必有假
範過廣：殺諸 pod 為災復測，非韌測。始小
產先：勿首於產。預先恆
忽果：無動之混為戲。修所學
警疲：混驗觸警。Grafana 註或靜期警
無停計：失控時需急停。備之

參

setup-prometheus-monitoring
configure-alerting-rules
define-slo-sli-sla

GitHub Repository

pjt222/agent-almanac

Path: i18n/wenyan-ultra/skills/run-chaos-experiment

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the run-chaos-experiment skill?

run-chaos-experiment is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform run-chaos-experiment-related tasks without extra prompting.

How do I install run-chaos-experiment?

Use the install commands on this page: add run-chaos-experiment to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does run-chaos-experiment belong to?

run-chaos-experiment is in the Testing category, tagged ai, testing and design.

Is run-chaos-experiment free to use?

Yes. run-chaos-experiment is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

Related Skills

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill

cloudflare-cron-triggers

Testing

This skill provides comprehensive knowledge for implementing Cloudflare Cron Triggers to schedule Workers using cron expressions. It covers setting up periodic tasks, maintenance jobs, and automated workflows while handling common issues like invalid cron expressions and timezone problems. Developers can use it for configuring scheduled handlers, testing cron triggers, and integrating with Workflows and Green Compute.

View skill

webapp-testing

Testing

This Claude Skill provides a Playwright-based toolkit for testing local web applications through Python scripts. It enables frontend verification, UI debugging, screenshot capture, and log viewing while managing server lifecycles. Use it for browser automation tasks but run scripts directly rather than reading their source code to avoid context pollution.

View skill

finishing-a-development-branch

Testing

This skill helps developers complete finished work by verifying tests pass and then presenting structured integration options. It guides the workflow for merging, creating PRs, or cleaning up branches after implementation is done. Use it when your code is ready and tested to systematically finalize the development process.

View skill