SKILL·A8F2A0

run-ab-test-models

Name: run-ab-test-models
Author: pjt222

pjt222

업데이트됨 1 month ago

9 조회

테스팅aitestingdesigndata

정보

이 기술은 트래픽 분할과 통계적 유의성 검정을 활용해 프로덕션 환경의 ML 모델에 대한 A/B 테스트를 가능하게 합니다. 카나리 및 섀도우 배포를 지원하여 모델 버전을 비교하고 비즈니스 영향을 측정할 수 있습니다. 신규 모델을 전체 롤아웃 전에 검증하거나 데이터 기반 배포 결정을 내리는 데 사용하세요.

빠른 설치

Claude Code

문서

Run A/B Test for Models

See Extended Examples for complete configuration files and templates.

Run controlled experiments comparing model versions with traffic split + statistical analysis.

When Use

Deploy new model version, want validate before full rollout
Compare multiple candidate models (different algorithms, features)
Test impact of hyperparameter changes on business metrics
Measure model performance in prod without risk full traffic
Regulatory needs gradual rollout (medical ML)
Judge cost-performance tradeoffs between model sizes

Inputs

Required: Champion model (current prod)
Required: Challenger model(s) (new version to test)
Required: Traffic allocation % (e.g., 5% to challenger)
Required: Success metrics (business + ML)
Required: Min sample size or test duration
Optional: Guardrail metrics (latency, error rate thresholds)
Optional: User segments for stratified testing

Steps

Step 1: Design Experiment

Define test parameters, success criteria, statistical needs.

# ab_test/experiment_config.py
from dataclasses import dataclass
from typing import List, Dict
import numpy as np
from scipy.stats import norm


@dataclass
# ... (see EXAMPLES.md for complete implementation)

Got: Experiment config with stat-sound sample size calc, typical 5-10k samples per variant for 5-10% MDE.

If fail: Sample too large? Up traffic allocation, extend duration, or accept larger MDE; verify baseline metric estimate; consider sequential testing.

Step 2: Implement Traffic Splitting

Set up routing — randomly assign requests to models.

# ab_test/traffic_router.py
import hashlib
import random
from typing import Dict, Optional
from dataclasses import dataclass
import logging

logger = logging.getLogger(__name__)
# ... (see EXAMPLES.md for complete implementation)

Got: Consistent user-to-variant assignment, accurate traffic split matches configured %, all assignments logged.

If fail: Verify hash uniform (test 10k user IDs), check user_id stable across requests (not session_id), logs capture all predictions, validate split in first 1000 requests.

Step 3: Implement Shadow Deployment (Optional)

Run challenger in parallel without affecting users (shadow mode).

# ab_test/shadow_deployment.py
import asyncio
from typing import Dict, Any
import logging
from concurrent.futures import ThreadPoolExecutor
import time

logger = logging.getLogger(__name__)
# ... (see EXAMPLES.md for complete implementation)

Got: Champion served at normal latency, challenger logged async without blocking, prediction diffs captured.

If fail: Set challenger timeout < champion SLA, handle challenger errors gracefully, monitor memory (two models loaded), consider sampling (log 10% of shadow predictions).

Step 4: Collect and Analyze Metrics

Gather data, run statistical tests.

# ab_test/analysis.py
import pandas as pd
import numpy as np
from scipy import stats
from typing import Dict, Tuple
import logging

logger = logging.getLogger(__name__)
# ... (see EXAMPLES.md for complete implementation)

Got: Stat test results with p-values, CIs, clear decision (rollout/keep/inconclusive), typical after 7-14 days or sample size.

If fail: Verify ground truth labels available (delayed analysis maybe), check sample ratio mismatch (SRM = assignment bugs), enough sample size, look for novelty/primacy effects in early data, consider sequential testing if fixed-horizon too slow.

Step 5: Monitor Guardrail Metrics

Continuous check challenger does not violate safety thresholds.

# ab_test/guardrails.py
import pandas as pd
import logging
from typing import Dict, List

logger = logging.getLogger(__name__)


# ... (see EXAMPLES.md for complete implementation)

Got: Guardrail violations detected within 5-15 min, auto stop if critical thresholds breached (latency, errors), alerts to team.

If fail: Verify thresholds realistic (not too tight), monitoring loop runs continuous, check stop_experiment() updates routing, test alert delivery.

Step 6: Make Rollout Decision

From results, decide rollout challenger.

# ab_test/rollout_decision.py
import logging
from typing import Dict
from dataclasses import dataclass

logger = logging.getLogger(__name__)


# ... (see EXAMPLES.md for complete implementation)

Got: Clear decision (full/gradual rollout, keep champion, extend test) with justification + action items.

If fail: Decision unclear? Subgroup analysis (segment, time, device), check interaction effects, review business context (2% lift worth eng cost?), consult stakeholders.

Checks

Traffic split matches configured % (within 1%)
Same user always to same variant
Sample size calc reasonable (5-50k per variant)
Stat tests produce p-values consistent with manual calc
Guardrail violations trigger alerts within 5 min
Shadow deployment shows <5% prediction divergence
Reports include CIs
Rollout decision documented

Pitfalls

Sample ratio mismatch (SRM): Observed split differs from configured (95/5 becomes 92/8) = assignment bug; check hash uniformity
Peeking: Check results before sample size inflates Type I error; use sequential testing or wait for pre-set end date
Novelty effect: Users respond different to new model at first; run 2+ weeks for steady state
Carryover effects: Prev variant exposure affects current; use new users or washout
Multiple testing: Many metrics = false positive risk; correct with Bonferroni or single primary metric
Insufficient power: Small allocation = months to detect; balance power with risk
Ignore segments: Aggregate lift hides negative on important segments; subgroup analysis
Attribution errors: Outcome metrics attributed to predictions (not other system changes)

GitHub 저장소

pjt222/agent-almanac

경로: i18n/caveman/skills/run-ab-test-models

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the run-ab-test-models skill?

run-ab-test-models is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform run-ab-test-models-related tasks without extra prompting.

How do I install run-ab-test-models?

Use the install commands on this page: add run-ab-test-models to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does run-ab-test-models belong to?

run-ab-test-models is in the Testing category, tagged ai, testing, design and data.

Is run-ab-test-models free to use?

Yes. run-ab-test-models is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

연관 스킬

evaluating-llms-harness

테스팅

이 Claude Skill은 MMLU, GSM8K를 포함한 60개 이상의 표준화된 학술 과제에서 LLM 성능을 벤치마크하기 위해 lm-evaluation-harness를 실행합니다. 개발자들이 모델 품질을 비교하고, 학습 진행 상황을 추적하거나 학술 결과를 보고할 수 있도록 설계되었습니다. 이 도구는 HuggingFace와 vLLM 모델을 포함한 다양한 백엔드를 지원합니다.

스킬 보기

cloudflare-cron-triggers

테스팅

이 스킬은 cron 표현식을 사용하여 Worker를 스케줄링하기 위한 Cloudflare Cron Triggers 구현에 관한 포괄적인 지식을 제공합니다. 주기적 작업, 유지보수 작업, 자동화된 워크플로우 설정 방법을 다루며, 잘못된 cron 표현식이나 시간대 문제 같은 일반적인 이슈들을 해결하는 방법을 포함합니다. 개발자들은 이를 통해 스케줄된 핸들러 구성, cron 트리거 테스트, Workflows 및 Green Compute와의 연동 작업을 수행할 수 있습니다.

스킬 보기

webapp-testing

테스팅

이 Claude Skill은 Python 스크립트를 통해 로컬 웹 애플리케이션을 테스트하기 위한 Playwright 기반 툴킷을 제공합니다. 프론트엔드 검증, UI 디버깅, 스크린샷 캡처, 로그 확인 기능을 지원하며 서버 라이프사이클을 관리합니다. 브라우저 자동화 작업에 사용하되 컨텍스트 오염을 방지하기 위해 소스 코드를 읽지 않고 스크립트를 직접 실행하세요.

스킬 보기

finishing-a-development-branch

테스팅

이 스킬은 테스트 통과를 확인한 후 체계적인 통합 옵션을 제시하여 개발자가 완성된 작업을 마무리하도록 돕습니다. 구현이 완료된 후 머지, PR 생성, 브랜치 정리와 같은 워크플로우를 안내합니다. 코드가 준비되고 테스트가 완료되었을 때 개발 프로세스를 체계적으로 마무리하기 위해 사용하세요.

스킬 보기

run-ab-test-models

정보

빠른 설치

Claude Code

문서

Run A/B Test for Models

When Use

Inputs

Steps

Step 1: Design Experiment

Step 2: Implement Traffic Splitting

Step 3: Implement Shadow Deployment (Optional)

Step 4: Collect and Analyze Metrics

Step 5: Monitor Guardrail Metrics

Step 6: Make Rollout Decision

Checks

Pitfalls

See Also

GitHub 저장소

Frequently asked questions

What is the run-ab-test-models skill?

How do I install run-ab-test-models?

What category does run-ab-test-models belong to?

Is run-ab-test-models free to use?

연관 스킬