statistical-analysis
정보
이 스킬은 검정 방법 선택, 가정 조건 확인, APA 형식 보고를 포함한 안내형 통계 분석을 제공합니다. t-검정, ANOVA, 회귀 분석 등 적절한 검정 방법을 선택하고 결과를 해석하는 데 도움이 필요한 학술 연구에 이상적입니다. 프로그래밍 방식의 모델 구현에는 개발자들이 statsmodels를 사용해야 합니다.
빠른 설치
Claude Code
추천npx skills add K-Dense-AI/claude-scientific-skills -a claude-code/plugin add https://github.com/K-Dense-AI/claude-scientific-skillsgit clone https://github.com/K-Dense-AI/claude-scientific-skills.git ~/.claude/skills/statistical-analysisClaude Code에서 이 명령을 복사하여 붙여넣어 스킬을 설치하세요
문서
Statistical Analysis
Overview
Statistical analysis is a systematic process for testing hypotheses and quantifying relationships. Conduct hypothesis tests (t-test, ANOVA, chi-square), regression, correlation, and Bayesian analyses with assumption checks and APA reporting. Apply this skill for academic research.
When to Use This Skill
This skill should be used when:
- Conducting statistical hypothesis tests (t-tests, ANOVA, chi-square)
- Performing regression or correlation analyses
- Running Bayesian statistical analyses
- Checking statistical assumptions and diagnostics
- Calculating effect sizes and conducting power analyses
- Reporting statistical results in APA format
- Analyzing experimental or observational data for research
Installation
Use uv to install the libraries used in this skill. Pin versions in production; unpinned installs are fine for exploration.
# Core frequentist stack (Python 3.10+; 3.12+ recommended for latest SciPy/ArviZ)
uv pip install "pingouin>=0.6" "scipy>=1.11" "statsmodels>=0.14.6" pandas matplotlib seaborn
# Bayesian modeling (PyMC 5 + ArviZ; ArviZ 0.23+ requires Python 3.12+)
uv pip install "pymc>=5.0" "arviz>=0.17"
Compatibility notes (2025–2026):
- Pingouin 0.5+ renamed output columns (
p_val,cohen_d,CI95,p_unc) — examples below use the current names. - statsmodels + SciPy: use
statsmodels>=0.14.6withscipy>=1.11to avoid_lazywhereimport errors on SciPy 1.16+. - Pingouin Bayes Factors: one-sided BF for t-tests was removed in 0.5+; use dedicated packages (e.g. JASP, BayesFactor via R) or PyMC for hypothesis testing.
For model-specific APIs (OLS, GLM, ARIMA), see the statsmodels skill. For PyMC workflows, see the pymc skill.
Core Capabilities
1. Test Selection and Planning
- Choose appropriate statistical tests based on research questions and data characteristics
- Conduct a priori power analyses to determine required sample sizes
- Plan analysis strategies including multiple comparison corrections
2. Assumption Checking
- Automatically verify all relevant assumptions before running tests
- Provide diagnostic visualizations (Q-Q plots, residual plots, box plots)
- Recommend remedial actions when assumptions are violated
3. Statistical Testing
- Hypothesis testing: t-tests, ANOVA, chi-square, non-parametric alternatives
- Regression: linear, multiple, logistic, with diagnostics
- Correlations: Pearson, Spearman, with confidence intervals
- Bayesian alternatives: Bayesian t-tests, ANOVA, regression with Bayes Factors
4. Effect Sizes and Interpretation
- Calculate and interpret appropriate effect sizes for all analyses
- Provide confidence intervals for effect estimates
- Distinguish statistical from practical significance
5. Professional Reporting
- Generate APA-style statistical reports
- Create publication-ready figures and tables
- Provide complete interpretation with all required statistics
Workflow Decision Tree
Use this decision tree to determine your analysis path:
START
│
├─ Need to SELECT a statistical test?
│ └─ YES → See "Test Selection Guide"
│ └─ NO → Continue
│
├─ Ready to check ASSUMPTIONS?
│ └─ YES → See "Assumption Checking"
│ └─ NO → Continue
│
├─ Ready to run ANALYSIS?
│ └─ YES → See "Running Statistical Tests"
│ └─ NO → Continue
│
└─ Need to REPORT results?
└─ YES → See "Reporting Results"
Test Selection Guide
Quick Reference: Choosing the Right Test
Use references/test_selection_guide.md for comprehensive guidance. Quick reference:
Comparing Two Groups:
- Independent, continuous, normal → Independent t-test
- Independent, continuous, non-normal → Mann-Whitney U test
- Paired, continuous, normal → Paired t-test
- Paired, continuous, non-normal → Wilcoxon signed-rank test
- Binary outcome → Chi-square or Fisher's exact test
Comparing 3+ Groups:
- Independent, continuous, normal → One-way ANOVA
- Independent, continuous, non-normal → Kruskal-Wallis test
- Paired, continuous, normal → Repeated measures ANOVA
- Paired, continuous, non-normal → Friedman test
Relationships:
- Two continuous variables → Pearson (normal) or Spearman correlation (non-normal)
- Continuous outcome with predictor(s) → Linear regression
- Binary outcome with predictor(s) → Logistic regression
Bayesian Alternatives: All tests have Bayesian versions that provide:
- Direct probability statements about hypotheses
- Bayes Factors quantifying evidence
- Ability to support null hypothesis
- See
references/bayesian_statistics.md
Assumption Checking
Systematic Assumption Verification
ALWAYS check assumptions before interpreting test results.
Use the bundled scripts/assumption_checks.py module for automated checking. Run Python from the skill directory (skills/statistical-analysis/) or add scripts/ to sys.path:
from assumption_checks import comprehensive_assumption_check
# Comprehensive check with visualizations
results = comprehensive_assumption_check(
data=df,
value_col='score',
group_col='group', # Optional: for group comparisons
alpha=0.05
)
This performs:
- Outlier detection (IQR and z-score methods)
- Normality testing (Shapiro-Wilk test + Q-Q plots)
- Homogeneity of variance (Levene's test + box plots)
- Interpretation and recommendations
Individual Assumption Checks
For targeted checks, use individual functions:
from assumption_checks import (
check_normality,
check_normality_per_group,
check_homogeneity_of_variance,
check_linearity,
detect_outliers
)
# Example: Check normality with visualization
result = check_normality(
data=df['score'],
name='Test Score',
alpha=0.05,
plot=True
)
print(result['interpretation'])
print(result['recommendation'])
What to Do When Assumptions Are Violated
Normality violated:
- Mild violation + n > 30 per group → Proceed with parametric test (robust)
- Moderate violation → Use non-parametric alternative
- Severe violation → Transform data or use non-parametric test
Homogeneity of variance violated:
- For t-test → Use Welch's t-test
- For ANOVA → Use Welch's ANOVA or Brown-Forsythe ANOVA
- For regression → Use robust standard errors or weighted least squares
Linearity violated (regression):
- Add polynomial terms
- Transform variables
- Use non-linear models or GAM
See references/assumptions_and_diagnostics.md for comprehensive guidance.
Running Statistical Tests
Python Libraries
Primary libraries for statistical analysis:
- scipy.stats: Core statistical tests
- statsmodels: Advanced regression and diagnostics
- pingouin: User-friendly statistical testing with effect sizes
- pymc: Bayesian statistical modeling
- arviz: Bayesian visualization and diagnostics
Example Analyses
T-Test with Complete Reporting
import pingouin as pg
import numpy as np
# Run independent t-test
result = pg.ttest(group_a, group_b, correction='auto')
# Extract results (Pingouin 0.5+ column names)
t_stat = result['T'].values[0]
df = result['dof'].values[0]
p_value = result['p_val'].values[0]
cohens_d = result['cohen_d'].values[0]
ci = result['CI95'].values[0]
ci_lower, ci_upper = ci[0], ci[1]
# Report
print(f"t({df:.0f}) = {t_stat:.2f}, p = {p_value:.3f}")
print(f"Cohen's d = {cohens_d:.2f}, 95% CI [{ci_lower:.2f}, {ci_upper:.2f}]")
ANOVA with Post-Hoc Tests
import pingouin as pg
# One-way ANOVA
aov = pg.anova(dv='score', between='group', data=df, detailed=True)
print(aov)
# If significant, conduct post-hoc tests
if aov['p_unc'].values[0] < 0.05:
posthoc = pg.pairwise_tukey(dv='score', between='group', data=df)
print(posthoc)
# Effect size
eta_squared = aov['np2'].values[0] # Partial eta-squared
print(f"Partial η² = {eta_squared:.3f}")
Linear Regression with Diagnostics
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Fit model
X = sm.add_constant(X_predictors) # Add intercept
model = sm.OLS(y, X).fit()
# Summary
print(model.summary())
# Check multicollinearity (VIF)
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
# Check assumptions
residuals = model.resid
fitted = model.fittedvalues
# Residual plots
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Residuals vs fitted
axes[0, 0].scatter(fitted, residuals, alpha=0.6)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')
# Q-Q plot
from scipy import stats
stats.probplot(residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q')
# Scale-Location
axes[1, 0].scatter(fitted, np.sqrt(np.abs(residuals / residuals.std())), alpha=0.6)
axes[1, 0].set_xlabel('Fitted values')
axes[1, 0].set_ylabel('√|Standardized residuals|')
axes[1, 0].set_title('Scale-Location')
# Residuals histogram
axes[1, 1].hist(residuals, bins=20, edgecolor='black', alpha=0.7)
axes[1, 1].set_xlabel('Residuals')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Histogram of Residuals')
plt.tight_layout()
plt.show()
Bayesian T-Test
import pymc as pm
import arviz as az
import numpy as np
with pm.Model() as model:
# Priors
mu1 = pm.Normal('mu_group1', mu=0, sigma=10)
mu2 = pm.Normal('mu_group2', mu=0, sigma=10)
sigma = pm.HalfNormal('sigma', sigma=10)
# Likelihood
y1 = pm.Normal('y1', mu=mu1, sigma=sigma, observed=group_a)
y2 = pm.Normal('y2', mu=mu2, sigma=sigma, observed=group_b)
# Derived quantity
diff = pm.Deterministic('difference', mu1 - mu2)
# Sample
trace = pm.sample(2000, tune=1000, return_inferencedata=True)
# Summarize
print(az.summary(trace, var_names=['difference']))
# Probability that group1 > group2
prob_greater = np.mean(trace.posterior['difference'].values > 0)
print(f"P(μ₁ > μ₂ | data) = {prob_greater:.3f}")
# Plot posterior
az.plot_posterior(trace, var_names=['difference'], ref_val=0)
Effect Sizes
Always Calculate Effect Sizes
Effect sizes quantify magnitude, while p-values only indicate existence of an effect.
See references/effect_sizes_and_power.md for comprehensive guidance.
Quick Reference: Common Effect Sizes
| Test | Effect Size | Small | Medium | Large |
|---|---|---|---|---|
| T-test | Cohen's d | 0.20 | 0.50 | 0.80 |
| ANOVA | η²_p | 0.01 | 0.06 | 0.14 |
| Correlation | r | 0.10 | 0.30 | 0.50 |
| Regression | R² | 0.02 | 0.13 | 0.26 |
| Chi-square | Cramér's V | 0.07 | 0.21 | 0.35 |
Important: Benchmarks are guidelines. Context matters!
Calculating Effect Sizes
Most effect sizes are automatically calculated by pingouin:
# T-test returns Cohen's d
result = pg.ttest(x, y)
d = result['cohen_d'].values[0]
# ANOVA returns partial eta-squared
aov = pg.anova(dv='score', between='group', data=df)
eta_p2 = aov['np2'].values[0]
# Correlation: r is already an effect size
corr = pg.corr(x, y)
r = corr['r'].values[0]
Confidence Intervals for Effect Sizes
Always report CIs to show precision:
from pingouin import compute_effsize_from_t
# For t-test
d, ci = compute_effsize_from_t(
t_statistic,
nx=len(group1),
ny=len(group2),
eftype='cohen'
)
print(f"d = {d:.2f}, 95% CI [{ci[0]:.2f}, {ci[1]:.2f}]")
Power Analysis
A Priori Power Analysis (Study Planning)
Determine required sample size before data collection:
from statsmodels.stats.power import (
tt_ind_solve_power,
FTestAnovaPower
)
# T-test: What n is needed to detect d = 0.5?
n_required = tt_ind_solve_power(
effect_size=0.5,
alpha=0.05,
power=0.80,
ratio=1.0,
alternative='two-sided'
)
print(f"Required n per group: {n_required:.0f}")
# ANOVA: What n is needed to detect f = 0.25?
anova_power = FTestAnovaPower()
n_per_group = anova_power.solve_power(
effect_size=0.25,
ngroups=3,
alpha=0.05,
power=0.80
)
print(f"Required n per group: {n_per_group:.0f}")
Sensitivity Analysis (Post-Study)
Determine what effect size you could detect:
# With n=50 per group, what effect could we detect?
detectable_d = tt_ind_solve_power(
effect_size=None, # Solve for this
nobs1=50,
alpha=0.05,
power=0.80,
ratio=1.0,
alternative='two-sided'
)
print(f"Study could detect d ≥ {detectable_d:.2f}")
Note: Post-hoc power analysis (calculating power after study) is generally not recommended. Use sensitivity analysis instead.
See references/effect_sizes_and_power.md for detailed guidance.
Reporting Results
APA Style Statistical Reporting
Follow guidelines in references/reporting_standards.md.
Essential Reporting Elements
- Descriptive statistics: M, SD, n for all groups/variables
- Test statistics: Test name, statistic, df, exact p-value
- Effect sizes: With confidence intervals
- Assumption checks: Which tests were done, results, actions taken
- All planned analyses: Including non-significant findings
Example Report Templates
Independent T-Test
Group A (n = 48, M = 75.2, SD = 8.5) scored significantly higher than
Group B (n = 52, M = 68.3, SD = 9.2), t(98) = 3.82, p < .001, d = 0.77,
95% CI [0.36, 1.18], two-tailed. Assumptions of normality (Shapiro-Wilk:
Group A W = 0.97, p = .18; Group B W = 0.96, p = .12) and homogeneity
of variance (Levene's F(1, 98) = 1.23, p = .27) were satisfied.
One-Way ANOVA
A one-way ANOVA revealed a significant main effect of treatment condition
on test scores, F(2, 147) = 8.45, p < .001, η²_p = .10. Post hoc
comparisons using Tukey's HSD indicated that Condition A (M = 78.2,
SD = 7.3) scored significantly higher than Condition B (M = 71.5,
SD = 8.1, p = .002, d = 0.87) and Condition C (M = 70.1, SD = 7.9,
p < .001, d = 1.07). Conditions B and C did not differ significantly
(p = .52, d = 0.18).
Multiple Regression
Multiple linear regression was conducted to predict exam scores from
study hours, prior GPA, and attendance. The overall model was significant,
F(3, 146) = 45.2, p < .001, R² = .48, adjusted R² = .47. Study hours
(B = 1.80, SE = 0.31, β = .35, t = 5.78, p < .001, 95% CI [1.18, 2.42])
and prior GPA (B = 8.52, SE = 1.95, β = .28, t = 4.37, p < .001,
95% CI [4.66, 12.38]) were significant predictors, while attendance was
not (B = 0.15, SE = 0.12, β = .08, t = 1.25, p = .21, 95% CI [-0.09, 0.39]).
Multicollinearity was not a concern (all VIF < 1.5).
Bayesian Analysis
A Bayesian independent samples t-test was conducted using weakly
informative priors (Normal(0, 1) for mean difference). The posterior
distribution indicated that Group A scored higher than Group B
(M_diff = 6.8, 95% credible interval [3.2, 10.4]). The Bayes Factor
BF₁₀ = 45.3 provided very strong evidence for a difference between
groups, with a 99.8% posterior probability that Group A's mean exceeded
Group B's mean. Convergence diagnostics were satisfactory (all R̂ < 1.01,
ESS > 1000).
Bayesian Statistics
When to Use Bayesian Methods
Consider Bayesian approaches when:
- You have prior information to incorporate
- You want direct probability statements about hypotheses
- Sample size is small or planning sequential data collection
- You need to quantify evidence for the null hypothesis
- The model is complex (hierarchical, missing data)
See references/bayesian_statistics.md for comprehensive guidance on:
- Bayes' theorem and interpretation
- Prior specification (informative, weakly informative, non-informative)
- Bayesian hypothesis testing with Bayes Factors
- Credible intervals vs. confidence intervals
- Bayesian t-tests, ANOVA, regression, and hierarchical models
- Model convergence checking and posterior predictive checks
Key Advantages
- Intuitive interpretation: "Given the data, there is a 95% probability the parameter is in this interval"
- Evidence for null: Can quantify support for no effect
- Flexible: No p-hacking concerns; can analyze data as it arrives
- Uncertainty quantification: Full posterior distribution
Resources
This skill includes comprehensive reference materials:
References Directory
- test_selection_guide.md: Decision tree for choosing appropriate statistical tests
- assumptions_and_diagnostics.md: Detailed guidance on checking and handling assumption violations
- effect_sizes_and_power.md: Calculating, interpreting, and reporting effect sizes; conducting power analyses
- bayesian_statistics.md: Complete guide to Bayesian analysis methods
- reporting_standards.md: APA-style reporting guidelines with examples
Scripts Directory
- assumption_checks.py: Automated assumption checking with visualizations
comprehensive_assumption_check(): Complete workflowcheck_normality(): Normality testing with Q-Q plotscheck_homogeneity_of_variance(): Levene's test with box plotscheck_linearity(): Regression linearity checksdetect_outliers(): IQR and z-score outlier detection
Best Practices
- Pre-register analyses when possible to distinguish confirmatory from exploratory
- Always check assumptions before interpreting results
- Report effect sizes with confidence intervals
- Report all planned analyses including non-significant results
- Distinguish statistical from practical significance
- Visualize data before and after analysis
- Check diagnostics for regression/ANOVA (residual plots, VIF, etc.)
- Conduct sensitivity analyses to assess robustness
- Share data and code for reproducibility
- Be transparent about violations, transformations, and decisions
Common Pitfalls to Avoid
- P-hacking: Don't test multiple ways until something is significant
- HARKing: Don't present exploratory findings as confirmatory
- Ignoring assumptions: Check them and report violations
- Confusing significance with importance: p < .05 ≠ meaningful effect
- Not reporting effect sizes: Essential for interpretation
- Cherry-picking results: Report all planned analyses
- Misinterpreting p-values: They're NOT probability that hypothesis is true
- Multiple comparisons: Correct for family-wise error when appropriate
- Ignoring missing data: Understand mechanism (MCAR, MAR, MNAR)
- Overinterpreting non-significant results: Absence of evidence ≠ evidence of absence
Getting Started Checklist
When beginning a statistical analysis:
- Define research question and hypotheses
- Determine appropriate statistical test (use test_selection_guide.md)
- Conduct power analysis to determine sample size
- Load and inspect data
- Check for missing data and outliers
- Verify assumptions using assumption_checks.py
- Run primary analysis
- Calculate effect sizes with confidence intervals
- Conduct post-hoc tests if needed (with corrections)
- Create visualizations
- Write results following reporting_standards.md
- Conduct sensitivity analyses
- Share data and code
Support and Further Reading
For questions about:
- Test selection: See references/test_selection_guide.md
- Assumptions: See references/assumptions_and_diagnostics.md
- Effect sizes: See references/effect_sizes_and_power.md
- Bayesian methods: See references/bayesian_statistics.md
- Reporting: See references/reporting_standards.md
Key textbooks:
- Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences
- Field, A. (2013). Discovering Statistics Using IBM SPSS Statistics
- Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models
- Kruschke, J. K. (2014). Doing Bayesian Data Analysis
Online resources:
- APA Style Guide: https://apastyle.apa.org/
- Statistical Consulting: Cross Validated (stats.stackexchange.com)
GitHub 저장소
연관 스킬
evaluating-llms-harness
테스팅이 Claude Skill은 MMLU, GSM8K를 포함한 60개 이상의 표준화된 학술 과제에서 LLM 성능을 벤치마크하기 위해 lm-evaluation-harness를 실행합니다. 개발자들이 모델 품질을 비교하고, 학습 진행 상황을 추적하거나 학술 결과를 보고할 수 있도록 설계되었습니다. 이 도구는 HuggingFace와 vLLM 모델을 포함한 다양한 백엔드를 지원합니다.
cloudflare-cron-triggers
테스팅이 스킬은 cron 표현식을 사용하여 Worker를 스케줄링하기 위한 Cloudflare Cron Triggers 구현에 관한 포괄적인 지식을 제공합니다. 주기적 작업, 유지보수 작업, 자동화된 워크플로우 설정 방법을 다루며, 잘못된 cron 표현식이나 시간대 문제 같은 일반적인 이슈들을 해결하는 방법을 포함합니다. 개발자들은 이를 통해 스케줄된 핸들러 구성, cron 트리거 테스트, Workflows 및 Green Compute와의 연동 작업을 수행할 수 있습니다.
webapp-testing
테스팅이 Claude Skill은 Python 스크립트를 통해 로컬 웹 애플리케이션을 테스트하기 위한 Playwright 기반 툴킷을 제공합니다. 프론트엔드 검증, UI 디버깅, 스크린샷 캡처, 로그 확인 기능을 지원하며 서버 라이프사이클을 관리합니다. 브라우저 자동화 작업에 사용하되 컨텍스트 오염을 방지하기 위해 소스 코드를 읽지 않고 스크립트를 직접 실행하세요.
finishing-a-development-branch
테스팅이 스킬은 테스트 통과를 확인한 후 체계적인 통합 옵션을 제시하여 개발자가 완성된 작업을 마무리하도록 돕습니다. 구현이 완료된 후 머지, PR 생성, 브랜치 정리와 같은 워크플로우를 안내합니다. 코드가 준비되고 테스트가 완료되었을 때 개발 프로세스를 체계적으로 마무리하기 위해 사용하세요.
