Back to Skills

Statistical Hypothesis Testing

aj-geddes
Updated Today
27 views
7
7
View on GitHub
Testingtesting

About

This Claude Skill performs statistical hypothesis testing to validate data-driven decisions through significance testing. It supports common tests including t-tests, ANOVA, chi-square, and non-parametric alternatives for comparing groups and analyzing relationships. Developers can use it for A/B testing, hypothesis validation, and determining statistical significance in their data analysis workflows.

Quick Install

Claude Code

Recommended
Plugin CommandRecommended
/plugin add https://github.com/aj-geddes/useful-ai-prompts
Git CloneAlternative
git clone https://github.com/aj-geddes/useful-ai-prompts.git ~/.claude/skills/Statistical Hypothesis Testing

Copy and paste this command in Claude Code to install this skill

Documentation

Statistical Hypothesis Testing

Hypothesis testing provides a framework for making data-driven decisions by testing whether observed differences are statistically significant or due to chance.

Testing Framework

  • Null Hypothesis (H0): No effect or difference exists
  • Alternative Hypothesis (H1): Effect or difference exists
  • Significance Level (α): Threshold for rejecting H0 (typically 0.05)
  • P-value: Probability of observing data if H0 is true

Common Tests

  • T-test: Compare means between two groups
  • ANOVA: Compare means across multiple groups
  • Chi-square: Test independence of categorical variables
  • Mann-Whitney U: Non-parametric alternative to t-test
  • Kruskal-Wallis: Non-parametric alternative to ANOVA

Implementation with Python

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Sample data
group_a = np.random.normal(100, 15, 50)  # Mean=100, SD=15
group_b = np.random.normal(105, 15, 50)  # Mean=105, SD=15

# Test 1: Independent samples t-test
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"T-test: t={t_stat:.4f}, p-value={p_value:.4f}")
if p_value < 0.05:
    print("Reject null hypothesis: Groups are significantly different")
else:
    print("Fail to reject null hypothesis: No significant difference")

# Test 2: Paired t-test (same subjects, two conditions)
before = np.array([85, 90, 88, 92, 87, 89, 91, 86, 88, 90])
after = np.array([92, 95, 91, 98, 94, 96, 99, 93, 95, 97])

t_stat, p_value = stats.ttest_rel(before, after)
print(f"\nPaired t-test: t={t_stat:.4f}, p-value={p_value:.4f}")

# Test 3: One-way ANOVA (multiple groups)
group1 = np.random.normal(100, 10, 30)
group2 = np.random.normal(105, 10, 30)
group3 = np.random.normal(102, 10, 30)

f_stat, p_value = stats.f_oneway(group1, group2, group3)
print(f"\nANOVA: F={f_stat:.4f}, p-value={p_value:.4f}")

# Test 4: Chi-square test (categorical variables)
# Create contingency table
contingency = np.array([
    [50, 30],  # Control: success, failure
    [45, 35]   # Treatment: success, failure
])

chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
print(f"\nChi-square: χ²={chi2:.4f}, p-value={p_value:.4f}")

# Test 5: Mann-Whitney U test (non-parametric)
u_stat, p_value = stats.mannwhitneyu(group_a, group_b)
print(f"\nMann-Whitney U: U={u_stat:.4f}, p-value={p_value:.4f}")

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Distribution comparison
axes[0, 0].hist(group_a, alpha=0.5, label='Group A', bins=20)
axes[0, 0].hist(group_b, alpha=0.5, label='Group B', bins=20)
axes[0, 0].set_title('Group Distributions')
axes[0, 0].legend()

# Q-Q plot for normality
stats.probplot(group_a, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot (Group A)')

# Before/After comparison
axes[1, 0].plot(before, 'o-', label='Before', alpha=0.7)
axes[1, 0].plot(after, 's-', label='After', alpha=0.7)
axes[1, 0].set_title('Paired Comparison')
axes[1, 0].legend()

# Effect size (Cohen's d)
cohens_d = (np.mean(group_a) - np.mean(group_b)) / np.sqrt(
    ((len(group_a)-1)*np.var(group_a, ddof=1) +
     (len(group_b)-1)*np.var(group_b, ddof=1)) /
    (len(group_a) + len(group_b) - 2)
)
axes[1, 1].text(0.5, 0.5, f"Cohen's d = {cohens_d:.4f}",
                ha='center', va='center', fontsize=14)
axes[1, 1].axis('off')

plt.tight_layout()
plt.show()

# Normality test (Shapiro-Wilk)
stat, p = stats.shapiro(group_a)
print(f"\nShapiro-Wilk normality test: W={stat:.4f}, p-value={p:.4f}")

# Effect size calculation
def calculate_effect_size(group1, group2):
    n1, n2 = len(group1), len(group2)
    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
    pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
    cohens_d = (np.mean(group1) - np.mean(group2)) / pooled_std
    return cohens_d

effect_size = calculate_effect_size(group_a, group_b)
print(f"Effect size (Cohen's d): {effect_size:.4f}")

# Confidence intervals
from scipy.stats import t as t_dist

def calculate_ci(data, confidence=0.95):
    n = len(data)
    mean = np.mean(data)
    se = np.std(data, ddof=1) / np.sqrt(n)
    margin = t_dist.ppf((1 + confidence) / 2, n - 1) * se
    return mean - margin, mean + margin

ci = calculate_ci(group_a)
print(f"95% CI for Group A: ({ci[0]:.2f}, {ci[1]:.2f})")

# Additional tests and visualizations

# Test 6: Levene's test for equal variances
stat_levene, p_levene = stats.levene(group_a, group_b)
print(f"\nLevene's Test for Equal Variance:")
print(f"Statistic: {stat_levene:.4f}, P-value: {p_levene:.4f}")

# Test 7: Welch's t-test (doesn't assume equal variance)
t_stat_welch, p_welch = stats.ttest_ind(group_a, group_b, equal_var=False)
print(f"\nWelch's t-test (unequal variance):")
print(f"t-stat: {t_stat_welch:.4f}, p-value: {p_welch:.4f}")

# Power analysis
from scipy.stats import nct
def calculate_power(effect_size, sample_size, alpha=0.05):
    t_critical = stats.t.ppf(1 - alpha/2, 2*sample_size - 2)
    ncp = effect_size * np.sqrt(sample_size / 2)
    power = 1 - stats.nct.cdf(t_critical, 2*sample_size - 2, ncp)
    return power

power = calculate_power(abs(effect_size), len(group_a))
print(f"\nStatistical Power: {power:.2%}")

# Bootstrap confidence intervals
def bootstrap_ci(data, n_bootstrap=10000, ci=95):
    bootstrap_means = []
    for _ in range(n_bootstrap):
        sample = np.random.choice(data, size=len(data), replace=True)
        bootstrap_means.append(np.mean(sample))
    lower = np.percentile(bootstrap_means, (100-ci)/2)
    upper = np.percentile(bootstrap_means, ci + (100-ci)/2)
    return lower, upper

boot_ci = bootstrap_ci(group_a)
print(f"\nBootstrap 95% CI for Group A: ({boot_ci[0]:.2f}, {boot_ci[1]:.2f})")

# Multiple testing correction (Bonferroni)
num_tests = 4
bonferroni_alpha = 0.05 / num_tests
print(f"\nBonferroni Corrected Alpha: {bonferroni_alpha:.4f}")
print(f"Use this threshold for {num_tests} tests")

# Test 8: Kruskal-Wallis test (non-parametric ANOVA)
h_stat, p_kw = stats.kruskal(group1, group2, group3)
print(f"\nKruskal-Wallis Test (non-parametric ANOVA):")
print(f"H-statistic: {h_stat:.4f}, p-value: {p_kw:.4f}")

# Effect size for ANOVA
f_stat, p_anova = stats.f_oneway(group1, group2, group3)
# Calculate eta-squared
grand_mean = np.mean([group1, group2, group3])
ss_between = sum(len(g) * (np.mean(g) - grand_mean)**2 for g in [group1, group2, group3])
ss_total = sum((x - grand_mean)**2 for g in [group1, group2, group3] for x in g)
eta_squared = ss_between / ss_total
print(f"\nEffect Size (Eta-squared): {eta_squared:.4f}")

Interpretation Guidelines

  • p < 0.05: Statistically significant (reject H0)
  • p ≥ 0.05: Not statistically significant (fail to reject H0)
  • Effect size: Magnitude of the difference (small/medium/large)
  • Confidence intervals: Range of plausible parameter values

Assumptions Checklist

  • Independence of observations
  • Normality of distributions (parametric tests)
  • Homogeneity of variance
  • Appropriate sample size
  • Random sampling

Common Pitfalls

  • Misinterpreting p-values
  • Multiple testing without correction
  • Ignoring effect sizes
  • Violating test assumptions
  • Confusing correlation with causation

Deliverables

  • Test results with p-values and test statistics
  • Effect size calculations
  • Visualization of distributions
  • Confidence intervals
  • Interpretation and business implications

GitHub Repository

aj-geddes/useful-ai-prompts
Path: skills/statistical-hypothesis-testing

Related Skills

content-collections

Meta

This skill provides a production-tested setup for Content Collections, a TypeScript-first tool that transforms Markdown/MDX files into type-safe data collections with Zod validation. Use it when building blogs, documentation sites, or content-heavy Vite + React applications to ensure type safety and automatic content validation. It covers everything from Vite plugin configuration and MDX compilation to deployment optimization and schema validation.

View skill

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill

webapp-testing

Testing

This Claude Skill provides a Playwright-based toolkit for testing local web applications through Python scripts. It enables frontend verification, UI debugging, screenshot capture, and log viewing while managing server lifecycles. Use it for browser automation tasks but run scripts directly rather than reading their source code to avoid context pollution.

View skill

finishing-a-development-branch

Testing

This skill helps developers complete finished work by verifying tests pass and then presenting structured integration options. It guides the workflow for merging, creating PRs, or cleaning up branches after implementation is done. Use it when your code is ready and tested to systematically finalize the development process.

View skill