Backtesting Analysis

derekcrosslu

Updated Today

21 views

Testingtestingdesign

About

This skill helps developers interpret backtesting results and detect overfitting in trading strategies. It provides guidance on key performance metrics like Sharpe Ratio and strategy-specific expectations. Use it when evaluating backtest outcomes or comparing multiple strategies during Phase 3 analysis.

Quick Install

Claude Code

Recommended

Plugin CommandRecommended

/plugin add https://github.com/derekcrosslu/CLAUDE_CODE_EXPLORE

Git CloneAlternative

git clone https://github.com/derekcrosslu/CLAUDE_CODE_EXPLORE.git ~/.claude/skills/Backtesting Analysis

Copy and paste this command in Claude Code to install this skill

Documentation

Backtesting Analysis Skill

Purpose: Interpret backtest results, understand performance metrics, and detect overfitting or unreliable strategies.

Progressive Disclosure: This primer contains essentials only. Full details available via docs command.

When to Use This Skill

Load when:

Evaluating backtest results (Phase 3)
Detecting potential overfitting
Understanding strategy-specific performance expectations
Comparing multiple strategies or explaining results

Quick Reference: Key Metrics

Sharpe Ratio (Primary Metric)

Formula: (Return - Risk-Free Rate) / Volatility

Sharpe	Quality	Action
< 0.5	Poor	Abandon
0.5 - 0.7	Marginal	Consider optimization
0.7 - 1.0	Acceptable	Optimize
1.0 - 1.5	Good	Production-ready
1.5 - 2.0	Very Good	Validate thoroughly
> 3.0	SUSPICIOUS	Likely overfitting

Key Insight: QuantConnect reports annual Sharpe. Sharpe > 1.0 is production-ready for most strategies.

Maximum Drawdown

Formula: (Trough - Peak) / Peak

Drawdown	Quality	Action
< 20%	Excellent	Low risk
20% - 30%	Good	Acceptable for live trading
30% - 40%	Concerning	Needs strong Sharpe to justify
> 40%	Too High	Unacceptable for most traders

Key Insight: Drawdowns > 30% are hard to tolerate psychologically. Consider: "Could I stomach this loss in real money?"

Total Trades (Statistical Significance)

Trade Count	Reliability	Decision Impact
< 20	Unreliable	Abandon or escalate
20 - 30	Low	Minimum viable
30 - 50	Moderate	Acceptable
50 - 100	Good	Strong confidence
100+	Excellent	Highly reliable

Key Insight: Need 30+ trades for basic significance, 100+ for high confidence. Few trades = unreliable metrics.

Win Rate

Win Rate	Quality	Interpretation
< 40%	Low	Needs large winners (trend following)
40% - 55%	Average	Typical for most strategies
55% - 65%	Good	Strong edge
> 75%	SUSPICIOUS	Likely overfitting

Key Insight: Win rate alone is misleading. Must consider profit factor and average win/loss ratio.

Profit Factor

Formula: Gross Profit / Gross Loss

Profit Factor	Quality	Interpretation
< 1.3	Marginal	Transaction costs may kill it
1.3 - 1.5	Acceptable	Decent after costs
1.5 - 2.0	Good	Strong profitability
> 3.0	Exceptional	Outstanding (verify no overfitting)

Key Insight: Minimum 1.5 for live trading to cover slippage and commissions.

Overfitting Detection (Red Flags)

Too Perfect Sharpe (> 3.0) → ESCALATE_TO_HUMAN
Too High Win Rate (> 75%) → Check for look-ahead bias
Too Few Trades (< 20) → Unreliable metrics
Excessive Optimization Improvement (> 30%) → Lucky parameters
Severe Out-of-Sample Degradation (> 40%) → ABANDON_HYPOTHESIS
Equity Curve Too Smooth → Check unrealistic assumptions
Works Only in One Market Regime → Not robust

Remember: If it looks too good to be true, it probably is.

Strategy-Type Expectations

Momentum

Sharpe: 0.8 - 1.5 | Drawdown: 20-35% | Win Rate: 40-55%

Mean Reversion

Sharpe: 0.7 - 1.3 | Drawdown: 15-30% | Win Rate: 55-70%

Trend Following

Sharpe: 0.5 - 1.0 | Drawdown: 25-40% | Win Rate: 30-50%

Breakout

Sharpe: 0.6 - 1.2 | Drawdown: 20-35% | Win Rate: 40-55%

Use these to calibrate expectations - different strategies have different profiles.

Example Decisions

GOOD (Optimization Worthy)

Sharpe: 0.85, Drawdown: 22%, Trades: 67, Win Rate: 42%, PF: 1.8
→ PROCEED_TO_OPTIMIZATION (decent baseline, worth improving)

EXCELLENT (Production Ready)

Sharpe: 1.35, Drawdown: 18%, Trades: 142, Win Rate: 53%, PF: 2.1
→ PROCEED_TO_VALIDATION (already strong, skip optimization)

SUSPICIOUS (Overfitting)

Sharpe: 4.2, Drawdown: 5%, Trades: 25, Win Rate: 88%, PF: 5.8
→ ESCALATE_TO_HUMAN (too perfect, likely look-ahead bias or bug)

POOR (Abandon)

Sharpe: 0.3, Drawdown: 38%, Trades: 89, Win Rate: 35%, PF: 1.1
→ ABANDON_HYPOTHESIS (poor risk-adjusted returns)

Common Confusion Points

Q: "Strategy made 200% returns, but Sharpe is only 0.6 - is this good?" A: No. We prioritize risk-adjusted returns (Sharpe), not raw returns. High returns with high volatility = bad Sharpe = risky.

Q: "Sharpe 2.5 with 15 trades - should I proceed?" A: ESCALATE_TO_HUMAN. Too few trades (<20) for statistical significance. High Sharpe with few trades = luck, not skill.

Q: "Optimization improved Sharpe from 0.8 to 1.5 (87% improvement) - is this good?" A: ESCALATE_TO_HUMAN. 87% > 30% threshold = likely overfitting to in-sample period.

Q: "Win rate is 78%, Sharpe is 1.2 - why is this flagged?" A: Win rate > 75% is an overfitting signal. Real strategies rarely achieve such high win rates.

Key Principles

Sharpe ratio is king - Primary metric for risk-adjusted returns
Trade count matters - Need 30+ for reliability, 100+ for confidence
Beware overfitting - Too perfect results are suspicious
Context by strategy type - Different strategies have different expectations
Risk-adjusted, not raw returns - High returns with high volatility = bad

Reference Documentation (Progressive Disclosure)

Need detailed analysis? All reference documentation accessible via --help:

python SCRIPTS/backtesting_analysis.py --help

That's the only way to access complete reference documentation.

Topics covered in --help:

Sharpe Ratio Deep Dive
Maximum Drawdown Analysis
Trade Count Statistical Significance
Win Rate Analysis
Profit Factor Analysis
Complete Overfitting Detection Guide
Strategy-Type Profiles (Momentum, Mean Reversion, Trend Following, Breakout)
10+ Annotated Example Backtests
Common Confusion Points

The primer above covers 90% of use cases. Use --help for edge cases and detailed analysis.

Integration with Decision Framework

This skill complements the decision-framework skill:

decision-framework: Provides thresholds and decision logic
backtesting-analysis: Provides interpretation and context

Workflow:

Load decision-framework to apply thresholds
Load backtesting-analysis to understand what metrics mean
Combine insights to make informed decisions

Related Files

.claude/skills/decision-framework/skill.md - Decision thresholds
SCRIPTS/decision_logic.py - Decision implementation
.claude/commands/qc-backtest.md - Backtest execution

Version: 2.0.0 (Progressive Disclosure) Last Updated: November 13, 2025 Lines: ~200 (was 555) Context Reduction: 64%

GitHub Repository

derekcrosslu/CLAUDE_CODE_EXPLORE

Path: .claude/skills/backtesting-analysis

Related Skills

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill

langchain

Algorithmic Art Generation

webapp-testing

Testing

This Claude Skill provides a Playwright-based toolkit for testing local web applications through Python scripts. It enables frontend verification, UI debugging, screenshot capture, and log viewing while managing server lifecycles. Use it for browser automation tasks but run scripts directly rather than reading their source code to avoid context pollution.

View skill