Backtesting Analysis
About
This skill helps developers interpret backtesting results and detect overfitting in trading strategies. It provides guidance on key performance metrics like Sharpe Ratio and strategy-specific expectations. Use it when evaluating backtest outcomes or comparing multiple strategies during Phase 3 analysis.
Quick Install
Claude Code
Recommended/plugin add https://github.com/derekcrosslu/CLAUDE_CODE_EXPLOREgit clone https://github.com/derekcrosslu/CLAUDE_CODE_EXPLORE.git ~/.claude/skills/Backtesting AnalysisCopy and paste this command in Claude Code to install this skill
Documentation
Backtesting Analysis Skill
Purpose: Interpret backtest results, understand performance metrics, and detect overfitting or unreliable strategies.
Progressive Disclosure: This primer contains essentials only. Full details available via docs command.
When to Use This Skill
Load when:
- Evaluating backtest results (Phase 3)
- Detecting potential overfitting
- Understanding strategy-specific performance expectations
- Comparing multiple strategies or explaining results
Quick Reference: Key Metrics
Sharpe Ratio (Primary Metric)
Formula: (Return - Risk-Free Rate) / Volatility
| Sharpe | Quality | Action |
|---|---|---|
| < 0.5 | Poor | Abandon |
| 0.5 - 0.7 | Marginal | Consider optimization |
| 0.7 - 1.0 | Acceptable | Optimize |
| 1.0 - 1.5 | Good | Production-ready |
| 1.5 - 2.0 | Very Good | Validate thoroughly |
| > 3.0 | SUSPICIOUS | Likely overfitting |
Key Insight: QuantConnect reports annual Sharpe. Sharpe > 1.0 is production-ready for most strategies.
Maximum Drawdown
Formula: (Trough - Peak) / Peak
| Drawdown | Quality | Action |
|---|---|---|
| < 20% | Excellent | Low risk |
| 20% - 30% | Good | Acceptable for live trading |
| 30% - 40% | Concerning | Needs strong Sharpe to justify |
| > 40% | Too High | Unacceptable for most traders |
Key Insight: Drawdowns > 30% are hard to tolerate psychologically. Consider: "Could I stomach this loss in real money?"
Total Trades (Statistical Significance)
| Trade Count | Reliability | Decision Impact |
|---|---|---|
| < 20 | Unreliable | Abandon or escalate |
| 20 - 30 | Low | Minimum viable |
| 30 - 50 | Moderate | Acceptable |
| 50 - 100 | Good | Strong confidence |
| 100+ | Excellent | Highly reliable |
Key Insight: Need 30+ trades for basic significance, 100+ for high confidence. Few trades = unreliable metrics.
Win Rate
| Win Rate | Quality | Interpretation |
|---|---|---|
| < 40% | Low | Needs large winners (trend following) |
| 40% - 55% | Average | Typical for most strategies |
| 55% - 65% | Good | Strong edge |
| > 75% | SUSPICIOUS | Likely overfitting |
Key Insight: Win rate alone is misleading. Must consider profit factor and average win/loss ratio.
Profit Factor
Formula: Gross Profit / Gross Loss
| Profit Factor | Quality | Interpretation |
|---|---|---|
| < 1.3 | Marginal | Transaction costs may kill it |
| 1.3 - 1.5 | Acceptable | Decent after costs |
| 1.5 - 2.0 | Good | Strong profitability |
| > 3.0 | Exceptional | Outstanding (verify no overfitting) |
Key Insight: Minimum 1.5 for live trading to cover slippage and commissions.
Overfitting Detection (Red Flags)
- Too Perfect Sharpe (> 3.0) → ESCALATE_TO_HUMAN
- Too High Win Rate (> 75%) → Check for look-ahead bias
- Too Few Trades (< 20) → Unreliable metrics
- Excessive Optimization Improvement (> 30%) → Lucky parameters
- Severe Out-of-Sample Degradation (> 40%) → ABANDON_HYPOTHESIS
- Equity Curve Too Smooth → Check unrealistic assumptions
- Works Only in One Market Regime → Not robust
Remember: If it looks too good to be true, it probably is.
Strategy-Type Expectations
Momentum
- Sharpe: 0.8 - 1.5 | Drawdown: 20-35% | Win Rate: 40-55%
Mean Reversion
- Sharpe: 0.7 - 1.3 | Drawdown: 15-30% | Win Rate: 55-70%
Trend Following
- Sharpe: 0.5 - 1.0 | Drawdown: 25-40% | Win Rate: 30-50%
Breakout
- Sharpe: 0.6 - 1.2 | Drawdown: 20-35% | Win Rate: 40-55%
Use these to calibrate expectations - different strategies have different profiles.
Example Decisions
GOOD (Optimization Worthy)
Sharpe: 0.85, Drawdown: 22%, Trades: 67, Win Rate: 42%, PF: 1.8
→ PROCEED_TO_OPTIMIZATION (decent baseline, worth improving)
EXCELLENT (Production Ready)
Sharpe: 1.35, Drawdown: 18%, Trades: 142, Win Rate: 53%, PF: 2.1
→ PROCEED_TO_VALIDATION (already strong, skip optimization)
SUSPICIOUS (Overfitting)
Sharpe: 4.2, Drawdown: 5%, Trades: 25, Win Rate: 88%, PF: 5.8
→ ESCALATE_TO_HUMAN (too perfect, likely look-ahead bias or bug)
POOR (Abandon)
Sharpe: 0.3, Drawdown: 38%, Trades: 89, Win Rate: 35%, PF: 1.1
→ ABANDON_HYPOTHESIS (poor risk-adjusted returns)
Common Confusion Points
Q: "Strategy made 200% returns, but Sharpe is only 0.6 - is this good?" A: No. We prioritize risk-adjusted returns (Sharpe), not raw returns. High returns with high volatility = bad Sharpe = risky.
Q: "Sharpe 2.5 with 15 trades - should I proceed?" A: ESCALATE_TO_HUMAN. Too few trades (<20) for statistical significance. High Sharpe with few trades = luck, not skill.
Q: "Optimization improved Sharpe from 0.8 to 1.5 (87% improvement) - is this good?" A: ESCALATE_TO_HUMAN. 87% > 30% threshold = likely overfitting to in-sample period.
Q: "Win rate is 78%, Sharpe is 1.2 - why is this flagged?" A: Win rate > 75% is an overfitting signal. Real strategies rarely achieve such high win rates.
Key Principles
- Sharpe ratio is king - Primary metric for risk-adjusted returns
- Trade count matters - Need 30+ for reliability, 100+ for confidence
- Beware overfitting - Too perfect results are suspicious
- Context by strategy type - Different strategies have different expectations
- Risk-adjusted, not raw returns - High returns with high volatility = bad
Reference Documentation (Progressive Disclosure)
Need detailed analysis? All reference documentation accessible via --help:
python SCRIPTS/backtesting_analysis.py --help
That's the only way to access complete reference documentation.
Topics covered in --help:
- Sharpe Ratio Deep Dive
- Maximum Drawdown Analysis
- Trade Count Statistical Significance
- Win Rate Analysis
- Profit Factor Analysis
- Complete Overfitting Detection Guide
- Strategy-Type Profiles (Momentum, Mean Reversion, Trend Following, Breakout)
- 10+ Annotated Example Backtests
- Common Confusion Points
The primer above covers 90% of use cases. Use --help for edge cases and detailed analysis.
Integration with Decision Framework
This skill complements the decision-framework skill:
- decision-framework: Provides thresholds and decision logic
- backtesting-analysis: Provides interpretation and context
Workflow:
- Load decision-framework to apply thresholds
- Load backtesting-analysis to understand what metrics mean
- Combine insights to make informed decisions
Related Files
.claude/skills/decision-framework/skill.md- Decision thresholdsSCRIPTS/decision_logic.py- Decision implementation.claude/commands/qc-backtest.md- Backtest execution
Version: 2.0.0 (Progressive Disclosure) Last Updated: November 13, 2025 Lines: ~200 (was 555) Context Reduction: 64%
GitHub Repository
Related Skills
evaluating-llms-harness
TestingThis Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.
langchain
MetaLangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.
Algorithmic Art Generation
MetaThis skill helps developers create algorithmic art using p5.js, focusing on generative art, computational aesthetics, and interactive visualizations. It automatically activates for topics like "generative art" or "p5.js visualization" and guides you through creating unique algorithms with features like seeded randomness, flow fields, and particle systems. Use it when you need to build reproducible, code-driven artistic patterns.
webapp-testing
TestingThis Claude Skill provides a Playwright-based toolkit for testing local web applications through Python scripts. It enables frontend verification, UI debugging, screenshot capture, and log viewing while managing server lifecycles. Use it for browser automation tasks but run scripts directly rather than reading their source code to avoid context pollution.
