Back to Skills

Backtesting Analysis

derekcrosslu
Updated Today
21 views
0
View on GitHub
Testingtestingdesign

About

This skill helps developers interpret backtesting results and detect overfitting in trading strategies. It provides guidance on key performance metrics like Sharpe Ratio and strategy-specific expectations. Use it when evaluating backtest outcomes or comparing multiple strategies during Phase 3 analysis.

Quick Install

Claude Code

Recommended
Plugin CommandRecommended
/plugin add https://github.com/derekcrosslu/CLAUDE_CODE_EXPLORE
Git CloneAlternative
git clone https://github.com/derekcrosslu/CLAUDE_CODE_EXPLORE.git ~/.claude/skills/Backtesting Analysis

Copy and paste this command in Claude Code to install this skill

Documentation

Backtesting Analysis Skill

Purpose: Interpret backtest results, understand performance metrics, and detect overfitting or unreliable strategies.

Progressive Disclosure: This primer contains essentials only. Full details available via docs command.


When to Use This Skill

Load when:

  • Evaluating backtest results (Phase 3)
  • Detecting potential overfitting
  • Understanding strategy-specific performance expectations
  • Comparing multiple strategies or explaining results

Quick Reference: Key Metrics

Sharpe Ratio (Primary Metric)

Formula: (Return - Risk-Free Rate) / Volatility

SharpeQualityAction
< 0.5PoorAbandon
0.5 - 0.7MarginalConsider optimization
0.7 - 1.0AcceptableOptimize
1.0 - 1.5GoodProduction-ready
1.5 - 2.0Very GoodValidate thoroughly
> 3.0SUSPICIOUSLikely overfitting

Key Insight: QuantConnect reports annual Sharpe. Sharpe > 1.0 is production-ready for most strategies.


Maximum Drawdown

Formula: (Trough - Peak) / Peak

DrawdownQualityAction
< 20%ExcellentLow risk
20% - 30%GoodAcceptable for live trading
30% - 40%ConcerningNeeds strong Sharpe to justify
> 40%Too HighUnacceptable for most traders

Key Insight: Drawdowns > 30% are hard to tolerate psychologically. Consider: "Could I stomach this loss in real money?"


Total Trades (Statistical Significance)

Trade CountReliabilityDecision Impact
< 20UnreliableAbandon or escalate
20 - 30LowMinimum viable
30 - 50ModerateAcceptable
50 - 100GoodStrong confidence
100+ExcellentHighly reliable

Key Insight: Need 30+ trades for basic significance, 100+ for high confidence. Few trades = unreliable metrics.


Win Rate

Win RateQualityInterpretation
< 40%LowNeeds large winners (trend following)
40% - 55%AverageTypical for most strategies
55% - 65%GoodStrong edge
> 75%SUSPICIOUSLikely overfitting

Key Insight: Win rate alone is misleading. Must consider profit factor and average win/loss ratio.


Profit Factor

Formula: Gross Profit / Gross Loss

Profit FactorQualityInterpretation
< 1.3MarginalTransaction costs may kill it
1.3 - 1.5AcceptableDecent after costs
1.5 - 2.0GoodStrong profitability
> 3.0ExceptionalOutstanding (verify no overfitting)

Key Insight: Minimum 1.5 for live trading to cover slippage and commissions.


Overfitting Detection (Red Flags)

  1. Too Perfect Sharpe (> 3.0) → ESCALATE_TO_HUMAN
  2. Too High Win Rate (> 75%) → Check for look-ahead bias
  3. Too Few Trades (< 20) → Unreliable metrics
  4. Excessive Optimization Improvement (> 30%) → Lucky parameters
  5. Severe Out-of-Sample Degradation (> 40%) → ABANDON_HYPOTHESIS
  6. Equity Curve Too Smooth → Check unrealistic assumptions
  7. Works Only in One Market Regime → Not robust

Remember: If it looks too good to be true, it probably is.


Strategy-Type Expectations

Momentum

  • Sharpe: 0.8 - 1.5 | Drawdown: 20-35% | Win Rate: 40-55%

Mean Reversion

  • Sharpe: 0.7 - 1.3 | Drawdown: 15-30% | Win Rate: 55-70%

Trend Following

  • Sharpe: 0.5 - 1.0 | Drawdown: 25-40% | Win Rate: 30-50%

Breakout

  • Sharpe: 0.6 - 1.2 | Drawdown: 20-35% | Win Rate: 40-55%

Use these to calibrate expectations - different strategies have different profiles.


Example Decisions

GOOD (Optimization Worthy)

Sharpe: 0.85, Drawdown: 22%, Trades: 67, Win Rate: 42%, PF: 1.8
→ PROCEED_TO_OPTIMIZATION (decent baseline, worth improving)

EXCELLENT (Production Ready)

Sharpe: 1.35, Drawdown: 18%, Trades: 142, Win Rate: 53%, PF: 2.1
→ PROCEED_TO_VALIDATION (already strong, skip optimization)

SUSPICIOUS (Overfitting)

Sharpe: 4.2, Drawdown: 5%, Trades: 25, Win Rate: 88%, PF: 5.8
→ ESCALATE_TO_HUMAN (too perfect, likely look-ahead bias or bug)

POOR (Abandon)

Sharpe: 0.3, Drawdown: 38%, Trades: 89, Win Rate: 35%, PF: 1.1
→ ABANDON_HYPOTHESIS (poor risk-adjusted returns)

Common Confusion Points

Q: "Strategy made 200% returns, but Sharpe is only 0.6 - is this good?" A: No. We prioritize risk-adjusted returns (Sharpe), not raw returns. High returns with high volatility = bad Sharpe = risky.

Q: "Sharpe 2.5 with 15 trades - should I proceed?" A: ESCALATE_TO_HUMAN. Too few trades (<20) for statistical significance. High Sharpe with few trades = luck, not skill.

Q: "Optimization improved Sharpe from 0.8 to 1.5 (87% improvement) - is this good?" A: ESCALATE_TO_HUMAN. 87% > 30% threshold = likely overfitting to in-sample period.

Q: "Win rate is 78%, Sharpe is 1.2 - why is this flagged?" A: Win rate > 75% is an overfitting signal. Real strategies rarely achieve such high win rates.


Key Principles

  1. Sharpe ratio is king - Primary metric for risk-adjusted returns
  2. Trade count matters - Need 30+ for reliability, 100+ for confidence
  3. Beware overfitting - Too perfect results are suspicious
  4. Context by strategy type - Different strategies have different expectations
  5. Risk-adjusted, not raw returns - High returns with high volatility = bad

Reference Documentation (Progressive Disclosure)

Need detailed analysis? All reference documentation accessible via --help:

python SCRIPTS/backtesting_analysis.py --help

That's the only way to access complete reference documentation.

Topics covered in --help:

  • Sharpe Ratio Deep Dive
  • Maximum Drawdown Analysis
  • Trade Count Statistical Significance
  • Win Rate Analysis
  • Profit Factor Analysis
  • Complete Overfitting Detection Guide
  • Strategy-Type Profiles (Momentum, Mean Reversion, Trend Following, Breakout)
  • 10+ Annotated Example Backtests
  • Common Confusion Points

The primer above covers 90% of use cases. Use --help for edge cases and detailed analysis.


Integration with Decision Framework

This skill complements the decision-framework skill:

  • decision-framework: Provides thresholds and decision logic
  • backtesting-analysis: Provides interpretation and context

Workflow:

  1. Load decision-framework to apply thresholds
  2. Load backtesting-analysis to understand what metrics mean
  3. Combine insights to make informed decisions

Related Files

  • .claude/skills/decision-framework/skill.md - Decision thresholds
  • SCRIPTS/decision_logic.py - Decision implementation
  • .claude/commands/qc-backtest.md - Backtest execution

Version: 2.0.0 (Progressive Disclosure) Last Updated: November 13, 2025 Lines: ~200 (was 555) Context Reduction: 64%

GitHub Repository

derekcrosslu/CLAUDE_CODE_EXPLORE
Path: .claude/skills/backtesting-analysis

Related Skills

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill

langchain

Meta

LangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.

View skill

Algorithmic Art Generation

Meta

This skill helps developers create algorithmic art using p5.js, focusing on generative art, computational aesthetics, and interactive visualizations. It automatically activates for topics like "generative art" or "p5.js visualization" and guides you through creating unique algorithms with features like seeded randomness, flow fields, and particle systems. Use it when you need to build reproducible, code-driven artistic patterns.

View skill

webapp-testing

Testing

This Claude Skill provides a Playwright-based toolkit for testing local web applications through Python scripts. It enables frontend verification, UI debugging, screenshot capture, and log viewing while managing server lifecycles. Use it for browser automation tasks but run scripts directly rather than reading their source code to avoid context pollution.

View skill