How to Backtest a Trading Strategy in Python
A backtest tells you how a strategy would have performed on historical data. Done right, it's a cheap way to kill bad ideas before they cost real money. Done wrong, it's a confidence booster for strategies that will lose every penny. This guide covers the full pipeline — data, vectorized vs event-driven, the metrics that actually matter, and the traps that fool 90% of beginners.
What we'll cover
- Why most beginner backtests lie
- Step 1: Get clean historical data
- Step 2: Write a vectorized backtest
- Step 3: Why event-driven is more honest
- Step 4: Model slippage, fees, and latency
- Step 5: The metrics that matter (and the ones that don't)
- Step 6: Walk-forward validation
- Recommended tools & frameworks
01Why most beginner backtests lie
Before any code, the most important truth: a backtest that shows a 150% annual return is almost certainly broken. Real institutional strategies target 10–25% annual Sharpe-adjusted returns. If your script claims more, the most likely explanations are:
- Lookahead bias — you accidentally used future information (e.g., today's close to decide today's open trade)
- Survivorship bias — your data only includes assets that exist today, ignoring the ones that delisted/went to zero
- Overfitting — you tuned parameters until the curve looked perfect on this exact dataset (and only this one)
- No cost model — fills happen at the mid price, fees are zero, slippage is zero — none of which is real
- Implementation bug — off-by-one indexing, wrong sign on returns, etc.
Rule of thumb: When your backtest result looks amazing, your first question should be "what's broken?" — not "how soon can I deploy?"
02Step 1: Get clean historical data
Data quality is the foundation. Bad data = backtest is fiction. For most retail strategies, you want:
- OHLCV bars (Open, High, Low, Close, Volume) at the resolution your strategy needs
- Adjusted for splits and dividends if trading equities
- Same exchange / timezone as where you'll trade live
Free sources that are good enough to start:
- Crypto: ccxt's
fetch_ohlcvagainst any major exchange — pulls millions of historical bars in seconds - US equities: yfinance for daily data; Polygon.io for intraday (paid)
- Kalshi / prediction markets: their public API exposes historical orderbook snapshots
A 2-minute crypto data pull with ccxt:
import ccxt, pandas as pd
exchange = ccxt.binance()
ohlcv = exchange.fetch_ohlcv("BTC/USDT", timeframe="1h", limit=1000)
df = pd.DataFrame(ohlcv, columns=["ts", "open", "high", "low", "close", "volume"])
df["ts"] = pd.to_datetime(df["ts"], unit="ms")
df = df.set_index("ts")
print(df.tail())
03Step 2: Write a vectorized backtest
"Vectorized" means: express your entire strategy as pandas/numpy operations on a whole price series at once. Fast (seconds for years of data), great for prototyping. Here's a moving-average crossover on the data above:
import numpy as np
# Generate signals
df["short_ma"] = df["close"].rolling(5).mean()
df["long_ma"] = df["close"].rolling(20).mean()
df["signal"] = (df["short_ma"] > df["long_ma"]).astype(int)
# Position: hold the signal from the PREVIOUS bar
# (shift(1) avoids lookahead bias — you don't know today's close at today's open)
df["position"] = df["signal"].shift(1).fillna(0)
# Returns
df["returns"] = df["close"].pct_change()
df["strategy_returns"] = df["position"] * df["returns"]
# Equity curve
df["equity"] = (1 + df["strategy_returns"]).cumprod()
print(df[["close", "position", "strategy_returns", "equity"]].tail())
total_return = df["equity"].iloc[-1] - 1
print(f"Total return: {total_return:.2%}")
That's a complete vectorized backtest in ~10 lines. Fast and useful for first-pass sanity checks. But it has hidden assumptions worth calling out.
Hidden assumption: Vectorized backtests implicitly assume you can trade at every bar's close at exactly the close price, with no slippage. That's never true in practice. Vectorized is good for "does this signal even have edge?" — not for "should I deploy with $100k."
04Step 3: Why event-driven is more honest
An event-driven backtest simulates time passing. It walks through bars one at a time, only sees information available up to that moment, decides whether to send orders, and matches those orders against the bars that follow. Closer to live trading because the data flow matches what happens in production.
A minimal event-driven structure:
class Backtest:
def __init__(self, df, initial_cash=10000):
self.df = df
self.cash = initial_cash
self.position = 0 # in units of the asset
self.history = []
def on_bar(self, ts, bar, strategy):
# 1. strategy decides
decision = strategy.decide(ts, bar, self.position, self.cash)
# 2. simulate fill (next bar's open, with cost model)
if decision == "buy" and self.position == 0:
fill_price = bar["next_open"] * 1.0005 # 5bps slippage
self.position = self.cash / fill_price
self.cash = 0
elif decision == "sell" and self.position > 0:
fill_price = bar["next_open"] * 0.9995
self.cash = self.position * fill_price
self.position = 0
# 3. mark-to-market
equity = self.cash + self.position * bar["close"]
self.history.append({"ts": ts, "equity": equity})
def run(self, strategy):
# Pre-compute next-bar open for fill simulation
self.df["next_open"] = self.df["open"].shift(-1)
for ts, bar in self.df.iterrows():
if pd.isna(bar["next_open"]):
break
self.on_bar(ts, bar, strategy)
This is more code but more truthful. Fills happen at the next bar's open (you can't fill on a candle you haven't seen yet), slippage is modeled, the equity curve reflects what an actual broker statement would show.
05Step 4: Model slippage, fees, and latency
Three cost components your backtest probably ignores:
| Cost | Typical value | How to model |
|---|---|---|
| Exchange fees | 0.05–0.10% per side (crypto), free–0.005% (US equities) | Subtract from every fill |
| Slippage | 1–10 bps per side, more on illiquid markets | Adjust fill price; or use a volume-impact model |
| Latency | 50–500 ms retail, 1–10 µs HFT | For low-freq strategies, ignore. For high-freq, fill at later bar. |
For most retail strategies, a flat 10 basis points round-trip cost (fees + slippage on both sides) is a decent first approximation. If your strategy doesn't survive that, it doesn't survive live.
06Step 5: The metrics that matter
"Total return" is the least useful metric. Anyone can pick a leveraged crypto bull market and claim 800% returns. What separates real strategies from gambling is risk-adjusted performance.
The core five to always compute:
import numpy as np
def metrics(equity_curve, risk_free_rate=0.04):
returns = equity_curve.pct_change().dropna()
n_years = len(returns) / 252 # adjust if intraday: 252 * bars_per_day
# Annualized return
cagr = (equity_curve.iloc[-1] / equity_curve.iloc[0]) ** (1/n_years) - 1
# Annualized volatility
ann_vol = returns.std() * np.sqrt(252)
# Sharpe ratio (per year)
sharpe = (cagr - risk_free_rate) / ann_vol
# Sortino ratio (only downside vol)
downside = returns[returns < 0].std() * np.sqrt(252)
sortino = (cagr - risk_free_rate) / downside
# Max drawdown
cummax = equity_curve.cummax()
drawdown = (equity_curve / cummax - 1)
max_dd = drawdown.min()
return {
"CAGR": f"{cagr:.1%}",
"Vol": f"{ann_vol:.1%}",
"Sharpe": f"{sharpe:.2f}",
"Sortino": f"{sortino:.2f}",
"Max DD": f"{max_dd:.1%}",
}
Targets to aim for:
- Sharpe ≥ 1.0 — anything below is barely worth the effort vs index investing
- Sharpe ≥ 2.0 — institutional-quality, very rare for retail
- Max drawdown ≤ 30% — losses bigger than this are emotionally impossible to ride out
- Sortino > Sharpe — strategy is asymmetric (bigger upside than downside), which is good
What does NOT matter (or matters less than people think):
- Win rate — a 35% win rate strategy with 3:1 average win/loss beats a 70% win rate with 0.5:1 every time
- Number of trades — more trades just means more fees
- "Looks like a smooth equity curve" — usually the result of overfitting
07Step 6: Walk-forward validation
The most common backtesting mistake: optimizing parameters on the same dataset you report results on. This guarantees overfitting.
The fix is walk-forward analysis:
- Split your data into N rolling windows (say, 12 windows of 1 year each)
- For each window: optimize parameters on the first 9 months, then test on the remaining 3 months. The test result is what counts — the optimization is throwaway.
- Concatenate the 3-month out-of-sample test segments. That's your honest performance.
def walk_forward(df, train_months=9, test_months=3):
results = []
start = df.index[0]
end = df.index[-1]
current = start
while current + pd.DateOffset(months=train_months + test_months) <= end:
train = df[current : current + pd.DateOffset(months=train_months)]
test = df[current + pd.DateOffset(months=train_months) :
current + pd.DateOffset(months=train_months + test_months)]
best_params = optimize_on(train) # your strategy's hyperparam search
test_equity = run_backtest(test, best_params)
results.append(test_equity)
current += pd.DateOffset(months=test_months)
return pd.concat(results)
If your strategy looks great on full-dataset optimization but collapses on walk-forward, you don't have a strategy — you have a hindsight model.
Recommended tools & frameworks
| Tool | When to use it |
|---|---|
| vectorbt | Vectorized backtests, fast parameter sweeps, gorgeous plots |
| backtrader | Event-driven, broker simulation, multi-asset, more code |
| Freqtrade | Crypto-only, integrated hyperopt, best-in-class for parameter optimization |
| QuantConnect Lean | Cloud-based, multi-asset, used by institutional clients |
| Pure pandas | Anything you want to fully understand from scratch |
For early prototyping, pure pandas (like the examples above) is hard to beat — you see exactly what's happening at every line, no framework magic.
A good backtest doesn't prove your strategy will work. It proves your strategy isn't obviously broken.
Ship your strategy. We'll handle the running.
WatchDog Bot is the trading bot platform for Python developers. Free trial, no credit card.
Start Free Trial →
WatchDog Bot