Tutorial
How to Backtest a Strategy — And What the Numbers Actually Mean
A backtest that shows 40% annual returns is not evidence that a strategy works. It is evidence that the strategy worked on that specific dataset, under those specific conditions, as implemented by that specific code. The gap between those two claims is where most systematic traders lose money. This article covers how to run a backtest correctly, and — more importantly — how to interpret the results without mistaking past performance for future prediction.
What Backtesting Actually Is
A backtest applies a strategy function to historical data and measures the output. It answers one question: "If this strategy had been running during this period, what would the outcomes have been?" It does not answer: "Will this strategy produce these outcomes in the future?"
That gap — between past and future — is the most important gap in systematic trading. Historical data is fixed. Markets are not. A regime shift, a change in market microstructure, or an increase in the number of participants running the same strategy can erode an edge that looked robust in backtest. Most backtesting tutorials do not address this directly because it is uncomfortable: it means no backtest result, however clean, is proof that a strategy will work.
What a backtest does provide is a structured way to test a hypothesis. If a strategy cannot outperform the benchmark on historical data, it has no case to be run live. If it does outperform on historical data, the next question is whether that outperformance survives out-of-sample validation, parameter sensitivity analysis, and an honest accounting of transaction costs. This article works through each of those checks in sequence. The goal is not to produce a good-looking backtest — it is to produce a backtest that fails fast when the hypothesis is weak.
Step 1 — Fetch Historical Data
Use Alpaca's free historical data tier for daily bar data. No authentication is required for this endpoint:
from alpaca.data.historical import StockHistoricalDataClient
from alpaca.data.requests import StockBarsRequest
from alpaca.data.timeframe import TimeFrame
from datetime import datetime
import pandas as pd
client = StockHistoricalDataClient()
request = StockBarsRequest(
symbol_or_symbols=["SPY"],
timeframe=TimeFrame.Day,
start=datetime(2020, 1, 1),
end=datetime(2023, 12, 31)
)
bars = client.get_stock_bars(request).df
bars = bars.droplevel("symbol") # flatten multi-index
bars.index = pd.to_datetime(bars.index)
print(bars[["open", "high", "low", "close", "volume"]].tail())
SPY — the S&P 500 ETF — is the standard backtesting vehicle for market-relative strategies because it has deep liquidity, minimal slippage in live trading, and a broad enough history to cover multiple distinct market regimes. The four-year window from 2020 to 2023 captures the pandemic crash, the recovery rally, the 2022 rate-hike selloff, and the 2023 recovery — four meaningfully different market environments in a single dataset.
Two data quality checks before proceeding: verify that the DataFrame contains no gaps in the daily index beyond expected market holidays, and verify that the closing prices align with known historical levels for SPY (approximately $320 at the start of 2020, approximately $470 at the end of 2023). If the data looks wrong, it probably is — missing or corrupted data produces backtest artifacts that can resemble genuinely good performance.
Step 2 — Implement the Strategy
The strategy used here is a moving average crossover: when a short-period moving average crosses above a long-period moving average, go long; when it crosses below, exit to cash. This is not a good trading strategy by modern standards — it is too simple, too widely known, and the edge has been largely arbitraged away. It is useful here precisely because the mechanism is transparent enough to illustrate interpretation without obscuring it:
def ma_crossover(prices: pd.Series, short_window: int = 10, long_window: int = 30) -> pd.Series:
short_ma = prices.rolling(short_window).mean()
long_ma = prices.rolling(long_window).mean()
signal = (short_ma > long_ma).astype(int)
return signal
Signal value of 1 means hold the long position; 0 means flat (not invested). The strategy does not short-sell. During the first long_window trading days, both moving averages are NaN, so the signal is 0 — the strategy is flat during the warmup period. This is correct behavior: making decisions before the indicator has enough data to be meaningful produces trades that are effectively noise.
The choice of short_window=10 and long_window=30 is arbitrary for demonstration purposes. In a real evaluation, these parameters would be derived from a hypothesis about the time horizon of the market behavior you are trying to capture — not selected because they produce the best result on a particular historical window.
Step 3 — Calculate Returns
Look-ahead bias is the most common backtesting error. It occurs when the strategy uses information that would not have been available at the time the trade decision was made. The most frequent form: using today's closing price to generate a signal, then using that same closing price as the fill price for the resulting trade.
In reality, if a signal is generated at the 4:00pm close, the earliest possible execution is the next open or next close. The .shift(1) on the position array prevents look-ahead bias by ensuring that today's signal generates tomorrow's position, not today's:
prices = bars["close"]
signal = ma_crossover(prices)
# Shift signal by 1 to avoid look-ahead bias:
# today's signal becomes tomorrow's position
position = signal.shift(1)
daily_returns = prices.pct_change()
strategy_returns = daily_returns * position
# Cumulative performance
cumulative_market = (1 + daily_returns).cumprod()
cumulative_strategy = (1 + strategy_returns).cumprod()
print(f"Market return: {(cumulative_market.iloc[-1] - 1) * 100:.1f}%")
print(f"Strategy return: {(cumulative_strategy.iloc[-1] - 1) * 100:.1f}%")
Without the .shift(1), the strategy would appear to have knowledge of each day's close before placing the trade at that close — a physical impossibility in live trading. The result is artificially inflated returns that cannot be replicated. This single error accounts for a substantial fraction of backtest results that fail to translate to live performance. The fix is one line of code, but understanding why it is necessary is what prevents the error from reappearing in more subtle forms.
Step 4 — Interpret the Metrics
Three metrics provide the most useful signal about strategy quality. None of them should be evaluated in isolation, and none of them should be compared to a single historical figure without asking whether the figure is stable across different time windows.
Sharpe ratio — return per unit of risk:
import numpy as np
sharpe = (strategy_returns.mean() / strategy_returns.std()) * np.sqrt(252)
print(f"Sharpe ratio: {sharpe:.2f}")
A Sharpe ratio above 1.0 is considered acceptable for a live strategy; above 2.0 is strong. The annualization factor np.sqrt(252) converts the daily Sharpe to an annualized figure (252 trading days per year). The number itself matters less than its stability. A strategy with a Sharpe of 1.8 on the full four-year dataset but a Sharpe of 0.3 during 2022 alone has a much weaker case than a strategy with a Sharpe of 1.2 that holds across all four individual years. Stability across subperiods is evidence of a real edge; instability is evidence of parameter fitting to a particular market environment.
Maximum drawdown — the worst peak-to-trough decline experienced during the backtest period:
rolling_max = cumulative_strategy.cummax()
drawdown = (cumulative_strategy - rolling_max) / rolling_max
max_drawdown = drawdown.min()
print(f"Max drawdown: {max_drawdown * 100:.1f}%")
Maximum drawdown is the risk metric most directly connected to whether you will sustain a strategy during live deployment. A strategy with a 30% historical drawdown will, at some point during live deployment, produce a 30% drawdown — and likely worse, because live conditions are not identical to backtest conditions. When setting position sizing and stop-loss parameters, assume the live maximum drawdown will be 1.5x to 2x the backtest maximum drawdown. That multiplier is not conservative pessimism; it is calibration to the observed relationship between paper and live performance across many systematic trading operations.
Win rate — the fraction of active trading days with a positive return:
trades = strategy_returns[strategy_returns != 0]
win_rate = (trades > 0).mean()
print(f"Win rate: {win_rate * 100:.1f}%")
print(f"Total trades: {len(trades)}")
Win rate in isolation is meaningless. A strategy with a 30% win rate can be highly profitable if the average win is 3x the average loss. A strategy with a 70% win rate can be a net loser if the average loss is 5x the average win. The relevant metric is the combination of win rate and average win/loss ratio — not either figure alone. When reporting win rate, always report the average win, average loss, and total trade count alongside it.
Step 5 — Out-of-Sample Validation
The critical test is whether the strategy's performance on data it has never been evaluated on is consistent with its in-sample performance. Split the dataset and evaluate each window separately:
train = bars.loc["2020":"2022", "close"]
test = bars.loc["2023":, "close"]
# Evaluate on training window
signal_train = ma_crossover(train)
pos_train = signal_train.shift(1)
ret_train = train.pct_change() * pos_train
sharpe_train = (ret_train.mean() / ret_train.std()) * np.sqrt(252)
# Evaluate on test window (held out)
signal_test = ma_crossover(test)
pos_test = signal_test.shift(1)
ret_test = test.pct_change() * pos_test
sharpe_test = (ret_test.mean() / ret_test.std()) * np.sqrt(252)
print(f"In-sample Sharpe (2020-2022): {sharpe_train:.2f}")
print(f"Out-of-sample Sharpe (2023): {sharpe_test:.2f}")
A strategy whose Sharpe drops from 1.8 on the training window to 0.4 on the test window is overfit. The parameters — short_window=10, long_window=30 — were likely selected, consciously or not, because they performed well on the training data. When presented with new data from a different market environment, the apparent edge disappears.
When out-of-sample performance is significantly weaker than in-sample performance, the choices are: adjust the hypothesis and re-derive parameters from first principles (not re-optimize them on the same data to get a better test result), change the universe or time frame, or discard the hypothesis entirely. Re-running parameter optimization until the out-of-sample result improves is not validation — it is simply moving the overfitting problem forward by one step. The test set becomes contaminated the moment you start making parameter decisions based on how the strategy performs on it.
What the Numbers Cannot Tell You
There are limits to what any backtest can measure, regardless of how carefully it is constructed. Acknowledging these limits before deploying is not optional.
Market impact: a backtest assumes your orders execute at the modeled price without affecting the price. For large positions in illiquid names, the act of buying moves the price against you. This cost does not appear in a backtest using daily closing prices. At small position sizes in highly liquid instruments like SPY, market impact is negligible. At larger sizes or in smaller-cap names, it can be the difference between a profitable strategy and a losing one.
Fill quality under stress: in volatile conditions, bid/ask spreads widen and market orders fill at prices further from the last trade than the model assumes. Backtests using daily OHLCV data do not capture intraday spread behavior or the degraded fill quality that accompanies fast-moving markets — precisely the conditions when many strategies are most active.
Regime change: a strategy that worked in a low-volatility, trending market performs differently in a high-volatility, mean-reverting market. Historical data covers both regimes but does not predict when the next regime shift will occur or how the strategy will behave during the transition period. Evaluating a strategy across multiple distinct historical regimes is the best available tool for understanding regime sensitivity.
Behavioral execution: a backtest does not model the psychological difficulty of holding a strategy through a 20% drawdown when it has been losing for six consecutive weeks. The live experience of running a strategy differs from the analytical experience of reviewing its backtest in a spreadsheet. Strategies that look fine on paper are abandoned at exactly the wrong moment in practice. Realistic drawdown projections and pre-defined rules for when to stop a strategy are part of the system design, not afterthoughts.
These are not gaps in the methodology. They are fundamental limits of evaluating a forward-looking system on backward-looking data. Acknowledging them is what separates systematic traders who sustain long-term performance from those who do not.
The Oyamori Approach
Oyamori's edge catalog includes out-of-sample validation for each entry — not just in-sample backtest results. The platform surfaces the key metrics (Sharpe, drawdown, win rate, regime sensitivity) for each edge before deployment, along with the assumptions built into each backtest and the historical conditions under which each edge has degraded.
The trader's job is to evaluate whether a strategy's risk profile matches their capital constraints and tolerance for drawdown — not to build the testing infrastructure from scratch on every evaluation. The catalog does the hypothesis testing; the trader brings the risk judgment.
Next: Paper Trading vs. Live Trading — How to Know When You're Ready →