Backtest performance difference between 2.9.2 and 2.10.0

Hi Brian,

I upgraded our environment to the latest 2.10 update. However, after running a long-term backtest and checking the return profile I found the below difference between 2.9.2 annual returns vs. the new version.

12/31/2012 -1.61%
12/31/2013 -4.56%
12/31/2014 -2.38%
12/31/2015 -1.50%
12/31/2016 -5.26%
12/31/2017 -1.12%
12/31/2018 -1.23%
12/31/2019 -1.73%
12/31/2020 -2.05%
12/31/2021 -2.40%
12/31/2022 -14.50%
12/31/2023 -1.57%
12/31/2024 -0.37%

Do you have any insights as to what could be causing this negative impact since no warnings or incompatibility issues popped up on running the algo in the new version? Are there any specific changes to the pricing or trade execution models that could cause this? This is rather confusing.

Thanks for your help.

Kamran

There's nothing I'm aware of and the Zipline test suite didn't reveal anything in the 2.9 -> 2.10 update. Can you dig into the results CSVs to try to pinpoint where the difference is coming in? For example, different leverage, different positions, etc?

It's certainly possible that changes in the underlying libraries Zipline uses could affect results, but I would usually expect any such effects to be smaller and less one-directional than what you show. One underlying change I'm aware of is that scipy.rankdata, which is used by Pipeline ranking functions, changed how they handled nans, which in theory could affect rankings, although it would most likely affect the bottom of the rankings and thus seems unlikely to affect the stocks you would select from the rankings. I happen to be aware of that change because it required adding a new parameter to Zipline calls of scipy.rankdata, but there could be other underlying package changes that I'm not aware of because they didn't bubble up into the test suite.

With version updates, it can be hard to avoid some amount of wiggle in the backtests, but I would dig into the results CSVs a bit more because those differences seem rather big.

Hi Brian,

I tried to go through my algo and kept disabling different parts of the code to see if the performance delta would disappear. However, the difference in performance did not disappear.
I am not sure what is causing the discrepancy. So I decided to test the two QR versions on a simple algo per below. This is a simple buy-and-hold with any cash balances reinvested.

As you can see there is still a significant difference in the performance of the QR versions. I have looked at the backtest files and I can not find anything. I can send them to you if you like.
The pricing database seems to be fine too!

I like to upgrade the environment to 2.10 but this is very confusing.

Thanks for your help.



import zipline.api as algo

def initialize(context):
    """
    Called once at the start of a backtest, and once per day at
    the start of live trading.
    """
   
    # Set SPY as benchmark
    algo.set_benchmark(algo.symbol("SPY"))
    

    context.init = 0 

def handle_data(context, data):
 
    algo.order_target_percent(algo.symbol('VT'), 1.0)

def before_trading_start(context, data):
    """
    Called every day before market open. Gathers today's pipeline
    output and initiates real-time data collection (in live trading).
    
    """
   
    pass

QR version 2.10

Start date 2023-01-04
End date 2024-04-25
Total months 15
Backtest
Annual return 16.65%
Cumulative returns 22.28%
Annual volatility 12.3%
Sharpe ratio 1.3141
Calmar ratio 1.4417
Stability 0.7389
Max drawdown -11.55%
Omega ratio 1.2332
Sortino ratio 1.9792
Skew -0.0156
Kurtosis -0.1577
Tail ratio 0.9679
Daily value at risk -1.49%
Gross leverage 1.0
Daily turnover 0.61%
Alpha -0.0546
Beta 0.9228

QR version 2.9.2

Start date 2023-01-04
End date 2024-04-25
Total months 15
Backtest
Annual return 18.984%
Cumulative returns 25.474%
Annual volatility 12.228%
Sharpe ratio 1.48
Calmar ratio 1.70
Stability 0.79
Max drawdown -11.179%
Omega ratio 1.27
Sortino ratio 2.26
Skew -0.01
Kurtosis -0.15
Tail ratio 0.99
Daily value at risk -1.469%
Gross leverage 1.00
Daily turnover 0.62%
Alpha -0.04
Beta 0.92

BTW, as a test, I created a Zipline container using 2.9 version while keeping everything else at 2.10 and the discrepancy disappeared. Whatever is causing this issue is contained within the Zipline 2.10 update!

Thank you for narrowing this down. Investigating...

This was a regression introduced in 2.10.0. Dividend payouts were not getting credited to the portfolio, which is why you saw performance being worse across the board. The root cause was the shift from trading_calendars to exchange_calendars, which, as mentioned in the release notes, involved switching from tz-aware session labels (2024-04-29T00:00:00-00:00) to tz-naive session labels (2024-04-29). Unfortunately, in the Zipline code that looks up dividends to be credited to the portfolio, a tz-aware session label was still being used for the lookup, so it didn’t find the dividends, which were stored with tz-naive Timestamps. The Zipline test suite has almost 3,000 tests, but remarkably, none of them caught this regression.

The issue has been fixed and a test has been added to prevent future regressions of this sort. Please update to 2.10.1 to get the fix.

Brian,

Thanks very much for a quick fix of this issue. I just ran my algos and the versions now match up. That was a sneaky one!

All the new features in the 2.10 update are very useful, in addition to Brain's latest sentiment data set which I look forward to testing.

Great work by you!

1 Like