Runaway memory usage by Zipline backtest during data pulls

Hi Brian,

Since re-downloading the new zipline minute bundle, I've been experiencing out-of-memory failures during my longer running backtests. After profiling, I isolated the growth in memory usage to data.history() calls.

Whenever a history call is made for a new batch of stocks, the memory used by the backtest process grows, and that memory is not released for the duration of the backtest. You can replicate with the following backtest code:

import psutil
from zipline.api import sid

def initialize(context):

    context.dayCount = 0

    return


def before_trading_start(context, data):

    context.dayCount += 1
    context.minCount = 0

    # Get 50 random asset sids from securities master

    securities = pd.read_csv('/codeload/ib_master.csv')
    context.sids = securities[(securities['Delisted']==0) & (securities['SecType']=='STK')]['Sid'].sample(n=50)

    context.assets = []

    for s in context.sids:

        try:
            context.assets.append(sid(s))
        except Exception as e:
            print('Exception at sid {}: {}'.format(s, e))

    return


def handle_data(context, data):

    context.minCount += 1

    if context.minCount == 1:
        print('MEM USAGE BEFORE PULL (DAY {}):'.format(context.dayCount), psutil.Process().memory_percent())

    daily_panel = data.history(context.assets, ["price", "open", "high", "low", "close", "volume"], 150, "1d")

    if context.minCount == 1:
        print('MEM USAGE AFTER PULL (DAY {}):'.format(context.dayCount), psutil.Process().memory_percent())
        print('-------')

The output looks something like this:

    quantrocket_zipline_1|MEM USAGE BEFORE PULL (DAY 1): 1.7565882120200094
    quantrocket_zipline_1|MEM USAGE AFTER PULL (DAY 1): 2.53061302150199
    quantrocket_zipline_1|-------
    quantrocket_zipline_1|MEM USAGE BEFORE PULL (DAY 2): 2.5621292061117003
    quantrocket_zipline_1|MEM USAGE AFTER PULL (DAY 2): 3.2926607050990113
    quantrocket_zipline_1|-------
    quantrocket_zipline_1|MEM USAGE BEFORE PULL (DAY 3): 3.321810736529819
    quantrocket_zipline_1|MEM USAGE AFTER PULL (DAY 3): 3.9942861059729267
    quantrocket_zipline_1|-------
    quantrocket_zipline_1|MEM USAGE BEFORE PULL (DAY 4): 4.022216470816672
    quantrocket_zipline_1|MEM USAGE AFTER PULL (DAY 4): 4.680665674508553
    quantrocket_zipline_1|-------
    quantrocket_zipline_1|MEM USAGE BEFORE PULL (DAY 5): 4.725159111604615
    quantrocket_zipline_1|MEM USAGE AFTER PULL (DAY 5): 5.452958557436905
    quantrocket_zipline_1|-------
    quantrocket_zipline_1|MEM USAGE BEFORE PULL (DAY 6): 5.487938595153874
    quantrocket_zipline_1|MEM USAGE AFTER PULL (DAY 6): 6.2009800752827005
    quantrocket_zipline_1|-------
    quantrocket_zipline_1|MEM USAGE BEFORE PULL (DAY 7): 6.2403509127130965
    quantrocket_zipline_1|MEM USAGE AFTER PULL (DAY 7): 6.932218980890507

This continues to grow until an error like this is thrown:

quantrocket_flightlog_1|2021-04-22 23:53:33 quantrocket.zipline: ERROR the system killed the worker handling the request, likely an Out Of Memory error; please add more memory or try a smaller request

It's worth mentioning that even though the code above pulls data every minute of the trading day, mem usage grows almost entirely in the first minute, when the new assets are pulled for the first time.

Obviously this puts an untenable cap on the duration of any backtest and destabilizes the system. Any thoughts on why this is happening and how to work around?

Thanks,
Paul.

Paul, thanks for digging into this. In general I've been finding the usage of usstock-1min to be having this issue. I think my ticket's last response and @peterfabakker's ticket are all stemming from this base issue. Just raising my hand in support, although I'm not much help.

After reading your post, I can only guess that the data pulling method (get_prices in your case, or data.history in mine) has to load something substantial into memory before retrieving even a single minute-level data point for a new asset. So it has to pre-load this 'something' multiplied by 11k assets, even though you only want one price for each.

That 'something' is not being released from memory after the retrieval. The benefit of this is that future data retrieval on the same asset won't require the pre-load and will be faster (esp. when pulling every minute in a backtest). The (more substantial) problem is that this builds up in memory, and with enough assets, will produce a memory crash.

Just a guess. Does this sound like a likely explanation, Brian?

Btw I have 16GB allocated to Docker, so it's not a small memory problem.

When you request data for a particular sid through data.history, Zipline stores the (compressed) data in an in-memory cache, so it will be quicker to access if you request the same sid again later in the backtest.

In your example code, memory usage is expected to grow for a time as you request more and more randomly chosen sids and they are added to the cache. Eventually the cache gets full, older items are evicted, and the memory usage will plateau. This is what I see on my machine, with a plateau in the 5 GB range.

The maximum number of sids to cache is defined in a block of code that we have never customized. I imagine Quantopian tuned the numbers based on the sizes of their built-in universes like the Q1500.

You could fork QuantRocket’s Zipline and modify the cache sizes then install in the zipline container with

pip install git+https://github.com/<your-username>/[email protected]<your-branch> 

to see if that helps. At this point I don’t have enough evidence to convince me that the numbers are too high and should be changed, though I’m open to further evidence suggesting otherwise.

In this test of a real algo that used to perform you can see that the time increases with every percentage of the backtest done: in minutes: 1, 2, 5, 7, 8, 13, 14, 18, 21, 22.
This makes Quantrocket just unuseable. I bought this 64GB mac with i9 proc and I have dedicated 32GB to Docker....

@brian you are allowed to see the exact code, but too many people have problems with this to just ignore it, IMHO

2021-05-02 09:50:15 quantrocket.zipline: INFO [QualUptrend.test]  █---------   6%  2009-09-30                    3%            0.29            -16%            $29845
2021-05-02 09:51:00 quantrocket.zipline: INFO [QualUptrend.test]  █---------   7%  2009-11-02                    2%            0.25            -16%            $24922
2021-05-02 09:53:30 quantrocket.zipline: INFO [QualUptrend.test]  █---------   7%  2009-11-30                    5%            0.36            -16%            $48180
2021-05-02 09:58:36 quantrocket.zipline: INFO [QualUptrend.test]  █---------   8%  2009-12-31                    8%            0.49            -16%            $78432
2021-05-02 10:06:18 quantrocket.zipline: INFO [QualUptrend.test]  █---------   9%  2010-02-01                    8%            0.46            -16%            $76095
2021-05-02 10:14:57 quantrocket.zipline: INFO [QualUptrend.test]  █---------   9%  2010-03-01                    8%            0.48            -16%            $84971
2021-05-02 10:27:44 quantrocket.zipline: INFO [QualUptrend.test]  █---------  10%  2010-03-31                   11%            0.58            -16%           $113476
2021-05-02 10:41:55 quantrocket.zipline: INFO [QualUptrend.test]  █---------  11%  2010-04-30                   13%            0.61            -16%           $127136
2021-05-02 10:59:12 quantrocket.zipline: INFO [QualUptrend.test]  █---------  12%  2010-06-01                    7%            0.36            -16%            $69105
2021-05-02 11:20:29 quantrocket.zipline: INFO [QualUptrend.test]  █---------  12%  2010-06-30                    7%            0.34            -16%            $66978
2021-05-02 11:42:58 quantrocket.zipline: INFO [QualUptrend.test]  █---------  13%  2010-08-02                   13%            0.56            -16%           $131648

I have moved all logic to the pipeline, the only thing that simply needs to be in handle_data is the stoploss check that uses data.current.

Running example code and digging into the cache sizes is definitely not ignoring it. :slight_smile:

Although I haven’t been able to reproduce runaway memory growth, I did end up getting better performance with a smaller cache size and have that in a master branch which you could try installing:

pip install git+https://github.com/quantrocket-llc/[email protected]

The cache sizes have always been the same, though, so I’m not sure how plausible it is that they are suddenly causing problems.

So I found something on my system that seemed to be causing the backtest issues. When I was not actively using quantrocket or docker at all, I noticed that my CPU usage was strangely high. On docker stats, the jupyter container was using 100-200% CPU (i.e. 1-2 cores), even though I wasn't running any kind of code or analysis.

I shut down docker completely and even restarted the system, but the CPU usage bizarrely continued in a process called 'vmmem' (which is the Windows virtual memory manager for any virtual machine, e.g. docker). I was not allowed to shut this process down from task manager.

After googling for a solution, I opened the Hyper-V Manager (another Windows platform util that lets you actively manage virtual machines). There I was able to kill the vmmem process. I restarted again, and started a backtest. It seemed to be running faster, and memory usage in docker stats wasn't growing as fast. Left it on overnight, and this time, it completed without error like it used to (pulled data for 1,523 unique assets, using 16% of memory, or roughly 2.6GB).

While I haven't seen this on my Mac installation, I would suggest a complete restart of your system and checking if there are any uninitiated processes related to docker or virtual machines that seem to be using resources. I couldn't believe that this process persisted even after a full system restart.

I'll be running a longer, full backtest that usually takes a couple days to complete this week. Will report back if errors appear again.

Memory crashes returned, so I tested the git branch you posted. Got significantly better memory footprint at an acceptable cost to time. More specifically:

Testing with ~6400 unique assets pulled for 150d data over 62 backtest days

  • old cache settings: mem usage plateaus ~35% (5.6 GB), run time 19.13 minutes
  • new settings: mem usage plateaus ~14.6% (2.3 GB), run time 21.25 minutes

Do you plan to incorporate the new settings into next release?

If so, optimal setting might actually vary per system based on mem constraints (10% time cost seen above is still worth fine tuning). Is there a way to make it easy for the user to adjust the settings for optimization?

1 Like

The adjusted cache sizes are included in QuantRocket 2.6.0: