Docker CPU stats use vs Amazon CPU use

peterfabakker · April 11, 2021, 11:43am

I have created an installation on EC2 with the commands provided. Now as my backtests are slow, I decided I need more CPU and Memory and I have changed from a t2.large to a t2.xlarge. (Also made the IP static) and now I consider going to t3.xlarge

When I look at the CPU usage in EC2 during a backtest it is 30%

When I run docker stats I see zipline is running 107%.....

When I inspect the docker-machine, I see however that the IP is wrong, the instance still says t2.large.

So my question is: how do I make sure the zipline container use more CPU? If I give it a bigger instance should it start using more after it restarted? or doesn't that matter?

Or does that not work and do I need to set up the whole thing again?

any pointer is appreciated

$ docker-machine  inspect quantrocket
{
    "ConfigVersion": 3,
    "Driver": {
        "IPAddress": "XXXXXXXXXX",
        "MachineName": "quantrocket",
        "SSHUser": "ubuntu",
        "SSHPort": 22,
        "SSHKeyPath": "/patht/id_rsa",
        "StorePath": "/pathr/.docker/machine",
        "SwarmMaster": false,
        "SwarmHost": "tcp://0.0.0.0:3376",
        "SwarmDiscovery": "",
        "Id": "312d2cc0d883793608104c6b309411da",
        "AccessKey": "XXXXXXXXXXX",
        "SecretKey": "XXxxxxXXXXX",
        "SessionToken": "",
        "Region": "us-east-1",
        "AMI": "ami-927185ef",
        "SSHKeyID": 0,
        "ExistingKey": false,
        "KeyName": "quantrocket",
        "InstanceId": "i-0aXXXXXXX08088",
        "InstanceType": "t2.large",
        "PrivateIPAddress": "1xxxxxxx",
        "SecurityGroupId": "",
        "SecurityGroupIds": [
            "sg-0XXXXXXXXe124"
        ],
        "SecurityGroupName": "",
        "SecurityGroupNames": [
            "docker-machine"
        ],
        "SecurityGroupReadOnly": false,
        "OpenPorts": [
            "80",
            "443"
        ],
        "Tags": "",
        "ReservationId": "",
        "DeviceName": "/dev/sda1",
        "RootSize": 200,
        "VolumeType": "gp2",
        "IamInstanceProfile": "",
        "VpcId": "vpc-XXXXXfe",
        "SubnetId": "subnXXXXX3",
        "Zone": "a",
        "RequestSpotInstance": false,
        "SpotPrice": "0.50",
        "BlockDurationMinutes": 0,
        "PrivateIPOnly": false,
        "UsePrivateIP": false,
        "UseEbsOptimizedInstance": false,
        "Monitoring": false,
        "SSHPrivateKeyPath": "",
        "RetryCount": 5,
        "Endpoint": "",
        "DisableSSL": false,
        "UserDataFile": ""
    },
    "DriverName": "amazonec2",
    "HostOptions": {
        "Driver": "",
        "Memory": 0,
        "Disk": 0,
        "EngineOptions": {
            "ArbitraryFlags": [],
            "Dns": null,
            "GraphDir": "",
            "Env": [],
            "Ipv6": false,
            "InsecureRegistry": [],
            "Labels": [],
            "LogLevel": "",
            "StorageDriver": "",
            "SelinuxEnabled": false,
            "TlsVerify": true,
            "RegistryMirror": [],
            "InstallURL": "https://get.docker.com"
        },
        "SwarmOptions": {
            "IsSwarm": false,
            "Address": "",
            "Discovery": "",
            "Agent": false,
            "Master": false,
            "Host": "tcp://0.0.0.0:3376",
            "Image": "swarm:latest",
            "Strategy": "spread",
            "Heartbeat": 0,
            "Overcommit": 0,
            "ArbitraryFlags": [],
            "ArbitraryJoinFlags": [],
            "Env": null,
            "IsExperimental": false
        },
        "AuthOptions": {
            "CertDir": "/path/.docker/machine/certs",
            "CaCertPath": "/pathr/.docker/machine/certs/ca.pem",
            "CaPrivateKeyPath": "/path/.docker/machine/certs/ca-key.pem",
            "CaCertRemotePath": "",
            "ServerCertPath": "/path/.docker/machine/machines/quantrocket/server.pem",
            "ServerKeyPath": "path/.docker/machine/machines/quantrocket/server-key.pem",
            "ClientKeyPath": "/path/.docker/machine/certs/key.pem",
            "ServerCertRemotePath": "",
            "ServerKeyRemotePath": "",
            "ClientCertPath": "/path/.docker/machine/certs/cert.pem",
            "ServerCertSANs": [],
            "StorePath": "/path/.docker/machine/machines/quantrocket"
        }
    },
    "Name": "quantrocket"

my stats:

CONTAINER ID   NAME                            CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
d88b64fb5a97   quantrocket_logspout_1          0.17%     8.938MiB / 15.67GiB   0.06%     414kB / 3.1MB     0B / 0B           11
2a3a91f00176   quantrocket_houston_1           0.50%     3.27MiB / 15.67GiB    0.02%     64.1MB / 65.1MB   4.1kB / 4.1kB     3
df36996552d0   quantrocket_blotter_1           0.00%     125.7MiB / 15.67GiB   0.78%     3.53MB / 164kB    0B / 7.96MB       15
b163d3cb6440   quantrocket_realtime_1          0.02%     97.88MiB / 15.67GiB   0.61%     14.1kB / 9.89kB   0B / 0B           29
8f38d4a0c90a   quantrocket_history_1           0.00%     75.59MiB / 15.67GiB   0.47%     1.73kB / 0B       0B / 0B           17
22a04fc1d3f2   quantrocket_fundamental_1       0.00%     137.6MiB / 15.67GiB   0.86%     1.96MB / 11.4MB   4.1kB / 65.5kB    13
139d54d48b16   quantrocket_master_1            0.00%     131.5MiB / 15.67GiB   0.82%     1.75MB / 5.18MB   0B / 125MB        13
ae9989b068a8   quantrocket_satellite_1         0.00%     28.06MiB / 15.67GiB   0.17%     9.84kB / 5.83kB   0B / 65.5kB       12
3cfac16831af   quantrocket_account_1           0.00%     100.2MiB / 15.67GiB   0.62%     119kB / 111kB     0B / 0B           9
cfbfcc70b4b8   quantrocket_jupyter_1           0.57%     327MiB / 15.67GiB     2.04%     7.27MB / 36.5MB   1.58MB / 20.3MB   36
ac635f62dc0e   quantrocket_flightlog_1         0.01%     89.59MiB / 15.67GiB   0.56%     1.6MB / 23.5kB    0B / 0B           25
2d96b151f937   quantrocket_countdown_1         0.02%     60.57MiB / 15.67GiB   0.38%     1.96kB / 169B     27MB / 8.19kB     9
8b65596d02e5   quantrocket_theia_1             0.02%     126.2MiB / 750MiB     16.82%    2.08kB / 0B       0B / 0B           30
52add1ceb52a   quantrocket_db_1                0.00%     57.58MiB / 15.67GiB   0.36%     2.08kB / 0B       0B / 0B           6
642758f1da86   quantrocket_zipline_1           101.21%   679.1MiB / 15.67GiB   4.23%     12MB / 2.14MB     0B / 2.2MB        30
adf039b3dbeb   quantrocket_moonshot_1          0.00%     106.3MiB / 15.67GiB   0.66%     1.81kB / 0B       0B / 0B           11
e25c320cfcff   quantrocket_codeload_1          3.09%     148.1MiB / 15.67GiB   0.92%     2.71MB / 13.9kB   4.1kB / 10.8MB    87
74b6ab58ea19   quantrocket_license-service_1   0.00%     94.78MiB / 15.67GiB   0.59%     1.3MB / 575kB     0B / 0B           5
230e9c48e199   quantrocket_ibgrouter_1         0.00%     31.16MiB / 15.67GiB   0.19%     388kB / 306kB     0B / 0B           5
4474ea1495ed   quantrocket_ibg1_1              0.01%     79MiB / 15.67GiB      0.49%     269kB / 173kB     0B / 61.4kB       24
476d182adc2e   quantrocket_postgres_1          0.02%     79.98MiB / 15.67GiB   0.50%     11.6kB / 12.2kB   0B / 53.1MB       13
27f629e55671   cloud_moonshot_1                0.01%     106.5MiB / 15.67GiB   0.66%     828B / 0B         1.27MB / 0B       11

kevinkurek · April 11, 2021, 7:41pm

This is a great question. I've noticed similar statistics and my zipline backtest hangs or doesn't finish even though I upgraded to a t3.xlarge recently.
Here I'm using the t3.xlarge using ~33% utilization for about 1.5 hours and gets me from 01/2020 to 04/2020, but after 04/30 the backtest essentially runs indefinitely and never finishes. I Have even tried running by quarters, but during May it seems to spin.

As an FYI, the t3 is the more cost efficient version of the t2 series, so any t3 is at least as cost-efficient as any t2 assuming you're using Linux in some form (I'm Ubuntu 20.04).

peterfabakker · April 12, 2021, 1:49am

thanks, at least I have moved to t3 and I have recreated the containers and it's the same: 100%+ in docker stats and 26% in Amazon. Backtests that used to run in an hour take now 10hrs+. Something must have happened between 2.4 and 2.5 that causes this

... now we need @Brian to answer

kevinkurek · April 12, 2021, 7:54pm

Brian, I'm actually seeing something similar for simply ingesting the usstock-1min on my EC2. I thought running the ingestion, after already having usstock-1min, simply updated the current DB. It looks like it's trying to recollect the entire DB from scratch.
ingest_bundle("usstock-1min")

bjsun · April 13, 2021, 6:13am

Not sure if this is helpful - I'm no expert, but I've spent a lot of time trying to make my zipline backtests run faster. From what I can tell, the core backtest algorithm runs on a single thread, like most python applications. This means that expanding the number of CPU/cores available (from t2.large to t2.xlarge, for example) will not make the backtest faster, because it will still only be running on a single core.

In docker stats, the CPU usage represents how much of a single core the container is using. If you have n CPUs allocated to docker, the combined CPU usage of all containers can go up to (n x 100%).

I'm not as familiar with EC2 tools, but if the usage is showing 30%, perhaps it means that you're fully using one of the four vCPUs in an xlarge, while the others remain mostly unused and available for other threads.

I don't know of any way to make the zipline backtest package run on more than one core - I think this would take some serious rewiring under the hood and require multiple instances of the zipline Algorithm object to be created on separate processes and synced. Python has a multiprocessing library, but the zipline objects do not pickle well and so the parallel processes end up failing.

The only way I've found to make the backtest faster (without buying a faster core) is to offload non-essential computations to a separate container that can run concurrently with the main zipline container but on a separate core. This is possible if you have a lot of custom modeling that doesn't require access to zipline data objects to execute.

If you have other techniques, would love to learn about them, as some of my backtests take multiple days to complete one year!

peterfabakker · April 14, 2021, 4:29am

I did some measuring with the same algorithm.

On my local mac (and docker on mac is usually slow): 2.5 hrs

On cloud (aws t3xlarge) after 15 hrs it has done 34% so this will be 45 hrs by the time its done.

My conclusion: something must be off with the cloud-config, it's 18 times as slow!

Happy to let somebody poke around on the instance, but this starts to become unworkable. 2.4 was twice as fast on the cloud as on my mac, now its 18 times slower.

Brian · April 15, 2021, 3:53pm

I don't think any change in 2.5 would offer a likely explanation.

charles · January 16, 2022, 11:02pm

I accidentally stumbled upon this thread while working on making my Moonshot backtests faster, and was shocked to see you guys are waiting hours-days for a backtest to complete.

Currently, it takes me around 20 minutes to run a full year's backtest using Moonshot w/ around 2-5 candidates per day; roughly 1,000 iterations of my strategy logic on intraday minute data.

Personally I love how easy it easy to think about Zipline backtests/strategies, but it's way too slow for me. And while I initially struggled to wrap my head around Moonshot, and how to implement all my strategy logic within the context of a dataframe, the results have been amazing. I'd highly recommend looking into Moonshot and consider migrating your strategies.

Also, here's some tips I've figured out to make Moonshot run faster/with larger than ram price data:

Breakout candidate scanning into a separate process vs including it within your strategy. I've added a scan() method into each of my strategies that will generate a list of candidates to be backtested. Once I have this list I can re-use it over and over again to run backtests.
For live trading I have a separate scheduled job that handles scanning, and generating the universe that my strategies then use. My scanning works with the entire us stock market universe, but my actual strategy only operates on a handful of names from the universe generated by the scanning job. My scan jobs run every 15 minutes, while my trading jobs run every minute.
For really large backtests, I've found that backtesting 1 day at a time let's me scale past memory limitations. In step 1. At the end of each day's backtest, I pickle the results; one file for each day. At the end of the backtest, I aggregate all the results into a single file that I then run my analysis on that.

I'm now currently looking into using Vaex (alternative dataframe library) to replace parts of Moonshot to make things even faster!

bjsun · January 17, 2022, 11:29pm

Thanks for sharing your experience here! We need more of this.

Your scan() method sounds like the zipline pipeline, which generally takes only a few seconds for each backtest day to run calculations on the entire universe of ~10k securities and return likelies.

The reason my backtests run slowly is that some days there are hundreds of likelies, and the signal is only found in minute movements. Based on the profiling I've done, a large chunk of the backtest time is spent pulling the historical minute data for each of these assets 390x per day. The rest of the time is spent on (lots of) slicing and calculations.

Since this post, I've found a couple tricks to speed things up a bit:

(1) If the pipeline + before_trading_start calculations are slowing the backtest down, they can be run in parallel in a separate container (e.g. satellite) independent of the backtest instance, thanks to run_pipeline() in zipline.research. Set it up so that all the pre-calcs for the next day are running simultaneous to handle_data() for the current day in the backtest instance, and save the results to a csv. Then for the next day, simply read the csv instead of waiting for pipeline.

(2) If you are tracking the same asset throughout all minutes of the day, the zipline engine spends a lot of time re-pulling the same minute data, with only the last minute changing. This adds up if you need 1000s of past minutes and can be avoided. Using get_data(), you can create a zipline data object in before_trading_starts() (again independent of the backtest instance, in a separate container) and pull the entire day's minute data for the asset just once into a df (let's call it EOD_minute_hist). Then for each minute, instead of calling data.history(asset, [fields], 1000, '1m'), simply slice EOD_minute_hist to the current minute of the backtest (e.g. EOD_minute_hist.loc[:data.current_dt].iloc[-1000:]). This saves a TON of time.

@Brian It might be worth considering incorporating this approach into a future zipline engine behind the scenes, as most times we are pulling 99% the same minute data from the bundle 390x per backtest day.