Long Time taken for pipeline to get Custom Data with text field

kamran.sokhanvari · May 3, 2021, 6:20am

I have successfully loaded about a year of CustomData using the documented procedures and can load them using the zipline pipeline interface.

However, there seems to be a major performance issue if any of the defined fields are of type "str" and in the CcstomFudamental class, the field is pulled using the 'object' type as required. The Performace hit is so significant that I am not sure I can use the CustomFundemantes for any other data type besides 'float'.

Any insight into this would be greatly appreciated.

example timing

top_decile = (dollar_volume_decile.eq(9))
class CustomFundamentals(Database):

    CODE = "testset-fundamentals"
    LOOKBACK_WINDOW = 180

   
    market_cap_cmpt = Column(float)
    entval_cmpt = Column(float)
    RBICSFocus_l2_name = Column(object)

pipe = Pipeline(columns={

        'market_cap_cmpt': CustomFundamentals.market_cap_cmpt.latest,
        'entval_cmpt': CustomFundamentals.entval_cmpt.latest,
    #    'RBICSFocus_l2_name': CustomFundamentals.RBICSFocus_l2_name.latest

}, screen=(top_decile)
)

from datetime import datetime
#import timeit
start = datetime.now()
data = run_pipeline(pipe, start_date='2021-04-26', end_date='2021-04-26')
end = datetime.now()
print("Time taken:", end-start)

Time taken: 0:00:03.371350

Now the same pipe with a text column uncommented;

pipe = Pipeline(columns={

        'market_cap_cmpt': CustomFundamentals.market_cap_cmpt.latest,
        'entval_cmpt': CustomFundamentals.entval_cmpt.latest,
        'RBICSFocus_l2_name': CustomFundamentals.RBICSFocus_l2_name.latest

}, screen=(top_decile)
)

from datetime import datetime
#import timeit
start = datetime.now()
data = run_pipeline(pipe, start_date='2021-04-26', end_date='2021-04-26')
end = datetime.now()
print("Time taken:", end-start)

Time taken: 0:01:50.254969

This version is not unusable as the call takes so long for a single days worth of data

Brian · May 3, 2021, 4:50pm

Thank you for reporting this. I'm able to reproduce the slowness affecting text columns and we are working on a fix.

Brian · May 3, 2021, 8:00pm

This was a pandas performance issue in the quantrocket-client package and is fixed in client version 2.5.0.3. The fix will be included in the next release, but you can install it now by updating to the latest client on any containers where you are running into this issue.

To update both the main Python 3 kernel and the Zipline kernel in JupyterLab, run:

pip install quantrocket-client==2.5.0.3 && conda activate zipline && pip install quantrocket-client==2.5.0.3 && conda deactivate

To update the zipline container (for running backtests), enter the container and run:

pip install quantrocket-client==2.5.0.3

kamran.sokhanvari · May 4, 2021, 5:56pm

Brian, Thanks very much for your quick response. I updated the libs and it's working on my side now.

Best,

Brian · June 21, 2021, 8:33pm

This fix is included in version 2.6.0.

Brian · June 21, 2021, 8:34pm

Brian · June 26, 2021, 10:00pm

This topic was automatically closed after 5 days. New replies are no longer allowed.