Ml_walkforward fails to write results to file path on longer runs

quinn · February 11, 2026, 10:23am

from quantrocket.moonshot import ml_walkforward
When running this from a remote client on long running jobs - almost always the result file does not arrive on the client device. I am assuming thre is some type of http disconnect going on. Is this a known issue? Any suggestions on how to fix this?

Brian · February 11, 2026, 2:17pm

The default timeout on the client side for ml_walkforward is 1 day. (The timeout on the server side is 7 days.) If your backtest runs longer than 1 day, increase the timeout by setting QUANTROCKET_TIMEOUT on the client side:

import os
from quantrocket.moonshot import ml_walkforward

timeout = 60*60*24*7 # 7 days 
os.environ["QUANTROCKET_TIMEOUT"] = str(timeout) # env vars must be strings

ml_walkforward(...)

quinn · February 11, 2026, 6:13pm

I'm normally in the couple hours range. I don't get a time out. The process just hangs forever.

Brian · February 12, 2026, 1:49pm

Monitor the detailed logs for Moonshot and see if the backtest appears to finish on the server.

If the backtest finishes but the client has been disconnected by then, you would see a 499 in the logs for houston.

Another possibility is that you're out of Moonshot workers due to other backtests running. This would manifest as the process hanging while waiting for other backtests to finish. The moonshot service has 5 workers for handling requests by default (can be tuned with UWSGI_WORKERS env var).

quinn · February 26, 2026, 3:50am

I've looked into this a lot more increased my workers etc. The logs do not show 499s. The logs show successful 200s on the server side - no timeouts on the client side nothing. At this point I'm packet dumping to try to see if that tells me anything. I don't know if this could do with running it on mac os or like some os level timeout on open connections or something. I've ruled out firewall issues as some requests are successful and short requests always work.

moonshot.py:653 (/Users/jsai23/Workspace/quant/quantbridge/.venv/lib/python3.11/site-packages/quantrocket/moonshot.py:653) calls
write_response_to_filepath_or_buffer(zipfilepath, response) files.py:36 (/Users/jsai23/Workspace/quant/quantbridge/.venv/lib/python3.11/site/packages/quantrocket/_cli/utils/files.py:36) is the loop:
for chunk in response.iter_content(chunk_size=1024):
Stuck in this waiting for a chunk just indefinitely just hangs.

Anyhelp would be great.

Brian · February 26, 2026, 3:58pm

Keeping a long-open connection between a local client device and a cloud server is inherently more susceptible to network issues, and I suspect that's what's going on. The most reliable solution might be to run long-running requests entirely on the cloud server (the typical QuantRocket use case for a user working inside JupyterLab) and limit client-to-server communication to shorter-lived requests. You could achieve this with custom scripts on the cloud server. For example:

From the client, run quantrocket satellite exec codeload.scripts.ml_walkforward.queue, passing the backtest parameters. The script would create a file on the server containing the backtest parameters and immediately return without running the backtest.
On the cloud server, on the crontab, regularly run quantrocket satellite exec codeload.scripts.ml_walkforward.run_backtests. This script looks for queued files and runs the requested backtest. Network interruptions/hanging shouldn't be an issue since the request originates from the same host that is running the backtest.
From the client, run quantrocket satellite exec codeload.scripts.ml_walkforward.get_results. This script checks for completed backtest files and returns them to the client.

Depending on why you need to trigger the backtests locally instead of from the cloud server, you might be able to simplify some of these steps.

quinn · February 27, 2026, 12:02am

Yup this solution would work for me. For future reference I was able to work around the client hangs with TCP Keep Alive it seems which is great. Thanks!