502 satellite stall issues

kevinkurek · September 17, 2025, 9:32pm

Hi Brian,

Getting 502's while running satellite. It's aperiodic and not 1 specific file. The entire countdown runs 2-3 days without issues, then reappears.
It also seems like if an error happens within satellite it hangs indefinitely and doesn't clean up without a manual restart. Sometimes it will run clean all day then the first cron kickoff the next day stalls. I don't think it's an HTTP concurrency issue but I'm unsure.

Cron

0 14 * * quantrocket history collect 'myprices' --priority && quantrocket history wait 'myprices' --timeout '2min' && quantrocket satellite exec 'codeload.project.pipeline.main'

Logs show history completed but quantrocket satellite exec doesn't run or fail.

2025-09-17 14:00:03 quantrocket.history: INFO [myprices] Collecting history from IBKR for 1 securities in myprices
2025-09-17 14:00:06 quantrocket.history: INFO [myprices] Saved 2220 total records for 1 total securities to quantrocket.v2.history.myprices.sqlite

When I do restart it manually I get logs like this

docker --context cloud compose restart satellite
>>
2025-09-17 15:24:42 quantrocket.countdown: ERROR Error running /opt/conda/bin/quantrocket satellite exec codeload.project.endofday.main
2025-09-17 15:24:42 quantrocket.countdown: ERROR msg: 'HTTPError(''502 Server Error: Bad Gateway for url: http://houston/satellite/commands?cmd=codeload.project.endofday.main'',
2025-09-17 15:24:42 quantrocket.countdown: ERROR   ''please check the logs for more details'')'
2025-09-17 15:24:42 quantrocket.countdown: ERROR status: error
2025-09-17 15:24:42 quantrocket.countdown: ERROR Error running /opt/conda/bin/quantrocket satellite exec codeload.project.pipeline.main
2025-09-17 15:24:42 quantrocket.countdown: ERROR msg: 'HTTPError(''502 Server Error: Bad Gateway for url: http://houston/satellite/commands?cmd=codeload.project.pipeline.main'',
2025-09-17 15:24:42 quantrocket.countdown: ERROR   ''please check the logs for more details'')'
2025-09-17 15:24:42 quantrocket.countdown: ERROR status: error
2025-09-17 15:24:42 quantrocket.countdown: ERROR Error running /opt/conda/bin/quantrocket satellite exec codeload.project.pipeline.main
2025-09-17 15:24:42 quantrocket.countdown: ERROR msg: 'HTTPError(''502 Server Error: Bad Gateway for url: http://houston/satellite/commands?cmd=codeload.project.pipeline.main'',
2025-09-17 15:24:42 quantrocket.countdown: ERROR   ''please check the logs for more details'')'
2025-09-17 15:24:42 quantrocket.countdown: ERROR status: error

Checking docker while frozen.

docker --context cloud logs --tail 500 quantrocket_cloud-satellite-1

...The work of process 40 is done. Seeya!
corrupted double-linked list
worker 2 killed successfully (pid: 40)
Respawned uWSGI worker 2 (new pid: 267)
...The work of process 267 is done. Seeya!
worker 2 killed successfully (pid: 267)
Respawned uWSGI worker 2 (new pid: 269)
...The work of process 269 is done. Seeya!
worker 2 killed successfully (pid: 269)
Respawned uWSGI worker 2 (new pid: 298)
...The work of process 298 is done. Seeya!
!!! uWSGI process 298 got Segmentation Fault !!!
*** SIGNAL QUEUE IS FULL: buffer size 212992 bytes (you can tune it with --signal-bufsize) ***
could not deliver signal 0 to workers pool
*** SIGNAL QUEUE IS FULL: buffer size 212992 bytes (you can tune it with --signal-bufsize) ***
could not deliver signal 0 to workers pool
*** SIGNAL QUEUE IS FULL: buffer size 212992 bytes (you can tune it with --signal-bufsize) ***
could not deliver signal 0 to workers pool
*** SIGNAL QUEUE IS FULL: buffer size 212992 bytes (you can tune it with --signal-bufsize) ***
could not deliver signal 0 to workers pool
*** SIGNAL QUEUE IS FULL: buffer size 212992 bytes (you can tune it with --signal-bufsize) ***
...

Here I restarted then re-ran the pipeline. Even during successful runs I see SegFaults.

Run1:
quantrocket satellite exec 'codeload.project.pipeline.main'
[TRADE DETAILS] completed.                  <---- END OF MY CODE
...The work of process 28 is done. Seeya!
!!! uWSGI process 28 got Segmentation Fault !!!

Run2:
[TRADE DETAILS] completed.                  <---- END OF MY CODE
...The work of process 29 is done. Seeya!
!!! uWSGI process 29 got Segmentation Fault !!!

Inside quantrocket flightlog stream --details after a successful run.

quantrocket_cloud-satellite-1|corrupted double-linked list
quantrocket_cloud-satellite-1|worker 1 killed successfully (pid: 28)
quantrocket_cloud-satellite-1|Respawned uWSGI worker 1 (new pid: 150)
  quantrocket_cloud-houston-1|172.18.0.21 - - [17/Sep/2025:20:20:22 +0000] "GET /ibg1/gateway HTTP/1.1" 200 22 "-" "python-requests/2.31.0"
  quantrocket_cloud-houston-1|172.18.0.10 - - [17/Sep/2025:20:20:22 +0000] "GET /ibgrouter/gateways?status=running HTTP/1.1" 200 9 "-" "python-urllib3/1.26.18"
quantrocket_cloud-countdown-1|From root@localhost Wed Sep 17 16:20:20 2025
quantrocket_cloud-countdown-1|Return-path: <root@localhost>
quantrocket_cloud-countdown-1|Envelope-to: root@localhost
quantrocket_cloud-countdown-1|Delivery-date: Wed, 17 Sep 2025 16:20:20 -0400
quantrocket_cloud-countdown-1|Received: from root by f3685c56c416 with local (Exim 4.94.2)
quantrocket_cloud-countdown-1|  (envelope-from <root@localhost>)
quantrocket_cloud-countdown-1|  id 1uyydY-0006ZW-80
quantrocket_cloud-countdown-1|  for root@localhost; Wed, 17 Sep 2025 16:20:20 -0400
quantrocket_cloud-countdown-1|From: root@localhost (Cron Daemon)
quantrocket_cloud-countdown-1|To: root@localhost
quantrocket_cloud-countdown-1|Subject: Cron <root@f3685c56c416> quantrocket master isopen 'NYSE' --ago '35m' && quantrocket satellite exec 'codeload.project.pipeline.main' && quantrocket satellite exec 'codeload.project.merge.main'
quantrocket_cloud-countdown-1|MIME-Version: 1.0
quantrocket_cloud-countdown-1|Content-Type: text/plain; charset=UTF-8
quantrocket_cloud-countdown-1|Content-Transfer-Encoding: 8bit
quantrocket_cloud-countdown-1|X-Cron-Env: <SHELL=/bin/sh>
quantrocket_cloud-countdown-1|X-Cron-Env: <HOME=/root>
quantrocket_cloud-countdown-1|X-Cron-Env: <PATH=/usr/bin:/bin>
quantrocket_cloud-countdown-1|X-Cron-Env: <LOGNAME=root>
quantrocket_cloud-countdown-1|Message-Id: <E1uyydY-0006ZW-80@f3685c56c416>
quantrocket_cloud-countdown-1|Date: Wed, 17 Sep 2025 16:20:20 -0400
quantrocket_cloud-countdown-1|
quantrocket_cloud-countdown-1|status: success
quantrocket_cloud-countdown-1|status: success
quantrocket_cloud-countdown-1|
quantrocket_cloud-countdown-1|

I do async fetch options pricing within the pipeline but they're not raising errors. I also have asyncio.wait_for(timeout=15m) to confirm it's not timing out. They've never been the issue.

Brian · September 18, 2025, 1:38pm

The 502 happens because of the segmentation fault. When the segfault occurs, the worker doesn't get respawned. (In the detailed logs, you will see "...The work of process XXX is done. Seeya!" but it won't be followed by "Respawned uWSGI worker X (new pid: XXX)") Eventually, if this happens to all the satellite workers, there will be no workers to handle new requests, resulting in 502 Bad Gateway. The immediate fix is to restart the satellite service so the workers get re-created.

As for why the segfaults are occurring, I saw this happen in a satellite script that was importing moonchart but have not yet tracked down the root cause. If you're importing moonchart, that could be the cause. If not, try successively eliminating imports in your script until the segfault goes away; that will at least reveal the source of it.

kevinkurek · September 18, 2025, 7:56pm

Ok after a lot of digging in which originally I thought it was greenlet installation, I'm now kind of certain it's from sklearn.metrics import roc_curve similar to this issue and this issue. They recommend changing package versions, but I couldn't get a clean run always.

Reproduce:

Inside QR base env

conda list
>>
libopenblas               0.3.21               hc2e42e2_0
numpy                     1.23.5          py311h1ee0e17_0
scikit-learn              1.3.2           py311hb93614b_2    conda-forge
scipy                     1.11.3          py311h82f920c_0

Run inside codeload. NOTE: Sometimes I have to run this 5-10 times to see the segfault appear.

# test_segfault.py
from sklearn.metrics import roc_curve

def main():
    pass

if __name__ == "__main__":
    main()

quantrocket satellite exec 'codeload.project.test_segfault.main'
>> 
quantrocket_cloud-satellite-1|...The work of process 52 is done. Seeya!
quantrocket_cloud-satellite-1|!!! uWSGI process 52 got Segmentation Fault !!!

Comment out from sklearn.metrics import roc_curve and I get

# from sklearn.metrics import roc_curve

def main():
    pass

if __name__ == "__main__":
    main()

quantrocket satellite exec 'codeload.project.test_segfault.main'
>>
quantrocket_cloud-satellite-1|...The work of process 91 is done. Seeya!
quantrocket_cloud-satellite-1|worker 5 killed successfully (pid: 91)
quantrocket_cloud-satellite-1|Respawned uWSGI worker 5 (new pid: 93)

The reason I say kinda is because sometimes I can successfully run it even when uncommented. If I run 5-10 times then it will throw a Segfault again; uncomment, and it resolves.

I also tried this but didn't see it go away.

# docker-compose.override.yml

services:
  satellite:
    environment:
      - SATELLITE_WORKERS=1
      - UWSGI_LAZY_APPS=1
      - UWSGI_SINGLE_INTERPRETER=1
      - OMP_NUM_THREADS=1
      - OPENBLAS_NUM_THREADS=1
      - MKL_NUM_THREADS=1
      - NUMEXPR_NUM_THREADS=1
      - MALLOC_ARENA_MAX=2
      - PYTHONFAULTHANDLER=1

kevinkurek · September 19, 2025, 2:14am

Getting same SegFault with scipy.

# a different script I import scipy but not sklearn and run in satellite.
from scipy.stats import norm # throws SegFault

Makes sense since scikit-learn requires it.

pip show scikit-learn | grep Requires
>>
Requires: joblib, numpy, scipy, threadpoolctl

Brian · September 19, 2025, 6:40pm

I agree that it seems to trace back to scipy. This will probably have to be resolved with package version updates in the next release. For now, I would suggest one of the following workarounds:

Run quantrocket/satellite:2.9.0, which has earlier package versions and doesn't trigger the segfault, or
Stay on 2.11.0 but restart the satellite service after segfaults occur. Since the segfaults occur during uWSGI worker respawning after the script completes, this is a viable option as the script itself is unaffected. You can either schedule docker compose restart satellite on the host crontab or restart manually on occasion. You can set the UWSGI_WORKERS environment variable on the satellite service as high as you like (default 6) to allow more segfaults to occur before you run out of workers and need to restart.

kevinkurek · September 22, 2025, 11:22pm

Thanks Brian, I picked path 1 for now.

Help for my future self or others.

You could successfully run a debug session inside quantrocket_cloud-jupyter-1 but quantrocket satellite exec 'codeload.project.pipeline.main' would fail. Reason satellite:2.9.0 was missing a compatible pyarrow which the satellite:2.11.0 stack had. You had to add pyarrow==11.0.0 into quantrocket.satellite.pip.txt then restart the container.
Because Py, Pandas, Numpy, etc were different versions you had to refactor away from non-compatible 3.9 code like pipe | instead of Union and using idxmax() on columns that can't be reduced over but could in the satellite:2.11.0 stack.