Load_state_dict not finishing in Celeryworker Task

Bennse · September 21, 2022, 11:34am

Heyo,

I am running an EasyOCR docker on a kubernetescluster which itself is working totally fine.
However the issue I am facing is that when running the easyocr.Reader function, the Celeryworker seems to never finish and finally time out after X Minutes.
Through a lot of debug work I was able to localize the issue to be at the “load_state_dict” function in module.py of the Torch package.

I have not gotten a single error or anything, the Worker simply timeouts at some point.
Locally without a Docker it works, locally and on Kubernetescluster inside a docker it doesn’t work.
(It does work when running the code through an API Function. It only has that issue with the Celeryworker)

As I am not able to get any more detail out of it, I was wondering if anyone on here could help me.
I have tried it with Torch versions 1.10.1 til 1.12.1 with Torchvision 0.11.2 til 0.13.1 and am not sure what to do next.

Cheers,
Bennse

jlewallen18 · November 10, 2023, 6:48am

This is a long shot…but were you able to figure this out? I’m dealing with the SAME exact thing, while running load_state_dict, it just straight up stops.

I’m using FastAPI and I am preloading my model, what’s strange is that when I log out which keys have been loaded, it stops about 75% of the way through, and when I restart my server, the other 25% are finally logged to the screen. So it’s like I’m hitting a max memory buffer issue. Absolutely no errors though.

Video demo:

The final one that should be loaded is audio_projection.2.bias which you can see bring printed to the screen when the server restarts