Came across the problem, that in pytorch 1.1.0 with uwsgi+flask on cpu even torch.cat does not work (everything just freezes without errors). I determined that problem is in uwsgi/flask, because in the same environment I was able to do the same operations without any issues. There was no such problem in previous versions of pytorch. As far as I understand, I am dealing with multiprocessing artifacts…
Nevertheless I found the solution and it was simple:
@lebionick thanks for the info. I actually have the exact same project structure as your code. Segmentor just named differently I had another solution for another app but I changed the solution to be exactly like yours with the before_first_request still to no success. I will report back if I find out more about this.
ok it seems load_state_dict and other operations in __init__ when run after flask.Flask(__name__) cause pytorch operations done in requests to hang forever. Not sure about the cause but maybe this info could help someone
Don’t you running the flask app with uWSGI’s “preforking worker mode” that is a basic config?
Try lazy-apps mode that each worker will load the trained model for them self and not share with others.
It works in my environment even I load model in the global.
I found this thread really interesting because I am working on scaling my inference server (which uses uWSGI + flask + PyTorch on AWS with Elastic Inference), and when I increased the # of processes recently I came across some intermittent issues:
terminate called after throwing an instance of ‘c10::Error’
what(): [enforce fail at inline_container.cc:316] . PytorchStreamWriter failed writing central directory: file write failed
I couldn’t find any other references to this sort of error, but I am guessing it could be some kind of concurrency issue as I increased the number of processes.
I found a different solution that works in my case and still maintains the default fork model of uWSGI (i.e. without lazy-apps). Enabling lazy-apps may not desirable because worker processes will not share resources with their parent (e.g. read-only resources such as models used for inference).
The idea is to share only the state_dict between processes, and each worker process will do model.load_state_dict(global_state) independently, but only once, during the first request. This is opposed to doing load_state_dict in the parent process and sharing this with the worker processes.
I just ran into a similar issue running with pytorch and uwsgi in a container. This issue (Launching two processes causes hanging · Issue #50669 · pytorch/pytorch · GitHub) indicates that using LD_PRELOAD to use Intel’s OMP instead of libgomp could avoid the issue which is in libgomp. On Debian Bullseye I was able to install libomp-dev-11 and LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libiomp5.so /usr/bin/uwsgi --ini /uwsgi.ini --uid www-data --enable-threads to workaround the issue.