PytorchStreamWriter failed writing central directory

af_luther · March 6, 2021, 6:59pm

I have deployed an inference server using uWSGI + Flask + PyTorch on AWS Amazon Linux with Elastic Inference.

I have uWSGI configured at 2 threads and 4 processes.

In most circumstances clients send three requests in a row which are processed in parallel. The first usually returns in less than half a second, the next two is less than a second.

After months of stable processing, I recently started doing some scalability testing and decided to try increasing the number of processes to 6 on a single server. Now I have started getting some intermittent issues which I have not seen before:

terminate called after throwing an instance of ‘c10::Error’
what(): [enforce fail at inline_container.cc:316] . PytorchStreamWriter failed writing central directory: file write failed

I cannot find any references to this error message. My best guess is that it is some kind of concurrency issue; perhaps my increased number of processes is now making it possibly, or at least more likely, for two processes to try and write the the same file at the same time (whatever that file may be in the “central directory” – whatever that is).

One possible solution might be to set lazy or lazy-apps to true in uWSGI, but first I’d like to better understand the root cause of the error message.

What is the “central directory” and what could cause this error? Is it indeed concurrency?

I load the model once in my python flask script
model = torch.jit.load(model_path).eval()
and execute the model in the service method with the given input data.