The problem
I’ve got a fairly basic multi-task CNN I’ve built in Pytorch (efficientnet_pytorch package used for the body, pytorch-lightning used during training).
I’m now deploying the model to a simple Tornado web app. For the app I have a test suite which involves running multiple forward passes through this. The app only runs on the CPU.
All these tests pass on my local machine (Macbook Pro) and my development machine (a beefy Google Cloud box).
The tests fail, and the app often crashes when I run them on the deployment machine (an AWS t3.medium), with the error below.
Fixes I’ve tried
- I noticed this in Docker, but running directly on the box gives the same issue.
- I tried trying both an AWS t3 and an m5 box in case it was a memory problem (4GB->8GB), but that didn’t fix it, and watching the memory it doesn’t seem to be approaching the limits.
- I’ve also upgraded to pytorch 1.10, which had no effect.
- Putting the forward pass in a dumb try/except loop trying it multiple times fix the problem - so far it succeeds every time on the second attempt. That’s not an ideal solution though.
Does anyone have any thoughts on what the issue might be, solutions, or any suggestions for getting a more informative stack trace?
Thanks very much!
The error
self = Conv2dStaticSamePadding(
3, 40, kernel_size=(3, 3), stride=(2, 2), bias=False
(static_padding): ZeroPad2d(padding=(0, 1, 0, 1), value=0.0)
)
x = tensor([[[[-2.1008, -2.1008, -2.1008, ..., -2.1008, -2.1008, 0.0000],
[-2.1008, -2.1008, -2.1008, ..., -2..., -1.7870, ..., -1.7870, -1.7696, 0.0000],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]]]])
def forward(self, x):
x = self.static_padding(x)
> x = F.conv2d(x, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups)
E RuntimeError: std::bad_alloc