Sporadic "RuntimeError: std::bad_alloc" doing forward pass on CPU during deployment on certain hardware

dgmp · December 8, 2021, 2:04pm

The problem

I’ve got a fairly basic multi-task CNN I’ve built in Pytorch (efficientnet_pytorch package used for the body, pytorch-lightning used during training).

I’m now deploying the model to a simple Tornado web app. For the app I have a test suite which involves running multiple forward passes through this. The app only runs on the CPU.

All these tests pass on my local machine (Macbook Pro) and my development machine (a beefy Google Cloud box).

The tests fail, and the app often crashes when I run them on the deployment machine (an AWS t3.medium), with the error below.

Fixes I’ve tried

I noticed this in Docker, but running directly on the box gives the same issue.
I tried trying both an AWS t3 and an m5 box in case it was a memory problem (4GB->8GB), but that didn’t fix it, and watching the memory it doesn’t seem to be approaching the limits.
I’ve also upgraded to pytorch 1.10, which had no effect.
Putting the forward pass in a dumb try/except loop trying it multiple times fix the problem - so far it succeeds every time on the second attempt. That’s not an ideal solution though.

Does anyone have any thoughts on what the issue might be, solutions, or any suggestions for getting a more informative stack trace?

Thanks very much!

The error

self = Conv2dStaticSamePadding(
  3, 40, kernel_size=(3, 3), stride=(2, 2), bias=False
  (static_padding): ZeroPad2d(padding=(0, 1, 0, 1), value=0.0)
)
x = tensor([[[[-2.1008, -2.1008, -2.1008,  ..., -2.1008, -2.1008,  0.0000],
          [-2.1008, -2.1008, -2.1008,  ..., -2..., -1.7870,  ..., -1.7870, -1.7696,  0.0000],
          [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]]]])

    def forward(self, x):
        x = self.static_padding(x)
>       x = F.conv2d(x, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups)
E       RuntimeError: std::bad_alloc

ptrblck · December 9, 2021, 6:27am

Are you seeing the issue by simply executing the forward pass of your model or is it only raised in your deployment setup using the web server etc.?

dgmp · December 9, 2021, 10:30am

Thanks for the reply!

It is caused by the forward pass, however I have only seen it occur in the context of the deployment setup when using the web server. However, that same setup and test suite works fine on different hardware using the same docker image.

ptrblck · December 9, 2021, 10:55am

Thanks for the update!
Could you try to get the backtrace via:

gdb --args python script.py args
...
run
...
bt

and post it here?

dgmp · December 9, 2021, 5:54pm

When I do this I just get

(gdb) bt
No stack.

Full outputs below and inputs below, in case I’m doing something silly that’s preventing the backtrace from appearing.

However, to make sure the lack of backtrack wasn’t just pytest/Tornado swallowing the error on a different thread, I pulled all the tests out and run them manually. In doing so I realised I only see the error where when I’ve called a used a model from the python lifelines packaged before I call pytorch. If I don’t run the tests on the lifelines model, all the pytorch tests succeed.

So, maybe this is a lifelines issue and not a pytorch problem? Super strange though that the error occurs when running pytorch.

$ PYTHONPATH=./ gdb --args python research/app/tests/test_model.py 
                                                                                                                                                                           
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...
(gdb) run
Starting program: /home/ubuntu/miniconda3/envs/reti/bin/python research/app/tests/test_model.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffe783d700 (LWP 39184)][New Thread 0x7fffe703c700 (LWP 39185)]
Traceback (most recent call last):
  File "/home/ubuntu/reti/research/app/tests/test_model.py", line 109, in <module>
    test.test_retina_model()
  File "/home/ubuntu/reti/research/app/tests/test_model.py", line 9, in test_retina_model
    result = process_image(env.GEORGE_LEFT)                                                                  in decorate_context
  File "/home/ubuntu/miniconda3/envs/reti/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/reti/research/app/models/models.py", line 73, in process_image
    raise exc
  File "/home/ubuntu/reti/research/app/models/models.py", line 55, in process_image                         , in _call_impl    embed = MODEL(image)
  File "/home/ubuntu/miniconda3/envs/reti/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)                                                                   , in _call_impl
  File "/home/ubuntu/reti/research/models/biobank.py", line 141, in forward    embed = self.encoder(x)
  File "/home/ubuntu/miniconda3/envs/reti/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl                                                                                             89, in extract_features    return forward_call(*input, **kwargs)
  File "/home/ubuntu/reti/research/models/efficientnets.py", line 16, in forward                            , in _call_impl
    x = self.extract_features(inputs)  File "/home/ubuntu/miniconda3/envs/reti/lib/python3.9/site-packages/efficientnet_pytorch/model.py", line 275, in forward89, in extract_features
    x = self._swish(self._bn0(self._conv_stem(inputs)))
  File "/home/ubuntu/miniconda3/envs/reti/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/reti/lib/python3.9/site-packages/efficientnet_pytorch/utils.py", line 275, in forward
    x = F.conv2d(x, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups)
RuntimeError: std::bad_alloc                                                                                                                                                                                                                         "ip-172-31-39-215" 17:46 09-Dec-21
[Thread 0x7fffe703c700 (LWP 39185) exited]
[Thread 0x7fffe783d700 (LWP 39184) exited]
[Inferior 1 (process 39155) exited with code 01]
(gdb) bt
No stack.

ptrblck · December 10, 2021, 7:35am

You could use thread apply all bt to get the backtrace of all threads.

That’s indeed strange and there might be some (undesired) interaction between these packages.

dgmp · December 20, 2021, 11:22am

Sorry for the extremely slow response! I went on vacation.

thread apply all bt also just returns nothing at all . Just returns to the debug terminal and does nothing.
I tried setting a breakpoint with catch throw std::bad_alloc. This does give a stack trace, I’m not sure if it’s informative? Sorry it isn’t easily legible, I couldn’t make syntax highlighting work on pastebin or github.
I have managed to make a minimal example, in case there’s anything obvious going wrong there. As before, this runs fine on my macbook pro but crashes on the deployment machine.

ptrblck · December 20, 2021, 5:30pm

Yes, the stacktrace seems to point to MKL/ideep:

(gdb) bt
#0  __cxxabiv1::__cxa_throw (obj=0x555558e7ed40, tinfo=0x7fffe3ae6278 <typeinfo for std::bad_alloc>, dest=0x7fffe39f910e <std::bad_alloc::~bad_alloc()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:77
#1  0x00007fffe39f6f8c in std::__throw_bad_alloc () at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1634095553113/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/exception.h:63
#2  0x00007fffe4cd1fa4 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<int const, dnnl::memory>, false> > >::_M_allocate_buckets(unsigned long) [clone .isra.585] ()
   from /home/ubuntu/miniconda3/envs/reti/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#3  0x00007fffe4cde4f5 in std::_Hashtable<int, std::pair<int const, dnnl::memory>, std::allocator<std::pair<int const, dnnl::memory> >, std::__detail::_Select1st, std::equal_to<int>, std::hash<int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_Hashtable<std::pair<int const, dnnl::memory> const*>(std::pair<int const, dnnl::memory> const*, std::pair<int const, dnnl::memory> const*, unsigned long, std::hash<int> const&, std::__detail::_Mod_range_hashing const&, std::__detail::_Default_ranged_hash const&, std::equal_to<int> const&, std::__detail::_Select1st const&, std::allocator<std::pair<int const, dnnl::memory> > const&) ()
   from /home/ubuntu/miniconda3/envs/reti/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#4  0x00007fffe5207db0 in ideep::tensor::reorder_if_differ_in(ideep::tensor::desc const&, ideep::attr_t const&) const () from /home/ubuntu/miniconda3/envs/reti/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#5  0x00007fffe5208243 in void ideep::convolution_forward::do_compute<true>(ideep::convolution_forward_params const&, ideep::tensor const&, ideep::tensor const&, ideep::tensor const&, ideep::tensor&) ()
   from /home/ubuntu/miniconda3/envs/reti/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#6  0x00007fffe5203ab6 in at::native::_mkldnn_convolution(ideep::tensor const&, ideep::tensor const&, c10::optional<ideep::tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long) ()
   from /home/ubuntu/miniconda3/envs/reti/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#7  0x00007fffe5204557 in at::native::mkldnn_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long) ()
   from /home/ubuntu/miniconda3/envs/reti/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so

Are you using this lifelines repository for your code?

dgmp · December 20, 2021, 6:00pm

Yep, that’s the one! On the latest version (0.26.4).

ptrblck · December 20, 2021, 11:12pm

Which CPU are you using?
I wasn’t able to reproduce the issue on an Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz using a source build, the 1.10.1 pip wheels, as well as the nightly pip wheel from today.

EDIT: In any case, would you mind creating an issue on GitHub so that the MKL devs are aware of this issue and might be able to reproduce it, please?

dgmp · December 21, 2021, 10:53am

Sure, have done here. It occurs for me on Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz (AWS t3.medium).

Thanks for your help!

enkhtogtokh · October 13, 2022, 9:08am

Exactly same problem happened and killed my productive morning time. The solution: is just upgrade your numpy version. It fixes the problem.
It might appear in Intel XEON cpus and cloud environment.
What i tried were:
upgraded MKL with conda, it did not help.
upgraded and downgraded pytorch versions, no chance. And all previous recommendations, they did not help.