Segmentation Fault when using checkpoint and DataParallel

Hey everyone,

I have a feeling I am going to be straight out of luck on this one, but thought I’d throw it out there and see.

System Info:
Cuda Version: 9.0.176
Cudnn Version: 7
OS: CentOS Linux 7
Pytorch: 0.4.1
Python: 3.6

I am encountering segmentation faults when I try to use torch.utils.checkpoint.checkpoint in a DataParallel module across multiple GPUs. Didn’t find anything for this specific problem so thought I’d create a new thread, apologies if I’ve missed something.

Posted a demo script here:
dp_segfault.py

The diamond pattern comes from the model I’m basing this on - it’s based on this paper Convolutional Neural Fabrics - so I can’t just use the normal sequential.

There seem to be a couple of weird things:

  • Running the script on 1 GPU works fine.
  • Running the script without any checkpointing works fine
  • Running the script without the diamond pattern (i.e. conv0(x) -> checkpoint(conv1, x) -> conv3(x)) works fine

Faulthandler output:

Current thread 0x00007f7bc2dff700 (most recent call first):
  File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90 in backward
  File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90 in backward
  File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply

Thread 0x00007f7bc3ffe700 (most recent call first):                                                                                                                                                         
                                                                                                                                                                                                            
Thread 0x00007f7bc47ff700 (most recent call first):                                                                                                                                                         
                                                                                                                                                                                                            
Thread 0x00007f7c57b1a740 (most recent call first):                                                                                                                                                         
  File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90 in backward                                                                                               
  File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 93 in backward                                                                                                          
  File "dp.py", line 62 in <module>                                                                                                                                                                         
Segmentation fault 

gdb output:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff649ff700 (LWP 39737)]
std::__push_heap<__gnu_cxx::__normal_iterator<torch::autograd::FunctionTask*, std::vector<torch::autograd::FunctionTask> >, long, torch::autograd::FunctionTask, __gnu_cxx::__ops::_Iter_comp_val<torch::autograd::CompareFunctionTaskTime> > (__first=..., __holeIndex=1, __topIndex=__topIndex@entry=0, __value=..., __comp=...) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/stl_heap.h:129
129     /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/stl_heap.h: No such file or directory.

Obviously the latter part gives some pointers - I probably need to find and install stl_heap.h - but I don’t have admin access (university cluster) so I’d like to really understand what needs to be done, and why this is happening, before going and pestering the sysadmins.

opened an issue at https://github.com/pytorch/pytorch/issues/11732

Glad to know it wasn’t just me being dumb.

For the benefit of anyone who reads this later (https://xkcd.com/979/) my hacky solution was to checkpoint all layers in the model. Not ideal, but it seems to get the job done, and the run time is not -too- bad

Hi, I have the same problem. Have you found any solution to solve it?