Segmentation Fault when using checkpoint and DataParallel


#1

Hey everyone,

I have a feeling I am going to be straight out of luck on this one, but thought I’d throw it out there and see.

System Info:
Cuda Version: 9.0.176
Cudnn Version: 7
OS: CentOS Linux 7
Pytorch: 0.4.1
Python: 3.6

I am encountering segmentation faults when I try to use torch.utils.checkpoint.checkpoint in a DataParallel module across multiple GPUs. Didn’t find anything for this specific problem so thought I’d create a new thread, apologies if I’ve missed something.

Posted a demo script here:
dp_segfault.py

The diamond pattern comes from the model I’m basing this on - it’s based on this paper Convolutional Neural Fabrics - so I can’t just use the normal sequential.

There seem to be a couple of weird things:

  • Running the script on 1 GPU works fine.
  • Running the script without any checkpointing works fine
  • Running the script without the diamond pattern (i.e. conv0(x) -> checkpoint(conv1, x) -> conv3(x)) works fine

Faulthandler output:

Current thread 0x00007f7bc2dff700 (most recent call first):
  File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90 in backward
  File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90 in backward
  File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply

Thread 0x00007f7bc3ffe700 (most recent call first):                                                                                                                                                         
                                                                                                                                                                                                            
Thread 0x00007f7bc47ff700 (most recent call first):                                                                                                                                                         
                                                                                                                                                                                                            
Thread 0x00007f7c57b1a740 (most recent call first):                                                                                                                                                         
  File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90 in backward                                                                                               
  File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 93 in backward                                                                                                          
  File "dp.py", line 62 in <module>                                                                                                                                                                         
Segmentation fault 

gdb output:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff649ff700 (LWP 39737)]
std::__push_heap<__gnu_cxx::__normal_iterator<torch::autograd::FunctionTask*, std::vector<torch::autograd::FunctionTask> >, long, torch::autograd::FunctionTask, __gnu_cxx::__ops::_Iter_comp_val<torch::autograd::CompareFunctionTaskTime> > (__first=..., __holeIndex=1, __topIndex=__topIndex@entry=0, __value=..., __comp=...) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/stl_heap.h:129
129     /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/stl_heap.h: No such file or directory.

Obviously the latter part gives some pointers - I probably need to find and install stl_heap.h - but I don’t have admin access (university cluster) so I’d like to really understand what needs to be done, and why this is happening, before going and pestering the sysadmins.


(Simon Wang) #2

opened an issue at https://github.com/pytorch/pytorch/issues/11732


#3

Glad to know it wasn’t just me being dumb.

For the benefit of anyone who reads this later (https://xkcd.com/979/) my hacky solution was to checkpoint all layers in the model. Not ideal, but it seems to get the job done, and the run time is not -too- bad


(Victor Tan) #4

Hi, I have the same problem. Have you found any solution to solve it?