Hey everyone,
I have a feeling I am going to be straight out of luck on this one, but thought I’d throw it out there and see.
System Info:
Cuda Version: 9.0.176
Cudnn Version: 7
OS: CentOS Linux 7
Pytorch: 0.4.1
Python: 3.6
I am encountering segmentation faults when I try to use torch.utils.checkpoint.checkpoint in a DataParallel module across multiple GPUs. Didn’t find anything for this specific problem so thought I’d create a new thread, apologies if I’ve missed something.
Posted a demo script here:
dp_segfault.py
The diamond pattern comes from the model I’m basing this on - it’s based on this paper Convolutional Neural Fabrics - so I can’t just use the normal sequential.
There seem to be a couple of weird things:
- Running the script on 1 GPU works fine.
- Running the script without any checkpointing works fine
- Running the script without the diamond pattern (i.e. conv0(x) -> checkpoint(conv1, x) -> conv3(x)) works fine
Faulthandler output:
Current thread 0x00007f7bc2dff700 (most recent call first):
File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90 in backward
File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90 in backward
File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
Thread 0x00007f7bc3ffe700 (most recent call first):
Thread 0x00007f7bc47ff700 (most recent call first):
Thread 0x00007f7c57b1a740 (most recent call first):
File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90 in backward
File "/home/{user}/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 93 in backward
File "dp.py", line 62 in <module>
Segmentation fault
gdb output:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff649ff700 (LWP 39737)]
std::__push_heap<__gnu_cxx::__normal_iterator<torch::autograd::FunctionTask*, std::vector<torch::autograd::FunctionTask> >, long, torch::autograd::FunctionTask, __gnu_cxx::__ops::_Iter_comp_val<torch::autograd::CompareFunctionTaskTime> > (__first=..., __holeIndex=1, __topIndex=__topIndex@entry=0, __value=..., __comp=...) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/stl_heap.h:129
129 /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/stl_heap.h: No such file or directory.
Obviously the latter part gives some pointers - I probably need to find and install stl_heap.h - but I don’t have admin access (university cluster) so I’d like to really understand what needs to be done, and why this is happening, before going and pestering the sysadmins.