I am using torch.nn.DataParallel in order to parallelize my training over 2 GPU’s.
When i run my training script with single GPU it works great :
CUDA_VISIBLE_DEVICES=0 python train.py
When i switch to multi-gpu case (setting CUDA_VISIBLE_DEVICES=0,1) the training fails at some point (after ~3.5 epochs) with segmentation fault.
This failure is very similar to the one reported here, but in my case i couldn’t find a suitable workaround.
This is the stacktrace i got from gdb:
Thread 3997 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff67dff700 (LWP 30326)]
0x00007fffcbfc7e8b in THRandom_random () from /home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/lib/libATen.so.1
(gdb) bt
#0 0x00007fffcbfc7e8b in THRandom_random () from /home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/lib/libATen.so.1
#1 0x00007fffcbf5df28 in THLongTensor_randperm () from /home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/lib/libATen.so.1
#2 0x00007fffede35ead in THPLongTensor_stateless_randperm (self=<optimized out>, args=<optimized out>, kwargs=<optimized out>)
at /home/ubuntu/dev/shared/pytorch/torch/csrc/generic/TensorMethods.cpp:54289
#3 0x00007ffff7a42773 in PyObject_Call () from /home/ubuntu/anaconda2/bin/../lib/libpython2.7.so.1.0
#4 0x00007fffed89b352 in THPUtils_dispatchStateless (tensor=0x55555630d960, name=0x7fffeea15f84 "randperm", args=0x7fff86fe8c10, kwargs=0x0) at torch/csrc/utils.cpp:160
#5 0x00007ffff7adc615 in PyEval_EvalFrameEx () from /home/ubuntu/anaconda2/bin/../lib/libpython2.7.so.1.0
#6 0x00007ffff7ade4e9 in PyEval_EvalCodeEx () from /home/ubuntu/anaconda2/bin/../lib
I’m using randperm in my loss function in the following way:
pos_order = torch.randperm(len(pos_non_zeros)).cuda()
Is there a way to generate random permutation directly on the GPU (without having to create CPU tensor and them move it to CUDA) ?
Thanks in advance.