Segmentation fault on multi-gpu scenario training (randperm call)

I am using torch.nn.DataParallel in order to parallelize my training over 2 GPU’s.
When i run my training script with single GPU it works great :
CUDA_VISIBLE_DEVICES=0 python train.py

When i switch to multi-gpu case (setting CUDA_VISIBLE_DEVICES=0,1) the training fails at some point (after ~3.5 epochs) with segmentation fault.

This failure is very similar to the one reported here, but in my case i couldn’t find a suitable workaround.
This is the stacktrace i got from gdb:

Thread 3997 "python" received signal SIGSEGV, Segmentation fault.                                                                                                                                   
[Switching to Thread 0x7fff67dff700 (LWP 30326)]                                                                                                                                                    
0x00007fffcbfc7e8b in THRandom_random () from /home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/lib/libATen.so.1                                                                             
(gdb) bt                                                                                                                                                                                            
#0  0x00007fffcbfc7e8b in THRandom_random () from /home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/lib/libATen.so.1                                                                         
#1  0x00007fffcbf5df28 in THLongTensor_randperm () from /home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/lib/libATen.so.1                                                                   
#2  0x00007fffede35ead in THPLongTensor_stateless_randperm (self=<optimized out>, args=<optimized out>, kwargs=<optimized out>)                                                                     
    at /home/ubuntu/dev/shared/pytorch/torch/csrc/generic/TensorMethods.cpp:54289                                                                                                                   
#3  0x00007ffff7a42773 in PyObject_Call () from /home/ubuntu/anaconda2/bin/../lib/libpython2.7.so.1.0                                                                                               
#4  0x00007fffed89b352 in THPUtils_dispatchStateless (tensor=0x55555630d960, name=0x7fffeea15f84 "randperm", args=0x7fff86fe8c10, kwargs=0x0) at torch/csrc/utils.cpp:160                           
#5  0x00007ffff7adc615 in PyEval_EvalFrameEx () from /home/ubuntu/anaconda2/bin/../lib/libpython2.7.so.1.0                                                                                          
#6  0x00007ffff7ade4e9 in PyEval_EvalCodeEx () from /home/ubuntu/anaconda2/bin/../lib

I’m using randperm in my loss function in the following way:
pos_order = torch.randperm(len(pos_non_zeros)).cuda()

Is there a way to generate random permutation directly on the GPU (without having to create CPU tensor and them move it to CUDA) ?

Thanks in advance.

Forgot to mention - i am using pytorch version 0.4.0a0+017893e (built from source)