Customized module working only on GPU 0

I wrote a customized module with c and cuda. It works fine with GPU 0. But when I switch to GPU 1 (I do have 2 GPUs on my machine), the following error occurs:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THC/THCTensorCopy.cu line=100 error=77 : an illegal memory access was encountered
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THC/THCTensorCopy.cu line=100 error=77 : an illegal memory access was encountered
terminate called after throwing an instance of 'std::runtime_error'
  what():  terminate called recursively
cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THC/THCTensorCopy.cu:100
Aborted (core dumped)

I selected GPU using the following code:

torch.cuda.set_device(args.gpu_id)

A sample of my customized module is as follows:

void updateOutput_cuda(THCudaTensor *input,  THCudaTensor *output,){

    input = THCudaTensor_newContiguous(state, input);

    THCudaTensor_resize4d(state, output, batchSize, nInputPlane, outputHeight, outputWidth);
    output = THCudaTensor_newContiguous(state, output);
    THCudaTensor_zero(state,output);

    THCudaTensor_free(state, input);
    THCudaTensor_free(state, output);
}

Thanks for your help!

I can only guess that there is a problem with the copying of your tensor. Are you sure the Tensor dimesions are correct?

I think it should be correct because it works well on GPU 0.
I’m feeling that this problem is relevant to this: https://github.com/pytorch/pytorch/issues/689
But I have no idea where the problem is.

Can you run this:

import pycuda
from pycuda import compiler
import pycuda.driver as drv
import torch
import sys
print('__Python VERSION:', sys.version)
print('__pyTorch VERSION:', torch.__version__)
print('__CUDA VERSION')
from subprocess import call
# call(["nvcc", "--version"]) does not work
! nvcc --version
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print('__Devices')
call(["nvidia-smi", "--format=csv", "--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])
print('Active CUDA Device: GPU', torch.cuda.current_device())

print ('Available devices ', torch.cuda.device_count())
print ('Current cuda device ', torch.cuda.current_device())

drv.init()
print("%d device(s) found." % drv.Device.count())
           
for ordinal in range(drv.Device.count()):
    dev = drv.Device(ordinal)
    print (ordinal, dev.name())
('__Python VERSION:', '2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:09:15) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
('__pyTorch VERSION:', '0.2.0_2')
__CUDA VERSION
('__CUDNN VERSION:', 6021)
('__Number CUDA Devices:', 2L)
__Devices
index, name, driver_version, memory.total [MiB], memory.used [MiB], memory.free [MiB]
0, GeForce GTX TITAN X, 375.66, 12204 MiB, 8355 MiB, 3849 MiB
1, GeForce GTX TITAN X, 375.66, 12207 MiB, 9174 MiB, 3033 MiB
('Active CUDA Device: GPU', 0L)
('Available devices ', 2L)
('Current cuda device ', 0L)
2 device(s) found.
(0, 'GeForce GTX TITAN X')
(1, 'GeForce GTX TITAN X')

I commented this line:

! nvcc --version

The memory seems to be consistent with nvidia-smi, I have 2 process working with them

Is it the first or second call to THCudaTensor_newContiguous that crashes? can you debug it?

input = THCudaTensor_newContiguous(state, input);

    THCudaTensor_resize4d(state, output, batchSize, nInputPlane, outputHeight, outputWidth);
    output = THCudaTensor_newContiguous(state, output);

The error is gone, it’s possibly because I should not use openmp
#pragma omp parallel for private(elt)