Illegal memory access with custom CUDA module when using nn.DataParallel

I wrote a custom CUDA module, following the provided example: https://github.com/pytorch/extension-ffi

So far, my modules worked flawlessly. Note that unlike the example, I am using custom CUDA kernels. However, I just wanted to start training my model on multiple GPUs and run into ‘illegal memory access’ exceptions in the backward call. Since the custom CUDA module is working perfectly fine on a single GPU and since the exact same CUDA code is working perfectly fine with Torch on multiple GPUs, I suspect that something in my PyTorch related Python code is wrong:

class CustomFunction(torch.autograd.Function):
	def __init__(self):
		super(CustomFunction, self).__init__()
	# end

	def forward(self, input1, input2, input3):
		self.save_for_backward(input1, input2, input3)

		assert(input1.is_contiguous() == True)
		assert(input2.is_contiguous() == True)
		assert(input3.is_contiguous() == True)

		output = input1.new(...).zero_()

		# call C portion which call the CUDA portion whcih calls the forward kernel

		return output
	# end

	def backward(self, gradOutput):
		input1, input2, input3 = self.saved_tensors

		assert(gradOutput.is_contiguous() == True)

		gradInput1 = input1.new(...).zero_()
		gradInput2 = input1.new(...).zero_()
		gradInput3 = input1.new(...).zero_()

		# call C portion which call the CUDA portion whcih calls the backward kernel

		return gradInput1, gradInput2, gradInput3
	# end
# end

When the call to the CUDA kernel in the backward function is removed, the ‘illegal memory access’ error does not appear but the model does of course not train correctly due to the missing gradients. I thus suspect some issue with the memory allocation and have already tried using gradOutput.new instead of input1.new in the backward function. I likewise used .get_device() to make sure that all the tensors in a call of the backward function are on the same GPU and can confirm that they are.

I normally am very reluctant to ask for help but am afraid that I might be missing something fundamental here. Is anyone able to provide any insight here? Thank you very much for making PyTorch happen by the way, it so far has been a joy to work with!

After tinkering with my code for a while, I noticed that changing from CUDA_VISIBLE_DEVICES="1,2" to CUDA_VISIBLE_DEVICES="0,1" makes everything work without any issues. Why does GPU 0 have to be included in the visible devices when using a custom CUDA module? What am I missing? Thanks!

this is kinda weird, and shouldn’t happen. Is there anyway I can debug this further? Can you open an issue on github.com/pytorch/pytorch with a reproduction code of the problem?

1 Like

Thank you for your message Soumith! I just created a repository that can potentially also serve as a reference for other people who are trying to create a CUDA extension: https://github.com/sniklaus/pytorch-extension

This extension simply computes the Hadamard product using a custom CUDA kernel. Everything works well if it is just being executed on a single graphics card but once more than one GPU is being used, the configuration seems to affect the execution. Please see the test.py file for more details.

1 Like

Hey Simon,

I’ve run your code, and it’s not failing for me. Also it always computes 0.0. Is this unexpected:

CUDA_VISIBLE_DEVICES="1,3" python test.py
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
switching to DataParallel mode
for me, this works with
    export CUDA_VISIBLE_DEVICES="0"
    export CUDA_VISIBLE_DEVICES="1"
    export CUDA_VISIBLE_DEVICES="2"
    export CUDA_VISIBLE_DEVICES="3"
    export CUDA_VISIBLE_DEVICES="0,1"
    export CUDA_VISIBLE_DEVICES="2,3"
and fails with many others like
    export CUDA_VISIBLE_DEVICES="0,2"
    export CUDA_VISIBLE_DEVICES="0,3"
    export CUDA_VISIBLE_DEVICES="1,2"
    export CUDA_VISIBLE_DEVICES="1,3"
    export CUDA_VISIBLE_DEVICES="0,1,2,3"
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1 Like

Thank you for spending your Sunday with trying this out! I have also noticed that you looked into many other topics, thank you Soumith! I am honestly amazed by how often I come across your work: papers, posts, repositories, talks. I would love to talk to you in person at some point. I will be in Hawaii for CVPR at the end of the month, let me know should you be around and want to grab shaved ice.

I just tried running it on a machine different from my main server and noticed that it works there without any issues as well. This is quite surprising since they only differ in the used graphics cards (the remaining hardware is identical, they even use the same motherboard). My main server has 4x Titan X (Maxwell) while the other machine I just tried has 4x Titan X (Pascal). They also both run the same version of PyTorch / CUDA / NVIDIA drivers. I will investigate this further and report back once I have a better idea of what’s going on. Thanks again!

I will not be at CVPR this year, but hope to meet you elsewhere.

I just had the chance to use a machine with the same hardware as my main server and it likewise does not cause the error. I doubt that the issue is related to PyTorch and I will post an update should I ever find out what is causing this behavior. If somebody else is able to provide more input, I would be happy to hear it though. :slight_smile:

1 Like