Segfault when looking up in CUDA pointer

Jegp · September 17, 2022, 2:14pm

I’m extracting a pointer for the underlying memory on the GPU for a tensor with bytes like so:

auto opt = options_buffer = torch::TensorOptions()
                   .dtype(torch::kUInt8)
                   .device(torch::DeviceType::CUDA)
                   .memory_format(c10::MemoryFormat::Contiguous);
auto buffer = torch::zeros(size, opt);
auto *array = buffer->data_ptr<uint8_t>();

However, if I try to look up in that or do anything related to that pointer, it segfaults. But. Why? The memory should be contiguous. I’m imagining it has something to do with memory formatting and/or locks/access, but I can’t seem to find good documentation for it. I would also be grateful if you happen to know any resources that can help in this direction.

array[0] // segfaults

ptrblck · September 17, 2022, 5:30pm

It seems you are trying to access device data from the host, which is UB and could segfault.

Jegp · September 17, 2022, 9:59pm

Thanks for the reply! I am indeed accessing it from the host. So, you would say this kind of access has to happen from a CUDA block?

The weird thing is that this worked in a previous installation. Did PyTorch add memory protection recently?

ptrblck · September 20, 2022, 8:12am

Yes, you need to either access the device array in a kernel or would need to copy it back to the CPU.
I don’t think this should have ever worked as it’s expected behavior in CUDA. PyTorch did not add any memory protection etc.

E.g. look at this simple example which:

allocates host and device memory
fills the host array with values
copies the host array to the device array via cudaMemcpy
launches the compute kernel, which indexes the device array
copies the device array back to the host
prints the host array via indexing it
frees the allocations

Now, add an invalid access e.g. via printf("%d\n", da[0]); in line 69 and you will get a Segmentation fault.

Jegp · September 26, 2022, 6:33pm

Thank you for your replies! I can see why this shouldn’t indeed work. I resolved it by writing CUDA kernels to deal with the memory allocated on the GPU.