Thanks for the answer!
I’m pretty sure that all the tensors going into the C++ functions are on the same GPU:
data going in is something like:
a:Tensor = something on gpu x
b = a.where(a[:, :, :, 0] > 0, torch.zeros(1, device=a.device))
result = cpp_extension_function(a, b)
I’m also not moving data between GPUs. I call the python scripts with the GPU that they should use and that’s it.
Now I had another idea: Instead of putting the tensors on the right GPU by using pytorch’s cuda:x, I switched to NVIDIA’s environment variable CUDA_VISIBLE_DEVICES=x. It works as expected. But I’d prefer to have the C++/CUDA functinos behave as expected also when the visible devices are not limited.
If I really don’t have to do anything special to start the kernel on the correct GPU (it finds out by itself), then I’m puzzled and I’ll work on the minimal example on monday