Torch.index_put on GPU

Hello together,

firstly enjoy the christmas holidays. Secondly i have a question regarding the torch function torch.index_put. As an input it receives indices (tuple of LongTensor) and i am wondering whether this function is supported on the GPU or does it move all the involved data to the RAM and use the CPU. I think a tuple is a python object and therefore not available on the GPU, isn’t it? It is also a little bit confusing, at least for me, why indices is a tuple and not just a torch.tensor with torch.dtype torch.long. A explanation would be that if the option accumulate=True one has to use something like an atomic add function, so the function itself is not suitable for GPU support, but i don’t know whether this is true.

Minimal example:

import torch as tr
import time

device=tr.device('cuda:0')
#device=tr.device('cpu')

N      = 10000
repeat = 10
x      = tr.ones(N,  1, device=device)
y      = tr.zeros(x.shape, device=device)
idx    = tr.zeros(1, N, dtype=tr.long, device=device)
z      = tr.zeros(1,1, device=device)
start  = tr.cuda.Event(enable_timing=True)
end    = tr.cuda.Event(enable_timing=True)

start.record()
#start = time.time()
for i in range(repeat):
    z = tr.index_put(z, tuple(idx), x, accumulate=True)

#end = time.time()
end.record()

# Waits for everything to finish running
tr.cuda.synchronize()

print(start.elapsed_time(end)/repeat)
#print((end-start)/repeat)

print(z)

GPU is much slower than CPU

Hopefully someone can help,
greetings

Yes, index_put is available on the GPU and the data from the passed Python types is extracted.
The CUDA implementation calls index_put)impl_ and dispatches to index_put_with_sort_stub if accumulate=True or deterministic algorithms are selected, which then calls index_put_with_sort_kernel.