How to improve performance of tensor plus constant array

I am moving from CUDA C to PyTorch to achieve high performance parallel computing.
If I need add a constant variable to a tensor, the following way is certainly working.

t = torch.tensor.ones(10, 10000, 1000)
k_a = range(10)
for i in range(10):
   t[i] = t[i] + k_a[i]

But for achieve better performance,
Do I need copy k_a to GPU memory first?
Can I copy k_a to GPU constant memory?
Or any thing i can do to improve it?


I would suggest to create k_a using torch.arange(10).float() instead of the Python range.

Loops are generally slower than vectorized code, so you could unsqueeze k_a in dim1 and dim2 and just add it in a single call:

device = 'cuda'
t = torch.ones(10, 10000, 1000, deive=device)
k_a = torch.arange(10, device=device).float()
ret = t + k_a.view(-1, 1, 1)

If you set device='cuda', this operation will automatically be executed on the GPU.