I am moving from CUDA C to PyTorch to achieve high performance parallel computing.
If I need add a constant variable to a tensor, the following way is certainly working.
t = torch.tensor.ones(10, 10000, 1000)
k_a = range(10)
for i in range(10):
t[i] = t[i] + k_a[i]
But for achieve better performance,
Do I need copy k_a to GPU memory first?
Can I copy k_a to GPU constant memory?
Or any thing i can do to improve it?
I would suggest to create
torch.arange(10).float() instead of the Python
Loops are generally slower than vectorized code, so you could unsqueeze
k_a in dim1 and dim2 and just add it in a single call:
device = 'cuda'
t = torch.ones(10, 10000, 1000, deive=device)
k_a = torch.arange(10, device=device).float()
ret = t + k_a.view(-1, 1, 1)
If you set
device='cuda', this operation will automatically be executed on the GPU.