I’m dealing with concatenation of tensors in cpu and gpu, and I’m trying to optimize the performance of the process.
Which option should be faster/preferred (if any)?
torch.cat([x.view(1,3) for x in array],dim=0).to(device)
torch.cat([x.view(1,3).to(device) for x in array],dim=0)
In other words, is faster to employ
torch.cat with tensors in cuda, or should I rather move to cpu after the concatenation? Does it depend on the specific case?
Many thanks in advance!
Yes, as usual.
Triggering a single copy kernel should be faster than calling them in a loop. However, in your use case it would also depend on the available GPU memory as the
torch.cat call would increase it assuming
x resides on the GPU and
device is set to
In the end it would also depend on the tensor shapes, number of tensors, etc., so you might want to profile your actual workload.
Thanks for the reply, thus I will perform a detailed profiling to check which suits better to my use case.