Optimizing concatenation of tensors

I’m dealing with concatenation of tensors in cpu and gpu, and I’m trying to optimize the performance of the process.

Which option should be faster/preferred (if any)?

torch.cat([x.view(1,3) for x in array],dim=0).to(device)


torch.cat([x.view(1,3).to(device) for x in array],dim=0)

where device="cuda".

In other words, is faster to employ torch.cat with tensors in cuda, or should I rather move to cpu after the concatenation? Does it depend on the specific case?

Many thanks in advance!

Yes, as usual.

Triggering a single copy kernel should be faster than calling them in a loop. However, in your use case it would also depend on the available GPU memory as the torch.cat call would increase it assuming x resides on the GPU and device is set to "cpu".

In the end it would also depend on the tensor shapes, number of tensors, etc., so you might want to profile your actual workload.

Thanks for the reply, thus I will perform a detailed profiling to check which suits better to my use case.
Many thanks!