Move tensors to gpu and concatenate them at the same time?

Given multiple tensors of the same shape on the CPU (e.g. A = [x, y, z]), is it possible to move them to the GPU and concatenate them at the same time? I know that I can first move each individual tensor to the GPU (A = [a.to(device) for a in A]) and then concatenate them afterwards (A = torch.cat(A, dim=cat_dim)), but that creates extra copying of data on the GPU when concatenating the data.

It feels like it would be more efficient to allocate the space needed for the concatenated tensor on the GPU first and then just move the CPU-tensors to the correct position on the GPU directly, if it is possible to specify a target memory location when moving the tensors to the GPU. Is this possible somehow?