Many small device transfers vs one large one

Hi,
I’ve tested to see if moving one large tensor from cpu to cuda take less time than copying the same amount of data split to chunks, as follows:

import torch
from torch import nn
from time import time

arr = [torch.randn(1000) for _ in range(1000)]
large = torch.randn(1000 * 1000)

start = time()
large.cuda()
torch.cuda.synchronize()
print(time() - start) # 0.0015168190002441406

start = time()
for x in arr:
    x.cuda()
torch.cuda.synchronize()
print(time() - start) # 0.023027658462524414

The numbers do vary from run to run but the difference in scales remains.
What motivated my checking was seeing that calling model.cuda() is actually a recursive call to param.cuda().

My question - would it be more efficient to replace the recursive call with something that first chunks all the data to a single tensor on CPU, moves it once then unpacks it back? I would think it is and the reason it’s not done is that the code is ugly (though it would have to be written once?). My code also didn’t simulate the CPU to CPU copies but I assume that takes less time than the CPU to GPU copy.

Any thoughts?