I’ve tested to see if moving one large tensor from cpu to cuda take less time than copying the same amount of data split to chunks, as follows:
import torch from torch import nn from time import time arr = [torch.randn(1000) for _ in range(1000)] large = torch.randn(1000 * 1000) start = time() large.cuda() torch.cuda.synchronize() print(time() - start) # 0.0015168190002441406 start = time() for x in arr: x.cuda() torch.cuda.synchronize() print(time() - start) # 0.023027658462524414
The numbers do vary from run to run but the difference in scales remains.
What motivated my checking was seeing that calling
model.cuda() is actually a recursive call to
My question - would it be more efficient to replace the recursive call with something that first chunks all the data to a single tensor on CPU, moves it once then unpacks it back? I would think it is and the reason it’s not done is that the code is ugly (though it would have to be written once?). My code also didn’t simulate the CPU to CPU copies but I assume that takes less time than the CPU to GPU copy.