I’m trying to improve the performance of my PyTorch code by using a preallocated page locked tensor to copy all my results from the GPU into. The results from the GPU will vary in size in the first dimension so I created a pinned buffer large enough to handle the max sized result.
What is the most performant way to copy the undersized results in the pinned memory and then to numpy arrays of their original size?
import torch
channels = 8
maxsize = 10
# create a page locked buffer on the cpu of masize in first dimension
buf = torch.zeros([maxsize, channels], pin_memory=True)
# create smaller tensors on the gpu
data = [
torch.empty((i, channels), device="cuda").fill_(i)
for i in range(1, 6)
]
model = lambda x: x
results = []
for res in data:
# run dummy model
res = model(res)
# best way to async copy the smaller res tensor from the gpu?
buf[:res.shape[0], :] = res
# check buffer looks correct
print(buf[:, 0], buf.data_ptr())
# do something with the correct sized results
results.append(buf[:res.shape[0], :].clone().numpy())
# check results look correct
for res in results:
print(res[:, 0])
tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) 140124862021632
tensor([2., 2., 0., 0., 0., 0., 0., 0., 0., 0.]) 140124862021632
tensor([3., 3., 3., 0., 0., 0., 0., 0., 0., 0.]) 140124862021632
tensor([4., 4., 4., 4., 0., 0., 0., 0., 0., 0.]) 140124862021632
tensor([5., 5., 5., 5., 5., 0., 0., 0., 0., 0.]) 140124862021632
[1.]
[2. 2.]
[3. 3. 3.]
[4. 4. 4. 4.]
[5. 5. 5. 5. 5.]