Best way to resuse oversized pinned buffer

I’m trying to improve the performance of my PyTorch code by using a preallocated page locked tensor to copy all my results from the GPU into. The results from the GPU will vary in size in the first dimension so I created a pinned buffer large enough to handle the max sized result.

What is the most performant way to copy the undersized results in the pinned memory and then to numpy arrays of their original size?

import torch

channels = 8
maxsize = 10

# create a page locked buffer on the cpu of masize in first dimension                                                                                                                                                                
buf = torch.zeros([maxsize, channels], pin_memory=True)

# create smaller tensors on the gpu                                                                                                                                                                       
data = [
    torch.empty((i, channels), device="cuda").fill_(i)
    for i in range(1, 6)
]

model = lambda x: x
results = []

for res in data:

    # run dummy model                                                                                                                                                                                     
    res = model(res)

    # best way to async copy the smaller res tensor from the gpu?                                                                                                                                         
    buf[:res.shape[0], :] = res

    # check buffer looks correct                                                                                                                                                                          
    print(buf[:, 0], buf.data_ptr())

    # do something with the correct sized results                                                                                                                                                         
    results.append(buf[:res.shape[0], :].clone().numpy())


# check results look correct                                                                                                                                                                              
for res in results:
    print(res[:, 0])
tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) 140124862021632
tensor([2., 2., 0., 0., 0., 0., 0., 0., 0., 0.]) 140124862021632
tensor([3., 3., 3., 0., 0., 0., 0., 0., 0., 0.]) 140124862021632
tensor([4., 4., 4., 4., 0., 0., 0., 0., 0., 0.]) 140124862021632
tensor([5., 5., 5., 5., 5., 0., 0., 0., 0., 0.]) 140124862021632
[1.]
[2. 2.]
[3. 3. 3.]
[4. 4. 4. 4.]
[5. 5. 5. 5. 5.]

in theory,
buf.narrow(0,0,res.shape[0]).copy_(res,non_blocking=True)