Minimizing time moving tensors between CPU and GPU

If the workload in your model is small, you would be limited by the memory bandwidth between the host and device.
I would recommend to manually profile the code and check, if the data transfer is really slower than the model execution via:

nb_iters = 100
torch.cuda.synchronize()
t0 = time.time()
for _ in range(nb_iters):
    images = buffer[batch_size].to('cuda')
torch.cuda.synchronize()
t1 = time.time()
print((t1 - t0) / nb_iters)

and the same for the model.