If the workload in your model is small, you would be limited by the memory bandwidth between the host and device.
I would recommend to manually profile the code and check, if the data transfer is really slower than the model execution via:
nb_iters = 100
torch.cuda.synchronize()
t0 = time.time()
for _ in range(nb_iters):
images = buffer[batch_size].to('cuda')
torch.cuda.synchronize()
t1 = time.time()
print((t1 - t0) / nb_iters)
and the same for the model.