@mattinjersey, it seems to me that the difference between your code and @ptrblck’s code is that the latter only measures the time for computation on the GPU, it does not account for the data transfer time (data transfer happens only once in @ptrblck’s code).
On the other hand, you transfer data to the GPU at every iteration, and hence you are observing the additional time required for that.
Is it possible, as @ptrblck suggested, to write your ComputeResults()
function as a dataset, so that a dataloader can generate data batches in pin-locked memory (by passing pin_memory=True
)? Once the data batch is in pin-locked memory, you can also pass non_blocking=True
to the cuda()
calls so that data transfer happens asynchronously with computations.
These 2 steps should help amortize the data transfer costs that are slowing you down.