I have some questions regarding the perf measurements.
Have you profiled which step is the bottleneck in every iteration? Is it data loading, or forward, or backward, or optimizer?
For GPU training, time.time() might not give the accurate measure, as CUDA ops return immediately after it is added to the stream. To get more accurate numbers, elapsed_time is a better option. The following post can serve as an example: