Is it the loading itself or augmentation? I usually try to move all augmentation to the GPU except gathering statistics which I do beforehand and store in a tensor.
You could also use OpenCV or so instead of TorchVision.
Finally, you ask about the bottleneck of where your training spends its time. This is important and leads to reducing the time per epoch.
You probably are aware, but it is worth reminding ourselves every now and then that other things (most prominently the learning rate) impact the overall runtime to a given target metric by changing the number of epochs needed. (And I say this as someone who does quite a bit of client work of “benchmark-and-optimize-time-per-epoch”.)
Is TorchVision+pillow-simd or the other backend faster or slower?
Is it IO or CPU bound?
You could try if you can save time by writing tensors of 9 images to disk.
However, in my experience, this can make the disk access a problem. One thing to do is to keep the images as byte tensors (if they have been 0…255) to cut the disk use.
(Also, you don’t measure the just the first batch, right? It is known to be slower.)
Of course, you can also look at at the hardware (for IO: M.2 is significantly faster than SATA for SSDs is one obvious thing, for CPU, more cores, I always want to get one of these fancy CPUs…).