Large image loading is very slow, any way I can speed up things?

I have a very specific problem in which I have to use large images (resolution 2k x 3k) and relatively large batches (16-32), I’m totally CPU bottlenecked (it’s either transfer to GPU or image loading)

Images are stored in uint8 PNG format, I’m thinking about converting them to JPEG and use NVIDIA DALI library for loading/decoding using GPU, in my past it gave good speed-ups, however this library is so clumsy, is there an easier solution? I want fully utilize my GPU, it’s mostly idle now

Machine specs:
Ryzen 9 5950x (16 cores, multithreading)
128Gb RAM
NVME m2 SSD
4090 gpu

It would be useful to understand if it is truly data transfer or loading/decoding. Have you tried comparing the performance of:
(1) loading + transfer + inference
(2) load data once, transfer the same data multiple times + run inference multiples times
(3) load data once, transfer once, run inference multiple times on the same data
to isolate the cause?

p.s. it’s during training
i’m very sure 4090 can handle third step relatively easy,
since neural network is very small (resnet-18-ish)

I’ll check the timings of batch generation and comeback