What kind of parallel options do I have to handle data loading overhead?

I am training a network currently and reporting status updates every 600 images with a batch size of 6 (i.e. every 100 batches). My total time for 600 images is about 85 seconds. Load time, however, is approximately 60 seconds of that each time. The forward and backward passes make up the remaining 25 seconds. I already have two separate processes loading the data and the GT into a memory buffer on the CPU. Hence, the load time appears to be the communication overhead with GPU (and maybe some blocking on the buffer, but unlikely).

Is there a solution using the DataParallelism modules that can address this issue? I’m only running on 1 GPU currently, but I have 4, each with 11GB of GPU-RAM.

Thanks!

Dataloader loads things in parallel when you set num_workers to something higher: http://pytorch.org/docs/master/data.html?highlight=dataloader . I’m not sure if that helps, though.