we already have prefetch (see the imagenet or dcgan examples), but we dont prefetch directly onto the GPU. We prefetch onto CPU, do data augmentation and then we put the mini-batch in CUDA pinned memory (on CPU) so that GPU transfer is very fast. Then we give data to network to transfer to GPU and train.
21 Likes