DataLoader IO BottleNeck

I tried my best to create two equivalent examples of training DenseNet121 on a large collection of images here.

Pytorch is 3 minutes faster than Keras, however what is interesting is that DataLoader seems to bottleneck the training a lot. Comparing Keras times data-loader vs numpy-array the difference is only 3 seconds per epoch, with py-torch this is 60!

Could this be because with Keras you have the option to use multi-processing or threading for the data-generator (and multi-processing for my example is much faster). However, with pytorch I could only do threading:

train_loader = DataLoader(dataset=train_dataset, batch_size=BATCHSIZE,
                          shuffle=True, num_workers=4*CPU_COUNT, pin_memory=True)

Curious if I’m doing something wrong or if others have found that DataLoader bottlenecks training on fast GPUs like P100 or V100?