Speed of FSRCNN

Hi everyone I have constructed a simple FSRCNN model which is nothing but a simple 3 layer CNN with no form of pooling. I have a 100,000 images with 80,000 for training and 20,000 for development. At the moment I have a patch size of 64 and batch size of 32. My code takes about 30 minutes for the forward + backward pass over an epoch ie once over all the batches. Is this speed acceptable or is it slower than usual. The network has been deployed on an ec2 instance with 4 16GB GPU’s. I have enabled cudnn.benchmarked is there anything else that I can do to increase the speed of my network?
I have also observed that my GPU-Util is 0% could my data loader be a bottleneck.