Training network with 4 GPUs performance is not exactly 4 times over one GPU why?

Training neural network with 4 GPUs using pyTorch, performance is not even 2 times (btw 1 & 2 times) compare to using one GPU. From Nvidia-smi we see GPU usage is for few milliseconds and next 5-10 seconds looks like data is off-loaded and loaded for new executions (mostly GPU usage is 0%). Is there any way in pyTorch to improve the data upload and offload for the GPU execution or is this normal ?

I’m running DCNN network with COCO dataset.

Using lmdb database datasets would give good increase in performance.
Using 4 GPUs will not give 4X speedup.(Amdahl’s law). It’s usual that performance with 4 GPUs is same as single GPU. In distributed training, there’s lot of communication between GPUs and CPU which hinders performance gains.