Long training time, bottleneck output shows high IO

I have a dataset of around 450,000 images each of which has a size of 128x128x3. The output is 13 float values for each of those images. I feel like my model (which is a ResNet34) takes a long time to train. Each of the epochs would require around 50 mins in two V100 gpus. I have tried with num_workers 8, 16, 32 but there is no visible change. I ran torch.bottleneck. I found the largest cumulative time in CPU is spent for method ‘read’ of ‘_io.BufferedReader’ objects. Is there anything that I can do to fix this?

--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         12200282 function calls (11871813 primitive calls) in 593.240 seconds

   Ordered by: internal time
   List reduced from 1389 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   110538  274.929    0.002  274.929    0.002 {method 'read' of '_io.BufferedReader' objects}
    12303   96.898    0.008   96.898    0.008 {built-in method posix.lstat}
    10533   42.315    0.004   42.315    0.004 {method 'item' of 'torch._C._TensorBase' objects}
   141570   35.918    0.000   35.918    0.000 {method 'add_' of 'torch._C._TensorBase' objects}
    94380   26.973    0.000   26.973    0.000 {method 'mul_' of 'torch._C._TensorBase' objects}
    47190   14.973    0.000   14.973    0.000 {method 'sqrt' of 'torch._C._TensorBase' objects}
      429   14.923    0.035  120.987    0.282 /project/6007383/farhan/envs/unet_exp/lib/python3.6/site-packages/torch/optim/adam.py:49(step)
    47190   14.280    0.000   14.280    0.000 {method 'addcmul_' of 'torch._C._TensorBase' objects}
    47190   13.784    0.000   13.784    0.000 {method 'addcdiv_' of 'torch._C._TensorBase' objects}
     1760    6.544    0.004    6.544    0.004 {method 'to' of 'torch._C._TensorBase' objects}
    12282    5.987    0.000    5.987    0.000 {method 'decode' of 'ImagingDecoder' objects}
      429    5.142    0.012    5.142    0.012 {method 'run_backward' of 'torch._C._EngineBase' objects}
    12291    3.329    0.000    3.329    0.000 {built-in method io.open}
    12282    2.949    0.000    2.949    0.000 {method 'close' of '_io.BufferedReader' objects}
    27756    2.362    0.000    2.362    0.000 {built-in method conv2d}


--------------------------------------------------------------------------------
  autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

        Because the autograd profiler uses the CUDA event API,
        the CUDA time column reports approximately max(cuda_time, cpu_time).
        Please ignore this output if your code does not use CUDA.

-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes                         
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
item                     6.73%            82.322ms         6.73%            82.322ms         82.322ms         7.84%            32.000us         32.000us         1                []                                   
_local_scalar_dense      6.73%            82.312ms         6.73%            82.312ms         82.312ms         7.84%            32.000us         32.000us         1                []                                   
item                     6.66%            81.402ms         6.66%            81.402ms         81.402ms         7.84%            32.000us         32.000us         1                []                                   
_local_scalar_dense      6.66%            81.393ms         6.66%            81.393ms         81.393ms         7.84%            32.000us         32.000us         1                []                                   
item                     6.66%            81.380ms         6.66%            81.380ms         81.380ms         3.92%            16.000us         16.000us         1                []                                   
_local_scalar_dense      6.66%            81.370ms         6.66%            81.370ms         81.370ms         3.92%            16.000us         16.000us         1                []                                   
item                     6.66%            81.358ms         6.66%            81.358ms         81.358ms         7.84%            32.000us         32.000us         1                []                                   
item                     6.66%            81.356ms         6.66%            81.356ms         81.356ms         5.88%            24.000us         24.000us         1                []                                   
item                     6.66%            81.354ms         6.66%            81.354ms         81.354ms         7.84%            32.000us         32.000us         1                []                                   
item                     6.66%            81.353ms         6.66%            81.353ms         81.353ms         7.84%            32.000us         32.000us         1                []                                   
item                     6.66%            81.351ms         6.66%            81.351ms         81.351ms         7.84%            32.000us         32.000us         1                []                                   
item                     6.66%            81.349ms         6.66%            81.349ms         81.349ms         7.84%            32.000us         32.000us         1                []                                   
item                     6.66%            81.349ms         6.66%            81.349ms         81.349ms         7.84%            32.000us         32.000us         1                []                                   
_local_scalar_dense      6.66%            81.348ms         6.66%            81.348ms         81.348ms         7.84%            32.000us         32.000us         1                []                                   
item                     6.66%            81.347ms         6.66%            81.347ms         81.347ms         0.00%            0.000us          0.000us          1                []                                   
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 1.222s
CUDA time total: 408.000us

Please note that this was ran on a very small subset of the data.

If the data loading is the bottleneck, have a look at this post by @rwightman, which explains how these bottlenecks might be avoided.

I am not sure if my interpretation is correct. It surely is only the data loading by the looks of the log right? Sorry as I am having a little hard time interpreting this.