I have a dataset of around 450,000 images each of which has a size of 128x128x3. The output is 13 float values for each of those images. I feel like my model (which is a ResNet34) takes a long time to train. Each of the epochs would require around 50 mins in two V100 gpus. I have tried with num_workers 8, 16, 32 but there is no visible change. I ran torch.bottleneck. I found the largest cumulative time in CPU is spent for method ‘read’ of ‘_io.BufferedReader’ objects. Is there anything that I can do to fix this?
--------------------------------------------------------------------------------
cProfile output
--------------------------------------------------------------------------------
12200282 function calls (11871813 primitive calls) in 593.240 seconds
Ordered by: internal time
List reduced from 1389 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
110538 274.929 0.002 274.929 0.002 {method 'read' of '_io.BufferedReader' objects}
12303 96.898 0.008 96.898 0.008 {built-in method posix.lstat}
10533 42.315 0.004 42.315 0.004 {method 'item' of 'torch._C._TensorBase' objects}
141570 35.918 0.000 35.918 0.000 {method 'add_' of 'torch._C._TensorBase' objects}
94380 26.973 0.000 26.973 0.000 {method 'mul_' of 'torch._C._TensorBase' objects}
47190 14.973 0.000 14.973 0.000 {method 'sqrt' of 'torch._C._TensorBase' objects}
429 14.923 0.035 120.987 0.282 /project/6007383/farhan/envs/unet_exp/lib/python3.6/site-packages/torch/optim/adam.py:49(step)
47190 14.280 0.000 14.280 0.000 {method 'addcmul_' of 'torch._C._TensorBase' objects}
47190 13.784 0.000 13.784 0.000 {method 'addcdiv_' of 'torch._C._TensorBase' objects}
1760 6.544 0.004 6.544 0.004 {method 'to' of 'torch._C._TensorBase' objects}
12282 5.987 0.000 5.987 0.000 {method 'decode' of 'ImagingDecoder' objects}
429 5.142 0.012 5.142 0.012 {method 'run_backward' of 'torch._C._EngineBase' objects}
12291 3.329 0.000 3.329 0.000 {built-in method io.open}
12282 2.949 0.000 2.949 0.000 {method 'close' of '_io.BufferedReader' objects}
27756 2.362 0.000 2.362 0.000 {built-in method conv2d}
--------------------------------------------------------------------------------
autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
Because the autograd profiler uses the CUDA event API,
the CUDA time column reports approximately max(cuda_time, cpu_time).
Please ignore this output if your code does not use CUDA.
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
item 6.73% 82.322ms 6.73% 82.322ms 82.322ms 7.84% 32.000us 32.000us 1 []
_local_scalar_dense 6.73% 82.312ms 6.73% 82.312ms 82.312ms 7.84% 32.000us 32.000us 1 []
item 6.66% 81.402ms 6.66% 81.402ms 81.402ms 7.84% 32.000us 32.000us 1 []
_local_scalar_dense 6.66% 81.393ms 6.66% 81.393ms 81.393ms 7.84% 32.000us 32.000us 1 []
item 6.66% 81.380ms 6.66% 81.380ms 81.380ms 3.92% 16.000us 16.000us 1 []
_local_scalar_dense 6.66% 81.370ms 6.66% 81.370ms 81.370ms 3.92% 16.000us 16.000us 1 []
item 6.66% 81.358ms 6.66% 81.358ms 81.358ms 7.84% 32.000us 32.000us 1 []
item 6.66% 81.356ms 6.66% 81.356ms 81.356ms 5.88% 24.000us 24.000us 1 []
item 6.66% 81.354ms 6.66% 81.354ms 81.354ms 7.84% 32.000us 32.000us 1 []
item 6.66% 81.353ms 6.66% 81.353ms 81.353ms 7.84% 32.000us 32.000us 1 []
item 6.66% 81.351ms 6.66% 81.351ms 81.351ms 7.84% 32.000us 32.000us 1 []
item 6.66% 81.349ms 6.66% 81.349ms 81.349ms 7.84% 32.000us 32.000us 1 []
item 6.66% 81.349ms 6.66% 81.349ms 81.349ms 7.84% 32.000us 32.000us 1 []
_local_scalar_dense 6.66% 81.348ms 6.66% 81.348ms 81.348ms 7.84% 32.000us 32.000us 1 []
item 6.66% 81.347ms 6.66% 81.347ms 81.347ms 0.00% 0.000us 0.000us 1 []
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Self CPU time total: 1.222s
CUDA time total: 408.000us
Please note that this was ran on a very small subset of the data.