@ptrblck I tried with torch.utils.bottleneck
and following is the output:
Note: I ran the script for 1 epoch only, as you suggested here, and this is printed without starting the training (i.e., before the problem occurred).
--------------------------------------------------------------------------------
Environment Summary
--------------------------------------------------------------------------------
PyTorch 1.9.0+cu111 DEBUG compiled w/ CUDA 11.1
Running with Python 3.6 and
`pip3 list` truncated output:
numpy==1.18.1
torch==1.9.0+cu111
torchaudio==0.9.0
torchvision==0.10.0+cu111
--------------------------------------------------------------------------------
cProfile output
--------------------------------------------------------------------------------
15127188 function calls (15111211 primitive calls) in 8.870 seconds
Ordered by: internal time
List reduced from 2747 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
289 3.666 0.013 3.666 0.013 {method 'cuda' of 'torch._C._TensorBase' objects}
1263775 1.174 0.000 1.905 0.000 /usr/lib/python3.6/posixpath.py:75(join)
1 1.069 1.069 4.304 4.304 /DatasetLoader.py:114(__init__)
2442625 0.395 0.000 0.395 0.000 {method 'split' of 'str' objects}
1264426 0.223 0.000 0.365 0.000 /usr/lib/python3.6/posixpath.py:41(_get_sep)
2560595 0.171 0.000 0.171 0.000 {method 'append' of 'list' objects}
1304392/1303795 0.149 0.000 0.155 0.000 {built-in method builtins.isinstance}
173823 0.144 0.000 0.151 0.000 /usr/lib/python3.6/glob.py:114(_iterdir)
1265133 0.128 0.000 0.128 0.000 {method 'endswith' of 'str' objects}
1270598 0.127 0.000 0.127 0.000 {method 'startswith' of 'str' objects}
1 0.117 0.117 0.288 0.288 /DatasetLoader.py:129(<listcomp>)
1268919 0.113 0.000 0.113 0.000 {built-in method posix.fspath}
2 0.110 0.055 0.115 0.057 {method 'readlines' of '_io._IOBase' objects}
1092599 0.091 0.000 0.091 0.000 {method 'strip' of 'str' objects}
1 0.066 0.066 0.775 0.775 /DatasetLoader.py:62(__init__)
--------------------------------------------------------------------------------
autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
------------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::normal_ 9.89% 1.233ms 9.89% 1.233ms 1.233ms 1
aten::normal_ 9.61% 1.198ms 9.61% 1.198ms 1.198ms 1
aten::normal_ 9.54% 1.190ms 9.54% 1.190ms 1.190ms 1
aten::normal_ 9.48% 1.182ms 9.48% 1.182ms 1.182ms 1
aten::normal_ 9.42% 1.174ms 9.42% 1.174ms 1.174ms 1
aten::uniform_ 6.81% 848.913us 6.81% 848.913us 848.913us 1
aten::uniform_ 6.35% 791.197us 6.35% 791.197us 791.197us 1
aten::uniform_ 6.26% 780.781us 6.26% 780.781us 780.781us 1
aten::uniform_ 5.96% 743.103us 5.96% 743.103us 743.103us 1
aten::uniform_ 5.94% 740.487us 5.94% 740.487us 740.487us 1
aten::uniform_ 5.64% 703.088us 5.64% 703.088us 703.088us 1
aten::normal_ 5.29% 660.099us 5.29% 660.099us 660.099us 1
aten::max 3.30% 411.601us 3.43% 427.419us 427.419us 1
aten::normal_ 3.29% 410.359us 3.29% 410.359us 410.359us 1
aten::normal_ 3.21% 400.519us 3.21% 400.519us 400.519us 1
------------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 12.467ms
--------------------------------------------------------------------------------
autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
Because the autograd profiler uses the CUDA event API,
the CUDA time column reports approximately max(cuda_time, cpu_time).
Please ignore this output if your code does not use CUDA.
------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::normal_ 9.13% 1.198ms 9.13% 1.198ms 1.198ms 1.198ms 9.13% 1.198ms 1.198ms 1
aten::normal_ 8.93% 1.172ms 8.93% 1.172ms 1.172ms 1.173ms 8.94% 1.173ms 1.173ms 1
aten::normal_ 8.93% 1.172ms 8.93% 1.172ms 1.172ms 1.171ms 8.93% 1.171ms 1.171ms 1
aten::normal_ 8.86% 1.163ms 8.86% 1.163ms 1.163ms 1.163ms 8.86% 1.163ms 1.163ms 1
aten::normal_ 8.85% 1.161ms 8.85% 1.161ms 1.161ms 1.162ms 8.85% 1.162ms 1.162ms 1
aten::uniform_ 7.56% 992.395us 7.56% 992.395us 992.395us 991.776us 7.56% 991.776us 991.776us 1
aten::uniform_ 7.36% 966.001us 7.36% 966.001us 966.001us 966.240us 7.36% 966.240us 966.240us 1
aten::uniform_ 7.30% 958.179us 7.30% 958.179us 958.179us 957.120us 7.29% 957.120us 957.120us 1
aten::uniform_ 7.29% 956.692us 7.29% 956.692us 956.692us 957.760us 7.30% 957.760us 957.760us 1
aten::uniform_ 7.10% 932.259us 7.10% 932.259us 932.259us 929.216us 7.08% 929.216us 929.216us 1
aten::uniform_ 5.43% 711.949us 5.43% 711.949us 711.949us 712.384us 5.43% 712.384us 712.384us 1
aten::normal_ 4.34% 569.383us 4.34% 569.383us 569.383us 569.248us 4.34% 569.248us 569.248us 1
aten::uniform_ 3.49% 457.421us 3.49% 457.421us 457.421us 457.216us 3.48% 457.216us 457.216us 1
aten::max 2.48% 324.992us 2.95% 387.751us 387.751us 328.032us 2.50% 387.328us 387.328us 1
aten::normal_ 2.95% 386.638us 2.95% 386.638us 386.638us 387.072us 2.95% 387.072us 387.072us 1
------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 13.123ms
Self CUDA time total: 13.123ms