Strange behavior in Pytorch

@ptrblck I tried with torch.utils.bottleneck and following is the output:
Note: I ran the script for 1 epoch only, as you suggested here, and this is printed without starting the training (i.e., before the problem occurred).

--------------------------------------------------------------------------------
  Environment Summary
--------------------------------------------------------------------------------
PyTorch 1.9.0+cu111 DEBUG compiled w/ CUDA 11.1
Running with Python 3.6 and 

`pip3 list` truncated output:
numpy==1.18.1
torch==1.9.0+cu111
torchaudio==0.9.0
torchvision==0.10.0+cu111
--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         15127188 function calls (15111211 primitive calls) in 8.870 seconds

   Ordered by: internal time
   List reduced from 2747 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      289    3.666    0.013    3.666    0.013 {method 'cuda' of 'torch._C._TensorBase' objects}
  1263775    1.174    0.000    1.905    0.000 /usr/lib/python3.6/posixpath.py:75(join)
        1    1.069    1.069    4.304    4.304 /DatasetLoader.py:114(__init__)
  2442625    0.395    0.000    0.395    0.000 {method 'split' of 'str' objects}
  1264426    0.223    0.000    0.365    0.000 /usr/lib/python3.6/posixpath.py:41(_get_sep)
  2560595    0.171    0.000    0.171    0.000 {method 'append' of 'list' objects}
1304392/1303795    0.149    0.000    0.155    0.000 {built-in method builtins.isinstance}
   173823    0.144    0.000    0.151    0.000 /usr/lib/python3.6/glob.py:114(_iterdir)
  1265133    0.128    0.000    0.128    0.000 {method 'endswith' of 'str' objects}
  1270598    0.127    0.000    0.127    0.000 {method 'startswith' of 'str' objects}
        1    0.117    0.117    0.288    0.288 /DatasetLoader.py:129(<listcomp>)
  1268919    0.113    0.000    0.113    0.000 {built-in method posix.fspath}
        2    0.110    0.055    0.115    0.057 {method 'readlines' of '_io._IOBase' objects}
  1092599    0.091    0.000    0.091    0.000 {method 'strip' of 'str' objects}
        1    0.066    0.066    0.775    0.775 /DatasetLoader.py:62(__init__)


--------------------------------------------------------------------------------
  autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

------------------  ------------  ------------  ------------  ------------  ------------  ------------  
              Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
------------------  ------------  ------------  ------------  ------------  ------------  ------------  
     aten::normal_         9.89%       1.233ms         9.89%       1.233ms       1.233ms             1  
     aten::normal_         9.61%       1.198ms         9.61%       1.198ms       1.198ms             1  
     aten::normal_         9.54%       1.190ms         9.54%       1.190ms       1.190ms             1  
     aten::normal_         9.48%       1.182ms         9.48%       1.182ms       1.182ms             1  
     aten::normal_         9.42%       1.174ms         9.42%       1.174ms       1.174ms             1  
    aten::uniform_         6.81%     848.913us         6.81%     848.913us     848.913us             1  
    aten::uniform_         6.35%     791.197us         6.35%     791.197us     791.197us             1  
    aten::uniform_         6.26%     780.781us         6.26%     780.781us     780.781us             1  
    aten::uniform_         5.96%     743.103us         5.96%     743.103us     743.103us             1  
    aten::uniform_         5.94%     740.487us         5.94%     740.487us     740.487us             1  
    aten::uniform_         5.64%     703.088us         5.64%     703.088us     703.088us             1  
     aten::normal_         5.29%     660.099us         5.29%     660.099us     660.099us             1  
         aten::max         3.30%     411.601us         3.43%     427.419us     427.419us             1  
     aten::normal_         3.29%     410.359us         3.29%     410.359us     410.359us             1  
     aten::normal_         3.21%     400.519us         3.21%     400.519us     400.519us             1  
------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 12.467ms

--------------------------------------------------------------------------------
  autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

	Because the autograd profiler uses the CUDA event API,
	the CUDA time column reports approximately max(cuda_time, cpu_time).
	Please ignore this output if your code does not use CUDA.

------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
              Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
     aten::normal_         9.13%       1.198ms         9.13%       1.198ms       1.198ms       1.198ms         9.13%       1.198ms       1.198ms             1  
     aten::normal_         8.93%       1.172ms         8.93%       1.172ms       1.172ms       1.173ms         8.94%       1.173ms       1.173ms             1  
     aten::normal_         8.93%       1.172ms         8.93%       1.172ms       1.172ms       1.171ms         8.93%       1.171ms       1.171ms             1  
     aten::normal_         8.86%       1.163ms         8.86%       1.163ms       1.163ms       1.163ms         8.86%       1.163ms       1.163ms             1  
     aten::normal_         8.85%       1.161ms         8.85%       1.161ms       1.161ms       1.162ms         8.85%       1.162ms       1.162ms             1  
    aten::uniform_         7.56%     992.395us         7.56%     992.395us     992.395us     991.776us         7.56%     991.776us     991.776us             1  
    aten::uniform_         7.36%     966.001us         7.36%     966.001us     966.001us     966.240us         7.36%     966.240us     966.240us             1  
    aten::uniform_         7.30%     958.179us         7.30%     958.179us     958.179us     957.120us         7.29%     957.120us     957.120us             1  
    aten::uniform_         7.29%     956.692us         7.29%     956.692us     956.692us     957.760us         7.30%     957.760us     957.760us             1  
    aten::uniform_         7.10%     932.259us         7.10%     932.259us     932.259us     929.216us         7.08%     929.216us     929.216us             1  
    aten::uniform_         5.43%     711.949us         5.43%     711.949us     711.949us     712.384us         5.43%     712.384us     712.384us             1  
     aten::normal_         4.34%     569.383us         4.34%     569.383us     569.383us     569.248us         4.34%     569.248us     569.248us             1  
    aten::uniform_         3.49%     457.421us         3.49%     457.421us     457.421us     457.216us         3.48%     457.216us     457.216us             1  
         aten::max         2.48%     324.992us         2.95%     387.751us     387.751us     328.032us         2.50%     387.328us     387.328us             1  
     aten::normal_         2.95%     386.638us         2.95%     386.638us     386.638us     387.072us         2.95%     387.072us     387.072us             1  
------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 13.123ms
Self CUDA time total: 13.123ms