nn.Conv is slower than before

dailaila · October 24, 2024, 3:14pm

As the title may already suggest, I noticed that whenever any of my model definition uses convolution, the training process become extremely slow, but I’m sure that the same model with the same training script code used to take far less amount of time…
I understand that 2d convolution can be an expensive operation, but as I said, I didn’t have this problem before.

Just to give you an example, I have a model that is basically composed by some convolution layers like this one:

nn.Sequential(
  nn.Conv2d(layer_sizes[i-1], layer_sizes[i], kernel_size=1, bias=False),
  nn.BatchNorm2d(layer_sizes[i]),
  nn.LeakyReLU(negative_slope=0.2)
)

followed by some other nn.Conv1d layer (with norm and activation as previous one).
I was able to train this model in 1 minute and 45 seconds per epoch, including a batch loop with forward pass, backward pass and some other small computations related to the training, but now the same piece of code takes 18 minutes… (obviously I’m training the model on GPU)

I’m having this problem with pytorch 2.4.1+cu124 and nvidia drivers 551.78. I’m not sure if this is is related or not but I also have installed the cuda toolkit 12.4 and I’m working on a windows machine.

Can anyone tell me what is the problem cousing such slowness or what should i check?

ptrblck · October 24, 2024, 3:35pm

Could you profile the code in the fast and slower environment and check the kernel execution times e.g. via the native profiler or Nsight Systems?

dailaila · October 24, 2024, 3:40pm

I’m not entirely sure on what you are asking me for, could you please provide me some reference for profiling?

code in the fast and slower environment

The problem is happening in a single environment, this is in fact the main point of my question. Why am I encountering such code behavior? I’m no longer able to train such model “faster”.

dailaila · October 24, 2024, 5:09pm

@ptrblck If this is what you are asking me, this is the profile for a single batch iteration:

with torch.autograd.profiler.profile(use_cuda=True, record_shapes=True) as prof:
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                aten::batch_norm         0.03%     221.300us         2.07%      15.332ms       1.022ms     219.000us         0.03%     300.222ms      20.015ms            15  
    aten::_batch_norm_impl_index         0.05%     334.000us         2.04%      15.111ms       1.007ms     181.000us         0.02%     300.003ms      20.000ms            15  
          aten::cudnn_batch_norm         0.96%       7.124ms         1.34%       9.897ms     761.331us     285.315ms        38.43%     294.883ms      22.683ms            13  
                      aten::topk        22.86%     169.234ms        22.86%     169.234ms      42.309ms     100.277ms        13.51%     100.277ms      25.069ms             4  
                    aten::matmul         0.05%     362.800us         6.77%      50.110ms       8.352ms     222.000us         0.03%      83.584ms      13.931ms             6  
                       aten::bmm         6.17%      45.652ms         6.17%      45.652ms       9.130ms      80.494ms        10.84%      80.494ms      16.099ms             5
               aten::convolution         0.04%     271.500us         7.26%      53.763ms       3.840ms      79.000us         0.01%      60.809ms       4.343ms            14
              aten::_convolution         0.08%     586.500us         7.23%      53.491ms       3.821ms     208.000us         0.03%      60.730ms       4.338ms            14
         aten::cudnn_convolution         7.10%      52.578ms         7.10%      52.578ms       3.756ms      60.095ms         8.09%      60.095ms       4.293ms            14
                    aten::conv2d         0.01%      81.300us         5.98%      44.294ms       6.328ms      17.000us         0.00%      56.539ms       8.077ms             7
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 740.250ms
Self CUDA time total: 742.386ms

While this one is for a single epoch:

with torch.autograd.profiler.profile(use_cuda=True, record_shapes=True) as prof:
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1517/1517 [09:27<00:00,  2.67it/s] 
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                       aten::batch_norm         0.13%     559.534ms         3.79%       15.902s     698.815us     264.701ms         0.05%      429.567s      18.878ms         22755
                           aten::_batch_norm_impl_index         0.12%     516.742ms         3.66%       15.342s     674.225us     382.006ms         0.07%      429.302s      18.866ms         22755
                                 aten::cudnn_batch_norm         2.37%        9.968s         3.48%       14.618s     741.245us      425.741s        76.61%      428.051s      21.705ms         19721
                                             aten::topk         0.16%     682.746ms         0.16%     682.746ms     112.516us       20.922s         3.76%       20.922s       3.448ms          6068
                                      aten::convolution         0.27%        1.115s         3.41%       14.329s     674.702us      59.594ms         0.01%       18.548s     873.326us         21238
                                     aten::_convolution         2.57%       10.804s         3.15%       13.215s     622.223us     253.331ms         0.05%       18.488s     870.520us         21238
                                       aten::leaky_relu         0.12%     491.424ms         0.12%     491.424ms      21.596us       17.673s         3.18%       17.673s     776.669us         22755
                                              aten::max         0.08%     355.966ms         0.09%     368.209ms      40.454us       17.145s         3.09%       17.165s       1.886ms          9102
                                aten::cudnn_convolution         0.27%        1.141s         0.27%        1.141s      53.743us       17.073s         3.07%       17.073s     803.882us         21238
                                           aten::conv2d         0.03%     116.006ms         0.18%     748.824ms      70.517us      28.624ms         0.01%       15.054s       1.418ms         10619
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 419.744s
Self CUDA time total: 555.749s

I obtained those profile only on the forward pass, i removed the backward pass and all the other computation I would normally perform in my training loop.