As the title may already suggest, I noticed that whenever any of my model definition uses convolution, the training process become extremely slow, but I’m sure that the same model with the same training script code used to take far less amount of time…
I understand that 2d convolution can be an expensive operation, but as I said, I didn’t have this problem before.
Just to give you an example, I have a model that is basically composed by some convolution layers like this one:
nn.Sequential(
nn.Conv2d(layer_sizes[i-1], layer_sizes[i], kernel_size=1, bias=False),
nn.BatchNorm2d(layer_sizes[i]),
nn.LeakyReLU(negative_slope=0.2)
)
followed by some other nn.Conv1d
layer (with norm and activation as previous one).
I was able to train this model in 1 minute and 45 seconds per epoch, including a batch loop with forward pass, backward pass and some other small computations related to the training, but now the same piece of code takes 18 minutes… (obviously I’m training the model on GPU)
I’m having this problem with pytorch 2.4.1+cu124 and nvidia drivers 551.78. I’m not sure if this is is related or not but I also have installed the cuda toolkit 12.4 and I’m working on a windows machine.
Can anyone tell me what is the problem cousing such slowness or what should i check?
Could you profile the code in the fast and slower environment and check the kernel execution times e.g. via the native profiler or Nsight Systems
?
I’m not entirely sure on what you are asking me for, could you please provide me some reference for profiling?
code in the fast and slower environment
The problem is happening in a single environment, this is in fact the main point of my question. Why am I encountering such code behavior? I’m no longer able to train such model “faster”.
@ptrblck If this is what you are asking me, this is the profile for a single batch iteration:
with torch.autograd.profiler.profile(use_cuda=True, record_shapes=True) as prof:
-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::batch_norm 0.03% 221.300us 2.07% 15.332ms 1.022ms 219.000us 0.03% 300.222ms 20.015ms 15
aten::_batch_norm_impl_index 0.05% 334.000us 2.04% 15.111ms 1.007ms 181.000us 0.02% 300.003ms 20.000ms 15
aten::cudnn_batch_norm 0.96% 7.124ms 1.34% 9.897ms 761.331us 285.315ms 38.43% 294.883ms 22.683ms 13
aten::topk 22.86% 169.234ms 22.86% 169.234ms 42.309ms 100.277ms 13.51% 100.277ms 25.069ms 4
aten::matmul 0.05% 362.800us 6.77% 50.110ms 8.352ms 222.000us 0.03% 83.584ms 13.931ms 6
aten::bmm 6.17% 45.652ms 6.17% 45.652ms 9.130ms 80.494ms 10.84% 80.494ms 16.099ms 5
aten::convolution 0.04% 271.500us 7.26% 53.763ms 3.840ms 79.000us 0.01% 60.809ms 4.343ms 14
aten::_convolution 0.08% 586.500us 7.23% 53.491ms 3.821ms 208.000us 0.03% 60.730ms 4.338ms 14
aten::cudnn_convolution 7.10% 52.578ms 7.10% 52.578ms 3.756ms 60.095ms 8.09% 60.095ms 4.293ms 14
aten::conv2d 0.01% 81.300us 5.98% 44.294ms 6.328ms 17.000us 0.00% 56.539ms 8.077ms 7
-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 740.250ms
Self CUDA time total: 742.386ms
While this one is for a single epoch:
with torch.autograd.profiler.profile(use_cuda=True, record_shapes=True) as prof:
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1517/1517 [09:27<00:00, 2.67it/s]
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::batch_norm 0.13% 559.534ms 3.79% 15.902s 698.815us 264.701ms 0.05% 429.567s 18.878ms 22755
aten::_batch_norm_impl_index 0.12% 516.742ms 3.66% 15.342s 674.225us 382.006ms 0.07% 429.302s 18.866ms 22755
aten::cudnn_batch_norm 2.37% 9.968s 3.48% 14.618s 741.245us 425.741s 76.61% 428.051s 21.705ms 19721
aten::topk 0.16% 682.746ms 0.16% 682.746ms 112.516us 20.922s 3.76% 20.922s 3.448ms 6068
aten::convolution 0.27% 1.115s 3.41% 14.329s 674.702us 59.594ms 0.01% 18.548s 873.326us 21238
aten::_convolution 2.57% 10.804s 3.15% 13.215s 622.223us 253.331ms 0.05% 18.488s 870.520us 21238
aten::leaky_relu 0.12% 491.424ms 0.12% 491.424ms 21.596us 17.673s 3.18% 17.673s 776.669us 22755
aten::max 0.08% 355.966ms 0.09% 368.209ms 40.454us 17.145s 3.09% 17.165s 1.886ms 9102
aten::cudnn_convolution 0.27% 1.141s 0.27% 1.141s 53.743us 17.073s 3.07% 17.073s 803.882us 21238
aten::conv2d 0.03% 116.006ms 0.18% 748.824ms 70.517us 28.624ms 0.01% 15.054s 1.418ms 10619
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 419.744s
Self CUDA time total: 555.749s
I obtained those profile only on the forward pass, i removed the backward pass and all the other computation I would normally perform in my training loop.