This is a blocking issue for my research since increases the training time in unacceptable amounts.
(former training time : 15h-18h, current: 45h-52h)
I have used perf_counter to measure the time needed for each line of above code and realized that backpropagation takes ~10x of the time needed for forward pass. I know that backprop is expected to take longer than forward pass, but I believe this difference is a bit too much.
The tests
Thinking that it might be an infrastructural problem, I also have ran my code at colab environment. The result was more or less the same. Here are the timings of training steps for first 7 images from the first epoch of both trainings.
I would love to, but currently I do have a repro on colab but I can not get that without my training set, which I access over my own gdrive. I will soon try to find a way. I am open to advices on that for sure.
Here is some good news though: I was able to eliminate the decrease in speed. Now my training is as fast as before. I have managed to fix the problem using one of my old environment.yml files. I had already checked the hardware, data, ubuntu and conda itself (by reinstalling all of them) but it was my environment after all.
What it all means to me is that there is a library (or are libraries) that causes the training to be 5x slower. I might be able to find that certain library/ies by gradually upgrading my environment. I will investigate further once I am done with my current project.
Btw, I want to emphasize that the problem persisted when I tried to get a repro in colab. This might mean that something interesting going on here like, something in one of the up-to-date libraries might be slowing down the training of CSRNet.
By skimming through both environments I think the main difference is:
Old environment:
pytorch=0.4.0=py36hdf912b8_0
cuda90=1.0=h6433d27_0 and cudatoolkit=9.0=h13b8566_0
cudnn=7.1.2=cuda9.0_0
New environment:
pytorch=1.4.0=py3.8_cuda10.1.243_cudnn7.6.3_0
cudatoolkit=10.1.243=h6bb024c_0
Besides an older PyTorch, CUDA, and cudnn version, there were also two cuda packages installed and I’m not sure how they interact with each other.
Also, did you add synchronizations to your code when profiling manually?
Since CUDA operations are asynchronous, you would have to synchronize via torch.cuda.synchronize() before starting and stopping the timer.
thank you for your attention. I agree with you on main differences. I am also surprised about how old version of those core libraries makes my code run faster.
Here is my measurement code. I have summarized the result by discarding insignificant timings (>0.01) when I was reporting them, for the sake of simplicity.
a = time.perf_counter()
img = img.cuda()
img = Variable(img)
target = target.type(torch.FloatTensor).unsqueeze(0).cuda()
target = Variable(target)
torch.cuda.synchronize()
b = time.perf_counter()
print('data conversion {:.3f}s '.format(b - a), end='')
a = time.perf_counter()
output = model(img)
torch.cuda.synchronize()
b = time.perf_counter()
print('model forward pass {:.3f}s '.format(b - a), end='')
a = time.perf_counter()
loss = criterion(output, target)
losses.update(loss.item(), img.size(0))
torch.cuda.synchronize()
b = time.perf_counter()
print('loss {:.3f}s '.format(b - a), end='')
a = time.perf_counter()
optimizer.zero_grad()
torch.cuda.synchronize()
b = time.perf_counter()
print('zerograd {:.3f}s '.format(b - a), end='')
a = time.perf_counter()
loss.backward()
optimizer.step()
torch.cuda.synchronize()
b = time.perf_counter()
print('backprop {:.3f}s '.format(b - a), end='')
a = time.perf_counter()
optimizer.step()
torch.cuda.synchronize()
b = time.perf_counter()
print('step {:.3f}s'.format(b - a))
Here is how the timing looked like originally
local
data conversion 0.001s model forward pass 0.045s loss 0.001s zerograd 0.000s backprop 0.496s step 0.002s
data conversion 0.002s model forward pass 0.067s loss 0.000s zerograd 0.000s backprop 0.762s step 0.003s
data conversion 0.000s model forward pass 0.015s loss 0.000s zerograd 0.000s backprop 0.115s step 0.002s
data conversion 0.001s model forward pass 0.071s loss 0.000s zerograd 0.000s backprop 0.759s step 0.003s
data conversion 0.000s model forward pass 0.015s loss 0.000s zerograd 0.000s backprop 0.133s step 0.003s
data conversion 0.001s model forward pass 0.058s loss 0.000s zerograd 0.000s backprop 0.589s step 0.003s