Backward computation 10x slower than Forward

Hi,

I have been trying to debug to find out why it takes too long to train a CNN regressor with the following training code:

img = img.cuda()
img = Variable(img)

output = model(img)
target = target.type(torch.FloatTensor).unsqueeze(0).cuda()
target = Variable(target)

loss = criterion(output, target)

losses.update(loss.item(), img.size(0))

optimizer.zero_grad()
loss.backward()
optimizer.step()

The problem

This is a blocking issue for my research since increases the training time in unacceptable amounts.
(former training time : 15h-18h, current: 45h-52h)

I have used perf_counter to measure the time needed for each line of above code and realized that backpropagation takes ~10x of the time needed for forward pass. I know that backprop is expected to take longer than forward pass, but I believe this difference is a bit too much.

The tests

Thinking that it might be an infrastructural problem, I also have ran my code at colab environment. The result was more or less the same. Here are the timings of training steps for first 7 images from the first epoch of both trainings.

colab              local                   
forward  backward  forward  backward                                      
0.077s   0.818s    0.071s   0.699s         
0.025s   0.238s    0.095s   1.043s         
0.071s   0.751s    0.066s   0.751s         
0.046s   0.486s    0.063s   0.685s         
0.073s   0.778s    0.060s   0.619s         
0.039s   0.385s    0.015s   0.139s         
0.077s   0.818s    0.019s   0.176s         

I appreaciate any help on finding or rectifying the flaw.

Hi,

It is hard to say without any more information.
Can you share a small script that reproduces this on colab so that we can inspect it in more details?

Hi Alban,

Thank you very much for reaching out.

I would love to, but currently I do have a repro on colab but I can not get that without my training set, which I access over my own gdrive. I will soon try to find a way. I am open to advices on that for sure.

Here is some good news though: I was able to eliminate the decrease in speed. Now my training is as fast as before. I have managed to fix the problem using one of my old environment.yml files. I had already checked the hardware, data, ubuntu and conda itself (by reinstalling all of them) but it was my environment after all.

What it all means to me is that there is a library (or are libraries) that causes the training to be 5x slower. I might be able to find that certain library/ies by gradually upgrading my environment. I will investigate further once I am done with my current project.

Here are links to my old and new environments.

Btw, I want to emphasize that the problem persisted when I tried to get a repro in colab. This might mean that something interesting going on here like, something in one of the up-to-date libraries might be slowing down the training of CSRNet.

By skimming through both environments I think the main difference is:

Old environment:

  • pytorch=0.4.0=py36hdf912b8_0
  • cuda90=1.0=h6433d27_0 and cudatoolkit=9.0=h13b8566_0
  • cudnn=7.1.2=cuda9.0_0

New environment:

  • pytorch=1.4.0=py3.8_cuda10.1.243_cudnn7.6.3_0
  • cudatoolkit=10.1.243=h6bb024c_0

Besides an older PyTorch, CUDA, and cudnn version, there were also two cuda packages installed and I’m not sure how they interact with each other.

Also, did you add synchronizations to your code when profiling manually?
Since CUDA operations are asynchronous, you would have to synchronize via torch.cuda.synchronize() before starting and stopping the timer.

Hi @ptrblck,

thank you for your attention. I agree with you on main differences. I am also surprised about how old version of those core libraries makes my code run faster.

Here is my measurement code. I have summarized the result by discarding insignificant timings (>0.01) when I was reporting them, for the sake of simplicity.

a = time.perf_counter()
img = img.cuda()
img = Variable(img)
target = target.type(torch.FloatTensor).unsqueeze(0).cuda()
target = Variable(target)
torch.cuda.synchronize()
b = time.perf_counter()

print('data conversion {:.3f}s '.format(b - a), end='')

a = time.perf_counter()
output = model(img)
torch.cuda.synchronize()
b = time.perf_counter()

print('model forward pass {:.3f}s '.format(b - a), end='')

a = time.perf_counter()
loss = criterion(output, target)
losses.update(loss.item(), img.size(0))
torch.cuda.synchronize()
b = time.perf_counter()

print('loss {:.3f}s '.format(b - a), end='')

a = time.perf_counter()
optimizer.zero_grad()
torch.cuda.synchronize() 
b = time.perf_counter()

print('zerograd {:.3f}s '.format(b - a), end='')

a = time.perf_counter()
loss.backward()
optimizer.step()
torch.cuda.synchronize() 

b = time.perf_counter()
print('backprop {:.3f}s '.format(b - a), end='')
a = time.perf_counter()
optimizer.step()
torch.cuda.synchronize() 
b = time.perf_counter()

print('step {:.3f}s'.format(b - a))

Here is how the timing looked like originally

local

data conversion 0.001s model forward pass 0.045s loss 0.001s zerograd 0.000s backprop 0.496s step 0.002s
data conversion 0.002s model forward pass 0.067s loss 0.000s zerograd 0.000s backprop 0.762s step 0.003s
data conversion 0.000s model forward pass 0.015s loss 0.000s zerograd 0.000s backprop 0.115s step 0.002s
data conversion 0.001s model forward pass 0.071s loss 0.000s zerograd 0.000s backprop 0.759s step 0.003s
data conversion 0.000s model forward pass 0.015s loss 0.000s zerograd 0.000s backprop 0.133s step 0.003s
data conversion 0.001s model forward pass 0.058s loss 0.000s zerograd 0.000s backprop 0.589s step 0.003s