Loss backward is slower than expected when there are multiple models in the gpu

Hi,

I have several small models all in the same GPUs (used by nn.DataParallel) and I am training them in two ways:

  • Forward one model and backward the loss:
output = models[0](x)
loss = mse_loss(output, target)
optimizer[0].zero_grad()
tic = time.time()
loss.backward()
print(time.time()-tic)
optimizer[0].step()
  • Forward one model, forward other models with torch.no_grad(), and backward the loss:
output = models[0](x)
with torch.no_grad():
    tmp_output = [models[i](x) for i in range(100)] 
loss = mse_loss(output, target)
tic = time.time()
loss.backward()
print(time.time()-tic)
optimizer[0].step()

Considering that the loss function is independent of the tmp_output, I believe that the execution time of the “loss.backward()” should be the same for both, however, the second one is significantly slower. I was wondering if I am missing something in here? I have also checked the generated graphs using torchviz and the graphs of the loss functions are the same for both.

Hi, how did you measure the execution time of loss.backward()? You have done a considerable amount of computation in [models[i](x) for i in range(100)], which I guess is the source of longer execution time.

Thank you for your reply, Yes this part needs to be clear, I only consider the time for the line of loss.backward(). I will edit the post.

I tried a code on both CPU and a single GPU, yet the results were as expected. Both cases take the same amount of time. However, I should point that the difference you see might be due to the first-time allocation of tensors. Besides, your method for measuring execution time is not good. Try something like timeit. If the issue persists, please leave a detailed demo of your code(you can replace the model with a simple network and use arbitrary input and target), plus the output.