Loss backward is slower than expected when there are multiple models in the gpu

aminshabani · August 21, 2021, 6:39pm

Hi,

I have several small models all in the same GPUs (used by nn.DataParallel) and I am training them in two ways:

Forward one model and backward the loss:

output = models[0](x)
loss = mse_loss(output, target)
optimizer[0].zero_grad()
tic = time.time()
loss.backward()
print(time.time()-tic)
optimizer[0].step()

Forward one model, forward other models with torch.no_grad(), and backward the loss:

output = models[0](x)
with torch.no_grad():
    tmp_output = [models[i](x) for i in range(100)] 
loss = mse_loss(output, target)
tic = time.time()
loss.backward()
print(time.time()-tic)
optimizer[0].step()

Considering that the loss function is independent of the tmp_output, I believe that the execution time of the “loss.backward()” should be the same for both, however, the second one is significantly slower. I was wondering if I am missing something in here? I have also checked the generated graphs using torchviz and the graphs of the loss functions are the same for both.

arman-yekkehkhani · August 21, 2021, 6:57pm

Hi, how did you measure the execution time of loss.backward()? You have done a considerable amount of computation in [models[i](x) for i in range(100)], which I guess is the source of longer execution time.

aminshabani · August 21, 2021, 7:00pm

Thank you for your reply, Yes this part needs to be clear, I only consider the time for the line of loss.backward(). I will edit the post.

arman-yekkehkhani · August 21, 2021, 8:17pm

I tried a code on both CPU and a single GPU, yet the results were as expected. Both cases take the same amount of time. However, I should point that the difference you see might be due to the first-time allocation of tensors. Besides, your method for measuring execution time is not good. Try something like timeit. If the issue persists, please leave a detailed demo of your code(you can replace the model with a simple network and use arbitrary input and target), plus the output.