Hi,

I have several small models all in the same GPUs (used by nn.DataParallel) and I am training them in two ways:

- Forward one model and backward the loss:

```
output = models[0](x)
loss = mse_loss(output, target)
optimizer[0].zero_grad()
tic = time.time()
loss.backward()
print(time.time()-tic)
optimizer[0].step()
```

- Forward one model, forward other models with torch.no_grad(), and backward the loss:

```
output = models[0](x)
with torch.no_grad():
tmp_output = [models[i](x) for i in range(100)]
loss = mse_loss(output, target)
tic = time.time()
loss.backward()
print(time.time()-tic)
optimizer[0].step()
```

Considering that the loss function is independent of the tmp_output, I believe that the execution time of the “loss.backward()” should be the same for both, however, the second one is significantly slower. I was wondering if I am missing something in here? I have also checked the generated graphs using torchviz and the graphs of the loss functions are the same for both.