Are they equivalent?

```
a = ...
loss1 = net(a)
b = ...
loss2 = net(b)
c = loss1+loss2
c.backward()
optimizer.step()
```

```
a = ...
loss1 = net(a)
loss1.backward()
b = ...
loss2 = net(b)
loss2.backward()
optimizer.step()
```

Are they equivalent?

```
a = ...
loss1 = net(a)
b = ...
loss2 = net(b)
c = loss1+loss2
c.backward()
optimizer.step()
```

```
a = ...
loss1 = net(a)
loss1.backward()
b = ...
loss2 = net(b)
loss2.backward()
optimizer.step()
```

Hi,

I think you made a mistake in your first part. I assume you wanted to put c = loss1 + loss2. If so, it is the same behaviour. Indeed, .backard() sum the gradients while you don’t call optimizer.zero_grad() (or net).

So to be clear, the first part you did : grad(loss1 + loss2)

and the second you did : grad(loss1) + grad(loss2)

but the gradient is a linear operator so it is equal. You can chek the code below with just import torch. I modified a pytorch example. Both backward print the same outputs.

Cheers.

## Code in file autograd/two_layer_net_autograd.py

import torch

device = torch.device(‘cpu’)

## N is batch size; D_in is input dimension;

## H is hidden dimension; D_out is output dimension.

N, D_in, H, D_out = 3, 6, 6, 2

## Create random Tensors to hold input and outputs

x = torch.randn(N, D_in, device=device)

a = torch.randn(N, D_in, device=device)

y = torch.randn(N, D_out, device=device)

b = torch.randn(N, D_out, device=device)## Create random Tensors for weights; setting requires_grad=True means that we

## want to compute gradients for these Tensors during the backward pass.

w1 = torch.randn(D_in, H, device=device, requires_grad=True)

w2 = torch.randn(H, D_out, device=device, requires_grad=True)learning_rate = 1e-6

for t in range(3):

# Forward pass: compute predicted y using operations on Tensors. Since w1 and

# w2 have requires_grad=True, operations involving these Tensors will cause

# PyTorch to build a computational graph, allowing automatic computation of

# gradients. Since we are no longer implementing the backward pass by hand we

# don’t need to keep references to intermediate values.

y_pred = x.mm(w1).clamp(min=0).mm(w2)

b_pred = a.mm(w1).clamp(min=0).mm(w2)`# Compute and print loss. Loss is a Tensor of shape (), and loss.item() # is a Python number giving its value. loss1 = (y_pred - y).pow(2).sum() loss2 = (b_pred - b).pow(2).sum() print(t, loss1.item()) print(t, loss2.item()) # Use autograd to compute the backward pass. This call will compute the # gradient of loss with respect to all Tensors with requires_grad=True. # After this call w1.grad and w2.grad will be Tensors holding the gradient # of the loss with respect to w1 and w2 respectively. loss1.backward(retain_graph=True) loss2.backward(retain_graph=True) print(w1.grad) print(w2.grad) # Update weights using gradient descent. For this step we just want to mutate # the values of w1 and w2 in-place; we don't want to build up a computational # graph for the update steps, so we use the torch.no_grad() context manager # to prevent PyTorch from building a computational graph for the updates #with torch.no_grad(): #w1 -= learning_rate * w1.grad #w2 -= learning_rate * w2.grad # Manually zero the gradients after running the backward pass w1.grad.zero_() w2.grad.zero_() loss = loss1 + loss2 loss.backward() print(w1.grad) print(w2.grad) w1.grad.zero_() w2.grad.zero_()`