Are they equivalent?
a = ...
loss1 = net(a)
b = ...
loss2 = net(b)
c = loss1+loss2
c.backward()
optimizer.step()
a = ...
loss1 = net(a)
loss1.backward()
b = ...
loss2 = net(b)
loss2.backward()
optimizer.step()
Are they equivalent?
a = ...
loss1 = net(a)
b = ...
loss2 = net(b)
c = loss1+loss2
c.backward()
optimizer.step()
a = ...
loss1 = net(a)
loss1.backward()
b = ...
loss2 = net(b)
loss2.backward()
optimizer.step()
Hi,
I think you made a mistake in your first part. I assume you wanted to put c = loss1 + loss2. If so, it is the same behaviour. Indeed, .backard() sum the gradients while you don’t call optimizer.zero_grad() (or net).
So to be clear, the first part you did : grad(loss1 + loss2)
and the second you did : grad(loss1) + grad(loss2)
but the gradient is a linear operator so it is equal. You can chek the code below with just import torch. I modified a pytorch example. Both backward print the same outputs.
Cheers.
Code in file autograd/two_layer_net_autograd.py
import torch
device = torch.device(‘cpu’)
N is batch size; D_in is input dimension;
H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 3, 6, 6, 2
Create random Tensors to hold input and outputs
x = torch.randn(N, D_in, device=device)
a = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)
b = torch.randn(N, D_out, device=device)Create random Tensors for weights; setting requires_grad=True means that we
want to compute gradients for these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)learning_rate = 1e-6
for t in range(3):
# Forward pass: compute predicted y using operations on Tensors. Since w1 and
# w2 have requires_grad=True, operations involving these Tensors will cause
# PyTorch to build a computational graph, allowing automatic computation of
# gradients. Since we are no longer implementing the backward pass by hand we
# don’t need to keep references to intermediate values.
y_pred = x.mm(w1).clamp(min=0).mm(w2)
b_pred = a.mm(w1).clamp(min=0).mm(w2)# Compute and print loss. Loss is a Tensor of shape (), and loss.item() # is a Python number giving its value. loss1 = (y_pred - y).pow(2).sum() loss2 = (b_pred - b).pow(2).sum() print(t, loss1.item()) print(t, loss2.item()) # Use autograd to compute the backward pass. This call will compute the # gradient of loss with respect to all Tensors with requires_grad=True. # After this call w1.grad and w2.grad will be Tensors holding the gradient # of the loss with respect to w1 and w2 respectively. loss1.backward(retain_graph=True) loss2.backward(retain_graph=True) print(w1.grad) print(w2.grad) # Update weights using gradient descent. For this step we just want to mutate # the values of w1 and w2 in-place; we don't want to build up a computational # graph for the update steps, so we use the torch.no_grad() context manager # to prevent PyTorch from building a computational graph for the updates #with torch.no_grad(): #w1 -= learning_rate * w1.grad #w2 -= learning_rate * w2.grad # Manually zero the gradients after running the backward pass w1.grad.zero_() w2.grad.zero_() loss = loss1 + loss2 loss.backward() print(w1.grad) print(w2.grad) w1.grad.zero_() w2.grad.zero_()