A question about autodiff

KunWangV · December 31, 2018, 5:32am

Are they equivalent?

a = ...
loss1 = net(a)

b = ...
loss2 = net(b)

c = loss1+loss2
c.backward()
optimizer.step()

a = ...
loss1 = net(a)
loss1.backward()

b = ...
loss2 = net(b)
loss2.backward()
optimizer.step()

tux · December 31, 2018, 1:30pm

Hi,

I think you made a mistake in your first part. I assume you wanted to put c = loss1 + loss2. If so, it is the same behaviour. Indeed, .backard() sum the gradients while you don’t call optimizer.zero_grad() (or net).

So to be clear, the first part you did : grad(loss1 + loss2)

and the second you did : grad(loss1) + grad(loss2)

but the gradient is a linear operator so it is equal. You can chek the code below with just import torch. I modified a pytorch example. Both backward print the same outputs.

Cheers.

Code in file autograd/two_layer_net_autograd.py

import torch

device = torch.device(‘cpu’)

N is batch size; D_in is input dimension;

H is hidden dimension; D_out is output dimension.

N, D_in, H, D_out = 3, 6, 6, 2

Create random Tensors to hold input and outputs

x = torch.randn(N, D_in, device=device)
a = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)
b = torch.randn(N, D_out, device=device)

Create random Tensors for weights; setting requires_grad=True means that we

want to compute gradients for these Tensors during the backward pass.

w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)

learning_rate = 1e-6
for t in range(3):
# Forward pass: compute predicted y using operations on Tensors. Since w1 and
# w2 have requires_grad=True, operations involving these Tensors will cause
# PyTorch to build a computational graph, allowing automatic computation of
# gradients. Since we are no longer implementing the backward pass by hand we
# don’t need to keep references to intermediate values.
y_pred = x.mm(w1).clamp(min=0).mm(w2)
b_pred = a.mm(w1).clamp(min=0).mm(w2)
# Compute and print loss. Loss is a Tensor of shape (), and loss.item()
# is a Python number giving its value.
loss1 = (y_pred - y).pow(2).sum()
loss2 = (b_pred - b).pow(2).sum()
print(t, loss1.item())
print(t, loss2.item())

# Use autograd to compute the backward pass. This call will compute the
# gradient of loss with respect to all Tensors with requires_grad=True.
# After this call w1.grad and w2.grad will be Tensors holding the gradient
# of the loss with respect to w1 and w2 respectively.
loss1.backward(retain_graph=True)
loss2.backward(retain_graph=True)
print(w1.grad)
print(w2.grad)

# Update weights using gradient descent. For this step we just want to mutate
# the values of w1 and w2 in-place; we don't want to build up a computational
# graph for the update steps, so we use the torch.no_grad() context manager
# to prevent PyTorch from building a computational graph for the updates
#with torch.no_grad():
    #w1 -= learning_rate * w1.grad
    #w2 -= learning_rate * w2.grad

# Manually zero the gradients after running the backward pass
w1.grad.zero_()
w2.grad.zero_()

loss = loss1 + loss2
loss.backward()
print(w1.grad)
print(w2.grad)
w1.grad.zero_()
w2.grad.zero_()