Basics of Loss.backward()

What is the difference between

loss.backward()

and

loss2 = Variable(loss.data)
loss2.backward()

where loss is generated like this

fwdPass = net.forward(myData) # net is VGG16, deep convolutional neural network in PyTorch
loss = F.binary_cross_entropy(fwdPass, target)

Motivation for such simple question: Using Variable(loss.data) is the only way to combine two loss functions on my GPU with 12 GB RAM without running out of memory. In my experience, if I wish to combine multiple loss functions (e.g. if fwdPass is say 7 channel image, I want to do BCE loss on channel 0 and L2 loss on channels 1-6 and want to backprop the combined loss function), naively summing the two causes my 12 GB RAM GPU to run out of memory.

Basically

BCE_Loss0 = F.binary_cross_entropy(fwdPass[:,0,:], target[:,0,:] # axis 0 is minibatch dimension
L2_loss1 = torch.norm(fwdPass[:,1,:], target[:,1,:])
L2_loss2 = torch.norm(fwdPass[:,2,:], target[:,2,:])
... #code omitted for brevity; you get the idea though
L2_loss6 = torch.norm(fwdPass[:,6,], target[:,6,:])

comboLoss = BCE_Loss0 + L2_loss1 + L2_loss2 + ... + loss6
comboLoss.backward() #GPU runs out of RAM

comboLoss = Variable(loss0.data + loss2.data + ... + loss6.data) # no memory problem
comboLoss.backward()
2 Likes

This won’t actually compute any gradients. It detaches loss from the rest of the computation graph; you’re creating a new node that is loss2 = Variable(loss.data) that is not connected to the rest of the computation graph.

1 Like

Thanks Richard. So what does Variable(loss.data).backward() do then? It seems to do something for my network, in that the calculated loss at the end is, in fact, decreasing. That’s the part I really can’t figure out.

Variable(loss.data).backward() should do nothing. Something else is causing gradients to be computed, and when combined with an optimizer, would result in your loss decreasing.

Thanks so much! I found the bug in the code that was giving funny gradients.

Sorry one final question concerning how the computation graphs and gradient updates are implemented under the hood. Suppose I have several losses as indicated above, and call .backward() on each loss like this:

# axis 0 is minibatch dimension, axis 1 is channel
BCE_Loss0 = F.binary_cross_entropy(fwdPass[:,0,:], target[:,0,:])

L2_loss1 = torch.norm(fwdPass[:,1,:], target[:,1,:])
L2_loss2 = torch.norm(fwdPass[:,2,:], target[:,2,:])

# optimG is adam gradient optimizer, defined elsewhere in code
optimG.zero_grad()

BCE_loss0.backward()
L2_loss1.backward()
L2_loss2.backward()
optimG.step()

Will cause the backpropagated gradients to be summed over each of the loss functions? In other words, is this effectively equivalent to

totalLoss = BCE_loss0 + L2_loss1 + L2_loss2
totalLoss.backward()
optimG.step()