Backprop in PyTorch?

Here is a snippet

import torch
import torch.nn as nn
from torch.autograd import Variable

dtype = torch.FloatTensor
x = Variable(torch.randn(1, 25).type(dtype), requires_grad = True)
t = Variable(torch.randn(1, 25).type(dtype), requires_grad = False)

criterion = nn.MSELoss()
loss = criterion(x, t)
optimizer = torch.optim.Adam([x])

for i in range(5):
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

This seems easy that we can even calculate such by hand. However, such code doesn’t work, and would throw an error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-cace42d6c54e> in <module>()
     13 for i in range(5):
     14     optimizer.zero_grad()
---> 15     loss.backward()
     16     optimizer.step()

RuntimeError: Trying to backward through the graph second time, but the buffers have already been freed. Please specify retain_variables=True when calling backward for the first time.

Follow such suggestion, I modify the code above as

import torch
import torch.nn as nn
from torch.autograd import Variable

dtype = torch.FloatTensor
x = Variable(torch.randn(1, 25).type(dtype), requires_grad = True)
t = Variable(torch.randn(1, 25).type(dtype), requires_grad = False)

criterion = nn.MSELoss()
loss = criterion(x, t)
optimizer = torch.optim.Adam([x])

for i in range(5):
    optimizer.zero_grad()
    loss.backward(retain_variables=(i==0))
    optimizer.step()

The same error is thrown out again.

However, if we write a numpy snippet

import numpy as np

N = 5
x = np.random.randn(N)
y = np.random.randn(N)

learning_rate = 1e-2
for t in range(500):
    loss = np.square(x - y).sum()
    print(t, loss)
    # Back-propagate
    grad_y = 2.0 * (y - x)
    # Update
    y -= learning_rate * grad_y

We can see that it would converge in a few iteration.

Is this a bug? Or is there anything wrong in my code?

Thanks.

In pytorch, if you apply same variable backward twice or more, it’s necessary to set the retrain_varaibles=True. Regarding your problem, I think when update the parameters with optimizer.step(), you should re-compute the loss

import torch
import torch.nn as nn
from torch.autograd import Variable

dtype = torch.FloatTensor
x = Variable(torch.randn(1, 25).type(dtype), requires_grad = True)
t = Variable(torch.randn(1, 25).type(dtype), requires_grad = False)

criterion = nn.MSELoss()
optimizer = torch.optim.Adam()

for i in range(5):
optimizer.zero_grad()
loss = criterion(x, t)
loss.backward()
optimizer.step()

Thanks, it seems that I forget to paste the whole codes here :sweat_smile:

Here is another case

import torch
import torch.nn as nn
from torch.autograd import Variable

dtype = torch.FloatTensor
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        
        self.x_dim = 50
        self.hidden_size = 25
        self.w = Variable(torch.randn(self.x_dim, self.hidden_size).type(dtype), requires_grad=True)
        
        self.freeze_weights()
        
    def forward(self, x):
        return torch.mm(x, self.w)
    
    def freeze_weights(self):
        for p in self.modules():
            p.requires_grad = False

net = SimpleNet()
x = Variable(torch.randn(1, 25).type(dtype), requires_grad = True)
t = net(Variable(torch.randn(1, 50).type(dtype), requires_grad = False))

criterion = nn.MSELoss()
optimizer = torch.optim.Adam([x])

for i in range(5):
    loss = criterion(x, t)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

These code is similar to the one above, except that the target comes from SimpleNet.

But it would throw an error:

AssertionError: nn criterions don't compute the gradient w.r.t. targets - please mark these variables as volatile or not requiring gradients

And we still have the numpy codes here:

import numpy as np

M = 10
N = 5
x = np.random.randn(M)
y = np.random.randn(N)

w = np.random.randn(M, N)

learning_rate = 1e-2
for t in range(500):
    loss = np.square(x.dot(w) - y).sum()
    print(t, loss)
    # Back-propagate
    grad_y = 2.0 * (y - x.dot(w))
    # Update
    y -= learning_rate * grad_y

Is there anything wrong in my code?
Thanks.

Hi,

So pytorch is very different from tensorflow in the sense that your don’t create the graph once, and then just backprop through it.
The whole framework is built such that the graph is super cheat to build and so it is built on the fly.
That means that for every input for which you want to compute backprop, you should execute your computation with them, and then call backward on the output.
In your case, you should have:

import torch
import torch.nn as nn
from torch.autograd import Variable

dtype = torch.FloatTensor
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        
        self.x_dim = 50
        self.hidden_size = 25
        # Parameters in modules have type Parameter
        self.w = nn.Parameter(torch.randn(self.x_dim, self.hidden_size).type(dtype))
        
        self.freeze_weights()
        
    def forward(self, x):
        return torch.mm(x, self.w)
    
    def freeze_weights(self):
        # There are no submodules to iterate through
        self.w.requires_grad = False

net = SimpleNet()
x = Variable(torch.randn(1, 25).type(dtype), requires_grad = True)

criterion = nn.MSELoss()
# The optimizer is given only x, so it will update only x, not the weights in net
# Set the learning rate to 1 so that the loss evolves quickly
optimizer = torch.optim.Adam([x], lr=1)

input = torch.randn(1, 50).type(dtype)
for i in range(50):
    # You should only package your inputs inside the loop
    input_var = Variable(input)
    # Compute the forward pass here
    t = net(input_var)
    loss = criterion(x, t)
    print("Loss after {}\t steps is {}".format(i, loss.data[0]))
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
1 Like

That code won’t give an error, but it’s still wrong. @albanD has explained why: you need to run your model again for every iteration.

1 Like

Do we need to zero_grad the network too? As net isn’t wrapped by the optimizer.

optimizer = torch.optim.Adam([x], lr=1)
net.eval()
.
.
optimizer.zero_grad()
net.zero_grad()
loss.backward()
optimizer.step()

according to below link:

Note that this answer is really old. Variables don’t exist anymore.
Also if you don’t are about the gradients in the network, you don’t need to zero them out indeed.