Model.zero_grad only fill the grad of parameters to 0

Do we need to fill the other Variable declared with (requires_grad=True) inside Module to 0 as well?

1 Like

you can call Variable.grad.data.zero_()

2 Likes

It is expected that it only affects parameters - things you optimize are considered model parameters. Not sure what’s your exact use case, but as @smth pointed out, you can just iterate over the other Variables and zero the gradients yourself.

Thanks for the explanations, but if I don’t fill other variable.grad.data to 0, will the grad of parameters that depended on those variables be wrongly estimated? Keeping tracking of all the variables and setting their grad to 0 properly seems quite error prone.

No, why would it matter? Gradient of parameters is not a function of the gradient w.r.t. some other Variable.

Do you mean whatever the module is, as long as it’s inherited from base nn.Module class, simply calling the default model.zero_grad() is enough?

I mean the error can pass from other Variable to the parameter, if some Variable’s grad is unintentionally accumulated from the last run, the gradient of parameters may also be wrong.

I really can’t help you a lot without knowing what “other Variables” are you referring to? Are you optimizing them too? If not, just doing zero_grad() should be enough. There’s no way the content of .grad of a Variable, can affect what gets accumulated into another .grad. That’s not how derivatives work (of course I’m talking in a context where we don’t have multi backward).

Thanks for your reply.
For example in the following case, Do I need to add self.out.grad.data.fill_(0) to the mode.zero_grad() function?

    def __init__(self):
        self.out = torch.autograd.Variable(torch.zeros(timestep, batchsize, 
                                                       self.W_decode.size()[1]) , requires_grad=True)

    def forward(self, input, state=None):
        # input is timesetep*N
        batchsize, timestep = input.size()[1], input.size()[0]
        vec_input = input.view(-1)
        emb = torch.index_select(self.W_emb, 0, vec_input).view(timestep,batchsize, -1) # emb = N*ninp
        inp = matmul(emb, self.W_rnn)
        state = torch.autograd.Variable( torch.zeros(inp.size()[1:])) if state is None else state
        for step in range(inp.size()[0]):
            this_input = inp[step]  # N * nhid
            this_input = torch.addmm(this_input, state, self.U_rnn)
            state = F.tanh(this_input  + self.b_rnn.expand_as(this_input) )
            self.out[step] = torch.addmm(self.out[step], state, self.W_decode)
            self.out[step] = F.softmax(self.out[step]  + self.b_decode.expand_as(out[step]) )
        return self.out

Oh, with this approach, you’re likely to have an expanding history, because self.out will contain a pointer to part of the graph for each iteration. As far as I see, you’re overwriting completely at every forward, so you should recreate it every time like that:

def __init__(self):
    self.out = get_new_out()

def forward(self, input):
    ...
    next_out = get_new_out()
    for step in range(inp.size(0)):
        next_out[step] = ... # an expression containing self.out
    self.out = next_out
    return next_out

You don’t need to worry about zeroing any gradients then. Also, keep in mind that self.out doesn’t require gradient, so it will be always 0.

1 Like

Thanks for the explanation.
In the first approach, I don’t need to allocate memory every time, so it will be a little bit faster? And it is written in that way to see if Mode.zero_grad is always safe.

And later self.out will be used to compute the loss, and I need to backpropagate from the loss to the Model’s parameters through self.out. In this case, it seems that self.out will need meaningful gradient for the chain rule to work?

No, you’re not going to feel any difference in speed. CPU allocations are fast and we have a custom CUDA allocator that caches the memory, so it’s also very fast. I don’t know how you define the safety of Module.zero_grad - it does what it’s meant to do, i.e. zero the grad of parameters.

No, self.out.grad will never be used in the process of computing gradients w.r.t. parameters. It’s just a buffer where the gradient gets accumulated, it is not used in any way during backpropagation.

Also, as I said, if you don’t reallocate the output your graphs will never be freed and that will blow up the memory. Don’t cache intermediate things too long, PyTorch philosophy is quite different from Lua torch.

Thanks for you explanation. I guess I know what do you mean.
Sorry I am quite new to torch. In theano or tensorflow, I can get gradient of any node in the computational graph. But it seems that pytorch only save gradient of the leaf nodes?

from torch.autograd import Variable
x = Variable(torch.ones(2, 2), requires_grad = True)
y = x  + 2 
z = y * y * 3
out = z.mean()
out.backward(retain_variables=True)
print(x.grad)

I get

Variable containing:
 4.5000  4.5000
 4.5000  4.5000
[torch.FloatTensor of size 2x2]

But print(y.grad) gives me

Variable containing:
 0  0
 0  0
[torch.FloatTensor of size 2x2]

How do I get the gradient of the inner node ( y)?

@ypxie autograd by default frees the intermediate gradients that are not needed anymore, so that the memory usage is minimal.
If you want to inspect internal gradients, you can use hooks, as explained in this post.
But if you don’t want to free the gradients, you can pass the retain_variables=True to backward, as explained in the docs

Thanks for your reply.
But even I pass the retain_variables=True to backward, y.grad is still all zero.
hooks look interesting, but is there anyway I can return the gradient rather than modify or print it?

you can return the gradient into a separate variable using a closure. Look at this post for sample code: Why cant I see .grad of an intermediate variable?

Many thanks, very helpful!

retain_variables will only prevent autograd from freeing some buffers needed for backward (e.g. when you want to backprop multiple times through a graph). Use hooks to access intermediate gradients.

1 Like