None gradients with nn.Parameter

_zero · April 23, 2020, 5:12pm

Hi,
I have the following component that would need to do some operations:

Store some tensors (var1)
Store some tensors that can be updated with autograd (var2)
Store something that keeps track of which tensor have been added (var3)
Count how many times every var2 was used (var4)

The forward pass then computes similarities (according to some metric) between the input and var1, and returns the corresponding top k var2. I then do some operations on this result.

When I check with the code below I have two problems:

I get the warning from checkpoint:

UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(“None of the inputs have requires_grad=True. Gradients will be None”)

myThing.var2.grad is None (before loss.backward())

class MyThing(nn.Module):
    def __init__(self):
        super(MyThing, self).__init__()

        self.var1 = torch.zeros(0, 10, requires_grad=False)
        self.var2 = nn.Parameter(torch.rand(0, 5, requires_grad=True))
        self.var3 = defaultdict(bool)
        self.var4 = torch.zeros(0, 1, requires_grad=False)

    def add(self, elements, sorter):
        with torch.no_grad(): # I don't want to store gradients for adding to variables
            c = 0
            highest_sorted = sorter.argsort(dim=0, descending=True)
            elements = elements[highest_losses]

            for element in elements:
                if not self.var3[element]:
                    self.var1= torch.cat((self.var1, element.unsqueeze(0)), dim=0)
                    
                    to_add = torch.rand(1, 5, requires_grad=True)
                    self.memory_values = nn.Parameter(torch.cat((self.var2, to_add), dim=0)).requires_grad_()
                    
                    self.var4 = torch.cat((self.var4, torch.zeros(1,1)), dim=0)
                    c += 1
                self.var3[element] = True
                
                if c >= SOME_MAXIMUM_VALUE:
                    break

    def forward(self, x):
        a = FC_LAYER_1(x)
        b = FC_LAYER_2(self.var1)
        
        sims = torch.matmul(a, b.t())

        idxs = sims.sort(dim=1, descending=True).indices
        k_highest_sims = smart_sort(sims, idxs)[:,:K]
        c = self.var2[idxs[:,:K]]

        self.var4[idxs[:,:K]] += 1
        return k_highest_sims, c

Code used for the forward pass outside the component:

x = SOME_TENSORS  # these have requires_grad set to True above
y = SOME_OTHER_TENSOR  # these have requires_grad set to True above
myThing = MyThing()

sims, outs = checkpoint(myThing, x) # need it for memory reasons. Warning here

z = FC_LAYER_3(sims)

result1 = FC_LAYER_4(y)
softmaxed_sims = F.softmax(sims, dim=1)
result2 = FC_LAYER_5(outs)

final_result = (result1 * (1-z) + (result2 * softmaxed_sims).sum(dim=1) * z)

The main problem is that the values in var2 always stay the same (confirming the None gradients).
Am I doing something wrong?

albanD · April 23, 2020, 5:18pm

The warning seem to indicate the issue: if nothing requires gradients, then nothing will be computed.
You need to find where the Tensors stopped requirering gradients. This will point to the op that is non-differentiable,.

_zero · April 23, 2020, 5:20pm

I would agree, but the problem is that I do x.requires_grad = True shortly before with no operations in between

albanD · April 23, 2020, 5:27pm

Can you share that code? I don’t see any call to backward/grad here.

Also you should only set that flag for Tensors for which you want gradients to be computed. If you have to set it on an intermediary results, the backprop will stop there and not go any further.

_zero · April 23, 2020, 5:45pm

Sure, the only thing that happens before the given code is

if flag:
    with torch.no_grad():
        outputs = main_model(tensor)
else:
    outputs = main_model(tensor)

SOME_TENSORS = outputs
if flag:
    SOME_TENSORS.requires_grad_() # also tried SOME_TENSORS.requires_grad = True

and after the given code:

# checking here gives None gradients
loss = metric(final_result, labels)
loss.backward()
# checking here gives None gradients
optimiser.step()

Even if x was set without requires_grad, shouldn’t var2 get updated? If not, could you explain why not?

albanD · April 23, 2020, 6:07pm

If flag is set, you run the model in no_grad mode. This means that you won’t be able to compute any gradient for it.
As I mentioned above, here you set the requires_grad flag on an intermediary Tensor. But that won’t make you able to backprop before it.
If you set flag to False, it should work fine no?

_zero · April 23, 2020, 6:19pm

At some point in the training I want to stop training the main model but only train the other FC layers. Is what I’m doing not the right way to do it? Shouldn’t var2 be updated even when the flag is set to True?

albanD · April 23, 2020, 6:23pm

Ho the model above is not the same as the one defined in the nn.Module MyThing above. Ok.

Do the outputs of the checkpoint require gradients properly?

_zero · April 23, 2020, 6:37pm

main_model and myThing are two separate things, that’s why I’m confused.
I will check the require_gradients of the output of the checkpoint too.

_zero · April 23, 2020, 9:24pm

Hi Alban, sorry to bother you again. I have tried checking the requires_grad of all the variables involved and they are all fine except c. Although, all the consequent variables that make use of c have requires_grad set to True

Do you know why?

albanD · April 23, 2020, 10:25pm

If self.var2.requires_grad=True, self.var2[idxs[:,:K]].requires_grad should also be True.
The only reason why it would be false is if you run in no_grad() mode.