Unsure how to figure out where backprop fails

So, maybe a silly question, but I’ve stumbled upon a problem where backprop seems to “fail” for part of my forward. i.e. part of my nn.Module .parameters() never get changes when I call .step()

The exact loop I have here is something like:

    def forward(self, input_arr):
        # self.encoders is an nn.ModuleList
        
        stacked_inputs = []
        for k in range(len(input_arr[0])):
            stacked_inputs.append([])
            for i in range(len(input_arr)):
                stacked_inputs[k].append(input_arr[i][k])

        for i in range(len(stacked_inputs)):
            stacked_inputs[i] = torch.stack(stacked_inputs[i])
        
        # I assume the actual issue starts here
        k = 0
        X = torch.zeros(len(input_arr),self.input_size)
        for i in range(len(stacked_inputs)):
            input = stacked_inputs[i]
            if i in self.encoder_indexes:
                input = self.encoders[k](input)
                k += 1
            X[:,self.input_indexes[i][0]:self.input_indexes[i][1]] = input

        X = X.to(self.device)

        output = self._foward_net(X)
        return output

And I’m curios where the problem could be arrising here, however, I’m curios, in the more general scenario, what would be the way to go about debugging something like this to begin with ?

So in general, if we perform a loss function on the output of a model, we would expect that any module (parameters) that the input came across would receive a gradient. If we’re finding that some of your parameters are not being given a gradient (thus not changing after a step), then they must not be affecting the output.

It’s possible that some parameters are being used, but are perfect such that their gradient is 0. But I’m going to assume that this isn’t the case.

Is the len(input) suppose to be at least k? I notice that if input isn’t long enough, some encoders in your module list might not be accessed. I would print out k here and assure that all your encoders are being touched. Otherwise it appears to me that the data is flowing correctly.

Something else you can do is iterate over the encoders and make sure that their gradients are not None and have non-zero values in them. It could be that some gradients are so small that they’re appearing to not be changing (but they are).

I altered the function in order to be more readable and support batching, so I made some edits, but the issue/problem stays the same.

The input is always long enough and I did check, the forward method of the encoder Modules is being ran.

Something else you can do is iterate over the encoders and make sure that their gradients are not None and have non-zero values in them. It could be that some gradients are so small that they’re appearing to not be changing (but they are).

This is what I might try next, good point, I will come back with updates either way.

Thanks a lot for the help :slight_smile:

Edit: I am pretty sure I found the issue and it’s related to me using a register_forward_hook in order to extract the outputs from the encoder nn.Module instead of just taking a slice of the model and running backprop as usual. Will change and see if that works. Preemtive sorry in case this was a red hering.

Alright, so the issue was indeed that the forward method of one of the underlying nn.Module objects was not working, my bad.

Thanks for the advice ayalaa2

If anyone has inputs on the other part of the question (is there an easy way to debug this kind of thing, or at least a hard way, but a hard way that can systematically point me to the exact line where the flow stops), I’d still love to hear it, as I’ll be facing these kinds of issues a lot in the future.