Loss.backward when learning both representations and prediction

Hi folks, I’m quite new to Pytorch and am trying to learn a model on top of graph representations. So I have a model that takes in my data, uses a graph neural network to learn representations, then concats the representations (node pairs) and passes these through a linear layer. However, it looks like when i do loss.backward, the layers aren’t being updated at all. I’m wondering if I’m doing anything wrong here? Here’s a short code snippet for clarification in my model’s forward function.

        x = self.gcn1.forward(graph_feats, edge_index=edge_index, edge_weight=edge_weight)
        x = F.relu(x)
        x = self.gcn2(x, edge_index=edge_index, edge_weight=edge_weight)
        x = F.relu(x)
        x = self.gcn3(x, edge_index=edge_index, edge_weight=edge_weight)
        
        # Take the d p vectors, and concat them, followed by putting classifier on top 
        stop_num = int(edge_index.shape[1]/2)
        
        node_idx = edge_index.T[0:stop_num]
        
        d_rep = x[node_idx[:,0]]
        p_rep = x[node_idx[:,1]]
        edge_weight_half = edge_weight[0:stop_num].type(torch.FloatTensor).view(-1,1).cuda()
        
        d_p_pair = torch.cat((d_rep, p_rep, edge_weight_half), dim=1)
        
        x = self.lin(d_p_pair)
        y = self.act_lin(x)

Hi,

This looks ok, you don’t use any non-differentiable operation.

You want to make sure that you give all the right parameters to your optimizer.

hey! thanks for responding. I thought so too, but it seems like the gradients aren’t flowing backwards. When i check the model.parameters(), i see all the layers that are present. is there a way to retrieve the weight matrices from each layer and observe them as they are trained?

It also looks like subsetting the features in x is causing this problem. If I don’t do that, the gradients appear to be able to flow. This step appears to be causing the problem and I have no idea why.

        # Take the d p vectors, and concat them, followed by putting classifier on top 
        stop_num = int(edge_index.shape[1]/2)
        
        node_idx = edge_index.T[0:stop_num]
        
        d_rep = x[node_idx[:,0]]
        p_rep = x[node_idx[:,1]]
        edge_weight_half = edge_weight[0:stop_num].type(torch.FloatTensor).view(-1,1).cuda()
        
        d_p_pair = torch.cat((d_rep, p_rep, edge_weight_half), dim=1)
        

No that should not be a problem as long as node_idx contains some indices.

If I don’t do that, the gradients appear to be able to flow.

How do you measure that?

Yeah they are. What’s weird is that if i replace it to the snippet below, I still get the issue of gradients not flowing.

I measure by having labels to compare and using BCELoss. I tried just all 1’s but still face this issue.

d_rep = x[0:100,:] 
d_rep = x[300:400,:]

I think I found the issue. After tinkering around, I realized the nn.Sigmoid layer was the problem. It seems like the BCELoss was the issue. If I swapped to the BCEWithLogitsLoss, it seems like the loss behaves and the weight matrices actually change now! I’m not really sure why BCELoss causes this issue though, and whether it’s the numerical instability that Pytorch indicates. Nevertheless, I think it solves the mystery? Any idea how to tell if it’s due to the numerical instability or when we need to use BCEWithLogitsLoss over BCELoss?

As mentioned in the doc the difference between these two functions is that one will take the log of the input but not the other.
So it will lead to surprising values if you give exponentiated values to BCELoss which does not expect them (values will be way too large). And potentially puts you in a region of the loss function where the gradient is 0.