The gradients of the weights of my model approach to zero

Saidur_Pavel · October 1, 2022, 5:19am

Hi, I am new to pytorch. The weights of my NN gradients become close to zero at the beginning, and my loss function does not change much. Any help or suggestions will be much appreciated.

Problem description: My dataset contains two variables, namely ‘posterior’ and ‘theta’. The variable ‘posterior’ is the input of the NN, which is basically a probability distribution. As an example, For a particular observation, at the beginning, the ‘posterior’ is taken from a uniform distribution over 1801 grid points. The final output will also like a probability distribution, where 9 of the grid points will be highly probable (close to 1) and remaining close to zero.

The NN will take the posterior input and will give a matrix A at the output. Then, based on the input ‘theta’, a variable X, and then another variable y= AX will be computed and then, by doing some calculation, an updated posterior will be returned. The posterior will be updated for N samples of X. BCELoss function is used to compute the loss between the updated posterior and the label.

    
    class model(nn.Module):
        def __init__(self):
            super().__init__()
            
            
            self.input = nn.Linear(1801,2048)
            self.fc1 = nn.Linear(2048,2048)
            self.fc2 = nn.Linear(2048,2048)
            self.out = nn.Linear(2048,10*100)
            
        def forward(self,x):
            x = F.relu(self.input(x))
            
            x = F.relu(self.fc1(x))
            x = F.relu(self.fc2(x))

            
            x = self.out(x)
            
            return x
            
    net = model()
    #lossfun = nn.BCEWithLogitsLoss()
    lossfun = nn.BCELoss()
    optimizer = torch.optim.Adam(net.parameters(),lr = .001)

            
    
    return net,lossfun,optimizer

def trainModel(train_loader,numepochs,samples,device):
    
    net,lossfun,optimizer = createModel()
    net.to(device)
    losses = np.zeros(numepochs)
    
    for epochi in range(numepochs):
        
        batchLoss = []
        
        for theta, posterior, label in train_loader:
            theta = theta.to(device)
            posterior = posterior.to(device)
            label = label.to(device)
            
            for samp in range(samples): # Iterate through the samples in X
            
            # Computing A matrix as the NN output and do some postprocessing
                A_tmp = net(posterior)
                A_tmp = A_tmp.cpu()
                A_shaped = A_tmp.view(-1,10,100)
                
                A_r =A_shaped[:,:,:50]
                A_i = A_shaped[:,:,50:]
                A = A_r+1j*A_i
                
                
                # Initializing a sample for X
                X = torch.zeros((theta.shape[0],50,1),dtype = torch.cfloat)
                
                # Computing X from theta for every data in a batchj
                for i in range(theta.shape[0]):
                    theta_loop = theta[i,:]
                    received_loop = ap.arr_received_tensor(50, theta_loop, 1, 20)
                    X[i,:,:] = received_loop # Array received signal
                    
                #Computing y
                y = A@X  
                
                
                posterior = torch.zeros((theta.shape[0],1801))
                

              # Updating posterior
                for i in range(y.shape[0]):
                    y_loop = y[i,:,:] # Grab i th y data from a batch of 32
                    Ry_loop = y_loop@y_loop.conj().T
                    posterior[i,:] = ap.op_angle_tensor(-90, 90, .1, Ry_loop, 9)[2]

                    
                print(f'Sample No: {samp}\n\n')
                    


            loss = lossfun(posterior,label)
 
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            
            batchLoss.append(loss.item())
            
        losses[epochi] = np.mean(batchLoss)
        print(f'Loss in epoch {epochi} is {losses[epochi]}')
        
    
    return losses,net ```

srishti-git1110 · October 1, 2022, 5:40am

Could you try using this learning rate scheduler & see if it helps?

Also, please try to post the code by enclosing it between 3 backticks ```.

krypticmouse · October 1, 2022, 10:52am

Could be dying relu problem can you check using leaky_relu or any other relu variant?

AlphaBetaGamma96 · October 1, 2022, 11:21am

Hi @Saidur_Pavel,

Surely the issue is the fact your posterior tensor seems to be the phase of an imaginary number which will hold a degeneracy and hence has an undefined gradient.

Can you share what ap is? As well as ap.op_angle_tensor?

Also, as @srishti-git1110 has already said copy and paste your code and wrap it with three backticks ```, so your code snippet is more readable!

Saidur_Pavel · October 1, 2022, 1:45pm

Thanks for your suggestion. I tried learning rate scheduler, but still it seems no improvement

Saidur_Pavel · October 1, 2022, 2:06pm

Thanks for your suggestion. I tried using leakyrelu instead of relu, but still the issue remains. I will also try other variants to see if it solve the issue.

Saidur_Pavel · October 1, 2022, 2:14pm

Hi, Thanks for your comment. I did not fully get your comment, " posterior tensor seems to be the phase of an imaginary number". To me, it seems ‘posterior’ tensor is real valued.

Also ‘ap’ is another python file, where I wrote the function op_angle_tesnor. I included the fucntion here for convenience.

def steer_tensor(n,d,phi):
    
    # Provide array manifold 
    # n = numer of Antenna
    # d = antenna separation 
    # phi = doa's with size n x 1
    
    pi = torch.tensor(math.pi)
    fac = pi/180
    out = torch.exp(-1j*2*pi*torch.arange(n).reshape(n,1)*d*torch.sin(fac*phi));
    return out

def op_angle_tensor(low,high,interval,R,n):
    
    # This function generate spectrum and extracts estimated angles
    
    
    # low = lower range of theta
    # high = higher tange of theta
    # interval = interval for discretizing grid
    # R = covariance matrix
    # n = number of sources
    
    theta_r = torch.round(torch.arange(low,high+interval,.1),decimals = 1)
    theta_r = theta_r.reshape(1,theta_r.shape[0])
    p = torch.zeros(theta_r.shape)
    n_antenna = R.shape[0]
    
    for index,ii in enumerate(theta_r[0]):
        aa = steer_tensor(n_antenna,.5,ii)
        p[:,index] = torch.abs(1/(aa.conj().T@torch.inverse(R)@aa))
    
    p = p/torch.sum(p) # This will act as updated posterior
    sorted_ind = torch.argsort(p)
    theta_o = theta_r[:,sorted_ind[:,-n:]]
    
    
    return theta_r[0],np.sort(theta_o)[0,0],p[0] ```

srishti-git1110 · October 1, 2022, 3:18pm

I see.
What @AlphaBetaGamma96 has suggested makes sense to me.
Could you please share what they asked for?
Probably the error could be solved then.

AlphaBetaGamma96 · October 5, 2022, 6:38pm

Hi @Saidur_Pavel,

Yeah I should’ve written complex instead of imaginary but my point still stands. The posterior tensor seems to be a real component of a complex value. The initial point I made about degeneracy is that it seems you’re trying to differentiate the phase of a complex number which isn’t uniquely definied.

However, after looking at your ap.op_angle_tensor why are you calling np.sort(theta_o) instead of torch.sort(theta_o)? That’ll surely break your graph which also make explain your gradients being zero?

parthshah231 · October 7, 2022, 10:13am

Could you also try reducing your initial_lr if you are using say 1e-3 try 1e-4 maybe?

AlphaBetaGamma96 · October 8, 2022, 1:42pm

@Saidur_Pavel in fact, given the gradient is near zero you could try starting with a larger learning rate to see if you’re suffering from the vanishing gradient problem. But do check why you’re using np.sort instead of a torch equivalent operation, as that’ll break your graph and hence your gradient.