# The gradients of the weights of my model approach to zero

Hi, I am new to pytorch. The weights of my NN gradients become close to zero at the beginning, and my loss function does not change much. Any help or suggestions will be much appreciated.

Problem description: My dataset contains two variables, namely ‘posterior’ and ‘theta’. The variable ‘posterior’ is the input of the NN, which is basically a probability distribution. As an example, For a particular observation, at the beginning, the ‘posterior’ is taken from a uniform distribution over 1801 grid points. The final output will also like a probability distribution, where 9 of the grid points will be highly probable (close to 1) and remaining close to zero.

The NN will take the posterior input and will give a matrix A at the output. Then, based on the input ‘theta’, a variable X, and then another variable y= AX will be computed and then, by doing some calculation, an updated posterior will be returned. The posterior will be updated for N samples of X. BCELoss function is used to compute the loss between the updated posterior and the label.

``````
class model(nn.Module):
def __init__(self):
super().__init__()

self.input = nn.Linear(1801,2048)
self.fc1 = nn.Linear(2048,2048)
self.fc2 = nn.Linear(2048,2048)
self.out = nn.Linear(2048,10*100)

def forward(self,x):
x = F.relu(self.input(x))

x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))

x = self.out(x)

return x

net = model()
#lossfun = nn.BCEWithLogitsLoss()
lossfun = nn.BCELoss()

return net,lossfun,optimizer

net,lossfun,optimizer = createModel()
net.to(device)
losses = np.zeros(numepochs)

for epochi in range(numepochs):

batchLoss = []

for theta, posterior, label in train_loader:
theta = theta.to(device)
posterior = posterior.to(device)
label = label.to(device)

for samp in range(samples): # Iterate through the samples in X

# Computing A matrix as the NN output and do some postprocessing
A_tmp = net(posterior)
A_tmp = A_tmp.cpu()
A_shaped = A_tmp.view(-1,10,100)

A_r =A_shaped[:,:,:50]
A_i = A_shaped[:,:,50:]
A = A_r+1j*A_i

# Initializing a sample for X
X = torch.zeros((theta.shape[0],50,1),dtype = torch.cfloat)

# Computing X from theta for every data in a batchj
for i in range(theta.shape[0]):
theta_loop = theta[i,:]

#Computing y
y = A@X

posterior = torch.zeros((theta.shape[0],1801))

# Updating posterior
for i in range(y.shape[0]):
y_loop = y[i,:,:] # Grab i th y data from a batch of 32
Ry_loop = y_loop@y_loop.conj().T
posterior[i,:] = ap.op_angle_tensor(-90, 90, .1, Ry_loop, 9)[2]

print(f'Sample No: {samp}\n\n')

loss = lossfun(posterior,label)

loss.backward()
optimizer.step()

batchLoss.append(loss.item())

losses[epochi] = np.mean(batchLoss)
print(f'Loss in epoch {epochi} is {losses[epochi]}')

return losses,net `````````

Could you try using this learning rate scheduler & see if it helps?

Also, please try to post the code by enclosing it between 3 backticks ```.

Could be dying relu problem can you check using leaky_relu or any other relu variant?

Surely the issue is the fact your `posterior` tensor seems to be the phase of an imaginary number which will hold a degeneracy and hence has an undefined gradient.

Can you share what `ap` is? As well as `ap.op_angle_tensor`?

Also, as @srishti-git1110 has already said copy and paste your code and wrap it with three backticks ```, so your code snippet is more readable!

Thanks for your suggestion. I tried learning rate scheduler, but still it seems no improvement

Thanks for your suggestion. I tried using leakyrelu instead of relu, but still the issue remains. I will also try other variants to see if it solve the issue.

Hi, Thanks for your comment. I did not fully get your comment, " `posterior` tensor seems to be the phase of an imaginary number". To me, it seems ‘posterior’ tensor is real valued.

Also ‘ap’ is another python file, where I wrote the function op_angle_tesnor. I included the fucntion here for convenience.

``````def steer_tensor(n,d,phi):

# Provide array manifold
# n = numer of Antenna
# d = antenna separation
# phi = doa's with size n x 1

pi = torch.tensor(math.pi)
fac = pi/180
out = torch.exp(-1j*2*pi*torch.arange(n).reshape(n,1)*d*torch.sin(fac*phi));
return out

def op_angle_tensor(low,high,interval,R,n):

# This function generate spectrum and extracts estimated angles

# low = lower range of theta
# high = higher tange of theta
# interval = interval for discretizing grid
# R = covariance matrix
# n = number of sources

theta_r = torch.round(torch.arange(low,high+interval,.1),decimals = 1)
theta_r = theta_r.reshape(1,theta_r.shape[0])
p = torch.zeros(theta_r.shape)
n_antenna = R.shape[0]

for index,ii in enumerate(theta_r[0]):
aa = steer_tensor(n_antenna,.5,ii)
p[:,index] = torch.abs(1/(aa.conj().T@torch.inverse(R)@aa))

p = p/torch.sum(p) # This will act as updated posterior
sorted_ind = torch.argsort(p)
theta_o = theta_r[:,sorted_ind[:,-n:]]

return theta_r[0],np.sort(theta_o)[0,0],p[0] `````````

I see.
What @AlphaBetaGamma96 has suggested makes sense to me.
However, after looking at your `ap.op_angle_tensor` why are you calling `np.sort(theta_o)` instead of `torch.sort(theta_o)`? That’ll surely break your graph which also make explain your gradients being zero?
Could you also try reducing your `initial_lr` if you are using say `1e-3` try `1e-4` maybe?
@Saidur_Pavel in fact, given the gradient is near zero you could try starting with a larger learning rate to see if you’re suffering from the vanishing gradient problem. But do check why you’re using `np.sort` instead of a torch equivalent operation, as that’ll break your graph and hence your gradient.