Grad is very small

Mainul · November 16, 2019, 5:56pm

First I thought, there is no update after the loss calculation and update step but the follwing code gives me false which prove there is an update:

before = list(model.parameters())[0].clone()
loss.backward()
optimizer.step()
after = list(model.parameters())[0].clone()
logging.info(torch.equal(before.data, after.data))

Then I wanted to see the magnitude of chage by the following code:
list(model.parameters())[0].grad
Which give me very small amaount of change:

tensor([[ 3.2294e-07,  7.3983e-06,  6.5637e-06,  ..., -1.3529e-06,
2019-11-16T17:21:03.747068480Z          -4.9979e-06,  4.9799e-06],
2019-11-16T17:21:03.747074966Z         [ 5.5463e-08,  3.0771e-06, -9.5087e-07,  ...,  1.9265e-06,
2019-11-16T17:21:03.747080687Z          -4.5251e-06, -7.1564e-07].....]

This is part of the output which shows how small they are.

Here is my model :

self.lstm_hidden_dim = lstm_hidden_dim
self.lstm = nn.LSTM(word_embedding_dim,lstm_hidden_dim, num_layers=1,batch_first=False,bidirectional=True)
self.firstHidden = nn.Linear(lstm_hidden_dim*2, 300)
self.relu=nn.ReLU()
self.secondlinear=nn.Linear(300, score_space_size)
self.softmax= nn.LogSoftmax(dim=1)

Loss function and optimizer:

loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

Any idea, why the grad is very small?

vipin14119 · November 18, 2019, 10:23am

Try using an activation function on output of lstm layer, softsign , or tanh or relu ?