Lstm - Ones Vs Randn Initialization - nan loss

Hello, my question is on the output of the loss function (cross entropy) for different models initialized with ones and randn. So, if I initialize as ones, the loss is a valid float (i.e. -107) but if I go with randn, the loss always appears as not-a-nr.

-> I have checked gradients, the grads flow in both cases (model gets updated as well). The only difference being ; one showing an actual loss and rand showing nan loss.
-> For extra info. my sequences are up to 40 timesteps long but as I said I do not suspect vanishing grads since I have checked it manually at every step.

So what could be the problem? Cuz i’m out of ideas :slight_smile: Thank you…

edit: the code for loss function is below

def custom_entropy(output_seq, label_seq):
loss_all = [] # for all steps
for t in range(len(label_seq)):
lbl = label_seq[t]
pred = output_seq[t]
loss = (-torch.log(pred) * lbl).mean()
return loss_all