Apologies for the long post. I am very new to pytorch. And really confused about crossentropy loss and nll loss in pytorch.

I am trying to use softmax on a sequence to sequence problem. I have targets of two types. First target has number of tokens of 13 and second one has 100. The model has two outputs and i am computing loss for each type and sum them. I have the model output shape `[seq_length, number of tokens, embedding dim]`

and output shape as `[seq_length, number of tokens]`

. For nll_loss i am using log softmax in output dimension 1 then passing output to loss function by transposing last two dimensions like `[seq_length, number of tokens, embedding dim]`

. Also in crossentropy i am passing raw logits of shape `[seq_length, number of tokens]`

. But each case loss just fluctuates and does not decrease. What I am doing wrong here.

A short description of full process is as follows.

Inputs are a sequence of tuples like `[(1,15), (2,27), (13,10)]`

. First i one hot encoded each token in each tuple. Then embed them so that first elements of each tuple have shape `[10, embed dim]`

, then add the two tensors of each tuple to create one embedding for each tuple. finally stack these to create the input of shape `[3, 110, embed dim]`

And the two output logits have shape `[3, 10, embedding dim]`

and `[3, 100, embedding dim]`

. And if the target is `([2,27), (12,10), (5, 90)]`

. They are onehot encoded separately so the target 1 has shape `[3, 10]`

and target shape is `[3,100]'`