Text generator lstm without softmax

Hi, i am trying to get next predicted word in the sequence. Is it possible to get maximum value from dense layer’s output without softmax since i am not doing any classification

Well, you do classification. The vocabulary represents your classes, and the predicted class is the next word.

i see…But i do not use a softmax in output dense layer. I am picking maximum value from output which equals vocabulary size.

Yes, that would be possible, if you return the class logits directly.
The argmin and argmax values of the logits would be the same as after applying a softmax.

so we can say “dense layer without softmax can return argmax values of the logits and this works just like softmax” ?

"yhat = np.argmax(yhat.to(“cpu”).detach().numpy()) "
i am using argmax for converting probability to integer and i think this is what you said right?
since this integer value maps a word in my vocabulary i am able to predict the next word this way without a sofmax i guess.

It depends what your use case is. The actual values will of course be different, but the maximal index will be the same, so you can use it to get the predicted class (most likely class).

However, you need to be careful, if you want to calculate a loss with these values and should use logits for e.g. nn.CrossEntropyLoss.

This should work, if you specify the right dimension for the argmax operation. Usually the class dimension would be dim1.

yes i use crossEntropyLoss. i didnt understand what you mean by class dimension but my dense layer takes input a vector of size 256 which is output of a LSTM and output is
1 dimensional vocabulary sized vector

Your output should have the shape [batch_size, nb_classes], right?
If that’s the case, you should use np.argmax(yhat.to('cpu').detach().numpy(), axis=1) to get the predicted class for each sample in the batch.
Otherwise np.argmax will only return a single value for the entire array.

output shape is : torch.Size([47, 7579]) 7579 is my vocabulary size and 47 is total number of predicted next words. For a given image and a first word of a description, model predicts second word than use seconds as an input for third and it goes like that.( Model generates captions for images and there are 5 description for each image in training text split.) . For example: these integers are encoded words from my vocabulary. And their total is 47. 2 means end of a sentence.
[42, 3, 87, 169, 6, 117, 55, 393, 11, 394, 3, 27, 4474, 640, 2, 18, 313, 64, 195, 118, 2, 39, 18, 117, 64, 195, 2057, 2, 39, 18, 117, 4, 394, 19, 60, 2057, 2, 39, 18, 3, 87, 169, 313, 64, 195, 2915, 2]
torch.Size([47, 7579])
Sorry if its not well explained but i tried to explain as i know :smiley:

I assume you are using a batch size of 1, since the batch dimension is missing in the output?
If so, you could still use axis=1 to get the predicted word from the vocabulary.

i see. Thank you ! Is there a special name for “using a dense like this without a softmax”?
Maybe i can read more about it to make it clear. I saw something similar to it called “multi parameter regression” but not sure

I’ve probably phrased it wrong.
The output of a linear layer, which can be used for a multi-class classification use case, will return the logits, which are raw prediction values in the range [-Inf, +Inf].
Internally nn.CrossEntropyLoss, which is usually used for this classification use case, will apply F.log_softmax and nn.NLLLoss, so the raw prediction values will be transformed to log probabilities.

1 Like

If i understood correctly, with CrossEntropyLoss there wont be need of an extra softmax because it applies it already during loss calculation.

That is correct. It will apply F.log_softmax internally, so you shouldn’t add a softmax layer in your model.

1 Like

Thank you very much @ptrblck helped me a lot