Text generator lstm without softmax

cevvalu · May 23, 2020, 7:32pm

Hi, i am trying to get next predicted word in the sequence. Is it possible to get maximum value from dense layer’s output without softmax since i am not doing any classification

vdw · May 24, 2020, 4:25am

Well, you do classification. The vocabulary represents your classes, and the predicted class is the next word.

cevvalu · May 24, 2020, 9:02am

i see…But i do not use a softmax in output dense layer. I am picking maximum value from output which equals vocabulary size.

ptrblck · May 24, 2020, 9:10am

Yes, that would be possible, if you return the class logits directly.
The argmin and argmax values of the logits would be the same as after applying a softmax.

cevvalu · May 24, 2020, 9:18am

so we can say “dense layer without softmax can return argmax values of the logits and this works just like softmax” ?

cevvalu · May 24, 2020, 9:22am

"yhat = np.argmax(yhat.to(“cpu”).detach().numpy()) "
i am using argmax for converting probability to integer and i think this is what you said right?
since this integer value maps a word in my vocabulary i am able to predict the next word this way without a sofmax i guess.

ptrblck · May 24, 2020, 9:23am

It depends what your use case is. The actual values will of course be different, but the maximal index will be the same, so you can use it to get the predicted class (most likely class).

However, you need to be careful, if you want to calculate a loss with these values and should use logits for e.g. nn.CrossEntropyLoss.

This should work, if you specify the right dimension for the argmax operation. Usually the class dimension would be dim1.

cevvalu · May 24, 2020, 9:30am

yes i use crossEntropyLoss. i didnt understand what you mean by class dimension but my dense layer takes input a vector of size 256 which is output of a LSTM and output is
1 dimensional vocabulary sized vector

ptrblck · May 24, 2020, 9:38am

Your output should have the shape [batch_size, nb_classes], right?
If that’s the case, you should use np.argmax(yhat.to('cpu').detach().numpy(), axis=1) to get the predicted class for each sample in the batch.
Otherwise np.argmax will only return a single value for the entire array.

cevvalu · May 24, 2020, 10:07am

output shape is : torch.Size([47, 7579]) 7579 is my vocabulary size and 47 is total number of predicted next words. For a given image and a first word of a description, model predicts second word than use seconds as an input for third and it goes like that.( Model generates captions for images and there are 5 description for each image in training text split.) . For example: these integers are encoded words from my vocabulary. And their total is 47. 2 means end of a sentence.
[42, 3, 87, 169, 6, 117, 55, 393, 11, 394, 3, 27, 4474, 640, 2, 18, 313, 64, 195, 118, 2, 39, 18, 117, 64, 195, 2057, 2, 39, 18, 117, 4, 394, 19, 60, 2057, 2, 39, 18, 3, 87, 169, 313, 64, 195, 2915, 2]
torch.Size([47, 7579])
Sorry if its not well explained but i tried to explain as i know

ptrblck · May 24, 2020, 10:23pm

I assume you are using a batch size of 1, since the batch dimension is missing in the output?
If so, you could still use axis=1 to get the predicted word from the vocabulary.

cevvalu · May 24, 2020, 11:05pm

i see. Thank you ! Is there a special name for “using a dense like this without a softmax”?
Maybe i can read more about it to make it clear. I saw something similar to it called “multi parameter regression” but not sure

ptrblck · May 24, 2020, 11:08pm

I’ve probably phrased it wrong.
The output of a linear layer, which can be used for a multi-class classification use case, will return the logits, which are raw prediction values in the range [-Inf, +Inf].
Internally nn.CrossEntropyLoss, which is usually used for this classification use case, will apply F.log_softmax and nn.NLLLoss, so the raw prediction values will be transformed to log probabilities.

cevvalu · May 25, 2020, 5:13pm

If i understood correctly, with CrossEntropyLoss there wont be need of an extra softmax because it applies it already during loss calculation.

ptrblck · May 26, 2020, 2:32am

That is correct. It will apply F.log_softmax internally, so you shouldn’t add a softmax layer in your model.

cevvalu · May 26, 2020, 10:39am

Thank you very much @ptrblck helped me a lot