Understanding AdaptiveLogSoftmaxWithLoss


I couldn’t understand from the documentation how can I go about using nn.AdaptiveLogSoftmaxWithLoss.

Could someone please explain, or give an example of how to use this layer instead of a regular LogSoftmax layer?


1 Like

I tried the following model:

class BiLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, num_layers):
        super(BiLSTM, self).__init__()
        self.embedding_dim = embedding_dim
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_size, num_layers=self.num_layers, bidirectional=True)
#         self.fc = nn.Linear(2 * hidden_size, vocab_size)
#         self.softmax = nn.LogSoftmax(dim=2)
        self.softmax = nn.AdaptiveLogSoftmaxWithLoss(2 * hidden_size, vocab_size, cutoffs=[10, 100, 1000])
    def forward(self, input, target):
        batch_size = input.size(0)
        out = self.embedding(input)
        hidden = self._init_hidden(batch_size)
        out = out.permute(1, 0, 2)
        out, hidden = self.lstm(out, hidden)
#         out = self.fc(out)
        out = out.squeeze(dim=0)
        target = target.squeeze(dim=1)
        out, loss = self.softmax(out, target)
        return out, loss
    def _init_hidden(self, batch_size):
        return (torch.zeros(2 * self.num_layers, batch_size, self.hidden_size).to(device),
                torch.zeros(2 * self.num_layers, batch_size, self.hidden_size).to(device))

The cutoffs were chosen after checking the density of word counts, it’s not just because these cutoffs appear in the tutorial.

I’m not seeing a significant change in performance. Anyone has had different experiences?