Hi,
I couldn’t understand from the documentation how can I go about using nn.AdaptiveLogSoftmaxWithLoss
.
Could someone please explain, or give an example of how to use this layer instead of a regular LogSoftmax
layer?
Thanks!
Hi,
I couldn’t understand from the documentation how can I go about using nn.AdaptiveLogSoftmaxWithLoss
.
Could someone please explain, or give an example of how to use this layer instead of a regular LogSoftmax
layer?
Thanks!
I tried the following model:
class BiLSTM(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_size, num_layers):
super(BiLSTM, self).__init__()
self.embedding_dim = embedding_dim
self.hidden_size = hidden_size
self.num_layers = num_layers
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_size, num_layers=self.num_layers, bidirectional=True)
# self.fc = nn.Linear(2 * hidden_size, vocab_size)
# self.softmax = nn.LogSoftmax(dim=2)
self.softmax = nn.AdaptiveLogSoftmaxWithLoss(2 * hidden_size, vocab_size, cutoffs=[10, 100, 1000])
def forward(self, input, target):
batch_size = input.size(0)
out = self.embedding(input)
hidden = self._init_hidden(batch_size)
out = out.permute(1, 0, 2)
out, hidden = self.lstm(out, hidden)
# out = self.fc(out)
out = out.squeeze(dim=0)
target = target.squeeze(dim=1)
out, loss = self.softmax(out, target)
return out, loss
def _init_hidden(self, batch_size):
return (torch.zeros(2 * self.num_layers, batch_size, self.hidden_size).to(device),
torch.zeros(2 * self.num_layers, batch_size, self.hidden_size).to(device))
The cutoffs were chosen after checking the density of word counts, it’s not just because these cutoffs appear in the tutorial.
I’m not seeing a significant change in performance. Anyone has had different experiences?