Loss not decreasing for part-of-speech tagger

Hi all!

I’m trying to train a part-of-speech tagger using the data from the Brown corpus from NLTK using the universal tagset:

Here is the dataset implementation:

class BrownDataset(Dataset):
    def __init__(self):
        self.sents = []
        self.tags = []
        for tagged_sent in brown.tagged_sents(tagset="universal"):
            words, tags = list(zip(*tagged_sent))
        c = Counter([word.lower() for sent in self.sents for word in sent])
        self.w2i = {"<PAD>": 0, "<UNK>": 1}
        for i, (w, _) in enumerate(c.most_common(VOCAB_SIZE - 2), 2):
            self.w2i[w] = i
        self.i2w = {i: w for w, i in self.w2i.items()}
        self.t2i = {"<PAD>": 0}
        for i, t in enumerate({tag for tags in self.tags for tag in tags}, 1):
            self.t2i[t] = i
        self.i2t = {i: t for t, i in self.t2i.items()}

    def __getitem__(self, index):
        return torch.tensor([self.w2i.get(w.lower(), self.w2i["<UNK>"]) for w in self.sents[index]]), torch.tensor([self.t2i[t] for t in self.tags[index]])

    def __len__(self):
        return len(self.sents)

My model is fairly simple, similar to the model in the PyTorch tutorial:

class PosTagger(nn.Module):
    def __init__(self):
        super(PosTagger, self).__init__()
        self.embedding = nn.Embedding(VOCAB_SIZE, EMB_DIM)
        self.lstm = nn.LSTM(EMB_DIM, HIDDEN_SIZE, batch_first=True)
        self.fc = nn.Linear(HIDDEN_SIZE, TAGSET_SIZE + 1)  # +1 for PAD

    def forward(self, x):
        x = self.embedding(x)
        x, _ = self.lstm(x)
        return F.log_softmax(self.fc(x), dim=1)

This is the setup code:

def collate_fn(batch):
    sents, tags = list(zip(*batch))
    return pad_sequence(sents, batch_first=True), pad_sequence(tags, batch_first=True)

dataset = BrownDataset()
dataloader = DataLoader(dataset, batch_size=64, shuffle=True, collate_fn=collate_fn)
model = PosTagger()
loss_function = nn.NLLLoss()
optimizer = optim.Adam(model.parameters())

And this is the training loop:

for epoch in tqdm(range(EPOCHS)):
    for sents, targets in dataloader:
        predictions = model(sents)
        batch_size, seq_len, _ = predictions.shape
        loss = loss_function(predictions.view(batch_size * seq_len, -1), targets.view(-1))

The loss per batch is constantly fluctuating between 2 and 4. Am I doing something wrong?


The Default learning rate for Adam is 0.001. This might be too high for this case. Try explicitly setting it. This is of-course barring any issue with your data.

Changing Adam’s learning rate didn’t do it. I think Adam is too “strong” for this problem. I got better results with SGD(lr=0.1) and a larger batch size, but the model converges very slowly. Maybe it is a problem with the data…

Perhaps try with a really tiny dataset and try to overfit on it. Just to ensure that everything is working as expected.

I split my dataset into 800 for train and 200 for test, and the best I can get after 300 epochs is 92% accuracy for train and 85% for test.

300 epoch is a lot but at-least the loss is going down and you are getting a good test score. How are you measuring your loss. Did you try to see if it is a gradual decrease ? . Are you sure it is not bouncing around a minima for a long time ? You would observe this as a periodic increase and decrease in the loss.

I was calculating the loss wrong. Now I’m doing epoch_loss / len(dataloader) and can actually see that the loss is decreasing, albeit slowly. I also added a categorical accuracy function I found somewhere to check my accuracy, and the accuracy is increasing, so I guess everything is OK :slight_smile: Thanks for all your help!