Model predicts always sos and eos labels

I am working on an NLP project where I have 7k images so my dataset is pretty good. However, when I am training my CNN-BiLSTM netrwork the error decreases untill 0.002 but in testing, the model start predicting always ['</s></s>'] where during training, my captions would look like

tensor([[   2,    2,    2,    2],
        [ 668,  668,  668,  668],
        [ 293,  293,  293,  293],
        [5627,   76,   30, 5627],
        [  76,    4,  426,  370],
        [   4,    1,  535,    4],
        [   1,    1,    7,    1],
        [   1,    1,  630,    1],
        [   1,    1,   30,    1],
        [   1,    1, 4382,    1],
        [   1,    1,   76,    1],
        [   1,    1,    4,    1],
        [   2,    2,    2,    2]])

where 2 represents both sos and eos
and 1 represents tha pad label.I used the pad to be able to work with batches.
heres how the second vector would look like during training
'</s>No car present. <pad><pad><pad><pad><pad><pad><pad></s>'

There are hundreds of things that can be off :). Since you’re not showing any code, just a few comments:

  • Why are <SOS> and <EOS> are the some token index. It might not matter in your concrete setting, but generally they serve different purposes.

  • Most of the time <PAD> is represented by 0. While it does not matter in principle, approaches using pack_padded_sequence assume that 0 means padding by default. In this case, you need to make it explicit that 1 means padding

  • I assume you use the LSTM to generate sentences. In this case, bidirectionally doesn’t really makes sense.

  • Do you have any view() or reshape() calls that might mangle your data?

Without looking at your code, it appears as though tokens 1 and 2 have a higher frequency than other tokens. And so your model is overfitting to those.

To address this, you can obtain a frequency of each token in your training dataset. Then create a weights vector which you can pass into your loss function(assuming you’re using CrossEntropyLoss).

The weight vector should be such that it is the ratio of the following:

weights = sum(frequency_of_each_token)/frequency_of_each_token

You can get token frequency with:

all_targets = all_targets.view(-1) #flatten first
frequency_of_each_token = torch.bincount(all_targets)
import os  # when loading file paths
import pandas as pd  # for lookup in annotation file
import spacy  # for tokenizer
import torch
from torch.nn.utils.rnn import pad_sequence  # pad batch
from import DataLoader, Dataset
from PIL import Image  # Load img
import torchvision.transforms as transforms

class FlickrDataset(Dataset):
    def __init__(self, root_dir, captions_file, transform=None):
        self.root_dir = root_dir
        self.df = pd.read_csv(captions_file)
        self.transform = transform

        # Get img, caption columns
        self.imgs = self.df["filename"] #image
        self.captions = self.df["impression"] #caption
    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):
        caption = self.captions[index]
        img_id = self.imgs[index]
        img =, img_id)).convert("RGB")

        if self.transform is not None:
            img = self.transform(img)
        return img, caption

def get_loader(root_folder,annotation_file,transform,batch_size=4,num_workers=1,shuffle=True,pin_memory=True,):
    dataset = FlickrDataset(root_folder, annotation_file, transform=transform)
    loader = DataLoader(

    return loader, dataset

class MyCollate:
    def __init__(self):
        self.tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")

    def __call__(self, batch):
        imgs = [item[0].unsqueeze(0) for item in batch]
        imgs =, dim=0)
        targets = [item[1] for item in batch]
            targets = self.tokenizer.batch_encode_plus(targets, padding=True)['input_ids']
            for row in targets:
        except Exception as e :
        return imgs, torch.transpose(torch.tensor(targets), 0, 1)

class DecoderRNN(nn.Module):
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers):
        super(DecoderRNN, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.bilstm = nn.LSTM(embed_size, hidden_size, num_layers, bidirectional=True)
        self.linear = nn.Linear(2 * hidden_size, vocab_size)
        self.dropout = nn.Dropout(0.5)

    def forward(self, features, captions): #(25, 16)
        embeddings = self.dropout(self.embed(captions))  #(25, 16, 50427)
        embeddings =, embeddings), dim=0)
        hiddens, _ = self.bilstm(embeddings)
        outputs = self.linear(hiddens)
        return outputs

class CNNtoRNN(nn.Module):
    def __init__(self, embed_size, hidden_size, num_layers, vocab_size): #(16,3,224,224) / (25,16)
        super(CNNtoRNN, self).__init__()
        self.encoderCNN = CheXNet(embed_size)  #(16, 512)
        self.decoderRNN = DecoderRNN(embed_size, hidden_size, num_layers, vocab_size)
        self.tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")

    def forward(self, images, captions):
        features = self.encoderCNN(images)
        outputs = self.decoderRNN(features, captions)
        return outputs

    def caption_image(self, image, max_length=30):
        result_caption = []

        with torch.no_grad():
            x = self.encoderCNN(image).unsqueeze(0)
            states = None
            flag = False
            for _ in range(max_length):
                hiddens, states = self.decoderRNN.bilstm(x, states)
                output = self.decoderRNN.linear(hiddens.squeeze(0))
                predicted = output.argmax(1)
                x = self.decoderRNN.embed(predicted).unsqueeze(0)
                if self.tokenizer.decode(predicted.item()) == "</s>":  #eos: </s>
                    if flag:
                      flag = True

        return [self.tokenizer.decode(result_caption)]

import torch
from tqdm import tqdm
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
from torch.utils.tensorboard import SummaryWriter
from import bleu_score

def train():
    transform = transforms.Compose(
            transforms.Resize((299, 299)),
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),

    train_loader, dataset = get_loader(
        batch_size = 8,

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    load_model = False
    save_model = True
    train_CNN = True

    # Hyperparameters
    embed_size = 784
    hidden_size = 784
    vocab_size = 50257
    num_layers = 3
    learning_rate = 0.0001
    num_epochs = 20

    step = 0

    # initialize model, loss etc
    model = CNNtoRNN(embed_size, hidden_size, vocab_size, num_layers).to(device)
    criterion = nn.CrossEntropyLoss(ignore_index= 1) #ignore the index of padding <PAD>
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    if load_model:
        step = load_checkpoint(torch.load("my_checkpoint.pth.tar"), model, optimizer)

    for epoch in range(num_epochs):
        # Uncomment the line below to see a couple of test cases
        # print_examples(model, device, dataset)

        if save_model:
            checkpoint = {
                "state_dict": model.state_dict(),
                "optimizer": optimizer.state_dict(),
                "step": step,
        #(imgs, captions) = next(iter(train_loader))
        for idx, (imgs, captions) in tqdm(enumerate(train_loader), total=len(train_loader), leave=True):
            imgs =
            captions =
            outputs = model(imgs, captions[:-1]) #captions[:-1]
            loss = criterion(outputs.reshape(-1, outputs.shape[2]), captions.reshape(-1))
            step += 1
        print('Epoch number {} ----> {}.'.format(epoch,loss))

I am using BioGpt tokenizer so its prebuilt, <PAD> is 1 and <EOS> is not defined so I am using <SOS>
also there is outputs = model(imgs, captions[:-1]) which means Im not passing the eos, Im leaving the model to predict the eos by itself.
Please notice criterion = nn.CrossEntropyLoss(ignore_index= 1) #ignore the index of padding <PAD>, so its not an issue.

UPDATE: I have tried to overfitt my model on one sample by running 500 epochs on it and try to print the prediction on the same sample each epoch, the first epoch was random I got “[‘zymosan zymosan hr restitution p50 p50 p50 p50 p50 tubation Đzymosan zymosan unpublished’]” and after this epoch, all predictions were ['</s></s>'] So I think my model is not learning, does anyone has an idea why ?

I can’t see anything obvious wrong. I’m still always a bit skeptical when using reshape() or view() like in

loss = criterion(outputs.reshape(-1, outputs.shape[2]), captions.reshape(-1))

In my code, I use things like

outputs, _ = rnn_lm_model(inputs, hidden)
loss = criterion(outputs.permute(0,2,1), targets)

to get the shapes right. This minimizes the risk of scrambling up the tensors. That being said, your code might very well be correct.

Thanks everyone for your suggestions. I found the issue based on your ideas !

  1. I wasnt uploading my chexnet wights effectively.
  2. The use of bilstm instead of lstm.
  3. my tokenizer was applying the same token for both sos and eos.
  4. my function that I use to generate predictions had some logic issues.

After fixing these issues, Now I am able to overfitt my sample !