Trained Seq2Seq model ouput EOS token only during test

QUANG_HUY_CHU · August 2, 2023, 5:56am

Hi everyone, I currently working on Seq2Seq model for Question-Answering task. After training, I tested the model by using one of the training question to test, but to my surprise the model return EOS tokens

You can check full notebook at my Github here

Simply explain, here is how I preprocessed and trained the model:

1. First I loaded the trainning dataset and validation dataset from the torchdata’s SQuAD2 dataset, shortened the training data to 30000 QA pairs and validation data to 1000 QA pairs.
2. With each sequence (questions and answers), applied text preprocessing and tokenization:

def prepare_text(sentence):
    sentence= ''.join(
        char for char in unicodedata.normalize('NFD', sentence.lower().strip()) if unicodedata.category(char)!= 'Mn'
    )
    sentence= re.sub(r"\s+", r" ", sentence).strip()
    tokens= [token.text for token in spacy_en.tokenizer(sentence)]
    
    return tokens

3. Created source vocabulary from training and validation dataset and target vocabulary from training and validation dataset. Here I defined (for both source and target vocabulary) SOS token as 0, EOS token as 1, UNK as 2, and PAD as 3, the next index will be the word in the dataset. The result vocabulary is:

Total vocabulary for source: 20357 words(tokens)
Total vocabulary for target: 19103 words(tokens)

4. Max padding each sequence to create fixed input:
I found that the max source(question) sequence length was 60 tokens and the max target(answer) sequence length was 46 tokens so I padded all the sequences in the training and validation dataset to the max length respectively using the PAD (3) token defined in the vocabulary above.
5. Created the Dataset from the Dataset package from torch.utils.data
6. Defined the Seq2Seq model:
I defined the LSTM-based Encoder-Decoder as below:

Seq2Seq(
,  (encoder): Encoder(
,    (embedding): Embedding(20357, 512)
,    (lstm): LSTM(512, 1024, batch_first=True)
,    (dropout): Dropout(p=0.5, inplace=False)
,  )
,  (decoder): Decoder(
,    (embedding): Embedding(19103, 512)
,    (lstm): LSTM(512, 1024, batch_first=True)
,    (fc): Linear(in_features=1024, out_features=19103, bias=True)
,    (dropout): Dropout(p=0.5, inplace=False)
,  )
,)

7. Traing:
First I initiated the weights, using:

# Initiate model weights
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)

s2s_model.apply(init_weights)

I chose 50 epochs, batch size as 128, learning rate as 0.01, optimizer is Adam with weight decay as 0.1. Also I use CrossEntropyLoss with PAD token as ignore_index:
criterion = nn.CrossEntropyLoss(ignore_index= target_vocab.token_to_index("<PAD>"))
Here is the training result:

5/50 Epoch  -  Training Loss = 1.4962  -  Validation Loss = 1.7524
10/50 Epoch  -  Training Loss = 1.5117  -  Validation Loss = 1.3630
15/50 Epoch  -  Training Loss = 1.5072  -  Validation Loss = 1.3645
20/50 Epoch  -  Training Loss = 1.4991  -  Validation Loss = 1.3679
25/50 Epoch  -  Training Loss = 1.5165  -  Validation Loss = 1.3641
30/50 Epoch  -  Training Loss = 1.5202  -  Validation Loss = 1.3675
35/50 Epoch  -  Training Loss = 1.5124  -  Validation Loss = 1.5445
40/50 Epoch  -  Training Loss = 1.5130  -  Validation Loss = 1.3653
45/50 Epoch  -  Training Loss = 1.5046  -  Validation Loss = 1.3663
50/50 Epoch  -  Training Loss = 1.5140  -  Validation Loss = 1.9076

8. Saved the model and test:
I saved the model, using:

model_save_name = 'Seq2Seq.pt'
torch.save(s2s_model, model_save_name)

Then loaded the model and created the simply input sequence handler:

# Load the model
s2s_model= torch.load(model_save_name, map_location= torch.device("cuda"))
s2s_model.eval()

# User input handler:
def eveluate(src, source_vocab, target_vocab, model, max_src_len, max_trg_len):
    token_src= []
    answer= []

    for token in prepare_text(src):
        token_src.append(source_vocab.token_to_index(token))
    
    if len(token_src)< max_src_len:
        token_src= token_src+ [source_vocab.token_to_index("<PAD>")]*(max_src_len- len(token_src))

    tensor_src= torch.tensor(token_src).unsqueeze(0)

    output= model(tensor_src.to(device), None, 1, max_trg_len, teacher_forcing_ratio = 0)
    output= output.view(-1, target_vocab.get_vocab_length()).argmax(1)

    #output= output.remove(source_vocab.index_to_token("<EOS>"))

    for token in output:
        answer.append(target_vocab.index_to_token(token.item()))

    print(f">> Question: {token_src}")
    print(f"<< Answer: {output}")
    print(" ".join(char for char in answer))

When I tested the model by use the same input in the training dataset, it just predicted only EOS tokens:

eveluate(
    src= "In what R&B group was she the lead singer?",
    source_vocab= source_vocab,
    target_vocab= target_vocab,
    model= s2s_model,
    max_src_len= max_src_len,
    max_trg_len= max_trg_len
)

>> Question: [14, 11, 34, 35, 16, 15, 36, 37, 27, 10, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
<< Answer: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       device='cuda:0')
<EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS> <EOS>

Did I perform anything wrong in the preprocessing, or training steps? I learned how to code the Seq2Seq model from many sources so I am afraid that I was doing something wrong here.

Any help is appreciated.

Thanks