My multi-dimensional Transformer doesn't seem to learn anything

shamoons · November 17, 2020, 2:00pm

I’m trying to go seq2seq with a Transformer model. My input and output are the same shape (torch.Size([499, 128]) where 499 is the sequence length and 128 is the number of features.

My input looks like:

My output looks like:

My training loop is:

    for batch in tqdm(dataset):
        optimizer.zero_grad()
        x, y = batch

        x = x.to(DEVICE)
        y = y.to(DEVICE)

        pred = model(x, torch.zeros(x.size()).to(DEVICE))

        loss = loss_fn(pred, y)
        loss.backward()
        optimizer.step()

My model is:

import math
from typing import final
import torch
import torch.nn as nn

class Reconstructor(nn.Module):
    def __init__(self, input_dim, output_dim, dim_embedding, num_layers=4, nhead=8, dim_feedforward=2048, dropout=0.5):
        super(Reconstructor, self).__init__()

        self.model_type = 'Transformer'
        self.src_mask = None
        self.pos_encoder = PositionalEncoding(d_model=dim_embedding, dropout=dropout)
        self.transformer = nn.Transformer(d_model=dim_embedding, nhead=nhead, dim_feedforward=dim_feedforward, num_encoder_layers=num_layers, num_decoder_layers=num_layers)
        self.decoder = nn.Linear(dim_embedding, output_dim)
        self.decoder_act_fn = nn.PReLU()

        self.init_weights()

    def init_weights(self):
        initrange = 0.1
        nn.init.zeros_(self.decoder.weight)
        nn.init.uniform_(self.decoder.weight, -initrange, initrange)

    def forward(self, src, tgt):

        pe_src = self.pos_encoder(src.permute(1, 0, 2))  # (seq, batch, features)
        transformer_output = self.transformer_encoder(pe_src)
        decoder_output = self.decoder(transformer_output.permute(1, 0, 2)).squeeze(2)
        decoder_output = self.decoder_act_fn(decoder_output)
        return decoder_output

My output has a shape of torch.Size([32, 499, 128]) where 32 is batch, 499 is my sequence length and 128 is the number of features. But the output has the same values:

tensor([[[0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         ...,
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017]]],
       grad_fn=<PreluBackward>)

What am I doing wrong? Thank you so much for any help.

utkuumetin · November 18, 2020, 7:24am

Hi, i’m facing similar issue. Mine gives me equal probability for every output.

shamoons · November 18, 2020, 3:40pm

Can you share your model? Perhaps we can try to solve this issue together?

utkuumetin · November 19, 2020, 5:52am

Hello again, i add my model to here but when i decrease learning rate its giving right outputs, now new problem arisen, whatever i give input its predict same value; example:

Input:
[90, 91, 26, 62, 92, 93, 26, 94, 95, 96]
incumbering soil and washed into immediate and glittering popularity possibly
Masked Input:
[90, 91, 26, 62, 92, 93, 26, 1, 95, 96]
incumbering soil and washed into immediate and unnk popularity possibly
Output:
[90, 91, 26, 62, 92, 93, 26, 33, 95, 96]
incumbering soil and washed into immediate and the popularity possibly

As you can see like this, it always predict “the” token.

Model:

class Kemal(nn.Module):
    def __init__(self, src_vocab_size, embedding_size, num_heads, dim_forward, num_encoder_layers, max_len, src_pad_idx, dropout, device):
        super(Kemal, self).__init__()
        
        self.src_word_embedding = nn.Embedding(src_vocab_size, embedding_size)
        self.src_position_embedding = nn.Embedding(max_len, embedding_size)
        
        self.device = device
        
        self.encoder_norm = nn.LayerNorm(embedding_size) 
 
        self.encoder_layer = nn.TransformerEncoderLayer(embedding_size, num_heads, dim_feedforward=dim_forward, dropout=dropout, activation='gelu')
        self.encoder = nn.TransformerEncoder(self.encoder_layer, num_encoder_layers, self.encoder_norm)
        
        self.fc = nn.Linear(embedding_size, src_vocab_size)
        
        self.src_pad_idx = src_pad_idx
        
    def make_src_pad_mask(self, src):
        src_mask = src.transpose(0, 1) == self.src_pad_idx
        return src_mask
        # (N, src_len)
        
    def forward(self, src):
        src_seq_lenght, N = src.shape
        
        src_mask = nn.Transformer.generate_square_subsequent_mask(None, src_seq_lenght).to(self.device)
        
        src_positions = (
            torch.arange(0, src_seq_lenght).unsqueeze(1).to(self.device)
        )
        
        embed_src = (self.src_word_embedding(src) + self.src_position_embedding(src_positions))
        src_padding_mask = self.make_src_pad_mask(src)
        out = self.encoder(embed_src, mask=src_mask, src_key_padding_mask=src_padding_mask)
        out = self.fc(out)
         
        return out

With CrossEntropyLoss

shamoons · November 19, 2020, 8:04pm

What is the purpose of the make_src_pad_mask?

utkuumetin · November 20, 2020, 5:50am

key_padding_mask – if provided, specified padding elements in the key will be ignored by the attention. When given a binary mask and a value is True, the corresponding value on the attention layer will be ignored. When given a byte mask and a value is non-zero, the corresponding value on the attention layer will be ignored

from https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html

shamoons · November 20, 2020, 1:55pm

I understand that - but why are you using it?

utkuumetin · November 20, 2020, 2:04pm

I’m not using now , i new to PyTorch so i follow tutorials and pad_mask remains from the examples, not in the current version.

Do you have any idea about my problem?

shamoons · November 20, 2020, 2:24pm

How about we do a 30 min Zoom to hash it out?

utkuumetin · November 21, 2020, 12:27pm

Sorry for late update, i solve my problem.

I was give to input to model in (N, seq_len) shape but i need to give (seq_len, N).
All the time i generate wrong src_mask and positional embeddings.
And problem about predict wrong tokens is becouse of word frequencies