My multi-dimensional Transformer doesn't seem to learn anything

I’m trying to go seq2seq with a Transformer model. My input and output are the same shape (torch.Size([499, 128]) where 499 is the sequence length and 128 is the number of features.

My input looks like:

My output looks like:

My training loop is:

    for batch in tqdm(dataset):
        x, y = batch

        x =
        y =

        pred = model(x, torch.zeros(x.size()).to(DEVICE))

        loss = loss_fn(pred, y)

My model is:

import math
from typing import final
import torch
import torch.nn as nn

class Reconstructor(nn.Module):
    def __init__(self, input_dim, output_dim, dim_embedding, num_layers=4, nhead=8, dim_feedforward=2048, dropout=0.5):
        super(Reconstructor, self).__init__()

        self.model_type = 'Transformer'
        self.src_mask = None
        self.pos_encoder = PositionalEncoding(d_model=dim_embedding, dropout=dropout)
        self.transformer = nn.Transformer(d_model=dim_embedding, nhead=nhead, dim_feedforward=dim_feedforward, num_encoder_layers=num_layers, num_decoder_layers=num_layers)
        self.decoder = nn.Linear(dim_embedding, output_dim)
        self.decoder_act_fn = nn.PReLU()


    def init_weights(self):
        initrange = 0.1
        nn.init.uniform_(self.decoder.weight, -initrange, initrange)

    def forward(self, src, tgt):

        pe_src = self.pos_encoder(src.permute(1, 0, 2))  # (seq, batch, features)
        transformer_output = self.transformer_encoder(pe_src)
        decoder_output = self.decoder(transformer_output.permute(1, 0, 2)).squeeze(2)
        decoder_output = self.decoder_act_fn(decoder_output)
        return decoder_output

My output has a shape of torch.Size([32, 499, 128]) where 32 is batch, 499 is my sequence length and 128 is the number of features. But the output has the same values:

tensor([[[0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017]]],

What am I doing wrong? Thank you so much for any help.

Hi, i’m facing similar issue. Mine gives me equal probability for every output.

Can you share your model? Perhaps we can try to solve this issue together?

Hello again, i add my model to here but when i decrease learning rate its giving right outputs, now new problem arisen, whatever i give input its predict same value; example:

[90, 91, 26, 62, 92, 93, 26, 94, 95, 96]
incumbering soil and washed into immediate and glittering popularity possibly
Masked Input:
[90, 91, 26, 62, 92, 93, 26, 1, 95, 96]
incumbering soil and washed into immediate and unnk popularity possibly
[90, 91, 26, 62, 92, 93, 26, 33, 95, 96]
incumbering soil and washed into immediate and the popularity possibly

As you can see like this, it always predict “the” token.


class Kemal(nn.Module):
    def __init__(self, src_vocab_size, embedding_size, num_heads, dim_forward, num_encoder_layers, max_len, src_pad_idx, dropout, device):
        super(Kemal, self).__init__()
        self.src_word_embedding = nn.Embedding(src_vocab_size, embedding_size)
        self.src_position_embedding = nn.Embedding(max_len, embedding_size)
        self.device = device
        self.encoder_norm = nn.LayerNorm(embedding_size) 
        self.encoder_layer = nn.TransformerEncoderLayer(embedding_size, num_heads, dim_feedforward=dim_forward, dropout=dropout, activation='gelu')
        self.encoder = nn.TransformerEncoder(self.encoder_layer, num_encoder_layers, self.encoder_norm)
        self.fc = nn.Linear(embedding_size, src_vocab_size)
        self.src_pad_idx = src_pad_idx
    def make_src_pad_mask(self, src):
        src_mask = src.transpose(0, 1) == self.src_pad_idx
        return src_mask
        # (N, src_len)
    def forward(self, src):
        src_seq_lenght, N = src.shape
        src_mask = nn.Transformer.generate_square_subsequent_mask(None, src_seq_lenght).to(self.device)
        src_positions = (
            torch.arange(0, src_seq_lenght).unsqueeze(1).to(self.device)
        embed_src = (self.src_word_embedding(src) + self.src_position_embedding(src_positions))
        src_padding_mask = self.make_src_pad_mask(src)
        out = self.encoder(embed_src, mask=src_mask, src_key_padding_mask=src_padding_mask)
        out = self.fc(out)
        return out

With CrossEntropyLoss

What is the purpose of the make_src_pad_mask?

key_padding_mask – if provided, specified padding elements in the key will be ignored by the attention. When given a binary mask and a value is True, the corresponding value on the attention layer will be ignored. When given a byte mask and a value is non-zero, the corresponding value on the attention layer will be ignored


I understand that - but why are you using it?

I’m not using now :smiley:, i new to PyTorch so i follow tutorials and pad_mask remains from the examples, not in the current version.

Do you have any idea about my problem?

How about we do a 30 min Zoom to hash it out?

Sorry for late update, i solve my problem.

I was give to input to model in (N, seq_len) shape but i need to give (seq_len, N). :smiley:
All the time i generate wrong src_mask and positional embeddings.
And problem about predict wrong tokens is becouse of word frequencies