Feedback on my Transformer for LM

Hi everyone!

I have a project where I have to build, train and analyze a simple transformer for language modeling.

I built my model following the tutorial in the pytorch examples:
https://pytorch.org/tutorials/beginner/transformer_tutorial.html

Here is my jupyter notebook: https://github.com/andcarnivorous/TestingTransformers/blob/master/transformertest.ipynb

I have tried to implement the input and target masks and so I started training on the wikitext dataset different versions with more/less encoders, decoders and heads. What I have noticed is that even when I use only two heads and two encoders/decoders the model manages to get to quite low validation loss.

Any feedback on how I set this up would be really appreciated.

You need an attention mask.

Thanks! That would be the memory_mask variable to pass to the transformer block, right? I guess this is why the model was reaching such low loss!

actually src_mask. Without mask, the model will be able to see the tokens it’s going to predict.

Oh but I am already applying a mask in the forward pass:

x = self.transformer(x, tgt, tgt_mask=generate_square_subsequent_mask(x.shape[0]), src_mask=generate_square_subsequent_mask(x.shape[0]))

From my understanding that generate_subsequent_mask() function hides the next token.

You can use your model to generate some texts and see if the generated tokens are same. If same, it means the src/tgt are not masked properly.

Yeah, I already tested and it does produce gibberish even when the loss is low, on training data, so I think the mask is working.

If it returns some random tokens, it means the mask is working.

Could you just use TransformerEncoder for a test? Encode/decoder is not necessary for a WLM task, IMO. And we have a tutorial on our website to set up a TransformerEncoder for WLM link.

That is the tutorial I am currently following! :slight_smile: Because of this project requirements, I have to use the decoder part as well, since I will also have to probably train it on translation tasks.
What I would like to do now is plot heatmaps of the attention matrix activation like this:

I am accessing the self_attention and multihead like so:

attn_activation = model.transformer.encoder.layers[1].self_attn.out_proj(emb)

But I am a little lost on how that heatmap is mapped… what is the value of every point ij in the map?