Hi everyone!
I have a project where I have to build, train and analyze a simple transformer for language modeling.
I built my model following the tutorial in the pytorch examples:
https://pytorch.org/tutorials/beginner/transformer_tutorial.html
Here is my jupyter notebook: https://github.com/andcarnivorous/TestingTransformers/blob/master/transformertest.ipynb
I have tried to implement the input and target masks and so I started training on the wikitext dataset different versions with more/less encoders, decoders and heads. What I have noticed is that even when I use only two heads and two encoders/decoders the model manages to get to quite low validation loss.
Any feedback on how I set this up would be really appreciated.
You need an attention mask.
Thanks! That would be the memory_mask variable to pass to the transformer block, right? I guess this is why the model was reaching such low loss!
actually src_mask
. Without mask, the model will be able to see the tokens it’s going to predict.
Oh but I am already applying a mask in the forward pass:
x = self.transformer(x, tgt, tgt_mask=generate_square_subsequent_mask(x.shape[0]), src_mask=generate_square_subsequent_mask(x.shape[0]))
From my understanding that generate_subsequent_mask() function hides the next token.
You can use your model to generate some texts and see if the generated tokens are same. If same, it means the src/tgt are not masked properly.
Yeah, I already tested and it does produce gibberish even when the loss is low, on training data, so I think the mask is working.
If it returns some random tokens, it means the mask is working.
Could you just use TransformerEncoder for a test? Encode/decoder is not necessary for a WLM task, IMO. And we have a tutorial on our website to set up a TransformerEncoder for WLM link.
That is the tutorial I am currently following!
Because of this project requirements, I have to use the decoder part as well, since I will also have to probably train it on translation tasks.
What I would like to do now is plot heatmaps of the attention matrix activation like this:
I am accessing the self_attention and multihead like so:
attn_activation = model.transformer.encoder.layers[1].self_attn.out_proj(emb)
But I am a little lost on how that heatmap is mapped… what is the value of every point ij in the map?