I have a small transformer model (~11M parameter) that I train on some data. The model achieved very good accuracies on test data (~99%). After a commit, the model would only reach about 3% or diverge completely, giving NaN for loss (caused by setting weights to NaN I guess). After searching for the cause for the drastic change in behaviour I realized that this behaviour is entirely dependent on the norm_first parameter. Setting it to False (on both Encoder and DecoderLayer) leads to the divergence during training.
It seems very odd to me that that alone would make or brake a model. I also use masks of bool type like this:
causal_mask = torch.triu(torch.ones(tgt.shape[1], tgt.shape[1], dtype=torch.bool), diagonal=1).to(src.device)
which gives warnings in eval() mode:
UserWarning: Converting mask without torch.bool dtype to bool; this will negatively affect performance
Don’t know if this might be related. Is this behaviour “normal” for not pretrained trasnformer or might there be something wrong with my setup?