I’m new to transformers so maybe this is a dumb question but I’ve been trying to get a normal Seq2Seq transformer model to fit to a few examples. The examples I’m using are English to English conversations, but I don’t think that matters too much. Obviously this will be hopelessly overfit, but for some reason I can’t seem to get larger models (>30 million parameters) to fit even on a few examples.
I’d assume that a larger model would have a better ability to fit (overfit in this scenario), so I’m not sure what’s going on. Maybe I’m thinking about this wrong, please correct me if so.
The expected behavior is that it will train down to a very low loss and respond to the training examples near perfect. This is not the case as the loss is stuck around 0.2 and it ends up outputting the same for any input.
I created the Transformer model with my own code which is 350 lines, so a bit much to post the entire code here but I believe the same effect could be seen with the standard pytorch TransformerEncoder and TransformerDecoder. Here is a link to the entire file I’ve used: https://drive.google.com/open?id=1ie-WqsOqYJ-bL6BrI88eAkh-KRsDPpvb