Help CUDA error: out of memory

@ptrblck , I am waiting for your idea I really apprceiate, do you think the code has an issue? indeed the model is not trained well. I change the way that I decoding is just send me back what ever I feed it as input no text generation properly. :(, I compare the results when I used one GPU there is no pad token and it generate the text for me. big different between 1 gpu and multilpes