I am implementing RCTM I model proposed in famous paper “Recurrent Continuous Translation Models” which is a precursor to transformers. I have successfully built and trained it on very small datasets and get meaningful results. Now, I’d like to train it on a big dataset and utilize GPU (Multiple GPUs). I run it with my ‘cuda’ device on my local, but the GPU is old so it’s only for experimental results. When I run it on Kaggle with 2XT4 GPUs, i only have 20% GPU usage and when I try to train it with relatively big batches (e.g 1000) it basically fails to perform well. After 5 mins, I don’t even see any GPU usage in the status bar of Kaggle.
I move all of my tensors as well as my model to GPU device. I also use nn.DataParallel to tell torch to make use of multiple GPUs.
You can access my notebook here: rctm-5 | Kaggle
The model basically takes a sentence as a sequence of one-hot encoded vectors. It convolves it a number of times to have a final q x 1 vector in the end (q = representation size). This part is called Convolutional Sentence Model (CSM). This vector is used in a RNN layer along with it’s output (which is the generated token in the desired language) to generate a sentence. This part is called Recurrent Language Model (RLM).
Suggestions for training this model with GPU utilization are welcome.