Recently, I am training a model with nn.Embedding, nn.LSTM, nn.Linear modules.
DataParallel is also leveraged for training efficiency. However, when multiple GPUs are utilized (with the same batch size as the single GPU running), the model is hard to train, i.e. the loss decreases very slowly, and the final model cannot work.
For optimizer, we directly use the codes like in imageNet example:
Thank you for helping me.
I finally got the reason for the training error:
As I only use the pack_padded_sequence() in my model without pad_packed_sequence(), then after gathering the results, the arrangement of the elements may be different from the original one.
Therefore, pack and unpack should be used within the model at the same time. In addition, the dimensions of unpacked variables should be of the same order as the input, so that the sizes of different sub-mini-batches are the same.
However, there is another question:
If the max length of
sequence is smaller than the corresponding dimension (dim
unpack(pack(sequence)) may change the size of dimension
T. Is there any efficient way to solve the problem?