DataParallel LSTM/GRU wrong hidden batch size (8 GPUs)

It seems that if you only have 1 gpu it works.

However if you have multiple GPUs with Dataparallel with LSTM, during forwarding the lstm units are expecting devided hidden state size, however is provided with the full size hidden variable h0.

Try your code on a machine with more than 1 GPU and u will see the error.