LSTM in CUDA has different semantics

It seems that using nn.LSTM with and without CUDA have different semantics (even in initialization).

Consider the following case:
I have the following model, in which I am feeding features (75-D) of video data, and finally want to be able to classify the video
**Note that num_layers=2 in nn.LSTM (my understanding is that this corresponds to number of layers in the LSTM layers and it has nothing to do with the sequence length or batch size, PLEASE CORRECT ME IF I’M WRONG) **

Now I am training the model this way:

I’m getting the following error:

Now If I remove .cuda() from everywhere and run it on cpu, then this error doesn’t appear and I am able to train the model.
Please let me know if I am missing something or my interpretation is wrong. Also please point out to some sources for reference for pytorch LSTM (if any).

Thank you.