Thanks for the code! 
I got an error of a device mismatch, since hidden_state and cell_state are both initialized on the CPU even if you push the model to the GPU.
Could you try to use:
def forward(self, X):
X = self.embedding(X)
trans_X = X.transpose(0, 1) # Make it to [sequence length, batch size, input_size]
hidden_state = torch.zeros(1, len(X), self.hidden_size).to(X.device)
cell_state = torch.zeros(1, len(X), self.hidden_size).to(X.device)
...
Also, the right_count calculation will raise another device mismatch.
You would have to call .cpu() on the torch.argmax operation, while it;s called on the sum in your code:
right_count = torch.sum(Y_batch.cpu() == torch.argmax(y_pred, 1).long().cpu()).item()
After fixing these issues, the code runs fine.
Let me know, if that helps.