Implementation of the decoder in seq2seq

Hi,
I want to clarify that my implementation is correct, I have not used attention yet so I unroll decoder in one call:
everywhere I use batch_first=True
consider simple case: batch_size=2, hidden_size=4, len(vocab) = 10
I pad every sequence(sentence) to max_ length so in our simple case decoder input:

1    6    9    9    4
1    9    9    9    0
[torch.LongTensor of size 2x5]

Outputs from nn.LSTM in the decoder have such format:

Variable containing:
(0 ,.,.) = 
  0.1351 -0.0738 -0.3071  0.1253
  0.2045  0.0473 -0.4745  0.0952
  0.1976  0.1333 -0.1086  0.0051
  0.1840  0.1820 -0.1250  0.0794
  0.2153  0.1870 -0.0804  0.0017

(1 ,.,.) = 
  0.1388 -0.0739 -0.3141  0.4524
  0.2480 -0.0281 -0.3296  0.3183
  0.2284  0.0410 -0.1947  0.1689
  0.2259  0.0712 -0.1931  0.1656
  0.2268  0.0772 -0.0793  0.3217
[torch.cuda.FloatTensor of size 2x5x4 (GPU 0)]

Then I create output_mask because I don’t want to compute pad elements:

Variable containing:
(0 ,.,.) = 
  1  1  1  1
  1  1  1  1
  1  1  1  1
  1  1  1  1
  1  1  1  1

(1 ,.,.) = 
  1  1  1  1
  1  1  1  1
  1  1  1  1
  1  1  1  1
  0  0  0  0
[torch.cuda.ByteTensor of size 2x5x4 (GPU 0)]

After it I use torch.masked_select and compute masked_outputs and get:

Variable containing:
 0.1351
-0.0738
-0.3071
 0.1253
 0.2045
 0.0473
-0.4745
 0.0952
 0.1976
 0.1333
-0.1086
 0.0051
 0.1840
 0.1820
-0.1250
 0.0794
 0.2153
 0.1870
-0.0804
 0.0017
 0.1388
-0.0739
-0.3141
 0.4524
 0.2480
-0.0281
-0.3296
 0.3183
 0.2284
 0.0410
-0.1947
 0.1689
 0.2259
 0.0712
-0.1931
 0.1656
[torch.cuda.FloatTensor of size 36 (GPU 0)]

Then I use masked_outputs.view((-1, hidden_size)) and get:

Variable containing:
 0.1351 -0.0738 -0.3071  0.1253
 0.2045  0.0473 -0.4745  0.0952
 0.1976  0.1333 -0.1086  0.0051
 0.1840  0.1820 -0.1250  0.0794
 0.2153  0.1870 -0.0804  0.0017
 0.1388 -0.0739 -0.3141  0.4524
 0.2480 -0.0281 -0.3296  0.3183
 0.2284  0.0410 -0.1947  0.1689
 0.2259  0.0712 -0.1931  0.1656
[torch.cuda.FloatTensor of size 9x4 (GPU 0)]

After all I use nn.Linear(hidden_size, len(vocab)) and get outputs from the decoder:

Variable containing:
 0.4705 -0.0552  0.4348 -0.0798  0.0775  0.3475  0.2021  0.6573 -0.0601  0.2252
 0.4778  0.0825  0.5056 -0.0496  0.1685  0.3090  0.2437  0.7343 -0.0047  0.1911
 0.3115 -0.0023  0.3454 -0.2025  0.1078  0.3741  0.2473  0.5391  0.0743  0.2276
 0.2770 -0.0017  0.3527 -0.2080  0.1067  0.3662  0.2493  0.5297  0.0664  0.2271
 0.2695  0.0146  0.3323 -0.2272  0.1143  0.3801  0.2603  0.5193  0.1007  0.2303
 0.3459 -0.1172  0.4360 -0.0959 -0.0210  0.4020  0.2096  0.6514 -0.1421  0.2980
 0.3347 -0.0184  0.4390 -0.1099  0.0262  0.4214  0.2549  0.6870 -0.0614  0.3002
 0.3110 -0.0216  0.3812 -0.1647  0.0529  0.4106  0.2519  0.6053  0.0044  0.2746
 0.3005 -0.0126  0.3806 -0.1708  0.0629  0.4036  0.2545  0.5973  0.0147  0.2681
[torch.cuda.FloatTensor of size 9x10 (GPU 0)]

I need decoder targets to compute loss so decoder_targets:

6  9  9  4  2
9  9  9  2  0
[torch.LongTensor of size 2x5]

Then I also use torch.masked_select for decoder_targets and get masked_targets:

6
9
9
4
2
9
9
9
2
[torch.cuda.LongTensor of size 9 (GPU 0)]

In the end I compute loss:
loss = criterion(decoder_outputs, masked_targets)

Is this correct implementation for the decoder? Can I use such approach?

Thanks!