Hi,
I want to clarify that my implementation is correct, I have not used attention yet so I unroll decoder in one call:
everywhere I use batch_first=True
consider simple case: batch_size=2
, hidden_size=4
, len(vocab) = 10
I pad every sequence(sentence) to max_ length so in our simple case decoder input:
1 6 9 9 4
1 9 9 9 0
[torch.LongTensor of size 2x5]
Outputs from nn.LSTM
in the decoder have such format:
Variable containing:
(0 ,.,.) =
0.1351 -0.0738 -0.3071 0.1253
0.2045 0.0473 -0.4745 0.0952
0.1976 0.1333 -0.1086 0.0051
0.1840 0.1820 -0.1250 0.0794
0.2153 0.1870 -0.0804 0.0017
(1 ,.,.) =
0.1388 -0.0739 -0.3141 0.4524
0.2480 -0.0281 -0.3296 0.3183
0.2284 0.0410 -0.1947 0.1689
0.2259 0.0712 -0.1931 0.1656
0.2268 0.0772 -0.0793 0.3217
[torch.cuda.FloatTensor of size 2x5x4 (GPU 0)]
Then I create output_mask because I don’t want to compute pad elements:
Variable containing:
(0 ,.,.) =
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
(1 ,.,.) =
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
0 0 0 0
[torch.cuda.ByteTensor of size 2x5x4 (GPU 0)]
After it I use torch.masked_select
and compute masked_outputs and get:
Variable containing:
0.1351
-0.0738
-0.3071
0.1253
0.2045
0.0473
-0.4745
0.0952
0.1976
0.1333
-0.1086
0.0051
0.1840
0.1820
-0.1250
0.0794
0.2153
0.1870
-0.0804
0.0017
0.1388
-0.0739
-0.3141
0.4524
0.2480
-0.0281
-0.3296
0.3183
0.2284
0.0410
-0.1947
0.1689
0.2259
0.0712
-0.1931
0.1656
[torch.cuda.FloatTensor of size 36 (GPU 0)]
Then I use masked_outputs.view((-1, hidden_size))
and get:
Variable containing:
0.1351 -0.0738 -0.3071 0.1253
0.2045 0.0473 -0.4745 0.0952
0.1976 0.1333 -0.1086 0.0051
0.1840 0.1820 -0.1250 0.0794
0.2153 0.1870 -0.0804 0.0017
0.1388 -0.0739 -0.3141 0.4524
0.2480 -0.0281 -0.3296 0.3183
0.2284 0.0410 -0.1947 0.1689
0.2259 0.0712 -0.1931 0.1656
[torch.cuda.FloatTensor of size 9x4 (GPU 0)]
After all I use nn.Linear(hidden_size, len(vocab))
and get outputs from the decoder:
Variable containing:
0.4705 -0.0552 0.4348 -0.0798 0.0775 0.3475 0.2021 0.6573 -0.0601 0.2252
0.4778 0.0825 0.5056 -0.0496 0.1685 0.3090 0.2437 0.7343 -0.0047 0.1911
0.3115 -0.0023 0.3454 -0.2025 0.1078 0.3741 0.2473 0.5391 0.0743 0.2276
0.2770 -0.0017 0.3527 -0.2080 0.1067 0.3662 0.2493 0.5297 0.0664 0.2271
0.2695 0.0146 0.3323 -0.2272 0.1143 0.3801 0.2603 0.5193 0.1007 0.2303
0.3459 -0.1172 0.4360 -0.0959 -0.0210 0.4020 0.2096 0.6514 -0.1421 0.2980
0.3347 -0.0184 0.4390 -0.1099 0.0262 0.4214 0.2549 0.6870 -0.0614 0.3002
0.3110 -0.0216 0.3812 -0.1647 0.0529 0.4106 0.2519 0.6053 0.0044 0.2746
0.3005 -0.0126 0.3806 -0.1708 0.0629 0.4036 0.2545 0.5973 0.0147 0.2681
[torch.cuda.FloatTensor of size 9x10 (GPU 0)]
I need decoder targets to compute loss so decoder_targets:
6 9 9 4 2
9 9 9 2 0
[torch.LongTensor of size 2x5]
Then I also use torch.masked_select
for decoder_targets and get masked_targets
:
6
9
9
4
2
9
9
9
2
[torch.cuda.LongTensor of size 9 (GPU 0)]
In the end I compute loss:
loss = criterion(decoder_outputs, masked_targets)
Is this correct implementation for the decoder? Can I use such approach?
Thanks!