Hello everyone.

My question is related to implementing reinforcement learning [Gradient Policy] with a Transformer sequence to sequence model.

To detail, I am using a Transformer seq2seq model for abstractive summarization. In order to do this, I need to sample a token from the Transformer during inference [no golden target] and to get the probability of that sample.

The way I am trying to do this is:

```
def get_distribution(model, batch):
src, (shift_tgt, lbl_tgt), segs, clss, mask_src, mask_tgt, mask_cls = batch
# the mock tgt are just torch.zeros tensors that have the same shape as the tgt
mock_tgt = get_mock_tgt(shift_tgt)
mock_return = get_mock_tgt(shift_tgt)
max_length = shift_tgt.shape[1]
log_probs = []
for i in range(0, max_length-1):
prediction = model(src, mock_tgt, segs, clss, mask_src, mask_tgt, mask_cls)
prediction = F.softmax(prediction, dim=2)
multi_dist = Categorical(prediction[:, i])
x_t = multi_dist.sample()
mock_tgt[:, i+1] = x_t
mock_return[:, i] = x_t
log_prob = multi_dist.log_prob(x_t)
log_probs.append(log_prob)
return mock_return, log_probs
```

However, I doubt this is the correct way. More exactly, the Transformer outputs an entire tensor at a time, instead of just one token. In order to get the Distribution of just one token then I use **multi_dist = Categorical(prediction[:, i])** and store the token given by the sample function for the next inference step.

How am I supposed to do this the right way? I have tried searching for it everywhere but cannot seem to find an answer for this.

Help