Reinforcement learning with Transformer for NLP

Hello everyone.

My question is related to implementing reinforcement learning [Gradient Policy] with a Transformer sequence to sequence model.

To detail, I am using a Transformer seq2seq model for abstractive summarization. In order to do this, I need to sample a token from the Transformer during inference [no golden target] and to get the probability of that sample.

The way I am trying to do this is:

def get_distribution(model, batch):

    src, (shift_tgt, lbl_tgt), segs, clss, mask_src, mask_tgt, mask_cls = batch

    # the mock tgt are just torch.zeros tensors that have the same shape as the tgt
    mock_tgt = get_mock_tgt(shift_tgt)
    mock_return = get_mock_tgt(shift_tgt)

    max_length = shift_tgt.shape[1]
    
    log_probs = []
    
    for i in range(0, max_length-1):
        prediction = model(src, mock_tgt, segs, clss, mask_src, mask_tgt, mask_cls)
        prediction = F.softmax(prediction, dim=2)
        
        multi_dist = Categorical(prediction[:, i])
        x_t = multi_dist.sample()
        mock_tgt[:, i+1] = x_t
        mock_return[:, i] = x_t
        log_prob = multi_dist.log_prob(x_t)
        log_probs.append(log_prob)
        
    return mock_return, log_probs

However, I doubt this is the correct way. More exactly, the Transformer outputs an entire tensor at a time, instead of just one token. In order to get the Distribution of just one token then I use multi_dist = Categorical(prediction[:, i]) and store the token given by the sample function for the next inference step.

How am I supposed to do this the right way? I have tried searching for it everywhere but cannot seem to find an answer for this.

Help :frowning: