Policy Gradient for NLP

As a beginner in RL, I am totally at a loss on how to implement a policy gradient for NLP tasks (such as NMT).

More specifically, I am trying to fine-tune a pre-trained Seq2Seq model via a policy gradient that gets rewards dependent on BLEU scores.

However, I didn’t manage to find any resources (at least not fit for my experience?) on how to do this?

From what I understand I need to create a loss based on the vocabulary distributions of each time step (of the decoder) and the reward received by a prediction weighted down by a base line. However I am confused how would this look since I am only generating a reward for the complete translation. Also what is the baseline supposed to be (a reward from a non-RL trained model ? ) ?

loss = - torch.sum(torch.log( policy(state) * (reward - baseline)))

I know this is probably incredibly simple for some of you, but it goes way over my head :frowning: Would appreciate some help for a noob.