the right side of your red line looks rather the gradient of your loss?
Your loss would be something like
loss = - torch.sum(torch.log( policy(state) * (reward - baseline)))
then you compute the gradient of this loss with respect to all the parameters/variables that requires a gradient in your code by calling:
and if you created, before the training loop, an optimizer associated to your policy like this:
optim_policy = optim.MyOptimizer( policy.parameters(), lr = wathever)
you can update your optimizer simply like this, after each backward call of your loss:
the parameters of your policy (theta) will be updated with the gradient of your loss in order to minimize your loss.