I’m trying to implement a policy gradient method in RL and the output of my model need some more calculations before computing the loss. What should I do with my output and the loss function in such case?
If the calculations are “trainable”, that is there’s some learning involved, perhaps you could use a simple multi-layer perception
mlp which takes in the output of your model, as input.
The output of the
mlp could then feed into the loss criterion.
Maybe posting some code of the calculations you want do might be helpful to understand what you want to do?
Thank you for your answer!
the calculations are not trainable because it need to deal with some ‘discounted reward’ which is not available until an episode ends
As I read other topics, I found out that if I use the Variable containing the output of my model through my calculation, the model is then trainable, but the outcome of the optimization is not normal.
Here is part of my codes:
definition of the network
self.conv1 = nn.Conv2d(7, 128, kernel_size=5, stride=1, padding=2)
self.conv2 = nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=1)
self.conv3 = nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=1)
self.conv4 = nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=1)
self.conv5 = nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=1)
self.conv6 = nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=1)
self.conv7 = nn.Conv2d(128, 1, kernel_size=5, stride=1, padding=2)
self.steps_done = 0 self.matches_done = 0 self.win_count = 0 def forward(self, x): x = F.relu(self.conv1(x)) x = F.relu(self.conv2(x)) x = F.relu(self.conv3(x)) x = F.relu(self.conv4(x)) x = F.relu(self.conv5(x)) x = F.relu(self.conv6(x)) x = F.relu(self.conv7(x)) x = x.view(x.size(0), -1) x = F.softmax(x) return x
main codes in the optimizer
output = model(Variable(epstate.type(dtype))) discounted_epr = discount_rewards(epreward) discounted_epr -= torch.mean(discounted_epr) discounted_epr /= torch.std(discounted_epr) discounted_epr.resize_(discounted_epr.size(), 1) discounted_epr = discounted_epr.expand(discounted_epr.size(), 81) epy = Variable(epy, requires_grad=False) discounted_epr = Variable(discounted_epr, requires_grad=False) loss = (epy - output).mul(discounted_epr).pow(2).mean() optimizer.zero_grad() loss.backward() optimizer.step()
@AjayTalati I’m not familiar with that community so the layout is not so good. sorry:cold_sweat:
No problem @pointW - if you want to understand, A3C, there’s already a very good implementation in PyTorch,
This might be more complicated than you need if you only want plain policy gradient - but it works very well