Gradient flow for auto-regressive RL model

Limmen1 · June 7, 2020, 6:48pm

Hi, if I have a auto-regressive RL model where my network has two heads, where the first head predicts action a1 and the second head predicts action a2 conditioned on a1, where both a1 and a2 are sampled using PyTorch categorical distribution, will the gradients be able to backpropagate through the sample operation? See code snippet below.

shared_latent = self.test_shared_latent(obs.float())
values_latent = self.test_value_net(shared_latent)

a_1_logits = self.test_protocol_port_target_net(shared_latent)
a_1_dist = Categorical(logits=a_1_logits)
a_1 = a_1_dist.sample()                
a_2_logits = self.test_source_net(a_1.float())
a_2_dist = Categorical(logits=a_2_logits)
                    
a_2 = a_2_dist.sample()
a_1_log_prob = a_1_dist.log_prob(a_1.float())
a_2_log_prob = a_2_dist.log_prob(a_2)

The a_1_log_prob and a_2_log_prob are used later to create the loss as follows:

a_1_ratio = th.exp(a_1_log_prob - a_1_old_log_prob)
a_2_ratio = th.exp(a_2_log_prob - rollout_data.a_2_old_log_prob)

# clipped surrogate loss for target
a_1_loss_1 = advantages * a_1_ratio
a_1_loss_2 = advantages * th.clamp(a_1_ratio, 1 - clip_range, 1 + clip_range)
a_1_loss = -th.min(a_1_loss_1, a_2_loss_2).mean()

# clipped surrogate loss for source
a_2_loss_1 = advantages * a_2_ratio
a_2_loss_2 = advantages * th.clamp(a_2_ratio, 1 - clip_range, 1 + clip_range)
a_2_loss = -th.min(a_2_loss_1, a_2_loss_2).mean()

policy_loss = a_1_loss + a_2_loss
pg_losses.append(policy_loss.item())