Inverting Gradients - Gradient of critic network output wrt action

Hello there,
I want to implement the inverting gradients as shown in the paper " Deep Reinforcement Learning in Parametrized Action Space". But firstly I need to have gradient of the critic network’s output ( Q(s,a) ) with respect to the action ( dQ(s,a) / da ) which is (6) equation in the paper.

I can write the loss as,
loss = value_net(state, policy_net(state))

But how can I differentiate this network with only respect to the action instead state ?

2 Likes

@frknayk I am trying to implement this paper too and having trouble implement its gradient inversion. Have you figured it out?

I have found tensorflow implementation

But later I decided to use loss = value_net(state,policy_net(state)) for some reasons

@frknayk Were you able to implement Inverting Gradients? If yes can you please share the snippet?

Thanks

Is this correct?

Invert gradient function for actions defined in the interval [-1,1]:

def invert_gradient(grads, action):

    pdiff_max = torch.div(-action+1.0,2.0)
    pdiff_min = torch.div(action+1.0,2.0)

    zeros_grad    = torch.zeros_like(grads)
    grad_inverter = torch.zeros_like(grads)

    grad_inverter = torch.where(torch.gt(grads, zeros_grad), torch.mul(grads,pdiff_max), torch.mul(grads,pdiff_min))

    return grad_inverter

Outside training loop:

params_grads = {}
for name, param in actor.named_parameters():
    params_grads[name] = torch.zeros_like(param.data)

Inside training loop to update actor network parameters:

actor_optimizer.zero_grad()
action_Q = actor(state_batch)
policy_output = -critic(state_batch,action_Q)

for i in range(policy_output.size()[0]):
      policy_output[i].backward(retain_graph=True)
      for name, param in actor.named_parameters():
            params_grads[name] += invert_gradient(param.grad.data, action_Q.detach()[i])

     for name, param in actor.named_parameters():
          param.grad.data = params_grads[name]/batch_size
          params_grads[name] = torch.zeros_like(param.data)

actor_optimizer.step()
1 Like

Hi jomavera, thanks so much for the example. However, i have a follow-up question.

  1. will this affect the discrete parameters also? because in the original paper if I’m not mistaken, the bounded action space is only for the continuous parameter.
  2. What if each parameter has different bound or limits? how do we track it?
    I think both of these questions are greatly influenced by my limited knowledge of how gradient storing and backpropagation works in pytorch.

However, I know that we can treat all action as bounded by -1 to 1, even the discrete, and for the parameters we can then map them to their respective limit. But will this approach be valid?
Thank you so much beforehand

Another question if I may, turns out it produce error of mismatch dimension
pdiff_max 's dimension is n_actions
while grads is of n_states

is it maybe because n_actions > 1

and if n_actions > 1, do we just sum pdiff_max and pdiff_min ?

pdiff_max = torch.sum(torch.div(-action+1.0,2.0))
pdiff_min = torch.sum(torch.div(action+1.0,2.0))

because we are doing it for every action and add all.
Thank you once again

Hi @gemsanyou,

I think my code is not correct. I forgot to zero gradients for each action. This should be correct:

actor_optimizer.zero_grad()
action_Q = actor(state_batch)
policy_output = -critic(state_batch,action_Q)

for i in range(policy_output.size()[0]):
    policy_output[i].backward(retain_graph=True)
    for name, param in actor.named_parameters():
        params_grads[name] += invert_gradient(param.grad.data, action_Q.detach()[i])
    actor_optimizer.zero_grad()

for name, param in actor.named_parameters():
    param.grad.data = params_grads[name]/batch_size
    params_grads[name] = torch.zeros_like(param.data)

actor_optimizer.step()

Regarding your questions:

  1. I see no problem to apply it to discrete actions as long as you know the action bounds
  2. If each action have different bounds you can change the code of invert_gradient function to have as argument the bounds, as follows:
def invert_gradient(grads, action, max_b, min_b):

    pdiff_max = torch.div(-action+max_b, max_b - min_b)
    pdiff_min = torch.div(action-min_b, max_b - min_b)

    zeros_grad    = torch.zeros_like(grads)
    grad_inverter = torch.zeros_like(grads)

    grad_inverter = torch.where(torch.gt(grads, zeros_grad), torch.mul(grads,pdiff_max), torch.mul(grads,pdiff_min))

    return grad_inverter

Then, somewhere outside the trainning you should store the action bounds for example in a dict called actions_bounds. Then the trainning loop should be

actor_optimizer.zero_grad()
action_Q = actor(state_batch)
policy_output = -critic(state_batch,action_Q)

for i in range(policy_output.size()[0]):
    policy_output[i].backward(retain_graph=True)
    max_b, min_b = actions_bounds[i]
    for name, param in actor.named_parameters():
        params_grads[name] += invert_gradient(param.grad.data, action_Q.detach()[i], max_b, min_b)
    actor_optimizer.zero_grad()

for name, param in actor.named_parameters():
    param.grad.data = params_grads[name]/batch_size
    params_grads[name] = torch.zeros_like(param.data)

actor_optimizer.step()

Regarding the final remarks, the inverting gradient formula tries to downscale the gradient if the action is close to the upper or lower limit in order to avoid the behaviour of a policy on actions on the bounds.

1 Like

Hi jomavera,
thank you so much for the super fast reply, even for necroing this post :smiley:

I will try this right now and follow up with the results. In particular this line of code, i think the i iteration is for each sample in batch, not for each action. But i will confirm it right now

Ah yes, the i is for sample iteration and the previous error i mentioned is this:

   grad_inverter = T.where(T.gt(grads, zeros_grad), T.mul(grads,pdiff_max), T.mul(grads,pdiff_min))
RuntimeError: The size of tensor a (99) must match the size of tensor b (8) at non-singleton dimension 1

when i print

print(policy_output.size()) = torch.Size([64, 1])

I assumed in the code that the action is only one component/scalar. The invert_gradient function should be called for each action component independently so the sum I don’t think is appropiate. So if we “compute” an inverted gradient for each action component and each at batch sample and sum it to params_grads[name] then I think the gradient should be the mean w.r.t. components and batch

param.grad.data = params_grads[name]/(batch_size*action_size)
1 Like

So currently here are my code, knowing that i is for each sample
the part in learn method:

self.actor.optimizer.zero_grad()
action_q = self.actor.forward(states)
policy_output = -self.critic.forward(states, action_q)
        
for i in range(policy_output.size()[0]):
    policy_output[i].backward(retain_graph=True)
    for name, param in self.actor.named_parameters():
        self.params_grads[name] += invert_gradient(param.grad.data, action_q.detach()[i])
    ??? self.actor.optimizer.zero_grad() ???

for name, param in self.actor.named_parameters():
    param.grad.data = self.params_grads[name]/(self.batch_size*self.n_actions)
    self.params_grads[name] = T.zeros_like(param.data)

invert gradient

def invert_gradient(grads, action, lower_bound=-1, upper_bound=1):
    bound_range = upper_bound-lower_bound
    pdiff_max = T.sum(T.div(upper_bound-action, bound_range))
    pdiff_min = T.sum(T.div(action-lower_bound, bound_range))
    zeros_grad = T.zeros_like(grads)
    grad_inverter = T.where(T.gt(grads, zeros_grad), T.mul(grads,pdiff_max), T.mul(grads,pdiff_min))
    return grad_inverter

do you think this is correct?
so what is the zeroing gradient for each i will do, now knowing that i is for iterating sample size? will it zero gradients for each action?

or should I have another iteration inside here like this:

for i in range(policy_output.size()[0]):
    policy_output[i].backward(retain_graph=True)
    for a in action_q.detach()[i]:
        for name, param in self.actor.named_parameters():
            self.params_grads[name] += invert_gradient(param.grad.data, a)
        self.actor.optimizer.zero_grad()

without summing inside invert grad method.
thank you so much beforehand

EDIT: i think i found an example repo of this for you guys also who needed it, turns out it’s with all other paramterized RL agent in pytorch too : GitHub - cycraig/MP-DQN: Source code for the dissertation: "Multi-Pass Deep Q-Networks for Reinforcement Learning with Parameterised Action Spaces"

I’m going to read it and try it out