Inverting Gradients - Gradient of critic network output wrt action

frknayk · February 18, 2019, 9:57am

Hello there,
I want to implement the inverting gradients as shown in the paper " Deep Reinforcement Learning in Parametrized Action Space". But firstly I need to have gradient of the critic network’s output ( Q(s,a) ) with respect to the action ( dQ(s,a) / da ) which is (6) equation in the paper.

I can write the loss as,
loss = value_net(state, policy_net(state))

But how can I differentiate this network with only respect to the action instead state ?

soulless · April 3, 2019, 7:15pm

@frknayk I am trying to implement this paper too and having trouble implement its gradient inversion. Have you figured it out?

frknayk · April 16, 2019, 12:36pm

I have found tensorflow implementation

But later I decided to use loss = value_net(state,policy_net(state)) for some reasons

Rahul_Patel · July 27, 2019, 4:55am

@frknayk Were you able to implement Inverting Gradients? If yes can you please share the snippet?

Thanks

jomavera · November 19, 2020, 5:31pm

Is this correct?

Invert gradient function for actions defined in the interval [-1,1]:

def invert_gradient(grads, action):

    pdiff_max = torch.div(-action+1.0,2.0)
    pdiff_min = torch.div(action+1.0,2.0)

    zeros_grad    = torch.zeros_like(grads)
    grad_inverter = torch.zeros_like(grads)

    grad_inverter = torch.where(torch.gt(grads, zeros_grad), torch.mul(grads,pdiff_max), torch.mul(grads,pdiff_min))

    return grad_inverter

Outside training loop:

params_grads = {}
for name, param in actor.named_parameters():
    params_grads[name] = torch.zeros_like(param.data)

Inside training loop to update actor network parameters:

actor_optimizer.zero_grad()
action_Q = actor(state_batch)
policy_output = -critic(state_batch,action_Q)

for i in range(policy_output.size()[0]):
      policy_output[i].backward(retain_graph=True)
      for name, param in actor.named_parameters():
            params_grads[name] += invert_gradient(param.grad.data, action_Q.detach()[i])

     for name, param in actor.named_parameters():
          param.grad.data = params_grads[name]/batch_size
          params_grads[name] = torch.zeros_like(param.data)

actor_optimizer.step()

gemsanyou · February 23, 2021, 6:25am

Hi jomavera, thanks so much for the example. However, i have a follow-up question.

will this affect the discrete parameters also? because in the original paper if I’m not mistaken, the bounded action space is only for the continuous parameter.
What if each parameter has different bound or limits? how do we track it?
I think both of these questions are greatly influenced by my limited knowledge of how gradient storing and backpropagation works in pytorch.

However, I know that we can treat all action as bounded by -1 to 1, even the discrete, and for the parameters we can then map them to their respective limit. But will this approach be valid?
Thank you so much beforehand

gemsanyou · February 23, 2021, 9:32am

Another question if I may, turns out it produce error of mismatch dimension
pdiff_max 's dimension is n_actions
while grads is of n_states

is it maybe because n_actions > 1

and if n_actions > 1, do we just sum pdiff_max and pdiff_min ?

pdiff_max = torch.sum(torch.div(-action+1.0,2.0))
pdiff_min = torch.sum(torch.div(action+1.0,2.0))

because we are doing it for every action and add all.
Thank you once again

jomavera · February 23, 2021, 2:39pm

Hi @gemsanyou,

I think my code is not correct. I forgot to zero gradients for each action. This should be correct:

actor_optimizer.zero_grad()
action_Q = actor(state_batch)
policy_output = -critic(state_batch,action_Q)

for i in range(policy_output.size()[0]):
    policy_output[i].backward(retain_graph=True)
    for name, param in actor.named_parameters():
        params_grads[name] += invert_gradient(param.grad.data, action_Q.detach()[i])
    actor_optimizer.zero_grad()

for name, param in actor.named_parameters():
    param.grad.data = params_grads[name]/batch_size
    params_grads[name] = torch.zeros_like(param.data)

actor_optimizer.step()

Regarding your questions:

I see no problem to apply it to discrete actions as long as you know the action bounds
If each action have different bounds you can change the code of invert_gradient function to have as argument the bounds, as follows:

def invert_gradient(grads, action, max_b, min_b):

    pdiff_max = torch.div(-action+max_b, max_b - min_b)
    pdiff_min = torch.div(action-min_b, max_b - min_b)

    zeros_grad    = torch.zeros_like(grads)
    grad_inverter = torch.zeros_like(grads)

    grad_inverter = torch.where(torch.gt(grads, zeros_grad), torch.mul(grads,pdiff_max), torch.mul(grads,pdiff_min))

    return grad_inverter

Then, somewhere outside the trainning you should store the action bounds for example in a dict called actions_bounds. Then the trainning loop should be

actor_optimizer.zero_grad()
action_Q = actor(state_batch)
policy_output = -critic(state_batch,action_Q)

for i in range(policy_output.size()[0]):
    policy_output[i].backward(retain_graph=True)
    max_b, min_b = actions_bounds[i]
    for name, param in actor.named_parameters():
        params_grads[name] += invert_gradient(param.grad.data, action_Q.detach()[i], max_b, min_b)
    actor_optimizer.zero_grad()

for name, param in actor.named_parameters():
    param.grad.data = params_grads[name]/batch_size
    params_grads[name] = torch.zeros_like(param.data)

actor_optimizer.step()

Regarding the final remarks, the inverting gradient formula tries to downscale the gradient if the action is close to the upper or lower limit in order to avoid the behaviour of a policy on actions on the bounds.

gemsanyou · February 23, 2021, 2:49pm

Hi jomavera,
thank you so much for the super fast reply, even for necroing this post

I will try this right now and follow up with the results. In particular this line of code, i think the i iteration is for each sample in batch, not for each action. But i will confirm it right now

Ah yes, the i is for sample iteration and the previous error i mentioned is this:

   grad_inverter = T.where(T.gt(grads, zeros_grad), T.mul(grads,pdiff_max), T.mul(grads,pdiff_min))
RuntimeError: The size of tensor a (99) must match the size of tensor b (8) at non-singleton dimension 1

when i print

print(policy_output.size()) = torch.Size([64, 1])

jomavera · February 23, 2021, 3:04pm

I assumed in the code that the action is only one component/scalar. The invert_gradient function should be called for each action component independently so the sum I don’t think is appropiate. So if we “compute” an inverted gradient for each action component and each at batch sample and sum it to params_grads[name] then I think the gradient should be the mean w.r.t. components and batch

param.grad.data = params_grads[name]/(batch_size*action_size)

gemsanyou · February 23, 2021, 3:20pm

So currently here are my code, knowing that i is for each sample
the part in learn method:

self.actor.optimizer.zero_grad()
action_q = self.actor.forward(states)
policy_output = -self.critic.forward(states, action_q)
        
for i in range(policy_output.size()[0]):
    policy_output[i].backward(retain_graph=True)
    for name, param in self.actor.named_parameters():
        self.params_grads[name] += invert_gradient(param.grad.data, action_q.detach()[i])
    ??? self.actor.optimizer.zero_grad() ???

for name, param in self.actor.named_parameters():
    param.grad.data = self.params_grads[name]/(self.batch_size*self.n_actions)
    self.params_grads[name] = T.zeros_like(param.data)

invert gradient

def invert_gradient(grads, action, lower_bound=-1, upper_bound=1):
    bound_range = upper_bound-lower_bound
    pdiff_max = T.sum(T.div(upper_bound-action, bound_range))
    pdiff_min = T.sum(T.div(action-lower_bound, bound_range))
    zeros_grad = T.zeros_like(grads)
    grad_inverter = T.where(T.gt(grads, zeros_grad), T.mul(grads,pdiff_max), T.mul(grads,pdiff_min))
    return grad_inverter

do you think this is correct?
so what is the zeroing gradient for each i will do, now knowing that i is for iterating sample size? will it zero gradients for each action?

or should I have another iteration inside here like this:

for i in range(policy_output.size()[0]):
    policy_output[i].backward(retain_graph=True)
    for a in action_q.detach()[i]:
        for name, param in self.actor.named_parameters():
            self.params_grads[name] += invert_gradient(param.grad.data, a)
        self.actor.optimizer.zero_grad()

without summing inside invert grad method.
thank you so much beforehand

EDIT: i think i found an example repo of this for you guys also who needed it, turns out it’s with all other paramterized RL agent in pytorch too : GitHub - cycraig/MP-DQN: Source code for the dissertation: "Multi-Pass Deep Q-Networks for Reinforcement Learning with Parameterised Action Spaces"

I’m going to read it and try it out

idiotpotato · March 26, 2021, 4:09am

Hi everyone,
I know it’s been too long since the last update but I just had the problem with inverting gradient so I wanted to ease this subject for anyone else.
as @gemsanyou said the inverting gradient is implemented in mp-dqn. For anyone who encounters this problem in future i will explain the code here.

this being the inverting gradient function:

	def invert_gradient(self,delta_a,current_a):
		index = delta_a>0
		delta_a[index] *=  (index.float() * (self.max_p - current_a)/self.rng)[index]
		delta_a[~index] *= ((~index).float() * (current_a- self.min_p)/self.rng)[~index]

		return delta_a

gets the gradient(wrt actions) and actions themselves. index is a masking variable indicating gradient is suggesting increment of action or not. gradients are then updated accordingly.
Now in order to use this function:

current_a = Variable(self.actor(state))
current_a.requires_grad = True
actor_loss = self.critic(state, current_a).mean()
self.critic.zero_grad()
actor_loss.backward()
delta_a = deepcopy(current_a.grad.data)
delta_a = self.invert_gradient(delta_a,current_a)
current_a = self.actor(state)
out = -torch.mul(delta_a,current_a)
self.actor.zero_grad()
out.backward(torch.ones(out.shape).to(device))

First you get the gradient wrt actions then after applying inverting gradient you will end up with modified dQ/da now what did we want in the first place? -(dQ/da) * (da/dw)
we changed dQ/da part now simply consider dQ/da = c. having -ca calculating gradient wrt weights we’ll end up with -c(da/dw) = -(dQ/da) * (da/dw).

One point worth mentioning here is about out.backward and why it has an ones vector as its input. it is because “a” is a vector and in backward we will end up with Jacobian matrix however for updating the weights we need one vector of gradients so we simply give the linear combination of J matrix’s columns as the input(one could pass delta_a to backward however it is not implemented like this in the original code) for more info read gradient arg

TheUnnamed22 · January 22, 2022, 9:51pm

Thanks a lot for the explenation! I think i understood it now.

But i have one question to be more certain. Why do you have to multiply the gradient with the current action?

out = -torch.mul(delta_a,current_a)

if delta_a is dQ/da or c then you would have to multiply with da/dw and not the action. Well i think its because you first do -c*a and then you differenciate that which leads to:

d(-c*a)/dw

now you can take out the c from the equation and get:

-c*(da/dw)

am i right with that?

idiotpotato · January 23, 2022, 12:36am

Yes that is exactly right