I want to implement the inverting gradients as shown in the paper " Deep Reinforcement Learning in Parametrized Action Space". But firstly I need to have gradient of the critic network’s output ( Q(s,a) ) with respect to the action ( dQ(s,a) / da ) which is (6) equation in the paper.
I can write the loss as,
loss = value_net(state, policy_net(state))
But how can I differentiate this network with only respect to the action instead state ?