Getting Gradients from Gym Environment Reward

I am trying to train a neural network based on gradients of a loss function calculated in a Gym environment. However, the gym environment requires me to detach the tensor (as the environment only supports numpy) to get the reward. How, can I use Autograd to do gradient-based training in this case? I have checked other policy gradient examples but didn’t understand how they get a gradient to train their models. I would appreciate it if you could help me understand the trick here.

def reward_func():
    observation, info = env.reset(seed=42)

    total_reward = torch.zeros(1).cuda().requires_grad_()

    terminated, truncated = False, False
    while not (terminated or truncated):
        action = agent(torch.from_numpy(observation).float().cuda())
        observation, reward, terminated, truncated, info = env.step(action.detach().cpu().numpy())
        total_reward = total_reward + reward
        
    return total_reward

Unfortunately, as of today you can’t do that in the general case. Some libs like mujoco offer finite difference gradient, and others like brax offer built-in differentiable operations, but the general case in gym is that you can’t backprop through the ops unless clearly states.
Sorry if this is disappointing.
One work-around could be to use finite difference but that would require an env where you can easily set the state, which is not always that easy.
Hope that helps

Thanks. I learn that the log_prob trick is also usable for continuous action spaces.
I previously used Brax with PyTorch, but this was the first time that I used Mujoco.

The log-prob trick (REINFORCE, log-derivative trick etc) gives you an estimator of the gradient of an integral taken over a probability space: it won’t give you the Jacobian over the transform or work to backprop through an env dynamic, in general.