Backpropagation not propagating through all computations

nierth · June 25, 2020, 12:48pm

Hi,

I am having the weird problem that backpropagation does not update all parameters as expected. The important code snippet is shown below. All actor/critic/env nets are simple MLPs. I checked the weights by outputting the sum of certain layers and the critic_1 / critic_2 networks get properly updated. The weights of the env net however stay the same all the time. Am I doing something completely wrong? I always thought that setting the requires_grad flag to true is sufficient to enable backpropagation.

Any help is highly appreciated

P.S. I also don’t understand why I have to run the state = state.to(device) line. Right before returning the state in the env.step(last_action) method the state is still on the GPU, but after returning it it is on the CPU

# Sample replay buffer, all elements are detached by using the sample function
last_state, last_action, next_state, reward, done = replay_buffer.sample(self.batch_size)

# enable gradient computation
last_state.requires_grad_(True)
last_action.requires_grad_(True)

# run
input_seed = env.get_input_seed().repeat(len(last_state)).unsqueeze(1)
env.set_state(last_state)
env.set_input_seed(input_seed)
state, _, _ = env.step(last_action)
state = state.to(device) # why have to move to device again?

action = self.actor(state)

with torch.no_grad():
   # Select action according to policy
   next_action = (self.actor_target(next_state))
   reward = reward

   # Compute the target Q value
   target_Q1 = self.critic_target_1(next_state, next_action)
   target_Q2 = self.critic_target_2(next_state, next_action)
   target_Q = torch.min(target_Q1, target_Q2)
   target_Q = reward + (1-done) * self.gamma * target_Q

# Get current Q estimates
current_Q1 = self.critic_1(state, action)
current_Q2 = self.critic_2(state, action)

# Compute critic loss
critic_loss = F.mse_loss(current_Q1, target_Q) + F.mse_loss(current_Q2, target_Q)

# Optimize the critic and environment
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()

rajcscw · June 25, 2020, 12:58pm

There are couple of problems.

First, I do not see that your env model is attached to any kind of optimizer. Questions that arise to me are: where is the forward pass of the model? Are you sure the loss you apply at the end is connected to the computation graph of the env model in some way?

Second,
You have the context with torch.no_grad() which would disable gradient computation. When you are in training mode, it should not be used at all.

Hope this helps!

nierth · June 25, 2020, 2:36pm

thanks, forgot to attach the parameters to the optimizer