Hi,
I am having the weird problem that backpropagation does not update all parameters as expected. The important code snippet is shown below. All actor/critic/env nets are simple MLPs. I checked the weights by outputting the sum of certain layers and the critic_1
/ critic_2
networks get properly updated. The weights of the env
net however stay the same all the time. Am I doing something completely wrong? I always thought that setting the requires_grad
flag to true is sufficient to enable backpropagation.
Any help is highly appreciated
P.S. I also don’t understand why I have to run the state = state.to(device)
line. Right before returning the state in the env.step(last_action)
method the state is still on the GPU, but after returning it it is on the CPU
# Sample replay buffer, all elements are detached by using the sample function
last_state, last_action, next_state, reward, done = replay_buffer.sample(self.batch_size)
# enable gradient computation
last_state.requires_grad_(True)
last_action.requires_grad_(True)
# run
input_seed = env.get_input_seed().repeat(len(last_state)).unsqueeze(1)
env.set_state(last_state)
env.set_input_seed(input_seed)
state, _, _ = env.step(last_action)
state = state.to(device) # why have to move to device again?
action = self.actor(state)
with torch.no_grad():
# Select action according to policy
next_action = (self.actor_target(next_state))
reward = reward
# Compute the target Q value
target_Q1 = self.critic_target_1(next_state, next_action)
target_Q2 = self.critic_target_2(next_state, next_action)
target_Q = torch.min(target_Q1, target_Q2)
target_Q = reward + (1-done) * self.gamma * target_Q
# Get current Q estimates
current_Q1 = self.critic_1(state, action)
current_Q2 = self.critic_2(state, action)
# Compute critic loss
critic_loss = F.mse_loss(current_Q1, target_Q) + F.mse_loss(current_Q2, target_Q)
# Optimize the critic and environment
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()