[SOLVED] Loss is backprop-ing correctly, but optimizer.step() updates nothing

The title is pretty self-descriptive as it relates to my problem. I’m encountering this issue in the context of an implementation of DQN inspired by the implementation of DQN in the official tutorial (http://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html). However, I suspect the issue isn’t related to the particular context and is probably some sort of more fundamental misunderstanding. I’ve pasted a minimal script which reproduces this behavior.

import torch
from torch import Tensor, LongTensor
from torch.autograd import Variable
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F

GAMMA = 0.99

class SimpleNet(nn.Module):
  def __init__(self):
    super(SimpleNet, self).__init__()
    self.head = nn.Linear(10, 3)
  def forward(self, x):
    return self.head(x)

current_Q = SimpleNet()
target_Q = SimpleNet()

optimizer = optim.Adam(current_Q.parameters())

initial_current_params = list(current_Q.parameters())
initial_target_params = list(target_Q.parameters())
initial_current_grad = list(current_Q.parameters())[0].grad
initial_target_grad = list(target_Q.parameters())[0].grad

state_batch = Variable(torch.randn(1,10))
action_batch = Variable(LongTensor([[0]]))
reward_batch = Variable(Tensor([0.0]))

current_state_values = current_Q(state_batch)
state_action_values = current_state_values.gather(1, action_batch)

non_final_next_states = Variable(torch.randn(1,10))
next_state_values = target_Q(non_final_next_states).gather(1, current_Q(non_final_next_states).max(1)[1])
expected_state_action_values = reward_batch + (GAMMA * next_state_values)

huber_loss = F.smooth_l1_loss(state_action_values, expected_state_action_values)

final_current_params = list(current_Q.parameters())
final_target_params = list(target_Q.parameters())
final_current_grad = list(current_Q.parameters())[0].grad
final_target_grad = list(target_Q.parameters())[0].grad

initial_current_params, initial_target_params, initial_current_grad, initial_target_grad, final_target_params, final_current_grad and final_target_grad are all what I would expect them to be but I don’t understand why initial_current_params is equal to final_current_params.

Any help with understanding this would be greatly appreciated.

At what point are you comparing initial_current_params with final_current_params?

At the end of your program, initial_current_params and final_current_params will have the same values because they point to the same Parameters. (The parameters are updated in place and initial_current_params isn’t a copy).

To get a copy of the parameter values:

initial_current_params = [p.data.clone() for p in current_Q.parameters()]
1 Like

That makes a TON of sense. That was it. I figured it was something stupidly simple. Thanks!