The title is pretty self-descriptive as it relates to my problem. I’m encountering this issue in the context of an implementation of DQN inspired by the implementation of DQN in the official tutorial (http://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html). However, I suspect the issue isn’t related to the particular context and is probably some sort of more fundamental misunderstanding. I’ve pasted a minimal script which reproduces this behavior.
import torch
from torch import Tensor, LongTensor
from torch.autograd import Variable
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
GAMMA = 0.99
class SimpleNet(nn.Module):
def __init__(self):
super(SimpleNet, self).__init__()
self.head = nn.Linear(10, 3)
def forward(self, x):
return self.head(x)
current_Q = SimpleNet()
target_Q = SimpleNet()
optimizer = optim.Adam(current_Q.parameters())
initial_current_params = list(current_Q.parameters())
initial_target_params = list(target_Q.parameters())
initial_current_grad = list(current_Q.parameters())[0].grad
initial_target_grad = list(target_Q.parameters())[0].grad
state_batch = Variable(torch.randn(1,10))
action_batch = Variable(LongTensor([[0]]))
reward_batch = Variable(Tensor([0.0]))
current_state_values = current_Q(state_batch)
state_action_values = current_state_values.gather(1, action_batch)
non_final_next_states = Variable(torch.randn(1,10))
next_state_values = target_Q(non_final_next_states).gather(1, current_Q(non_final_next_states).max(1)[1])
expected_state_action_values = reward_batch + (GAMMA * next_state_values)
huber_loss = F.smooth_l1_loss(state_action_values, expected_state_action_values)
optimizer.zero_grad()
huber_loss.backward()
optimizer.step()
final_current_params = list(current_Q.parameters())
final_target_params = list(target_Q.parameters())
final_current_grad = list(current_Q.parameters())[0].grad
final_target_grad = list(target_Q.parameters())[0].grad
initial_current_params, initial_target_params, initial_current_grad, initial_target_grad, final_target_params, final_current_grad and final_target_grad are all what I would expect them to be but I don’t understand why initial_current_params is equal to final_current_params.
Any help with understanding this would be greatly appreciated.