I am trying to implement a research paper about meta-learning, “Learning to learn by gradient descent by gradient descent”, written by Andrychowicz et al. in 2016. In this research paper, the update step of stochastic gradient descent is replaced by the output of an LSTM. Specifically, for every single scalar parameter theta of an optimizee network n, there is a separate LSTM m(theta) which takes the derivative of a loss function f with respect to theta as input and proposes an update step for theta. All LSTM networks share the same weights, but have different hidden and cell states. Furthermore, each LSTM network consists of two stacked LSTM cells with a hidden size of 20, and a final linear layer that maps the 20-dimensional output of the second LSTM cell to a scalar value.
After predicting a sequence of update steps for the parameters of the optimizee network n, the LSTM networks are updated based on the gradient of the loss functions f with respect to their weights accumulated at different time steps. Since each scalar parameter of the optimizee network is associated with a different LSTM, it seemed to me that the best way to implement the algorithm proposed in the paper is to loop over every single scalar parameter of the optimizee network, and update each based on the output of its associated LSTM, as in the code excerpt below:
def meta_step(self, num_unrolls=20, meta_optimizer=None, data=None):
total_loss = 0.0
prediction = None
loss = None
for unroll in range(num_unrolls):
self.optimizee.zero_grad()
if data is not None:
pass
else:
prediction = self.optimizee()
loss = self.optimizee_loss_fn(prediction)
loss.backward()
total_loss += loss.item()
for param in self.optimizee.parameters():
new_param = torch.zeros_like(param).view(-1)
grad_vec = param.grad.view(-1)
for index, value in enumerate(param.view(-1)):
grad = grad_vec[index].view(1)
new_param[index] = param[index] + self.models[index](grad)[0]
with torch.no_grad():
param.copy_(new_param)
for layer in self.lstm_parameters:
for param in layer:
print(param.grad)
print(self.linear_paramters[0].grad)
print(self.linear_paramters[1].grad)
input()
Without the torch.no_grad() block, PyTorch would prevent me from changing the values in param, as it considers param to be a leaf tensor. In either case, however, the .grad attribute of all the lstm_parameters and the linear_parameters turn out to be None. Furthermore, PyTorch considers some of the lstm_parameters and linear_parameters to be non-leaf tensors. I cannot quite understand why these problems occur, although I have read the basic documents on autograd mechanisms. I would appreciate an explanation of the reasons behind these problems and a possible solution to them. Perhaps to clarify, I implemented my custom LSTM class, and this custom LSTM class references the ‘global’ lstm_parameters and linear_parameters, which leads to weight sharing between the LSTM networks.
Edit: Well, on second thought, I probably do not need a list of models, and can just use a single instance of an LSTM model from the official PyTorch implementation, and give it different hidden and cell states based on the parameter of the optimizee that I want to update. However, the problems remain even when I use a single dummy variable of type nn.Parameter as the proposed update step.