Dear Community, I’m trying to understand why the following meta-learning pseudo-code works. Could you please give me some guidance?
param: dict[str, torch.Tensor]
optimizer = Adam(params=param)
def inner_loop(parameter, data):
cloned_param = clone parameter
calculate something with cloned_param (using data)
get the loss from said calculation
gradients = autograd.grad(output=that loss, input=cloned_parameter.values)
use gradients to update cloned_param
return: the updated cloned_param
def outer_loop(data):
adapted_parameters = inner_loop(param, part 1 of data)
loss = calculate(adapted_parameters, part 2 of data)
optimizer.zero_grad()
[Q] loss.backward()
optimizer.step()
[Q] Why would this (and the optimizer) work? The loss comes from the adapted_parameters, which is cloned then updated in inner_loop, not the original parameters (‘param’)
My understanding is that we need to use param itself to calculate stuff and get the loss if we want to .backward to update the param itself. It’s confusing to me that we are using loss from (virtually) a ‘future’ version of param to update param, and that it works
[Q2] What would be a nice way to organize parameters besides dictionary with string names? For models with lots of layers/steps, errors from typing names (of parameter tensors) wrongly has been frustrating
Thanks a lot!