I want to get optimizer tensors for each parameter in my network.
I can do something like this to extract the optimizer statistics
parameters = [prm for prm in optimizer.param_groups['params']]
param_ = parameters[0]
state = optimizer.state
statistic1 = state[param_]['exp_avg']
statistic2 = state[param_]['exp_avg_sq']
Now which parameter does p actually correspond to? I have a transformer network where many of the encoder layers have the same shape, so I won’t be able to rely on shape information alone.
I know that toch.optim and torch.module are independent so this might by tricky.
If I look at the optimizer state dict
optimizer_state_dict['param_groups']['params']
is just a list of numbers [0, 1, 2, …]
For example, how would I get the exp_avg
for layer1.attention.key.dense.weight
?