Passing per-parameter options within FSDP wrapped model

rowhanm · September 30, 2022, 5:40am

Hi, I’m training a model using FSDP and wanted to use different learning rates for different parameters.

In order to get a list of the full names of the parameters, I used the summon_full_params() context manager and then I filter my params as per their names into two buckets param_group_1 and param_group_2.

Then I pass these groups into the optimizer as:

torch.optim.SGD(
[{"params": param_group_1, "lr": 1e-3}, {"params": param_group_2, "lr": 1e-4}]
)

However this fails because -

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

If I pass the full model parameters into 1 group then it works but I would like to fine tune the LRs

torch.optim.SGD(
fsdp_model.parameters(), lr=1e-3
)

Is there a correct way of doing this? Am I finding the parameter groups correctly?

agu · October 3, 2022, 1:52am

Different learning rates for different parameters is not well-supported with FSDP at the moment, but we are working on it. This support may be available in the release after 1.13 (where 1.13 is the upcoming release) or earlier in nightlies. When the support lands and stabilizes, the documentation will include how to apply different hyperparameters to different parameters with FSDP.