Layer-wise learning rate in fsdp

tree33 · February 23, 2023, 9:59am

Hi, I’m trying to set different lr for different layers in fsdp models.

For DDP, the model.named_parameters() function gives me all the names and their corresponding weights of all layers. And I can simply filter out the norm layers and apply no weight decay for these layers.

While for fsdp, named_parameters() gives only ._fsdp_wrapped_module.flat_param and a flattened array. I’m wondering if it is possible to also apply no weight decay for norm layers inside fsdp.

agu · March 7, 2023, 10:51pm

You can try out passing use_orig_params=True to the FSDP constructor

Then, the original parameters will be returned from named_parameters(). Note though, the original parameters’ data will be 1D and only contain what is in the local rank’s shard. Still, you should be able to apply separate learning rates (or other hyperparameters) for each original parameter.