Layer-wise learning rate in fsdp

Hi, I’m trying to set different lr for different layers in fsdp models.

For DDP, the model.named_parameters() function gives me all the names and their corresponding weights of all layers. And I can simply filter out the norm layers and apply no weight decay for these layers.

While for fsdp, named_parameters() gives only ._fsdp_wrapped_module.flat_param and a flattened array. I’m wondering if it is possible to also apply no weight decay for norm layers inside fsdp.

You can try out passing use_orig_params=True to the FSDP constructor :slight_smile:

Then, the original parameters will be returned from named_parameters(). Note though, the original parameters’ data will be 1D and only contain what is in the local rank’s shard. Still, you should be able to apply separate learning rates (or other hyperparameters) for each original parameter.