Question about requires_grad usage

cmitch29 · March 28, 2024, 10:14am

In the Llama recipes repository, there is a function called freeze_transformer_layers. This function can be triggered when sharding the model via FSDP.

However, the nightly (or main) FSDP docs, at time of writing, states:

FSDP has some constraints on freezing parameters (i.e. setting param.requires_grad=False). For use_orig_params=False, each FSDP instance must manage parameters that are all frozen or all non-frozen. For use_orig_params=True, FSDP supports mixing frozen and non-frozen, but we recommend not doing so since then the gradient memory usage will be higher than expected (namely, equivalent to not freezing those parameters). This means that ideally, frozen parameters should be isolated into their own nn.Module s and wrapped separately with FSDP.

Why are we able to freeze / unfreeze certain parameters in this way? Should we be wrapping the decoder blocks in FSDP separately?