Hello, I’m currently trying to wrap my Model which contains some frozen params with nested FSDP. I read from the pytorch.distributed.fsdp
this warning:
FSDP has some constraints on freezing parameters (i.e. setting
param.requires_grad=False
). Foruse_orig_params=False
, each FSDP instance must manage parameters that are all frozen or all non-frozen. Foruse_orig_params=True
, FSDP supports mixing frozen and non-frozen, but we recommend not doing so since then the gradient memory usage will be higher than expected (namely, equivalent to not freezing those parameters). This means that ideally, frozen parameters should be isolated into their ownnn.Module
s and wrapped separately with FSDP.
I didn’t want to set use_orig_params=True
and as they stated, I Isolated the frozen parameters into their own nn.Module
s and then wrapped every submodule with FSDP, but I’m still getting this error:
ValueError: FlatParameter
requires uniform requires_grad
Could someone please clarify what they meant by ‘ideally’ or help me overcome this issue?
Thanks!