Using FSDP for only part of a model

divinho · April 5, 2024, 6:06pm

Does FSDP have support for this? I only want to shard certain parts of the model. I know FullyShardedDataParallel has an ignored_modules arg but I don’t understand how that is supposed to work if one has nn.Linear layers that one sometimes wants to be sharded and sometimes not.

I was wondering if one could manually call torch.distributed.fsdp.wrap on individual parts of the model but some initial tests seem to indicate that approach does not work (seems the root module needs to be of type FullyShardedDataParallel)?

Related to that is the question of what happens to the outer FSDP unit during training. This is referenced in the FSDP advanced tutorial on transfomer wrapping policy. Does this get unloaded while other (child) FSDP units are running? Or is this always kept in unsharded state?

GuWei007 · April 14, 2024, 9:49am

Can you try these two parameters of FSDP?
auto_wrap_policy and ignored_modules