FSDP- Fully sharded dataparallel on model that has requires_grad=False

Is it possible to wrap a teacher model in FSDP that is in eval mode and has requires_grad=False?
My setup is knowledge distillation between a huge teacher and a small student. I`d like to only partition weights of the teacher across 6 gpus, or both teacher and student.

FSDP should be able to work on the teacher only, though if see something like the gradients of your student model are not being computed correctly, feel free to file an issue on Github!

Thank you! Indeed I managed to wrap transformer blocks of my teacher only.
However, I see there are multiple ways of defining the wrapping policy.
I have 2 questions here:

  1. How is it possible to have more FSDP modules than nr of ranks (gpus) ? Is there a documentation explaining this, I understood so far that each ran will store only one shard of parameters/optimizers/gradients, but can there be more shards/FSDP modules than nr of gpus? If so, why?

  2. What policy would you recommend. Are there any guidelines? What happens if I set too many FSDP modules?

    auto_wrap_policy = functools.partial(
    transformer_auto_wrap_policy, transformer_layer_cls={GPT2Block}

    auto_wrap_policy = functools.partial(
    size_based_auto_wrap_policy, min_num_params=2000000
    )

Thanks for your feedback, it is very very useful in my work!