Is it possible to run FSDP on a module that is not part of the optimizer?

I have a teacher and a student, both submodules of the same model.

Currently I add whole model to the optimizer, but I only wrap teacher transformer blocks in FSDP.
I assume this means that even if my teacher runs with torch.nograd in forward pass, it will have fp16 weights (2bytes), fp16 gradients (2 bytes)+optimizer states(fp32 weights , adam states - 12 bytes) stored. Using mixed precision this is a total of 16 bytes/parameter. These get sharded accross 6 gpus so 16/6 bytes/gpu. (in these calculations I assume that deepspeed stage 3 documentation and way of computing memory usage also applies to FSDP)

I wonder is it possible to get rid of gradients and optimizer states for my teacher and still apply FSDP to it? I found that if I don’t include it into the optimizer I get and error with FSDP…

Old method to reduce memory in teacher: What I did as an alternative to FSDP is that I set my teacher to fp16 with requires_grad=False, no FSDP, and I have only fp16 weights stored - 2 bytes/param in teacher.

My question: would FSDP be more efficient than my old method in reducing memory (is there a way to shard the remaining 2bytes/param across 6 gpus using fsdp)?