FSDP: Can users control which parameters are offloaded to CPU?

Hi all,

I have a scenario where I’m training an RL actor model using FSDP. Both rollout and training share the same backbone.

  • The model weights, after sharding, fit on GPU.
  • However, the gradients and optimizer states do not fit on GPU.

My goal is to keep the weights on GPU for rollout while offloading gradients and optimizer states to CPU.

Currently, FSDP’s CPUOffloadPolicy offloads weights, gradients, and optimizer states together, and I haven’t found a way to selectively offload only gradients and optimizer states.

Question:

  • Is there a way in FSDP to control which parameters (or their states) are offloaded to CPU?
  • If not, are there recommended workarounds for this scenario?