How to configure FSDP to allow training a 13B model with 2 A100 80GB GPUs


I am trying to finetune a 13B model and fit the training process into 2 A100 80GB GPUs with FSDP, but I am constantly met with OOM errors. I am using CPU offloading, but it does not seem to address the OOM problem. Any solutions? Would FSDP enable training a very large model on limited GPU resources?


Also interested in this.