Hi,
I am trying to finetune a 13B model and fit the training process into 2 A100 80GB GPUs with FSDP, but I am constantly met with OOM errors. I am using CPU offloading, but it does not seem to address the OOM problem. Any solutions? Would FSDP enable training a very large model on limited GPU resources?
Thanks!