Does FSDP1 and/or FSDP2 work on GPU only? I didn’t see any explicit documentation that said so. Conceptually I don’t see why there much difference sharding across GPUs (processes) vs. CPUs (prosesses) but it doesn’t look like the Pytorch FSDP can’t do it. I was looking to shard across CPU processes bound to NUMA nodes within a single machine.
FSDP1/FSDP2 use cuda wait_event and wait_stream heavily to achieve compute communication overlapping. that’s why we always assume cuda/rocm availability
there is ongoing effort to make FSDP2 + Gloo work: Work: wait_stream API by d4l3k · Pull Request #156883 · pytorch/pytorch · GitHub
cc @d4l3k if you have more info
1 Like
Fair enough. I was afraid I was doing something wrong.