A rather simple question, that may be complex to answer.
Anyone know how to test distributed PyTorch code on a single GPU? I don’t care how slow it executes, just that it is numerically accurate.
I’m writing a no-nonsense blog on different distributed PyTorch methods, but would rather not have to rent a cluster just to write a blog.
With Jax, I could set XLA environment variable xla_force_host_platform_device_count to force a CPU to appear like many for testing purposes. Anything like this exist in PyTorch?
Hi, thanks for the question. This depends on what aspect of distributed PyTorch you would like to test. For example, if you just need to call collectives (e.g. dist.allreduce, dist.broadcast) or DDP, then you can just use the gloo backend which operates with CPU (dist.init_process_group(backend="gloo").
If you need something specific to GPU, for example testing NCCL collectives or using FSDP. We don’t currently have a way to virtually split a single GPU to act as multiple.
This depends on what aspect of distributed PyTorch you would like to test.
Sadly, I want to test… all of them! Lofty goal is 3D parallelism, but would settle for 2D.
So, that would include things like FSDP, NCCL collectives, as well as pipelining.
We don’t currently have a way to virtually split a single GPU to act as multiple.
So, the conclusion is I should probably start using my wallet to rent a small cluster
I saw a few datacentre inference GPUs do have support for splitting GPUs into virtual ones, but nothing on consumer hardware. If PyTorch doesn’t have support, would you know of anything on a consumer GPU that could work?
I assume you are referring to MIG on NVIDIA data center GPUs. If so, note that no communication between MIG slices is supported and only a single process per MIG (without IPC) is supported, so MIG won’t help.
MIG was one, yes. I think vgu may also be a candidate (as in, the subcommand nvidia-smi vgpu however this is also not on my hardware. I wasn’t aware that MIG didn’t have support for comms between slices though