Data Partition to GPU Mapping

Currently, in deepspeed, it seems data ranks are interleaved for a specific number of pipeline stages on a single node for different cards. I was wondering if anyone could provide any suggestions on how I can map data ranks to the same card for any number of pipeline stages

Sorry, I’m unsure what does “data rank” mean in this context? For pipeline parallelism data should only be going into the first stage of the pipeline group right? In interleaved 1f1b (https://arxiv.org/pdf/2104.04473.pdf) one gpu (rank) will have multiple pipeline stages, you can do by setting that model chunk to that specific gpu.

FYI PyTorch has an official pipeline parallelism package (GitHub - pytorch/PiPPy: Pipeline Parallelism for PyTorch) that is going to upstreamed into the PyTorch repository into the coming months (you will be able to use these APIs from just import torch). The package provides support for the most popular schedules (gpipe, 1f1b, interleaved 1f1b, BFS) (PiPPy/pippy/PipelineSchedule.py at main · pytorch/PiPPy · GitHub)