How to get outputs in the same order as inputs with multiple spawned processes running on multiple GPUs and batches of data processed by each?

I am using Pytorch Distributed Data Parallel approach and spawning multiple processes from parent process, each running on separate GPU.I am using Pytorch Distributed Data Sampler along with Data Loader for loading batches of input data to each process.
My questions:

  1. Under the hood, how does Pytorch Distributed Data Sampler, Data Loader make slices of input data? Just for simplicity say we have 4 GPUs, and 400 input samples and batch size of say 50, then will Pytorch Distributed Data Sampler (together with Data Loader) make first 50 samples go to GPU-0, next 50 to GPU-1., next 50 to GPU-2, then GPU-3 and then again next 50 to GPU-0 i.e. in the order of GPU device number? or the order of GPU to select for next batch of input is random based on which GPU has finished its previous batch first? or is it like 400 samples get divided into 4 parts first and then GPU-0 would get first 100 samples of input data (50 at a time ), GPU-1 will get next 100 samples ( 50 at a time) and so on…and in this case no matter if say GPU-3 gets its second batch started earlier than GPU-0, but still with respect to input data, GPU-0 would still have first 100 samples and GPU-3 would have last 100?

2). My Second question is how to retrieve output data in same order as input data so that final consolidated output ( having outputs from all processes combined in one data structure) is in same order as original inputs and each output corresponds to the right input

Thanks for posting @kaleemiqb If using the recommended DistributedDataParallel (DDP) mode, where there is a dedicated process for each GPU, DDP does not split input data. Each process will have its own data loader and its own DDP instance. DDP only help to automatically compute the global averaged gradient in the backward pass. So it really depend on the dataload next batch is loaded, which I think it’s random.

for the second question, you can record the input batch, and the output of the model in its own process in a map, and if you want to concat them together do a all_gather manually, but input_batch across process might not rank properly.

Also, if you want a detailed control on how the data generate and be consumed by processes, consider using custom dataloader Writing Custom Datasets, DataLoaders and Transforms — PyTorch Tutorials 1.9.1+cu102 documentation