I am using Pytorch Distributed Data Parallel approach and spawning multiple processes from parent process, each running on separate GPU.I am using Pytorch Distributed Data Sampler along with Data Loader for loading batches of input data to each process.
My questions:
- Under the hood, how does Pytorch Distributed Data Sampler, Data Loader make slices of input data? Just for simplicity say we have 4 GPUs, and 400 input samples and batch size of say 50, then will Pytorch Distributed Data Sampler (together with Data Loader) make first 50 samples go to GPU-0, next 50 to GPU-1., next 50 to GPU-2, then GPU-3 and then again next 50 to GPU-0 i.e. in the order of GPU device number? or the order of GPU to select for next batch of input is random based on which GPU has finished its previous batch first? or is it like 400 samples get divided into 4 parts first and then GPU-0 would get first 100 samples of input data (50 at a time ), GPU-1 will get next 100 samples ( 50 at a time) and so on…and in this case no matter if say GPU-3 gets its second batch started earlier than GPU-0, but still with respect to input data, GPU-0 would still have first 100 samples and GPU-3 would have last 100?
2). My Second question is how to retrieve output data in same order as input data so that final consolidated output ( having outputs from all processes combined in one data structure) is in same order as original inputs and each output corresponds to the right input