Hello Patrick! I hope you are doing well. I got another workaround to this problem but I am having some knowledge issues implementing it. Hope I could get your help in this.
The solution to this problem being, instead of me trying to play directly with the dataloader, deciding batch sizes and all those stuff, why not I do this. The dataloader will kind of be in the shared memory and the processes that exist in DDP in each of my GPUs will request from the dataloader thier respective batches.
My thought is that is it possible for me to place iterator on my dataloader in the shared memory, and then like always pass this iterator to each of the GPU process. This way whenever the GPU requests for data from the dataloader, it is basically doing a next() on the object and fetching he data, if its a faster GPU, it will keep on fetching this data whenever its done and the other GPU will also do the same, but here I dont have to decide how much batches do I need to send them which was basically causing sync issues for me.
Do I need to be more clear or did you understand this kind of workaround. This was kind of done by one guy on the forum but I dont know how do I implement like him coz shared memory asks for a tensor but I instead have an iterator to be shared.
( How to share data among DataLoader processes to save memory - Memory Format - PyTorch Forums )
Right now whenever I create these processes, two different iterators are getting created.
( I am naive to all of this so my apologies in advance if I make some mistake while describing)