I am new to distributed training on multiple GPUs. Following the tutorial in here, I believe my model is the same on all GPUs, and the distributed sampler is creating dataloaders for each of the processes.
Assuming that this is correct, then each of my GPUs have access to a subset of my data, and I should have a separate loader_train
for each rank.
Now mid training, after (lets say) epoch 10, I need to gather all the dataloader objects into a list. My solution is to use all_gather_object
.
When I try the following code:
dloader_all = [None for _ in range(torch.distributed.get_world_size())]
torch.distributed.all_gather(dloader_all , loader_train[torch.distributed.get_rank()])
I get the error:
TypeError: 'DataLoader' object is not subscriptable
I wrote the code following the documentation. What am I missing here? How can I gather the objects into a list where list[0]
is the dataloader from the first rank and so on?
I would appreciate any help.