Can I gather dataloader objects from processes using torch.distributed.all_gather_object?

Xgk · December 12, 2023, 2:52am

I am new to distributed training on multiple GPUs. Following the tutorial in here, I believe my model is the same on all GPUs, and the distributed sampler is creating dataloaders for each of the processes.
Assuming that this is correct, then each of my GPUs have access to a subset of my data, and I should have a separate loader_train for each rank.
Now mid training, after (lets say) epoch 10, I need to gather all the dataloader objects into a list. My solution is to use all_gather_object.

When I try the following code:

dloader_all = [None for _ in range(torch.distributed.get_world_size())]
torch.distributed.all_gather(dloader_all , loader_train[torch.distributed.get_rank()])

I get the error:

TypeError: 'DataLoader' object is not subscriptable

I wrote the code following the documentation. What am I missing here? How can I gather the objects into a list where list[0] is the dataloader from the first rank and so on?
I would appreciate any help.

wconstab · December 13, 2023, 6:26pm

I suspect you cannot actually send a DataLoader object to a remote process, because it would have some internal state (file handles, etc) which wouldn’t make sense in the context new process even if they supported python’s serialization protocol (pickle).

However the literal error you’re hitting seems to be coming from improper use of subscripting on the loader_train object, which is not actually related to the all_gather call.

Xgk · December 13, 2023, 8:52pm

I managed to fix this using the code and idea in here and here.
Basically, I serialized the dataset, converted it to bytestorage, padded it (since my data was not of the same size), gathered and then unpadded and unserialized it.