As per my requirement the batch composition is non-deterministic which is easily handled through a custom Batchsampler
on a single-GPU or DataParallel
implementation. For a DistributedDataParallel
scenario, I can’t think of anyway to fix the issue.
I wish to seek discussion regarding the potential solution I have in mind for managing it for DDP.
Potential Solution
Since only 1 sequence of indices should be generated is it wise to restrict sampling to rank=0 and use torch.distributed.broadcast
to collect samples produced by rank=0 onto corresponding ranks?
Something like
import torch
from torch.utils.data.sampler import SequentialSampler
#Simulate different ordering of indices with a non-deterministic batch sampler
samples = torch.randperm(12).tolist()
print('Starting samples for rank:', args.rank, samples)
###Sync'd solution
s = torch.cuda.Stream()
samples = torch.LongTensor(samples).to(args.rank)
handle=torch.distributed.broadcast(samples, src=0, async_op=True)
handle.wait()
with torch.cuda.stream(s):
s.wait_stream(torch.cuda.default_stream())
print('Samples in rank:',args.rank, samples)
==output==
Starting samples for rank: 1 [6, 0, 8, 11, 3, 4, 7, 5, 2, 9, 10, 1]
Starting samples for rank: 0 [3, 5, 6, 9, 7, 11, 10, 1, 8, 0, 2, 4]
Samples in rank: 1 tensor([ 3, 5, 6, 9, 7, 11, 10, 1, 8, 0, 2, 4], device='cuda:1')
Samples in rank: 0 tensor([ 3, 5, 6, 9, 7, 11, 10, 1, 8, 0, 2, 4], device='cuda:0')
I can effectively move this inside CustomDistributedSampler
to broadcast rank:0’s indices across all devices and use a slice similar to PyTorch’s distributed sampler to return indices.
Any issues with the above implementation? Is there a better looking and/or optimized solution?