I am currently using DistributedDataParallel to speed up model training.
Originally, my code looked like the following and worked well :
train_sampler = DistributedSampler(data, num_replicas=self.world_size, rank=dist.get_rank())
train_sampler.set_epoch(epoch)
data_loader = DataLoader(data, batch_size=self.batch_size, shuffle=False, num_workers=0, pin_memory=self.pin_memory, sampler=train_sampler)
for batch in tqdm(data_loader, leave=False, desc='Batch Training {:<3}'.format(epoch), ncols=100, mininterval=1):
=== train model ===
When running the above code, data gets distributed across all gpus as expected. For instance, let’s say we have a training data of length 1,000 and we want to train with batch size 100. In a single gpu environment, the number of iterations per epoch is 1,000/100 = 10. In a multi gpu (let’s say 2 gpus) environment using DDP & DistributedSampler, the number of iterations per epoch should be 5.
However, when defining a separate function that makes the dataloader, DistributedSampler does not seem to work as expected.
def make_dataloader(data, batch_size, epoch):
train_sampler = DistributedSampler(data, num_replicas=self.world_size, rank=dist.get_rank())
train_sampler.set_epoch(epoch)
data_loader = DataLoader(data, batch_size=batch_size, shuffle=False, num_workers=0, pin_memory=self.pin_memory, sampler=train_sampler)
return data_loader
data_loader = make_dataloader(data, self.batch_size, epoch)
for batch in tqdm(data_loader, leave=False, desc='Batch Training {:<3}'.format(epoch), ncols=100, mininterval=1):
=== train model ===
When running the code above, data do not get distributed as expected. That is, if I train a model in the same circumstance as in the example (data length : 1000, batch size : 100, num gpus : 2), each process runs 10 iterations per epoch, not 5.
Why would DistibutedSampler behave differently?