Distributed Sampler Works in Strange Manner

I am currently using DistributedDataParallel to speed up model training.

Originally, my code looked like the following and worked well :

train_sampler = DistributedSampler(data, num_replicas=self.world_size, rank=dist.get_rank())
train_sampler.set_epoch(epoch)
data_loader = DataLoader(data, batch_size=self.batch_size, shuffle=False, num_workers=0, pin_memory=self.pin_memory, sampler=train_sampler)

for batch in tqdm(data_loader, leave=False, desc='Batch Training {:<3}'.format(epoch), ncols=100, mininterval=1):
    === train model ===

When running the above code, data gets distributed across all gpus as expected. For instance, let’s say we have a training data of length 1,000 and we want to train with batch size 100. In a single gpu environment, the number of iterations per epoch is 1,000/100 = 10. In a multi gpu (let’s say 2 gpus) environment using DDP & DistributedSampler, the number of iterations per epoch should be 5.

However, when defining a separate function that makes the dataloader, DistributedSampler does not seem to work as expected.

def make_dataloader(data, batch_size, epoch):
    train_sampler = DistributedSampler(data, num_replicas=self.world_size, rank=dist.get_rank())       
    train_sampler.set_epoch(epoch)
    data_loader = DataLoader(data, batch_size=batch_size, shuffle=False, num_workers=0, pin_memory=self.pin_memory, sampler=train_sampler)
    return data_loader

data_loader = make_dataloader(data, self.batch_size, epoch) 
for batch in tqdm(data_loader, leave=False, desc='Batch Training {:<3}'.format(epoch), ncols=100, mininterval=1):
    === train model ===

When running the code above, data do not get distributed as expected. That is, if I train a model in the same circumstance as in the example (data length : 1000, batch size : 100, num gpus : 2), each process runs 10 iterations per epoch, not 5.

Why would DistibutedSampler behave differently?

When running the code above, data do not get distributed as expected. That is, if I train a model in the same circumstance as in the example (data length : 1000, batch size : 100, num gpus : 2), each process runs 10 iterations per epoch, not 5. Why would DistibutedSampler behave differently?

What is the value self.world_size in the make_dataloader function?

If it is 1 each process will run 10 iterations.

self.world_size is set as the number of gpus to be used. So in the example, 2.

I am not seeing a problem when I do a basic example

train_dataset = []
for _ in range(1000):
    train_dataset.append(torch.rand(4, 4))

def basic_sampler(rank, world_size):
    sampler = torch.utils.data.distributed.DistributedSampler(
    	train_dataset,
    	num_replicas=world_size,
    	rank=rank
    )
    data_loader = DataLoader(train_dataset, batch_size=100, sampler=sampler)
    print(f"rank = {rank} train_dataset sample count = {len(train_dataset)}")
    print(f"rank = {rank} data_loader batch count = {len(list(data_loader))}")


if __name__ == "__main__":
    mp.spawn(basic_sampler, nprocs=2, args=(2,)) 

rank = 0 train_dataset sample count = 1000
rank = 0 data_loader batch count = 5
rank = 1 train_dataset sample count = 1000
rank = 1 data_loader batch count = 5

Are you able to add some debugging logs to confirm that everything is set correctly?

Sorry for the late reply. I was using a cloud instace and it worked fine this time. I guess something went wrong else where…