Hi, I am new to DistributedDataParallel. Recently, I want to use DDP with the customized dataset, and the dataset will be dynamically changed after each epoch. I want to know, how to meet these needs. The following is the pseudo-code I wrote, I don’t know if it is correct.
def train(gpu, args):
############################################################
rank = args.nr * args.gpus + gpu
dist.init_process_group(
backend='nccl',
init_method='env://',
world_size=args.world_size,
rank=rank
)
############################################################
model = MyModel()
torch.cuda.set_device(gpu)
model.cuda(gpu)
batch_size = ...
criterion = nn.CrossEntropyLoss().cuda(gpu)
optimizer = torch.optim.SGD(model.parameters(), 1e-4)
###############################################################
# Wrap the model
model = nn.parallel.DistributedDataParallel(model,
device_ids=[gpu])
train_set = MyDataset()
train_sampler = torch.utils.data.distributed.DistributedSampler(...)
train_loader = torch.utils.data.DataLoader(
dataset=train_dataset,
batch_size=batch_size,
sampler=train_sampler) #
for i in range(args.epoch):
# the training code
...
# the validation code
...
# I modified the dataset. so, should I reuse the DistributedSampler? I am confused if it is correct with the following a few lines.
new_trainset = MyDataset()
new_train_sampler = torch.utils.data.distributed.DistributedSampler(...)
new_train_loader = torch.utils.data.DataLoader(
dataset=train_dataset,
batch_size=batch_size,
sampler=train_sampler) #
...
The confusing thing for me is when I modified the original dataset, should I reuse the DistributedSampler? or it will automatically recognize the change? If I need to re-new a sampler, should I use barrier()?
Thank you very much for your time and your help!