Hi,
I’m a pytorch beginner. I want to distribute data on several GPUs with Distributed Data Parralel.
I run this code on a 4 GPU node. My wordlsize is 4.
My question is about the validation batch_size and the num_worker (validation and training).
Will each process have a validation batch_size = 4/worldsize and a num_worker = 4/world_size?
Will each process have a training num_worker equal to 16/world_size?
def get_dataset(data_folder,train_images,val_images, CROP_SIZE, UPSCALE_FACTOR):
world_size = dist.get_world_size()
# Create dataset training and validation
train_set = TrainDatasetFromFolder(data_folder, crop_size=CROP_SIZE, upscale_factor=UPSCALE_FACTOR,
image_list=train_images)
val_set = ValDatasetFromFolder(data_folder, upscale_factor=UPSCALE_FACTOR, image_list=val_images)
train_sampler = DistributedSampler(train_set,num_replicas=world_size)
val_sampler = DistributedSampler(val_set,num_replicas=world_size)
batch_size = int(80/ float(world_size))
print(world_size, batch_size)
train_loader = DataLoader(
dataset=train_set,
sampler=train_sampler,
batch_size=batch_size,
num_workers=16,
pin_memory=True,
)
val_loader = DataLoader(
dataset=val_set,
sampler=val_sampler,
batch_size=4,
num_workers=4,
)
return train_loader, val_loader, batch_size