Hello
I have a single node, with 8 GPUs, and am interested in using DistributedDataParallel
(DDP) to distribute data processing across GPU devices. I have two functions (I’m basing these partly off of this demo:
def main(local_world_size, local_rank, args):
# These are the parameters used to initialize the process group
env_dict = {
key: os.environ[key]
for key in ("MASTER_ADDR", "MASTER_PORT", "RANK", "WORLD_SIZE")
}
print(f"[{os.getpid()}] Initializing process group with: {env_dict}")
dist.init_process_group(backend="nccl")
print(
f"[{os.getpid()}] world_size = {dist.get_world_size()}, "
+ f"rank = {dist.get_rank()}, backend={dist.get_backend()}"
)
# train your model
train(args.local_world_size, args.local_rank, args)
# Tear down the process group
dist.destroy_process_group()
and
DATASET_MAP = {'D1': D1_Dataset,
'D2': D2_Dataset}
def train(local_world_size, local_rank, args):
n = torch.cuda.device_count() // local_world_size
device_ids = list(range(local_rank * n, (local_rank + 1) * n))
print(
f"[{os.getpid()}] rank = {dist.get_rank()}, "
+ f"world_size = {dist.get_world_size()}, n = {n}, device_ids = {device_ids}"
)
root_dir = f"/mnt/efs/ml_data/{args.dataset}/"
datasets = {subset: DATASET_MAP[args.dataset](root=root_dir,
data_subset=subset,
in_memory=False) \
for subset in ['train','val']}
training_sampler = DistributedSampler(datasets['train'], rank=local_rank, shuffle=True)
loaders = {'train': DataLoader(datasets['train'],
batch_size=args.batch_size,
num_workers=4,
sampler=training_sampler),
'val': DataLoader(datasets['val'],
batch_size=args.batch_size,
num_workers=4)}
model = networks.Network(
raw_dim=args.num_classes
hidden_dim=args.hidden_dim,
output_size=args.ouput_dim
)
model = model.cuda(device_ids[0])
ddp_model = DDP(model, device_ids)
for epoch in torch.arange(0, args.num_epochs):
print(f"Epoch: {epoch}.")
for i, graph in enumerate(loaders['train']):
graph = graph.cuda(local_rank)
output = ddp_model(graph)
print(output.device)
and finally
if __name__ == "__main__":
args = parse_args()
main(args.local_world_size, args.local_rank, args)
My training and validation datasets are quite large, so I am loading batches at call time (when iterating over the DataLoader items), and not up front when creating the custom Dataset
objects. The in_memory = False
parameter of the custom Dataset class ensures that __getitem__
method of the Dataset class does the loading, and not the __init__
method.
The downside to the above approach is that we’ll redundantly load samples every epoch. I am wondering at which point in the distributed training this should occur? Should I create the datasets in main
? or should I continue to do so in train
? Because the train / val datasets are large, I want to make sure I am not putting each whole dataset onto each individual GPU device.
Does this make sense?
Thanks for your help.
Kristian