Speed up model training

adywi · June 8, 2022, 11:21am

Hello Torch users,

I’m currently implementing a 3D resnet18 on fMRI data of dimension [27, 75, 93, 81]. I couldn’t do one epoch in 48 hours on two GPUs A100.

I have already tried to transform my data directly into NumPy array in order to speed up the process.

I already use the following code to run the model on 2 GPUs:

 if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)

By the way, how can I check if the two GPUs are used?

I use this for my train loader:

train_loader = torch.utils.data.DataLoader(train_set,
                              batch_size=64,
                              shuffle=True,
                              num_workers=0)

Any ideas or tricks to speed up the process?

ptrblck · June 9, 2022, 5:20am

nn.DataParallel is not the recommended approach for data parallelism as it can create an imbalanced memory usage on the devices and is slower than DistributedDataParallel (DDP). Use DDP with a single process per GPU (even on a single node) for a faster performance.

nvidia-smi gives you the GPU util. as well as its memory usage and is often good enough as a quick test to see how many devices are used.

Generally, check the performance guide for some common issues and recommendations.

adywi · June 9, 2022, 3:15pm

Thanks a lot for your response! I’ll try to implement the DDP. In my current script, do you think that nn.DataParallel accelerates the process or it may even slow it down?

ptrblck · June 9, 2022, 6:00pm

I would guess the current approach could yield a speedup, but you would need to profile the use case on your system to check it.