Speed up model training

Hello Torch users,

I’m currently implementing a 3D resnet18 on fMRI data of dimension [27, 75, 93, 81]. I couldn’t do one epoch in 48 hours on two GPUs A100.

I have already tried to transform my data directly into NumPy array in order to speed up the process.

I already use the following code to run the model on 2 GPUs:

 if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)

By the way, how can I check if the two GPUs are used?

I use this for my train loader:

train_loader = torch.utils.data.DataLoader(train_set,
                              batch_size=64,
                              shuffle=True,
                              num_workers=0)

Any ideas or tricks to speed up the process?

nn.DataParallel is not the recommended approach for data parallelism as it can create an imbalanced memory usage on the devices and is slower than DistributedDataParallel (DDP). Use DDP with a single process per GPU (even on a single node) for a faster performance.

nvidia-smi gives you the GPU util. as well as its memory usage and is often good enough as a quick test to see how many devices are used.

Generally, check the performance guide for some common issues and recommendations.

Thanks a lot for your response! I’ll try to implement the DDP. In my current script, do you think that nn.DataParallel accelerates the process or it may even slow it down?

I would guess the current approach could yield a speedup, but you would need to profile the use case on your system to check it.