DDP training slow during first epoch

Praveen_Srinivasan · January 28, 2021, 12:54am

Hi,
I am training a model with DDP w/ 8 GPUs (one process per GPU) and the DataLoader with 8 data loading workers. Training is slow during the first epoch and speeds up significantly immediately starting the second epoch and onwards. Has anyone seen this before?

Dwight_Foster · January 28, 2021, 1:43am

Yes this is not unusual. The first epoch is always the slowest because everything needs to load onto the gpu at first. After that the model and gradients are loaded on the gpu already so it is much faster.

Praveen_Srinivasan · January 28, 2021, 3:17am

Do you mean the first batch item? I’m talking about the first pass through the entire dataset.

ptrblck · January 28, 2021, 7:21am

Is your system using some caching for the data loading?
I’ve seen servers in the past, which load data from the network and are thus using a built-in caching mechanism to avoid the network transfer once all samples are loaded, and load the cached files from a fast SSD installed in the server.

Dwight_Foster · January 28, 2021, 12:08pm

I am talking about the first pass through the entire dataset too. This is the first time the data is loaded onto the gpu so it should take the longest. Or at least it does for me.

Praveen_Srinivasan · January 28, 2021, 11:52pm

No, based on timing I think the issue is in the synchronization between workers during forward and backward.

Praveen_Srinivasan · January 30, 2021, 6:23am

It turns out the issue was torch.backends.cudnn. benchmark=True - the model has variable length tensors which led to a slowdown. With that set to false, the issue was resolved.