I am training a model with DDP w/ 8 GPUs (one process per GPU) and the DataLoader with 8 data loading workers. Training is slow during the first epoch and speeds up significantly immediately starting the second epoch and onwards. Has anyone seen this before?
Yes this is not unusual. The first epoch is always the slowest because everything needs to load onto the gpu at first. After that the model and gradients are loaded on the gpu already so it is much faster.
Do you mean the first batch item? I’m talking about the first pass through the entire dataset.
Is your system using some caching for the data loading?
I’ve seen servers in the past, which load data from the network and are thus using a built-in caching mechanism to avoid the network transfer once all samples are loaded, and load the cached files from a fast SSD installed in the server.
I am talking about the first pass through the entire dataset too. This is the first time the data is loaded onto the gpu so it should take the longest. Or at least it does for me.
No, based on timing I think the issue is in the synchronization between workers during forward and backward.
It turns out the issue was
benchmark=True - the model has variable length tensors which led to a slowdown. With that set to false, the issue was resolved.