Slow training for transformer with parallel data feed

Matin_Ahmadi · July 11, 2024, 3:46pm

Hi everyone. I am training a transformer model. Since my data is huge, I convert my data to a bunch of files, each containing tensors. Then I use threads to open those files and feed the tensors in those files one by one into a queue. Then a generator over this queue is served as __iter__ in my dataset class.
My model has around 40M parameters (4 layers, d_model 512, d_feedforward 2048) and the batches are of size 32. Each batch takes 2.5 seconds using mixed precision training and checkpoints (to prevent CUDA out of memory). So each epoch (~10000 iterations per epoch) takes around 7 hours. And my GPU Is an RTX 4080 with 16G memory.
My take is that this training is slow. I read somewhere that a similar model took 2 days for 40 epochs (whereas mine takes 13 days).
I’m skeptical about the model. Can you provide some insights?
Is this parallel data feeding legit? Am I doing something wrong here?

ptrblck · July 11, 2024, 10:58pm

I would recommend profiling the code first to understand where the current bottleneck is before trying to optimize anything. E.g. if the data loading is indeed the bottleneck, you might want to profile it separately to check if e.g. the multi-threading approach is causing the slow execution time.

Matin_Ahmadi · July 12, 2024, 5:46am

I don’t think it’s the data loading, since the queue has a max size of 1000 tensors, while each batch is 32. so there’s no waiting on the queue, considering the 2.5 seconds per batch. (Unless threads themselves are somehow magically slowing everything down)
Also, mixed precision training reduces the time from 2.9 to 2.5 seconds.
Is it possible that torch.utils.checkpoint is the culprit?
Or is 2.5 seconds normal for this model with these parameters?

ptrblck · July 12, 2024, 12:30pm

I would recommend profiling your workload instead of guessing to understand which part of the code is slow.

Matin_Ahmadi · July 12, 2024, 6:07pm

I do profile my model. Most of the time is spent in the backward pass, which, given the checkpoints, makes sense. I need to get an idea of whether this 2.5 seconds is normal for this model with ~40M parameters on this GPU.