I created a custom dataset to load frames from a video and corresponding labels and combined them together using ConcatDataset to create a dataloader.
I noticed that in the first epoch, the GPU is well used at ~90% usage with n_workers = 8. But from second epoch on the average usage drops and there is always a period of no usage, and the threads are not feeding the GPU fast enough.
I have a Ryzen 2700x CPU with RTX 2080Ti and loading data from an SSD.
I am wondering if this slowdown has something to do with the CPU (not being overclocked), in which case why does this slowdown not happen until the second epoch. I dont see this effect anytime during the first epoch.
Any help would be appreciated. Thanks.
I was wondering if anyone can add some additional information. I can add further information, don’t know how to attack the problem. Would appreciate it. What I have noticed that if I cold start the training, the first epoch is always fast at least 30% faster and the rest of the epochs clock almost the same duration, but the first epoch seems faster, I dont know if this is something to do with pytorch on AMD or something in my code.
My first training epoch on a small dataset takes ~90 seconds. The dataloader loop (regardless of training or for validation), with the same batchsize runs significantly slower. The validation takes 180 seconds and the second training epoch (onward) takes 2~3 times the time the first epoch took. And this remains more or less constant. On a large dataset, this is more pronounced.
What is noticeable that in the first epoch the GPU is hardly idle and from the 2nd epoch (onward) it is idle almost all the time. I don’t know what this is related to (hardware issue or code). I am at my wits end to and hoping someone can help diagnose the issue.
As a test, I stripped the code of all training and just created batches and skipped thorough rest of the operations except applying . I noticed that for a batch size=32 and 60 iterations/epoch and 10 epochs. I noted the following timings:
Epoch 0: 15.64
Epoch 1: 61.19
Epoch 2: 59.28
Epoch 3: 63.12
Epoch 4: 62.46
Epoch 5: 60.43
Epoch 6: 60.72
Epoch 7: 59.05
Epoch 8: 64.78
Epoch 9: 62.96
Same thing with 5 epochs and 300 iteration/epoch
Epoch 0: 61.79
Epoch 1: 253.94
Epoch 2: 255.45
Epoch 3: 255.12
Epoch 4: 255.57
So the problem is definitely in the data loading part. I should mention that my data is being loaded using a custom dataset from a video. I am not sure why it should affect the performance from the 2nd epoch onward.
did you figure out the problem?
How big the data is?
Nope, not yet. I have singled out the problem to dataloading.
I created a custom dataset and just wrap it with the pytorch DataLoader. So I dont know if the problem is with the pytorch dataloader in general, an issue with pytorch dataloader with AMD or something specific to my hardware configuration.
The data is basically a collection of videos. I randomly choose a video and from that choose a random frame. To do that I create a dataset for each video and combine all of them using ConcatDataset.
I solved my problem using a custom dataloader developed by DKFZ group, https://github.com/MIC-DKFZ/batchgenerators. They have their own version of multi-threaded dataloader developed by Fabian Isensee. The timings I now get are more reasonable and “consistent”:
batch size=32 and 60 iterations/epoch and 10 epochs
Epoch 0: 10.64
Epoch 1: 10.19
Epoch 2: 10.12
Epoch 3: 10.19
Epoch 4: 11.27
Epoch 5: 10.87
Epoch 6: 10.48
Epoch 7: 10.3
Epoch 8: 10.23
Epoch 9: 10.48
Same thing with 5 epochs and 300 iteration/epoch
Epoch 0: 51.95
Epoch 1: 50.35
Epoch 2: 51.69
Epoch 3: 50.38
Epoch 4: 51.71
I’m working on a project using 20bn something something dataset. Somehow, I run into exactly the same issue with a baseline, which has a pytorch dataloader. The second epoch and onwards, the performance slows down ~2x. I’m going to try the dataloader you suggested, but I’m still curious why pytorch dataloader has this issue. Does anyone has a clue? Thanks in advance! @albanD
There are a lot of things that can impact the dataloader performances so it’s hard to say for sure.
Which version of pytorch do you use? Where is the dataset stored physically (local drive or network drive)? What are the other tasks running on the same machine?
We use PyTorch 1.0. The dataset is stored on a high speed local M.2 drive. The task exclusively occupies this server.
We tried a lot of different things but nothing really solve the problem. Our current solution is that we downsample the dataset and load the whole dataset into the memory at the beginning. The multithread data loader is removed from the pipeline and the problem is temporally solved. Once we need higher resolution or larger dataset, I will investigate this issue again. Thanks a lot!
I found a related issue. My data is stored in multiple np.memmap files, and I use a dataloader to access entries in these different np.memmap files. Dataloader shuffles indices at the start of every epoch.
At the beginning of every epoch, the data loader is slow, but then towards the end it speeds up quite a bit. I wonder whether this is because the dataloader clears all the data loaded to main memory at the end of every epoch.