I have a training scheme where the training data is spread across many sources.
I’ve defined a dataset per source, ending up with many (30-60) datasets (torch.utils.data.ConcatDataset).
Noticed a slowdown behavior watching GPU load during training. Once in a while the GPU load drops to 0 (not periodic).
Tried to look at transforms time Vs. number of workers, but all seems to be working well (when spread across all workers, get_batch time is smaller then net calculations).
All seems to run smooth (GPU well loaded), until number of datasets goes up above a certain number (e.g 30).
Then I experience the non-periodic slowdown.
Thought about memory … but seems like I have enough free memory during the run.
Was hoping one of you could have an idea where should I look next.
Are you lazily loading the data in all “small” datasets or are you preloading the complete data?
In the former case, could you check, if a specific Dataset, which is added to ConcatDataset, creates the slowdown?
Each dataset has access to preloaded lidar point cloud, saved in frames. during run time - the relevant frame is returned. All data from all dataloader fits in memory (I thought it was a memory loading issue… but no such luck)
Dependency on specific dataset:
I’ve tried checking for a specific dataset, but found no correlation.
Also tried duplicating a single dataset for N (e.g 30) times, and was able to reproduce the effect.
Now trying to see if larger N for single dataset duplication would lead to longer delay.
It looks like you have plenty of memory free, but also the swap seems to be used at least a bit.
You could have a look at this website to change the swappiness.
It seems I have been wrong to say there is no pattern to the slowdown.
Giving it another look, the slowdown seems to correlate with the number of workers.
Working with 20 processes (in addition to main process) would give slow down every 20 iterations (not exactly but close to, with some variance).
Moving to 10 processes will cause shorter intervals between slow downs.
Am I missing some other route cause for my slow downs? related to initialization of workers?
This pattern would point towards a data loading bottleneck, i.e. the workers are not fast enough in preparing the next batch relative to the model training.
This could be the case, if your model workload is small or if your data loading is the bottleneck as explained e.g. here.