Runtime slowdown related to number of datasets

talfur · August 5, 2020, 10:25am

Hi,

I have a training scheme where the training data is spread across many sources.
I’ve defined a dataset per source, ending up with many (30-60) datasets (torch.utils.data.ConcatDataset).

Noticed a slowdown behavior watching GPU load during training. Once in a while the GPU load drops to 0 (not periodic).
Tried to look at transforms time Vs. number of workers, but all seems to be working well (when spread across all workers, get_batch time is smaller then net calculations).

All seems to run smooth (GPU well loaded), until number of datasets goes up above a certain number (e.g 30).
Then I experience the non-periodic slowdown.

Thought about memory … but seems like I have enough free memory during the run.

Was hoping one of you could have an idea where should I look next.

Many thanks,

Tal

ptrblck · August 8, 2020, 9:06am

Are you lazily loading the data in all “small” datasets or are you preloading the complete data?
In the former case, could you check, if a specific Dataset, which is added to ConcatDataset, creates the slowdown?

talfur · August 10, 2020, 8:48am

Thank you for the replay.
Data loading:

Each dataset has access to preloaded lidar point cloud, saved in frames. during run time - the relevant frame is returned. All data from all dataloader fits in memory (I thought it was a memory loading issue… but no such luck)

Dependency on specific dataset:

I’ve tried checking for a specific dataset, but found no correlation.
Also tried duplicating a single dataset for N (e.g 30) times, and was able to reproduce the effect.
Now trying to see if larger N for single dataset duplication would lead to longer delay.

A bit lost with the next testing direction though

ptrblck · August 11, 2020, 8:32am

Are you sure the complete data fits into memory or could your machine be using the swap?

talfur · August 11, 2020, 11:38am

Maybe I’m missing something with the way I deduce my memory assumption.
I’ve used free -m, and a typical result looks like this:

I assumed this means I have pleanty of memory left…
Can you explain (or point me to one) your point with using swap and not memory?

ptrblck · August 13, 2020, 8:27am

It looks like you have plenty of memory free, but also the swap seems to be used at least a bit.
You could have a look at this website to change the swappiness.

talfur · August 15, 2020, 12:03pm

Thank you!
I’ll do that and update.

talfur · August 17, 2020, 7:12pm

One more question.

It seems I have been wrong to say there is no pattern to the slowdown.
Giving it another look, the slowdown seems to correlate with the number of workers.
Working with 20 processes (in addition to main process) would give slow down every 20 iterations (not exactly but close to, with some variance).
Moving to 10 processes will cause shorter intervals between slow downs.

Am I missing some other route cause for my slow downs? related to initialization of workers?

ptrblck · August 18, 2020, 5:34am

This pattern would point towards a data loading bottleneck, i.e. the workers are not fast enough in preparing the next batch relative to the model training.
This could be the case, if your model workload is small or if your data loading is the bottleneck as explained e.g. here.