Training fast with small dataset, slow with large dataset

vibe2 · September 5, 2020, 5:09am

I am training a GAN drawing samples from LMDBs. I have multiple LMDBs - I create a dataset for each and concatenate them to make the final one. I was initially training with about 4K samples split across 8 LMDBs. I recently scaled up to about 60K samples, split across 20 LMDBs. I noticed significant slowing in the training when increasing the dataset (after 5 epochs, as I know sometimes the initial epoch is slow due to caching & initialization etc.).

I’m trying to figure out why the training speed (images/second) is 5-10x slower per iteration when using the large dataset than on the smaller one. I observed as I reduce the size of the training set, the training speed progressively gets faster and faster until reaching the expected speed. Using up to ~30K images shows no slowness, and the speeds drop dramatically from there. It doesn’t appear to matter which images are in the reduced dataset - as long as the number of images >30Kish, the training speed drops dramatically.

I’m training on 4 V100’s on a single node, using DataParallel() and apex AMP. Using PyTorch 1.4. My minibatches are ~96. Increasing num_workers beyond 6 doesn’t seem to help. Whether I use 1, 2, or 4 GPUs at a time, I still observe the “slower training with more data” phenomenon. The only difference between fast training and slow training in my testing is the number of images included in the dataset. Any suggestions appreciated. Thanks!

Henry_Chibueze · September 5, 2020, 10:04am

Try and visualise this : .
Imagine u create a for loop that iterates through an array X, computes the negative exponent of the square of each values in the array and append the result to a new array, if the values in array X are a few hundreds the process will run for few microseconds but assuming the number of values in array X was a hundred million then it might even take more than a minute to compute coz it will have to iterates through the array X hundred million times.

So the thing is when ur data set increased to 60k with a minibatch of 96 ur training loop will iterate through the dataset 60k/96 times which constitutes one epoch of training unlike when it’s 30k datasets where it will be 30k/96 times constituting one epoch

Also Incase u are wondering y the difference in time of training with 4k dataset is almost same with 10k or 15k dataset but increases drastically when training with more than 30k, it’s literally the same thing that happens when u make a for loop iterate an array X when the array size is 100 or 1000. The thing is to the human eye the timing looks the same but to a machine there’s significant difference and when the size of the data increases drastically it starts to get significant to the eyes of a human. Try using the time module to get the start and stop time of the iteration when using different datasets and you’ll see the difference.
Hope this helped.

vibe2 · September 5, 2020, 3:00pm

Hi Henry, thanks for looking at this! My issue is related to seconds per iteration getting slower with more data, not each epoch taking longer with more data (which is to be expected).

I do use the time module to profile the training speed and observe the 5-10x slowness with more data. Hope that helps clarify the issue.

Henry_Chibueze · September 7, 2020, 9:24pm

This could be due to the fact that variable holding the training data is consuming alot of cpu memory in the background and hence making it some worth difficult for computation in the training process
It’s like trying to copy a file of 2GB in a hard drive with only 2.1GB of free space compared to a hard drive with 6GB free space
If u’ve coded in assembly b4 u’ll most likely know what I’m talking about.
Hope this helps.

vibe2 · September 14, 2020, 3:57am

My machine has several hundred GB of free RAM during training so I do not think that is a factor…

Additional tests I have tried: switching to PyTorch 1.6 and using DistributedDataParallel() did not help significantly

Henry_Chibueze · September 14, 2020, 8:49am

I still think it’s memory consumption of the training data as it gets bigger

jspisak · September 14, 2020, 9:26pm

Hey there - have you entered a GitHub issue yet? @mcarilli - can you take a look?

mcarilli · September 14, 2020, 10:34pm

Sounds to me like some OS memory management issue (ie data swapping to/from disk). The fact that you have several hundred GB of free RAM during training is strange, though. How are you measuring the free RAM?

To sandbox the issue, do you observe similar symptoms when you don’t use Amp and/or DataParallel at all?

mrshenli · September 16, 2020, 6:20pm

Hey @vibe2

I have some questions regarding the perf measurements.

Have you profiled which step is the bottleneck in every iteration? Is it data loading, or forward, or backward, or optimizer?
For GPU training, time.time() might not give the accurate measure, as CUDA ops return immediately after it is added to the stream. To get more accurate numbers, elapsed_time is a better option. The following post can serve as an example:

vibe2 · September 16, 2020, 8:53pm

Thanks everybody for the reply - I’m doing some more detailed profiling based on your suggestions and will get back to you shortly!

vibe2 · September 16, 2020, 11:07pm

Background:

This is on an nvidia DGX2. I confirmed that all data is cached on the NVMe cache (as the cache does not grow as training progresses)
I removed DataParallel()/DDP()/AMP, and am training on one GPU as @mcarilli suggested
All hyperparams including minibatch and n_workers for the dataloader are kept the same for all tests
I invoke time.time() after each iteration to get the seconds per iteration.

Profiling results:

While training with a small dataset (4k samples), it takes 1.2 seconds per iteration, and that speed is consistent after tens of thousands of iterations
While training with a large dataset (65k samples), it takes an average of 2 seconds per iteration. The way the slowness manifests is a handful of “fast” iteration at 1.2 seconds/iter, followed by a slow iteration that takes 4-10 seconds. It averages out close to 2 seconds/iter. When scaling up to many GPUs, the speed penalty (average seconds/iteration) with the large dataset is 5x or more (and only a negligible speed penalty with a small dataset).
To profile the RAM, I am calling the watch free command. It stays fairly consistent (doesn’t blow up as training progresses):

              total        used        free      shared  buff/cache   available
Mem:     1583405972    32676452     1541340      351212  1549188180  1548865308
Swap:       3904508      279808     3624700

I imagine the slowness is from loading the data, but I will try the elapsed_time() profiler as well as @mrshenli suggested
@jspisak I have not yet filed a GitHub issue - let me know if I should do that

jspisak · September 17, 2020, 3:46am

Yes, a GitHub issue would be great. This way we can triage, track and ensure we get this taken tare of. Thanks!!

mcarilli · September 17, 2020, 5:08pm

So it doesnt sound related to DP/DDP/Amp. Have you tried varing the number of dataloader workers beyond 6 in the single-GPU case? Also how big is the dataset on disk?

vibe2 · September 24, 2020, 3:23pm

@mcarilli Thanks for the reply! On one GPU, no DP/DDP/Amp with 6 workers I observe the slowness with the large (65k sample) dataset, but increasing the number of workers to 12 removes the slowness (e.g. training with the large dataset is just as fast as training with the smaller dataset). Wrapping the networks with DataParallel() but keeping it at 1GPU/12 workers/minibatch=12 did not cause any slowness. When I scale up to 2 GPUs, num_workers=24 and using minibatch=24, I start to see the slowness again. The pattern of slow training is as described above - several “fast” iterations followed by a slow iteration of 5-50 seconds, and this behavior persists for tens of thousands of iters.
I tried using AMP on 1 GPU, effectively increasing the minibatch size from 12 to 30, and I observed the slow training with the large dataset even after increasing the number of workers to 24. With a small dataset of ~4K samples, only 3 workers are required and I get consistently fast speeds.
The machine has 96 CPUs and 16 GPUs so I would think 6 workers per GPU would be ideal. When working with the small dataset, only 3 workers are required to get peak training speeds.
The dataset is 2.8TB total spanning approx. 40 LMDBs
I have set torch.set_num_threads(96) and torch.set_num_interop_threads(48) at the beginning of the training script.
Whether I set pin_memory=True or False doesn’t change the behavior. I’m passing in the SequentialSampler() into the dataloader, but the slow training still happens if I set sampler=False and shuffle=True.
Here’s how I open each LMDB in the dataset’s __init__ class:
self.env = lmdb.open(lmdbPath, readonly=True, lock=False, readahead=False, meminit=False, max_readers=512)
readahead=True made things worse, meminit doesn’t appear to be a factor, and increasing max_readers didn’t help either
Here’s how I am reading imgs from the LMDB in the dataset’s __getitem__ method:

    with env.begin(write=False) as txn:
        buf = txn.get(path.encode('ascii'))
        buf_meta = txn.get((path + '.meta').encode('ascii')).decode('ascii')

Please do let me know of any suggestions! I think the key finding here is that only 3 workers are required for top training speeds with a small dataset and 1 GPU, but 12 were required with the large dataset (with same minibatch/hyperparams). With effectively larger minibatches from multi-GPU or AMP, I still saw slowness after increasing the num_workers beyond 12.

vibe2 · September 24, 2020, 3:36pm

@jspisak I filed a GitHub issue here: https://github.com/pytorch/pytorch/issues/45273

And linked to this PyTorch thread. Presumably we’ll continue to discuss here to determine if I’m doing something silly or if there might be a bug/missing feature (the former much more likely). Thanks!