I am training a GAN drawing samples from LMDBs. I have multiple LMDBs - I create a dataset for each and concatenate them to make the final one. I was initially training with about 4K samples split across 8 LMDBs. I recently scaled up to about 60K samples, split across 20 LMDBs. I noticed significant slowing in the training when increasing the dataset (after 5 epochs, as I know sometimes the initial epoch is slow due to caching & initialization etc.).
I’m trying to figure out why the training speed (images/second) is 5-10x slower per iteration when using the large dataset than on the smaller one. I observed as I reduce the size of the training set, the training speed progressively gets faster and faster until reaching the expected speed. Using up to ~30K images shows no slowness, and the speeds drop dramatically from there. It doesn’t appear to matter which images are in the reduced dataset - as long as the number of images >30Kish, the training speed drops dramatically.
I’m training on 4 V100’s on a single node, using DataParallel() and apex AMP. Using PyTorch 1.4. My minibatches are ~96. Increasing num_workers beyond 6 doesn’t seem to help. Whether I use 1, 2, or 4 GPUs at a time, I still observe the “slower training with more data” phenomenon. The only difference between fast training and slow training in my testing is the number of images included in the dataset. Any suggestions appreciated. Thanks!
Try and visualise this : .
Imagine u create a for loop that iterates through an array X, computes the negative exponent of the square of each values in the array and append the result to a new array, if the values in array X are a few hundreds the process will run for few microseconds but assuming the number of values in array X was a hundred million then it might even take more than a minute to compute coz it will have to iterates through the array X hundred million times.
So the thing is when ur data set increased to 60k with a minibatch of 96 ur training loop will iterate through the dataset 60k/96 times which constitutes one epoch of training unlike when it’s 30k datasets where it will be 30k/96 times constituting one epoch
Also Incase u are wondering y the difference in time of training with 4k dataset is almost same with 10k or 15k dataset but increases drastically when training with more than 30k, it’s literally the same thing that happens when u make a for loop iterate an array X when the array size is 100 or 1000. The thing is to the human eye the timing looks the same but to a machine there’s significant difference and when the size of the data increases drastically it starts to get significant to the eyes of a human. Try using the time module to get the start and stop time of the iteration when using different datasets and you’ll see the difference.
Hope this helped.
This could be due to the fact that variable holding the training data is consuming alot of cpu memory in the background and hence making it some worth difficult for computation in the training process
It’s like trying to copy a file of 2GB in a hard drive with only 2.1GB of free space compared to a hard drive with 6GB free space
If u’ve coded in assembly b4 u’ll most likely know what I’m talking about.
Hope this helps.
Sounds to me like some OS memory management issue (ie data swapping to/from disk). The fact that you have several hundred GB of free RAM during training is strange, though. How are you measuring the free RAM?
To sandbox the issue, do you observe similar symptoms when you don’t use Amp and/or DataParallel at all?
I have some questions regarding the perf measurements.
Have you profiled which step is the bottleneck in every iteration? Is it data loading, or forward, or backward, or optimizer?
For GPU training, time.time() might not give the accurate measure, as CUDA ops return immediately after it is added to the stream. To get more accurate numbers, elapsed_time is a better option. The following post can serve as an example:
This is on an nvidia DGX2. I confirmed that all data is cached on the NVMe cache (as the cache does not grow as training progresses)
I removed DataParallel()/DDP()/AMP, and am training on one GPU as @mcarilli suggested
All hyperparams including minibatch and n_workers for the dataloader are kept the same for all tests
I invoke time.time() after each iteration to get the seconds per iteration.
While training with a small dataset (4k samples), it takes 1.2 seconds per iteration, and that speed is consistent after tens of thousands of iterations
While training with a large dataset (65k samples), it takes an average of 2 seconds per iteration. The way the slowness manifests is a handful of “fast” iteration at 1.2 seconds/iter, followed by a slow iteration that takes 4-10 seconds. It averages out close to 2 seconds/iter. When scaling up to many GPUs, the speed penalty (average seconds/iteration) with the large dataset is 5x or more (and only a negligible speed penalty with a small dataset).
To profile the RAM, I am calling the watch free command. It stays fairly consistent (doesn’t blow up as training progresses):
total used free shared buff/cache available
Mem: 1583405972 32676452 1541340 351212 1549188180 1548865308
Swap: 3904508 279808 3624700
I imagine the slowness is from loading the data, but I will try the elapsed_time() profiler as well as @mrshenli suggested
@jspisak I have not yet filed a GitHub issue - let me know if I should do that
@mcarilli Thanks for the reply! On one GPU, no DP/DDP/Amp with 6 workers I observe the slowness with the large (65k sample) dataset, but increasing the number of workers to 12 removes the slowness (e.g. training with the large dataset is just as fast as training with the smaller dataset). Wrapping the networks with DataParallel() but keeping it at 1GPU/12 workers/minibatch=12 did not cause any slowness. When I scale up to 2 GPUs, num_workers=24 and using minibatch=24, I start to see the slowness again. The pattern of slow training is as described above - several “fast” iterations followed by a slow iteration of 5-50 seconds, and this behavior persists for tens of thousands of iters.
I tried using AMP on 1 GPU, effectively increasing the minibatch size from 12 to 30, and I observed the slow training with the large dataset even after increasing the number of workers to 24. With a small dataset of ~4K samples, only 3 workers are required and I get consistently fast speeds.
The machine has 96 CPUs and 16 GPUs so I would think 6 workers per GPU would be ideal. When working with the small dataset, only 3 workers are required to get peak training speeds.
The dataset is 2.8TB total spanning approx. 40 LMDBs
I have set torch.set_num_threads(96) and torch.set_num_interop_threads(48) at the beginning of the training script.
Whether I set pin_memory=True or False doesn’t change the behavior. I’m passing in the SequentialSampler() into the dataloader, but the slow training still happens if I set sampler=False and shuffle=True.
Here’s how I open each LMDB in the dataset’s __init__ class: self.env = lmdb.open(lmdbPath, readonly=True, lock=False, readahead=False, meminit=False, max_readers=512)
readahead=True made things worse, meminit doesn’t appear to be a factor, and increasing max_readers didn’t help either
Here’s how I am reading imgs from the LMDB in the dataset’s __getitem__ method:
with env.begin(write=False) as txn:
buf = txn.get(path.encode('ascii'))
buf_meta = txn.get((path + '.meta').encode('ascii')).decode('ascii')
Please do let me know of any suggestions! I think the key finding here is that only 3 workers are required for top training speeds with a small dataset and 1 GPU, but 12 were required with the large dataset (with same minibatch/hyperparams). With effectively larger minibatches from multi-GPU or AMP, I still saw slowness after increasing the num_workers beyond 12.