I am training a GAN drawing samples from LMDBs. I have multiple LMDBs - I create a dataset for each and concatenate them to make the final one. I was initially training with about 4K samples split across 8 LMDBs. I recently scaled up to about 60K samples, split across 20 LMDBs. I noticed significant slowing in the training when increasing the dataset (after 5 epochs, as I know sometimes the initial epoch is slow due to caching & initialization etc.).
I’m trying to figure out why the training speed (images/second) is 5-10x slower per iteration when using the large dataset than on the smaller one. I observed as I reduce the size of the training set, the training speed progressively gets faster and faster until reaching the expected speed. Using up to ~30K images shows no slowness, and the speeds drop dramatically from there. It doesn’t appear to matter which images are in the reduced dataset - as long as the number of images >30Kish, the training speed drops dramatically.
I’m training on 4 V100’s on a single node, using DataParallel() and apex AMP. Using PyTorch 1.4. My minibatches are ~96. Increasing num_workers beyond 6 doesn’t seem to help. Whether I use 1, 2, or 4 GPUs at a time, I still observe the “slower training with more data” phenomenon. The only difference between fast training and slow training in my testing is the number of images included in the dataset. Any suggestions appreciated. Thanks!