GAN Training takes too long!

aneeshaasc · April 9, 2019, 5:08pm

I’ve created a Conditional DCGAN to work on the MNIST data. It involves applying some transformations and reversing them before feeding the data to the Conditional DGAN for training. Therefore I’ve created a custom dataset from custom transformations on the MNIST.

However, the training time is too long. One epoch took about 3 hours. I’ve modified the Data loader to use num_workers=6 and pin_memory=True, and still see the training is painstakingly slow. How can I speed this up?

Using num_workers=16 throws error:

assert self._parent_pid == os.getpid(), 'can only join a child process'

AssertionError: can only join a child process

Is there any other way to increase training speed?

Full code can be found at: https://colab.research.google.com/drive/1gUr54oAwONwqCRWcWc2lMKSQ8ZOJxw0N

ptrblck · April 10, 2019, 9:23pm

Could you try to time your data loading using data_time from the ImageNet example?
Alternatively, could you just time the transformations?
Since they seem to perform some OpenCV workload it might be the bottleneck.

aneeshaasc · April 10, 2019, 9:24pm

Is this meant to influence training speed?

ptrblck · April 10, 2019, 9:27pm

If your data loading is the bottleneck, your actual training will have to wait until the next batch is provided, so yes it’ll influence the training time.

aneeshaasc · April 10, 2019, 9:36pm

Would it help to apply transformations on MNIST once and save it for use as and when needed? this would speed up fetching data I thought. Please let me know your opinion.

ptrblck · April 10, 2019, 9:37pm

If the transformations are static/deterministic, i.e. yielding the same result for each call, then you could store the transformed tensors and just load them (lazily) for your training.
This should speed up the training, if the transformations are really the bottleneck, so you should check that first before optimizing unnecessary code parts.