Add automatic tuning flags to utils.data.dataloader

Sean_O_Bannon · July 19, 2019, 5:16pm

Hi! I’d like to highlight a feature request made on the GitHub repo for automatic tuning of batch_size and num_workers, and start some discussion around this topic.

Much like tensorflow has introduced atf.data.experimental.AUTOTUNE flag to automatically tune these parameters, I think this feature would be very relevant for PyTorch users as well.

I have a couple questions for the community to start gathering building concensus -

Have you previously thought about this autotuning flag?
If you have thought about it before, what was the blocker to implementing it?
If this feature was introduced, would you use it?
What parameters do you use for batch_size and num_workers right now, and how do you set them?

ptrblck · July 19, 2019, 11:51pm

It would be interesting to know, how many users are choosing the batch size based on the computation performance vs. the methodological perspective (e.g. bad training behavior for tiny or huge batch sizes).
E.g. if hypothetically a very small batch size would yield the best speedup, wouldn’t it also make some architectural changes necessary (replacing batchnorm layers for groupnorm etc.)?

Sean_O_Bannon · July 22, 2019, 5:09pm

I think a flag like this might be most useful for inference jobs, where we care exclusively about performance, without regard for training behavior. You’re right - adjusting batch size for users automatically will have side effects for training; but we can avoid this issue by at least narrowing the scope to inference.

Additionally, tuning num_workers would improve performance for both training and inference, especially for large scale (GB/TB) inference jobs.

Also, experimentally, it seems that large batch sizes tend to yield the best speedups, up until data can no longer fit in memory.

andravin · July 26, 2019, 11:07pm

batch_size: use the largest batch size that fits in GPU memory. Scale learning rate and warmup according to the method of “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.”

num_workers: how do you tune a Winnebago to get a good lap time at Le Mans? Step 1: replace it with a Ferrari. Step 2: tune the Ferrari.

Sean_O_Bannon · July 30, 2019, 12:05am

Hey Andrew ! Your suggestion on batch_size makes a lot of sense.

For num_workers - are you suggesting a TF/other DL library solution or a different consumer of data within PyTorch?

andravin · July 30, 2019, 12:35am

@Sean_O_Bannon I think it makes sense to have an optimized implementation of image folder and transforms for pytorch. The current API is nice, but the implementation is inefficient, and it slows down the entire system when it needs to feed 8xV100s and a small network (eg mobilenet_v2).

Elegant_Lin · August 11, 2019, 9:08am

Hi, I am also focusing on this research. Do you mean that I can follow this research by using DistributedParralleDataset?

I am very puzzled about this.

Thanks!

ptrblck · August 21, 2019, 4:35pm

For such systems, you could try out DALI.