Benchmarking optimizer timing while training DNNs

Jamesswiz · May 22, 2021, 6:22am

Hello,
I have a custom Pytorch optimizer for which I would like to compare the average execution time (with SGD/Adam) during each call to optimizer.step() and also for a specific operation (say matrix-matrix multiplication) within my custom optimizer.

What is the best and correct way to do this without the impact of other background processes?
I want to do this while the network trains to see scalability performance of optimizers, where number of trainable network weights (among all initialized weights) increases over epochs.
This needs be done both for CPU and GPU training.

Any suggestions to do a fair comparison would help. Thanks

eqy · May 22, 2021, 6:48am

I think the benchmarking module could be useful for this.

Jamesswiz · May 22, 2021, 6:57am

@eqy Can you be more specific on how to use it.? I think every call to benchmark.Timer( optimizer.step() ) will do say multiple runs over optimizer.step() for each batch and I don’t want this behaviour because I am training the network too.

eqy · May 22, 2021, 7:08am

Yes, typical benchmarking would use multiple runs to avoid measuring any startup costs. If you need to run this live, I would see if simply instrumenting your training loop with timing measurements (e.g, even just a crude t1 = time.time() ... t2 = time.tim2() and appropriate torch.cuda.synchronize calls provides enough precision.

Jamesswiz · May 22, 2021, 7:15am

Thanks for the pointers. Where should I use torch.cuda.synchronize ? Onetime at top while importing packages or within training loop for each batch or epoch?

eqy · May 22, 2021, 7:22am

It might extra overhead end to end, but the goal is to avoid leaving GPU operations in flight when timing. For example, you might synchronize to ensure that all GPU operations before your operations of interest have finished, and again to ensure that your timing finishes after all GPU operations of interest have completed.