Benchmarking optimizer timing while training DNNs

I have a custom Pytorch optimizer for which I would like to compare the average execution time (with SGD/Adam) during each call to optimizer.step() and also for a specific operation (say matrix-matrix multiplication) within my custom optimizer.

What is the best and correct way to do this without the impact of other background processes?
I want to do this while the network trains to see scalability performance of optimizers, where number of trainable network weights (among all initialized weights) increases over epochs.
This needs be done both for CPU and GPU training.

Any suggestions to do a fair comparison would help. Thanks

I think the benchmarking module could be useful for this.

@eqy Can you be more specific on how to use it.? I think every call to benchmark.Timer( optimizer.step() ) will do say multiple runs over optimizer.step() for each batch and I don’t want this behaviour because I am training the network too.

Yes, typical benchmarking would use multiple runs to avoid measuring any startup costs. If you need to run this live, I would see if simply instrumenting your training loop with timing measurements (e.g, even just a crude t1 = time.time() ... t2 = time.tim2() and appropriate torch.cuda.synchronize calls provides enough precision.

Thanks for the pointers. Where should I use torch.cuda.synchronize ? Onetime at top while importing packages or within training loop for each batch or epoch?

It might extra overhead end to end, but the goal is to avoid leaving GPU operations in flight when timing. For example, you might synchronize to ensure that all GPU operations before your operations of interest have finished, and again to ensure that your timing finishes after all GPU operations of interest have completed.