Reliably measure module latency + Repeatability

maaft · March 5, 2020, 10:11am

For NAS (Network Architecture Search) I need to measure the latency of operation that are present in my search space.

Therefore, I tried different approaches to measure the latency of a nn.Module:

pytorch autograd profiler
“normal” time.time() measurements
cuda events

Of course I used torch.cuda.synchronize to account for any asynchronous execution.
Also I averaged all my measurements over >10000 of data points.

My problem now is, that I can’t get get:

Reliable measurements: nn.Conv2D(48,48, k=3) called with torch.randn(1,48,48,48) is sometimes slower than when I call the module with torch.randn(1,48,24,24). The same goes for using different channels: Higher channels should result in higher latency but sometimes the opposite is the case.
Repeatability Almost always, a second execution of the test script gives different results. Remember that I used torch.cuda.synchronize everywhere its needed.

A reason for 1. could be the massive parallelism of modern GPUs: The GPU just doesn’t care if the convolution uses 128 or 256 channels. It will calculate all results in parallel anyway, resulting in more or less the same latency.

For 2. I don’t have any clue. Why is the behavior of my GPU differently at different times? This makes all NAS papers or other papers that claim to have SOTA latency basically absolutely unreliable.

tl;dr:
Is there any method to measure module latency in a reliable and reproducible way?

ptrblck · March 6, 2020, 7:01am

The execution time highly depends on the used algorithm. E.g. you could see worse performance using cudnn for atypical input and kernel shapes. You could use torch.backends.cudnn.benchmark=True to call into the cudnn heuristics and execute a benchmark run in your first iteration to select the fastest kernels.
Getting stable results is not that easy and I would generally recommend to:

run a few warmup iterations right before the synchronization and the actual profiling
use a few iterations and calculate the average instead of isolated runs
lock the GPU clock, is you want stable results

Also, in case you are interested on some background knowledge, have a look at these GTC2019 slides for best practices when benchmarking CUDA applications.

maaft · March 6, 2020, 8:03am

Thank you for your answer!

I am aware of cudnn’s benchmark features and tried both with them enabled and disabled. Unfortunately enabling cudnn optimization even led to more unstable results.
- I already do warmup (i.e. 100 iterations)
- I don’t measure isolated runs. In fact I calculate the average over a period of full 6 seconds (can be more then 10k iterations or more)
- This sounds promising, I will try that.

Thank you for the link!

I will come back to this thread when I tried the GPU clock lock.

maaft · March 6, 2020, 8:11am

Unfortunately nvidia-smi -q -d SUPPORTED_CLOCKS gives just N/A output.

Do you know any other way to set the clock?

maaft · March 9, 2020, 12:54pm

Hi @ptrblck, do you have any idea how I can proceed?

I really start to doubt the quality of some publications in that regard, if we can’t reliably measure latency of our models.

ptrblck · March 9, 2020, 11:32pm

If the option is not supported, you shouldn’t worry about it too much, as all other steps (warmup, average over runs) should already give you some stability.
How large is your current run to run variance and could you post a (minimal) code snippet so that we could profile it?

If the variance is still large, could you check, if your system is overheating and thus might reduce the clocks?

maaft · March 16, 2020, 1:29pm

I found out why it was so unstable: My input-tensors were just to small (e.g. 1xCx12x12).

Seems like, that measuring very small tensors is just way to noisy as the latency is somewhere in the micro-seconds range.

I ended up multiplying the spatial-size with a constant factor of 8 for “measuring” the latency of my modules. In this way it will of course not reflect the real latency, but at least I can compare different modules for my NAS algorithm now.

themozel · June 25, 2020, 7:29am

Hey maaft !
I saw your post here as I was searching for some benchmarking tools for the Squeezenas paper by Albert Shaw. I am trying to reproduce the paper results. I suppose you have also worked on this. As I am a bit new in this field I wanted to ask if you could share with me how you measured the latency values ?

maaft · June 25, 2020, 9:37am

Hi themozel!

Unfortunately I didn’t find a solution for measuring cells with small tensors. A “hacky solution” was to just multiply H and W with a constant factor of e.g. 8. Although this leads to “correct” (i.e. expected) latency differences between operations of different complexity, it can also produce false results for high tensor sizes, as the number of tensor cores which can work on your data in parallel is limited.

If you find a reliable “real” solution, I would be interested to hear about it. Until then I will take all claims from any NAS authors regarding latency with a huge grain of salt.

Edit:
Oh, and to answer your question: You need to do multiple forward passes and just measure the time, averaging all results in the end. And don’t forget to use torch.synchronize().

Edit2:

You might also look into directions for latency estimation by a 2nd network. You basically sample e.g. 100.000 architectures and measure the latency of each candidate. The prediction network is then trained to predict the latency from the sampled parameters. When training your NAS Supernet, you will then use this pretrained 2nd network to predict the latency of your model.

Compared with only measuring single cells, this will reduce measurement noise significantly. Also it gives you more accurate final latency results of you architecture (instead of adding up single cell latencies).

themozel · June 25, 2020, 11:20am

Thank you for your fast reply and thorough explanation.
I am now using cuda events for benchmarking on a NVIDIA Xavier in 30W mode and I am getting lower latency values than the ones in the paper and also a different value for each run ! I measure the model.evaluate() part of the code so I am not sure if I am on the right track.