Time profile the model prediction on gpu in pytorch?

Is this the right way to time profile the model prediction on a test sample?

import torch.autograd.profiler as profiler
x.to("cuda:0") # test sample
with profiler.profile(use_cuda=True) as prof:
    with torch.no_grad():
        pred = model(x)

I would recommend to add some warum iterations and also calculate the actual runtime as a mean over iterations. This should be automatically done for you by using torch.utils.benchmark from the current master branch or nightly binary.

Thanks for your reply, I suppose you meant warmup iterations? But I got the general idea to take mean over number of iterations.

Also it prints this in the end:

This might be silly question but please bear with me as I am curious here; the cuda time corresponds to the time taken to compute the prediction

pred = model(x)

What exactly does the CPU time corresponds here?