Profiling PyTorch: pred.cpu() is reponsible for 95% of runtime?

I am currently profiling my Torch implementation of a graphical model, using the python cProfile class. Inference on my model takes around 12 seconds on GPU (Torch), 15 seconds on CPU (Torch) and 14 using CPU (numpy). The very low speed-up from CPU to GPU surprised me and this is the reason I am profiling my model.

The profiling output (see below) indicates that 95% of the runtime is used by a single .cpu() call. This call is done at the very end of the prediction pipeline in order to copy the results into numpy.

Torch Implementation Numpy implementation
GPU: 12 seconds CPU: 15 seconds


The main computation of the model is done in the kernel.compute() function, which is called 10 times (see pseudo-code below). In the numpy implementation this takes up about 93% of the runtime, which seams about right. kernel.compute() mainly consists of convolutions, dot-products, matrix-multiplications, reshapes and transposes. Surprisingly, the kernel.compute() call does not even show up in the torch profile graph. This means that less then 1% of computational time is spend inside this function.


def inference(self, num_iter=10):
    # Essentially softmax
    prediction = exp_and_normalize(-self.unary)
    for i in range(num_iter):
        tmp1 = -self.unary

        # Main computational effort
        tmp2 = self.kernel.compute(prediction) 

        tmp1 = tmp1 - tmp2
        prediction = exp_and_normalize(tmp1)

    return prediction.cpu().data.numpy()

So overall I am very surprised by this finding and I wonder how to interpret this. At the moment I see three possibilities:

  1. The single cpu() call indeed takes around 11 seconds. Maybe this is do to a bug? In this case I wonder how I would like to know how I can work around this. Maybe doing the data copying in parallel? Also do you know how I can test whether this hypothesis is true?

  2. Pytorch does all the gpu processing in parallel. The rest of the code only lines up computations for the GPU. The cpu() call then essentially is a join operation making the python interpreter wait till all the gpu computations are finished. Can somebody confirm this? If this is true, is there a way for me to profile my implementation?

  3. There are a view hidden .cpu() calls I (and the profiler) don’t know about. I am creating a view torch variables inside self.kernel.compute(prediction). Maybe this is where the computation time goes?

I think the problem here is that all cuda calls made with pytorch are asynchronous, and so the only point where it actually waits for the computations to be done is when you use them: when you want to send them back to the cpu.
You can either use torch.cuda.synchronize() after some operations to force synchronization.
You can also try setting the CUDA_LAUNCH_BLOCKING=1 environment variable to force the CUDA api to become synchronous. Note that this will most certainly slow down your code.

1 Like

Thanks a lot. Yes, the asynchronized computation explains this output. CUDA_LAUNCH_BLOCKING=1 is a great hint. This is very useful for meaningful profiling.

I posted a follow up question using the insides gained by additional profiling runs in a new thread.

Regarding forum etiquette. Is it generally preferred to post follow up questions in the same thread or to create a new thread for new questions?