Getting "can't export a trace that didn't finish running" error with profiler

Hi I have a transformer encoder and I’m testing it with the following code:

import torch
import torch.autograd.profiler as profiler

encoder = torch.jit.load('eval/')

tmp = torch.ones([1, 7, 80])
len = torch.Tensor([7])

encoder.forward(tmp, len)
encoder.forward(tmp, len)

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    # Input is of the form [batch_size, len, feature_depth]
    out, _, _ = encoder.forward(torch.ones([1, 403, 80]), torch.Tensor([403]))
    print("DONE WITH FORWARD")

And I’m getting the following error :

RuntimeError: can't export a trace that didn't finish running

Even though I see that DONE message (which I presume means the encoder forward is done). Does anyone know why this might be happening?

The encoder code itself is pretty similar to fairseq/ at master · pytorch/fairseq · GitHub but unfortunately I can’t give you the jit file or the exact code :frowning:

Solved: The print(prof) line should be outside the with block.


Hello, I have a question about the warmup part of the code that was posted.

Why is there two warmup calls to forward? I have found that I sometimes need this as well. I was wondering if this is just a heuristic or if somebody could point me towards documentation or concrete reasons why this is a good idea.


I have no idea either. Before torch 1.17, I noticed warmup was significantly longer and needed only 1 warmup step, but after 1.17, warmup is faster but needs 2 warmup steps. I know they wanted to make warmup faster in 1.17 but I’m not sure exactly why two steps are needed.

Thanks for the reply!

I am curious as well as to why this is.

Have you seen/heard of it in other’s code?

I don’t know where this code is coming from and thus cannot guarantee what the author intended to do, but warmup iterations are needed for:

  • if I’m not mistaken, the JIT uses (a few) passes to optimize the graph and thus would need these warmup stage for a proper profiling
  • if you are using the GPU, the first CUDA operation will create the context, allocate memory and put it to the cache. The following CUDA operations would reuse the cache and avoid slow and blocking memory allocations (if possible)
  • if you are using cudnn.benchmark the first iteration for each new input shape will run internal profiling via cudnnFind in order to select the fastest kernel for this workload and will thus be slow

Hi @ptrblck

Thanks for the response!

I have a couple questions about the points mentioned,

If I am importing a trace (.pt) and using that for inference, are these JIT passes mentioned still needed? I thought that torchscript took care of this. (and if it is indeed still needed, could you point me in the direction of places to read about this.

I assumed that CUDA did need to allocate memory in the first pass, but it is this two shot warmup that I see is needed here (and other places) is what I am confused about.


I don’t know enough about the currently used JIT vs. legacy JIT, but saw someone mentioned two(?) needed iterations for graph optimizations etc. in a created bug a while ago.

Sweet! Thanks @ptrblck.

Would you be able to point me in the direction of this bug report? I took a look under “oncall: jit” and “jit-backlog” but if you remember anything about it that would be great!


I searched for it yesterday as well, but couldn’t find it. Let me check it again and ask around.

Thank you! I appreciate it.