TorchScript hangs with warning on DenseNet on PyTorch 1.12

I compile a DenseNet169 model using TorchScript. When I use the model to do inference, it hangs for a while (half a minute), returns the output and gives the following warning.

UserWarning: operator() profile_node %201 : bool = prim::profile_ivalue(%training.24)
 does not have profile information (Triggered internally at  ../torch/csrc/jit/codegen/cuda/graph_fuser.cpp:104.)
  embedding = model.forward(image)

Using latest PyTorch 1.12 containers on AWS EC2, with Pthon 3.10. CPU and GPU are the same.
I tried it with ResNet50 without any problem (works fine).

Here is how I do it.

model = models.densenet169(weights=None)
model = model.cpu()
model.eval()
scripted_model = torch.jit.script(model)
smodel_file = 'densenet169.pt'
torch.jit.save(scripted_model, smodel_file)

# Use the model (in a different script)
device = torch.device('cuda')
#device = torch.device('cpu')
model = torch.jit.load(smodel_file, map_location=device)
# load image, transform it and forward
embeddings = model.forward(image)

Any ideas what the problem is? Looks like it is related to DenseNet model, as it does not happen on ResNet. Any solutions?

Thanks.

I cannot reproduce it using 1.12.0+cu116 and don’t see a warning or a hang.
Note that I’ve initialized image = torch.randn(1, 3, 224, 224).cuda() so unsure which shapes you are using.

Thanks Peter.

I load a real image and transform it to 224x224, similar to your input. And it happens only with DenseNet169, tried ResNet, ConvNext so far, they worked fine.

I installed PyTorch in a conda environment as follows:

conda create -n pytorch12 python=3.10.5 ipython
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge

And on AWS EC2 (p3.16x) with V100 GPUs and CUDA 11.6.

I have not yet tested it on any other platform, but heard from a colleague about the same issue with PyTorch 1.12 + TorchScript + DenseNet169.

That’s a bit strange, as it doesn’t seem to show any issues in 1.12.0. Let me rerun it with 1.12.1 on a V100 which I assume you’ve installed.

Checked my version now: 1.12.0+cu116

Tried on another EC2 p3 instance, instantiated from the PyTorch1.12 containers, and got the same output for DenseNet169 (only the path to graph_fuser.cpp is different, the warning and hanging are the same).

It also works in 1.12.1+cu116 for me:

root@28366e9e60bd:/workspace/src# cat tmp.py 
import torch
import torchvision.models as models

model = models.densenet169(weights=None)
model = model.cpu()
model.eval()
scripted_model = torch.jit.script(model)
smodel_file = 'densenet169.pt'
torch.jit.save(scripted_model, smodel_file)

root@28366e9e60bd:/workspace/src# cat lala.py 
import torch

# Use the model (in a different script)
device = torch.device('cuda')
smodel_file = 'densenet169.pt'
model = torch.jit.load(smodel_file, map_location=device)
# load image, transform it and forward
image = torch.randn(1, 3, 224, 224).cuda()
embeddings = model.forward(image)
print(embeddings.shape)

root@28366e9e60bd:/workspace/src# python tmp.py 
root@28366e9e60bd:/workspace/src# python lala.py 
torch.Size([1, 1000])

Thanks a lot. Weirdly, I can reproduce it every time I run it…

Just to make sure we are comparing the same builds: was any previous PyTorch version installed in these AWS containers and if so, do you know how it was installed (pip wheel, conda binary, source build)?
If something already ships in the container, could you uninstall every torch and torchvision installation you can find and install the latest stable release?
I don’t fully understand why 1.12.0 is being installed using your command even though 1.12.1 is the latest one.

I installed 1.12.0 some time ago, that is why. This instance has another PyTorch container (inactive), which I do not use.

I’ve also tested it on an instance started from one of the PyTorch 1.12 images on AWS (it is also 1.12.0, and it does not have any other installation or container). I got the same output. I will also test it on earlier versions, e.g., 1.11. and the latest version 1.12.1.

Btw, I did a google search and found a similar issue reported before:

I have installed PyTorch 1.12.1 and reproduced the issue. I think I found how to reproduce it, but no idea why it happens.

Here is how to reproduce it:

torch.set_grad_enabled(False)  # culprit!
for i in range(3):
   image = torch.rand(1,3,224,224).to(device)
   model.forward(image)

If I remove torch.set_grad_enabled(False), then I do not have the issue or if I run it on only one image. So, I need to run it on 2+ images sequentially with torch.set_grad_enabled(False) to reproduce it. Using with torch.no_grad() results in the same behavior.

Sorry, I did not provide these details in the original post, as I did not think they would have caused the issue (still do not understand why).