TorchScript hangs with warning on DenseNet on PyTorch 1.12

I compile a DenseNet169 model using TorchScript. When I use the model to do inference, it hangs for a while (half a minute), returns the output and gives the following warning.

UserWarning: operator() profile_node %201 : bool = prim::profile_ivalue(%training.24)
 does not have profile information (Triggered internally at  ../torch/csrc/jit/codegen/cuda/graph_fuser.cpp:104.)
  embedding = model.forward(image)

Using latest PyTorch 1.12 containers on AWS EC2, with Pthon 3.10. CPU and GPU are the same.
I tried it with ResNet50 without any problem (works fine).

Here is how I do it.

model = models.densenet169(weights=None)
model = model.cpu()
model.eval()
scripted_model = torch.jit.script(model)
smodel_file = 'densenet169.pt'
torch.jit.save(scripted_model, smodel_file)

# Use the model (in a different script)
device = torch.device('cuda')
#device = torch.device('cpu')
model = torch.jit.load(smodel_file, map_location=device)
# load image, transform it and forward
embeddings = model.forward(image)

Any ideas what the problem is? Looks like it is related to DenseNet model, as it does not happen on ResNet. Any solutions?

Thanks.

I cannot reproduce it using 1.12.0+cu116 and don’t see a warning or a hang.
Note that I’ve initialized image = torch.randn(1, 3, 224, 224).cuda() so unsure which shapes you are using.

Thanks Peter.

I load a real image and transform it to 224x224, similar to your input. And it happens only with DenseNet169, tried ResNet, ConvNext so far, they worked fine.

I installed PyTorch in a conda environment as follows:

conda create -n pytorch12 python=3.10.5 ipython
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge

And on AWS EC2 (p3.16x) with V100 GPUs and CUDA 11.6.

I have not yet tested it on any other platform, but heard from a colleague about the same issue with PyTorch 1.12 + TorchScript + DenseNet169.

That’s a bit strange, as it doesn’t seem to show any issues in 1.12.0. Let me rerun it with 1.12.1 on a V100 which I assume you’ve installed.

Checked my version now: 1.12.0+cu116

Tried on another EC2 p3 instance, instantiated from the PyTorch1.12 containers, and got the same output for DenseNet169 (only the path to graph_fuser.cpp is different, the warning and hanging are the same).

It also works in 1.12.1+cu116 for me:

root@28366e9e60bd:/workspace/src# cat tmp.py 
import torch
import torchvision.models as models

model = models.densenet169(weights=None)
model = model.cpu()
model.eval()
scripted_model = torch.jit.script(model)
smodel_file = 'densenet169.pt'
torch.jit.save(scripted_model, smodel_file)

root@28366e9e60bd:/workspace/src# cat lala.py 
import torch

# Use the model (in a different script)
device = torch.device('cuda')
smodel_file = 'densenet169.pt'
model = torch.jit.load(smodel_file, map_location=device)
# load image, transform it and forward
image = torch.randn(1, 3, 224, 224).cuda()
embeddings = model.forward(image)
print(embeddings.shape)

root@28366e9e60bd:/workspace/src# python tmp.py 
root@28366e9e60bd:/workspace/src# python lala.py 
torch.Size([1, 1000])

Thanks a lot. Weirdly, I can reproduce it every time I run it…

Just to make sure we are comparing the same builds: was any previous PyTorch version installed in these AWS containers and if so, do you know how it was installed (pip wheel, conda binary, source build)?
If something already ships in the container, could you uninstall every torch and torchvision installation you can find and install the latest stable release?
I don’t fully understand why 1.12.0 is being installed using your command even though 1.12.1 is the latest one.

I installed 1.12.0 some time ago, that is why. This instance has another PyTorch container (inactive), which I do not use.

I’ve also tested it on an instance started from one of the PyTorch 1.12 images on AWS (it is also 1.12.0, and it does not have any other installation or container). I got the same output. I will also test it on earlier versions, e.g., 1.11. and the latest version 1.12.1.

Btw, I did a google search and found a similar issue reported before:

I have installed PyTorch 1.12.1 and reproduced the issue. I think I found how to reproduce it, but no idea why it happens.

Here is how to reproduce it:

torch.set_grad_enabled(False)  # culprit!
for i in range(3):
   image = torch.rand(1,3,224,224).to(device)
   model.forward(image)

If I remove torch.set_grad_enabled(False), then I do not have the issue or if I run it on only one image. So, I need to run it on 2+ images sequentially with torch.set_grad_enabled(False) to reproduce it. Using with torch.no_grad() results in the same behavior.

Sorry, I did not provide these details in the original post, as I did not think they would have caused the issue (still do not understand why).

Hello there,
I have run into very similar issue when converting other model to torchscript. At first I thought it freezes on second batch. IMO the problem is connected with BatchNorm, here is code that should explain it:

import torch
from tqdm import tqdm
import time
from typing import Type
from argparse import ArgumentParser


class Layer(torch.nn.Module):
    def __init__(self, num_input_features: int, num_output_features: int) -> None:
        super().__init__()
        self.num_input_features = num_input_features
        self.num_output_features = num_output_features


class WithBN(Layer):
    def __init__(self, num_input_features: int, num_output_features: int) -> None:
        super().__init__(num_input_features, num_output_features)
        self.bn = torch.nn.BatchNorm2d(num_input_features, )
        self.conv = torch.nn.Conv2d(num_input_features, num_output_features, kernel_size=1, stride=1, bias=False)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.bn(x)
        x = self.conv(x)
        return x


class WithoutBN(Layer):
    def __init__(self, num_input_features: int, num_output_features: int) -> None:
        super().__init__(num_input_features, num_output_features)
        self.conv = torch.nn.Conv2d(num_input_features, num_output_features, kernel_size=1, stride=1, bias=False)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.conv(x)
        return x


class DeepNetwork(torch.nn.Module):
    def __init__(self, num_input_features: int, num_layers: int, layer: Type[Layer], growth: int) -> None:
        super().__init__()
        for i in range(num_layers):
            num_output_features = num_input_features + growth
            l = layer(num_input_features, num_output_features)
            self.add_module("denselayer%d" % (i + 1), l)
            num_input_features = num_output_features

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        for _, layer in self.named_children():
            x = layer(x)
        return x        


def loop(model: torch.nn.Module, use_tqdm=True) -> None:
    model = torch.jit.script(model, glob_input)
    with torch.no_grad(): # culprit
        for i in tqdm(range(3), colour='green', leave=False, disable=not use_tqdm):
            start = time.time()
            model(glob_input)
            stop = time.time()
            print(f"It {i}th time:{stop-start:8>.2f}s")


if __name__ == '__main__':
    global device
    global glob_input

    parser = ArgumentParser()
    parser.add_argument('--gpu', action='store_true', help='use gpu')
    parser.add_argument('--growth', type=int, default=16, help='growth rate')
    parser.add_argument('--tqdm', action='store_true', help='use tqdm')
    args = parser.parse_args()

    device = torch.device("cuda:0" if torch.cuda.is_available() and args.gpu else "cpu")
    num_layers = 99
    growth = args.growth
    num_input_features = 64
    glob_input = torch.rand(2, num_input_features, 224, 224).to(device)

    print("Model with BatchNorm")
    model = DeepNetwork(num_input_features, num_layers, WithBN, growth).to(device)
    model.eval()
    loop(model, use_tqdm=args.tqdm)

    print("Model without BatchNorm")
    model = DeepNetwork(num_input_features, num_layers, WithoutBN, growth).to(device)
    model.eval()
    loop(model, use_tqdm=args.tqdm)

I tested my hypothesis with networks of different widths, on cpu and gpu respectively. Here are outputs with commands at the top:

python3 freeze_example.py --growth=2
Model with BatchNorm
It 0th time:4.08s
It 1th time:12.35s
It 2th time:3.77s
Model without BatchNorm
It 0th time:2.96s
It 1th time:2.93s
It 2th time:2.88s

python3 freeze_example.py --growth=4
Model with BatchNorm
It 0th time:7.05s
It 1th time:15.53s
It 2th time:6.59s
Model without BatchNorm
It 0th time:5.95s
It 1th time:5.76s
It 2th time:5.53s

python3 freeze_example.py --growth=8
Model with BatchNorm
It 0th time:13.31s
It 1th time:21.57s
It 2th time:13.02s
Model without BatchNorm
It 0th time:11.77s
It 1th time:10.68s
It 2th time:10.75s


python3 freeze_example.py --growth=2 --gpu
Model with BatchNorm
It 0th time:2.29s
It 1th time:8.38s
It 2th time:0.01s
Model without BatchNorm
It 0th time:0.05s
It 1th time:0.05s
It 2th time:0.01s

python3 freeze_example.py --growth=16 --gpu
Model with BatchNorm
It 0th time:3.03s
It 1th time:8.32s
It 2th time:0.01s
Model without BatchNorm
It 0th time:0.05s
It 1th time:0.05s
It 2th time:0.00s

python3 freeze_example.py --growth=64 --gpu
Model with BatchNorm
It 0th time:13.77s
It 1th time:8.44s
It 2th time:0.01s
Model without BatchNorm
It 0th time:8.74s
It 1th time:0.05s
It 2th time:0.00s

I tested it on docker with:

FROM nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04
RUN pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116