TorchScript hangs with warning on DenseNet on PyTorch 1.12

blackberry · September 7, 2022, 1:51am

I compile a DenseNet169 model using TorchScript. When I use the model to do inference, it hangs for a while (half a minute), returns the output and gives the following warning.

UserWarning: operator() profile_node %201 : bool = prim::profile_ivalue(%training.24)
 does not have profile information (Triggered internally at  ../torch/csrc/jit/codegen/cuda/graph_fuser.cpp:104.)
  embedding = model.forward(image)

Using latest PyTorch 1.12 containers on AWS EC2, with Pthon 3.10. CPU and GPU are the same.
I tried it with ResNet50 without any problem (works fine).

Here is how I do it.

model = models.densenet169(weights=None)
model = model.cpu()
model.eval()
scripted_model = torch.jit.script(model)
smodel_file = 'densenet169.pt'
torch.jit.save(scripted_model, smodel_file)

# Use the model (in a different script)
device = torch.device('cuda')
#device = torch.device('cpu')
model = torch.jit.load(smodel_file, map_location=device)
# load image, transform it and forward
embeddings = model.forward(image)

Any ideas what the problem is? Looks like it is related to DenseNet model, as it does not happen on ResNet. Any solutions?

Thanks.

ptrblck · September 7, 2022, 3:22am

I cannot reproduce it using 1.12.0+cu116 and don’t see a warning or a hang.
Note that I’ve initialized image = torch.randn(1, 3, 224, 224).cuda() so unsure which shapes you are using.

blackberry · September 7, 2022, 4:34am

Thanks Peter.

I load a real image and transform it to 224x224, similar to your input. And it happens only with DenseNet169, tried ResNet, ConvNext so far, they worked fine.

I installed PyTorch in a conda environment as follows:

conda create -n pytorch12 python=3.10.5 ipython
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge

And on AWS EC2 (p3.16x) with V100 GPUs and CUDA 11.6.

I have not yet tested it on any other platform, but heard from a colleague about the same issue with PyTorch 1.12 + TorchScript + DenseNet169.

ptrblck · September 7, 2022, 4:37am

That’s a bit strange, as it doesn’t seem to show any issues in 1.12.0. Let me rerun it with 1.12.1 on a V100 which I assume you’ve installed.

blackberry · September 7, 2022, 4:37am

Checked my version now: 1.12.0+cu116

blackberry · September 7, 2022, 5:06am

Tried on another EC2 p3 instance, instantiated from the PyTorch1.12 containers, and got the same output for DenseNet169 (only the path to graph_fuser.cpp is different, the warning and hanging are the same).

ptrblck · September 7, 2022, 5:07am

It also works in 1.12.1+cu116 for me:

root@28366e9e60bd:/workspace/src# cat tmp.py 
import torch
import torchvision.models as models

model = models.densenet169(weights=None)
model = model.cpu()
model.eval()
scripted_model = torch.jit.script(model)
smodel_file = 'densenet169.pt'
torch.jit.save(scripted_model, smodel_file)

root@28366e9e60bd:/workspace/src# cat lala.py 
import torch

# Use the model (in a different script)
device = torch.device('cuda')
smodel_file = 'densenet169.pt'
model = torch.jit.load(smodel_file, map_location=device)
# load image, transform it and forward
image = torch.randn(1, 3, 224, 224).cuda()
embeddings = model.forward(image)
print(embeddings.shape)

root@28366e9e60bd:/workspace/src# python tmp.py 
root@28366e9e60bd:/workspace/src# python lala.py 
torch.Size([1, 1000])

blackberry · September 7, 2022, 5:13am

Thanks a lot. Weirdly, I can reproduce it every time I run it…

ptrblck · September 7, 2022, 5:19am

Just to make sure we are comparing the same builds: was any previous PyTorch version installed in these AWS containers and if so, do you know how it was installed (pip wheel, conda binary, source build)?
If something already ships in the container, could you uninstall every torch and torchvision installation you can find and install the latest stable release?
I don’t fully understand why 1.12.0 is being installed using your command even though 1.12.1 is the latest one.

blackberry · September 7, 2022, 5:26am

I installed 1.12.0 some time ago, that is why. This instance has another PyTorch container (inactive), which I do not use.

I’ve also tested it on an instance started from one of the PyTorch 1.12 images on AWS (it is also 1.12.0, and it does not have any other installation or container). I got the same output. I will also test it on earlier versions, e.g., 1.11. and the latest version 1.12.1.

blackberry · September 7, 2022, 5:29am

Btw, I did a google search and found a similar issue reported before:

github.com/pytorch/pytorch

PR #81785 causes inference start up overhead on timm models.

opened 06:50PM - 27 Jul 22 UTC

closed 09:56AM - 06 Aug 22 UTC

jjsjann123

oncall: jit

### 🐛 Describe the bug PR #81785 seems to have increased start up overhead for …timm models significantly and has timed out our internal CI. Overall throughput is about the same, but end-2-end test time for the example below has regressed 2x. Downloading the repo from timm: `https://github.com/rwightman/pytorch-image-models.git` run with ``` python ./pytorch-image-models/benchmark.py --bench infer --model adv_inception_v3 -b 128 --img-size 224 --fuser nvfuser --torchscript 2>&1 | ts -s ``` On commit '9a5fa15ea834399dbcb60a6c90dc9559ca4e07c3' ``` 00:00:02 Benchmarking in float32 precision. NCHW layout. torchscript enabled 00:00:03 Model adv_inception_v3 created, param count: 23834568 00:00:04 Running inference benchmark on adv_inception_v3 for 40 steps w/ input size (3, 224, 224) and batch size 128. 00:00:44 /opt/pytorch/pytorch/torch/nn/modules/module.py:1186: UserWarning: operator() profile_node %3813 : bool = prim::profile_ivalue(%training) 00:00:44 does not have profile information (Triggered internally at /opt/pytorch/pytorch/torch/csrc/jit/codegen/cuda/graph_fuser.cpp:104.) 00:00:44 return forward_call(*input, **kwargs) 00:00:44 Infer [8/40]. 1833.46 samples/sec. 69.813 ms/step. 00:00:45 Infer [16/40]. 1832.86 samples/sec. 69.836 ms/step. 00:00:45 Infer [24/40]. 1833.01 samples/sec. 69.830 ms/step. 00:00:46 Infer [32/40]. 1832.72 samples/sec. 69.842 ms/step. 00:00:47 Infer [40/40]. 1832.99 samples/sec. 69.831 ms/step. 00:00:47 Inference benchmark of adv_inception_v3 done. 1832.04 samples/sec, 69.83 ms/step 00:00:47 --result 00:00:47 { 00:00:47 "model": "adv_inception_v3", 00:00:47 "infer_samples_per_sec": 1832.04, 00:00:47 "infer_step_time": 69.831, 00:00:47 "infer_batch_size": 128, 00:00:47 "infer_img_size": 224, 00:00:47 "param_count": 23.83 00:00:47 } ``` going to commit prior `1e8ef8cb205319e43fb4210bdbbf7749db2e0d73` ``` 00:00:02 Benchmarking in float32 precision. NCHW layout. torchscript enabled 00:00:04 Model adv_inception_v3 created, param count: 23834568 00:00:05 Running inference benchmark on adv_inception_v3 for 40 steps w/ input size (3, 224, 224) and batch size 128. 00:00:24 /opt/pytorch/pytorch/torch/nn/modules/module.py:1186: UserWarning: operator() profile_node %3813 : bool = prim::profile_ivalue(%training) 00:00:24 does not have profile information (Triggered internally at /opt/pytorch/pytorch/torch/csrc/jit/codegen/cuda/graph_fuser.cpp:104.) 00:00:24 return forward_call(*input, **kwargs) 00:00:24 Infer [8/40]. 1833.36 samples/sec. 69.817 ms/step. 00:00:25 Infer [16/40]. 1833.43 samples/sec. 69.814 ms/step. 00:00:26 Infer [24/40]. 1833.36 samples/sec. 69.817 ms/step. 00:00:26 Infer [32/40]. 1833.58 samples/sec. 69.809 ms/step. 00:00:27 Infer [40/40]. 1833.49 samples/sec. 69.812 ms/step. 00:00:27 Inference benchmark of adv_inception_v3 done. 1832.86 samples/sec, 69.81 ms/step 00:00:27 --result 00:00:27 { 00:00:27 "model": "adv_inception_v3", 00:00:27 "infer_samples_per_sec": 1832.86, 00:00:27 "infer_step_time": 69.812, 00:00:27 "infer_batch_size": 128, 00:00:27 "infer_img_size": 224, 00:00:27 "param_count": 23.83 00:00:27 } ``` ### Versions Collecting environment information... PyTorch version: 1.13.0a0+git4bf2f14 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.4 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.22.1 Libc version: glibc-2.31 Python version: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-5.4.0-97-generic-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 11.7.64 GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB GPU 1: Tesla V100-SXM2-16GB GPU 2: Tesla V100-SXM2-16GB GPU 3: Tesla V100-SXM2-16GB GPU 4: Tesla V100-SXM2-16GB GPU 5: Tesla V100-SXM2-16GB GPU 6: Tesla V100-SXM2-16GB GPU 7: Tesla V100-SXM2-16GB Nvidia driver version: 470.82.01 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Versions of relevant libraries: [pip3] functorch==0.3.0a0+516a8cd [pip3] numpy==1.21.5 [pip3] torch==1.13.0a0+git1e8ef8c [pip3] torchdynamo==1.13.0.dev0 [pip3] torchvision==0.14.0a0+91176e8 [conda] functorch 0.3.0a0+516a8cd dev_0 <develop> [conda] magma-cuda110 2.5.2 1 pytorch [conda] mkl 2019.1 144 [conda] mkl-include 2019.1 144 [conda] nomkl 3.0 0 [conda] numpy 1.21.5 py38h7a5d4dd_2 [conda] numpy-base 1.21.5 py38hb8be1f0_2 [conda] torch 1.13.0a0+git1e8ef8c dev_0 <develop> [conda] torchdynamo 1.13.0.dev0 dev_0 <develop> [conda] torchvision 0.14.0a0+91176e8 dev_0 <develop>

blackberry · September 7, 2022, 6:11am

I have installed PyTorch 1.12.1 and reproduced the issue. I think I found how to reproduce it, but no idea why it happens.

Here is how to reproduce it:

torch.set_grad_enabled(False)  # culprit!
for i in range(3):
   image = torch.rand(1,3,224,224).to(device)
   model.forward(image)

If I remove torch.set_grad_enabled(False), then I do not have the issue or if I run it on only one image. So, I need to run it on 2+ images sequentially with torch.set_grad_enabled(False) to reproduce it. Using with torch.no_grad() results in the same behavior.

Sorry, I did not provide these details in the original post, as I did not think they would have caused the issue (still do not understand why).

marcinmazur · June 12, 2023, 12:34pm

Hello there,
I have run into very similar issue when converting other model to torchscript. At first I thought it freezes on second batch. IMO the problem is connected with BatchNorm, here is code that should explain it:

import torch
from tqdm import tqdm
import time
from typing import Type
from argparse import ArgumentParser


class Layer(torch.nn.Module):
    def __init__(self, num_input_features: int, num_output_features: int) -> None:
        super().__init__()
        self.num_input_features = num_input_features
        self.num_output_features = num_output_features


class WithBN(Layer):
    def __init__(self, num_input_features: int, num_output_features: int) -> None:
        super().__init__(num_input_features, num_output_features)
        self.bn = torch.nn.BatchNorm2d(num_input_features, )
        self.conv = torch.nn.Conv2d(num_input_features, num_output_features, kernel_size=1, stride=1, bias=False)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.bn(x)
        x = self.conv(x)
        return x


class WithoutBN(Layer):
    def __init__(self, num_input_features: int, num_output_features: int) -> None:
        super().__init__(num_input_features, num_output_features)
        self.conv = torch.nn.Conv2d(num_input_features, num_output_features, kernel_size=1, stride=1, bias=False)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.conv(x)
        return x


class DeepNetwork(torch.nn.Module):
    def __init__(self, num_input_features: int, num_layers: int, layer: Type[Layer], growth: int) -> None:
        super().__init__()
        for i in range(num_layers):
            num_output_features = num_input_features + growth
            l = layer(num_input_features, num_output_features)
            self.add_module("denselayer%d" % (i + 1), l)
            num_input_features = num_output_features

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        for _, layer in self.named_children():
            x = layer(x)
        return x        


def loop(model: torch.nn.Module, use_tqdm=True) -> None:
    model = torch.jit.script(model, glob_input)
    with torch.no_grad(): # culprit
        for i in tqdm(range(3), colour='green', leave=False, disable=not use_tqdm):
            start = time.time()
            model(glob_input)
            stop = time.time()
            print(f"It {i}th time:{stop-start:8>.2f}s")


if __name__ == '__main__':
    global device
    global glob_input

    parser = ArgumentParser()
    parser.add_argument('--gpu', action='store_true', help='use gpu')
    parser.add_argument('--growth', type=int, default=16, help='growth rate')
    parser.add_argument('--tqdm', action='store_true', help='use tqdm')
    args = parser.parse_args()

    device = torch.device("cuda:0" if torch.cuda.is_available() and args.gpu else "cpu")
    num_layers = 99
    growth = args.growth
    num_input_features = 64
    glob_input = torch.rand(2, num_input_features, 224, 224).to(device)

    print("Model with BatchNorm")
    model = DeepNetwork(num_input_features, num_layers, WithBN, growth).to(device)
    model.eval()
    loop(model, use_tqdm=args.tqdm)

    print("Model without BatchNorm")
    model = DeepNetwork(num_input_features, num_layers, WithoutBN, growth).to(device)
    model.eval()
    loop(model, use_tqdm=args.tqdm)

I tested my hypothesis with networks of different widths, on cpu and gpu respectively. Here are outputs with commands at the top:

python3 freeze_example.py --growth=2
Model with BatchNorm
It 0th time:4.08s
It 1th time:12.35s
It 2th time:3.77s
Model without BatchNorm
It 0th time:2.96s
It 1th time:2.93s
It 2th time:2.88s

python3 freeze_example.py --growth=4
Model with BatchNorm
It 0th time:7.05s
It 1th time:15.53s
It 2th time:6.59s
Model without BatchNorm
It 0th time:5.95s
It 1th time:5.76s
It 2th time:5.53s

python3 freeze_example.py --growth=8
Model with BatchNorm
It 0th time:13.31s
It 1th time:21.57s
It 2th time:13.02s
Model without BatchNorm
It 0th time:11.77s
It 1th time:10.68s
It 2th time:10.75s


python3 freeze_example.py --growth=2 --gpu
Model with BatchNorm
It 0th time:2.29s
It 1th time:8.38s
It 2th time:0.01s
Model without BatchNorm
It 0th time:0.05s
It 1th time:0.05s
It 2th time:0.01s

python3 freeze_example.py --growth=16 --gpu
Model with BatchNorm
It 0th time:3.03s
It 1th time:8.32s
It 2th time:0.01s
Model without BatchNorm
It 0th time:0.05s
It 1th time:0.05s
It 2th time:0.00s

python3 freeze_example.py --growth=64 --gpu
Model with BatchNorm
It 0th time:13.77s
It 1th time:8.44s
It 2th time:0.01s
Model without BatchNorm
It 0th time:8.74s
It 1th time:0.05s
It 2th time:0.00s

I tested it on docker with:

FROM nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04
RUN pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116