Second forward call of torchscripted module breaks on cuda

I have a torchscripted module which I am testing out for prediction. Each instance is a dictionary of tensors. The first instance I pass into the torchscripted module correctly runs through the model and generates an acceptable output. The second instance to the same torchscript object causes the below error. This only occurs when running on a cuda device and not on CPU. I have insured its not a problem with the data by passing in the exact same instance twice and observing the error get thrown on the second forward pass. Any idea what this could be?

Traceback (most recent call last):
  File "predict.py", line 60, in <module>
    output_greedy = script(batch)
  File "xxx/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
RuntimeError: default_program(22): error: extra text after expected end of number

1 error detected in the compilation of "default_program".

nvrtc compilation failed: 

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)


template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}

extern "C" __global__
void fused_neg_add_mul(float* t0, float* aten_mul) {
{
  if (512 * blockIdx.x + threadIdx.x<67 ? 1 : 0) {
    float v = __ldg(t0 + (512 * blockIdx.x + threadIdx.x) % 67);
    aten_mul[512 * blockIdx.x + threadIdx.x] = ((0.f - v) + 1.f) * -1.000000020040877e+20.f;
  }
}
}

Are you scripting the same model in the same file and the second invocation of torch.jit.script(model) raises this issue?
If so, do you see the error on any model or just your custom one? Could you also post the output of python -m torch.utils.collect_env here, please?

Are you scripting the same model in the same file and the second invocation of torch.jit.script(model) raises this issue?
I tried the following and got the same error. Scripting in the same file and then running the two prediction calls and scripting in the file, saving to disk, reloading in the same file and running the two inference calls. I was originally scripting with module.to_torchscript() in pytorch lightning so tried to switch to invoking torch.jit.script as well. Same error.
If so, do you see the error on any model or just your custom one? Could you also post the output of python -m torch.utils.collect_env here, please?

Collecting environment information...
PyTorch version: 1.8.1+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB
Nvidia driver version: 460.80
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.4
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] pytorch-lightning==1.3.5
[pip3] torch==1.8.1
[pip3] torch4u==0.0.1
[pip3] torchmetrics==0.3.2
[conda] Could not collect

I tried a basic example with another module and it seems to not have issue. However, the module I am torchscripting is much more complex. Any advice on how one may proceed? I additionally tried running on an A100 with CUDA 11 and hit the same issue where it fails on the second attempt at a forward pass.

import torch
from torch import nn

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
            nn.ReLU()
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

module = NeuralNetwork().to(device='cuda')


script = torch.jit.script(module)
input = torch.rand((1,28,28)).to(device='cuda')

out = script(input)
out = script(input)
print(out)

I’m getting the same issue.

I load my script model and simply loop over images. The 2nd image results in this error every time (even when it’s the same image). Is there a way forward?

I triedthe example from @AndriyMulyar and the error does not occure.

The second loop doing this is because this is when the JIT fusers try to produce an optimized kernel by default.
This could be your environment if all JIT compilation fails, or a bug in the fuser.

Maybe you can isolate the issue by

  • making sure something works (from a fuser tutorial or so),
  • if it does, try to reduce your example by leaving out half of the computation to see if it still happens. If you do this a few times, you can replace inputs by random tensors of the same dtype, shape and stride. This might get us a shareable sellf-contained example.

Best regards

Thomas