F.interpolate and Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

When running model on server I get the following error:

I use

F.interpolate(g, size=input_size[2:], mode=upsample_mode) in the forward method
and get following error when running script on the server:

File “/usr/local/lib/python3.10/dist-packages/torch/_decomp/decompositions.py”, line 2821, in upsample_bilinear2d x_ceil = torch.ceil(x).clamp(max=in_h - 1).to(torch.int64)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_CUDA_clamp_Tensor)

When running the same model locally on GPU this error does not appear.

Also this error seems to not appear when running on the server, but directly specifying size, e.g: F.interpolate(g, size=(64,64), mode=upsample_mode), but I need size to be calculated and not specified.

In the following thread (`F.interpolate` uses incorrect size when `align_corners=True` · Issue #76487 · pytorch/pytorch · GitHub) I have seen suggestion: `F.interpolate` uses incorrect size when `align_corners=True` · Issue #76487 · pytorch/pytorch · GitHub

Do you know any reason for this problem or any workaround for this? I do not think that this related to moving model or tensor to cuda.

I don’t know where the device mismatch comes from so could you print all inputs to F.interpolate?

It seems it related to the following incident: [pt2] The min and max parameters of torch.clamp do not support numpy format · Issue #94174 · pytorch/pytorch · GitHub, somehow related how torch.clamp works.

I am not completely able to follow explanation on what is happening with dynamo and inductor, but when I downgraded PyTorch to 2.0.1 on the server (to be the same as on the local machine, where it worked without this error) - then it worked on server as well.

So basically running the same code in this image at server: PyTorch Release 23.05 - NVIDIA Docs or with PyTorch 2.0.1 locally, F.interpolate functions does not throw the device mismatch error.
When running in the following image: rel-23-08 it throws the above mentioned device mismatch error unless size explicitly fixed with numbers e.g. size=(64,64).
So I assume that there was some change between PyTorch 2.0.1 and PyTorch 2.1, which causes this error.

As to the inputs to F.interpolate:

  1. I have tried input_size[2:] - which is torch.Size - works on PyTorch 2.0.1, but not PyTorch 2.1
  2. I have tried [int(h), int(w)], (int(h), int(w)) - works on PyTorch 2.0.1, but not PyTorch 2.1
  3. and (64,64) - the only one which worked on PyTorch 2.1 and PyTorch 2.0.1

Hi,
I’m experiencing a similar issue with version 2.3.0.

When I train my model everything runs without any errors, but when I try to save the trained model with ‘torch.jit.trace’ I get the following error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_CUDA_clamp_Tensor)

When I call ‘torch.jit.trace’ both the model and the input are on the GPU.

This is the function of my code that generates the error:

flow_map = F.interpolate(
                input=-prob,
                size=self.input_size,
                mode="bilinear",
                align_corners=False,
            )

Where ‘-prob’ is a torch cuda tensor and ‘self.input_size’ is a tuple (336, 336).

The exception is actually raised by the ‘get_values’ function of the ‘torch_decomp\decompositions.py’ module at line 3488:

xp1 = (x + 1).clamp(max=inp_size - 1)

It somehow happens that, during scripting, the inp_size variable becomes a CPU tensor, while during training the variable is an int.
This seems to be the problem, because if I do this check on inp_size:

        if type(inp_size) is not int:
            inp_size = inp_size.to(x.device)
        xp1 = (x + 1).clamp(max=inp_size - 1)

The error does not occur anymore.

If I downgrade to version 2.0.1 everything works.

I would be very grateful if you could help me because this issue is preventing us from upgrading the PyTorch version.

Thanks in advance!

I was not able to fix it, just downgraded Pytorch to 2.0.1.
I assume that this bug ([pt2] The min and max parameters of torch.clamp do not support numpy format · Issue #94174 · pytorch/pytorch · GitHub) has to be fixed.
But maybe @ptrblck could elaborate.

TorchScript is deprecated and in “maintenance” mode which means it won’t receive any major updates or bug fixes. You could try to use torch.compile and check if the issue disappears.

@alessandro_bonvini uses torch.jit.trace so the issue is unrelated as it’s pointing towards torch.compile treating numpy arrays as CPUTensors.

Thank you for the reply!

I’m using torch.script because I need to run on Windows. Additionally, I have a runtime engine written in C++ using libtorch that reads the scripted model and executes it at runtime.

Is there any other way to do this? I mean training a model in PyTorch, saving it, and executing it at runtime with libtorch without knowing the model implementation?

Thank you again!