Inplace elementwise multiplication of 3 dim tensors with CUDA

Hi I have a tensor x of shape: [8, 8, 89] and a second tensor s of shape [8,8] containing only values 1 and -1.
Now I want to expand s to the same shape of x:

s = s.unsqueeze(2).expand(x.shape)

and multiple them element-wise:

x = x*s

Two questions:

  1. Why do I get a RuntimeError: CUDA error: invalid configuration argument when running this code with CUDA.
  2. Is there a better operator which does this inplace?

Thank You!


  1. I can’t reproduce this error when running on colab.
    Does the following code works for you?
import torch

s = torch.rand(8, 8, device="cuda")
x = torch.rand(8, 8, 89, device="cuda")

s = s.unsqueeze(2).expand(x.shape)

s * x
  1. You can do this inplace by doing x *= s.

Thank you!
Ok. You are right. The issue must be related to something deeper. I just copied the part in the model which raised the issue, but now all operations on x seem to have same issue.
Had to investigate this in more detail…

Is it possible to dump all relevant information of a tensor to better reproduce issues like that?

x: torch.Size([8, 8, 89]) torch.float32 cuda:0
s: torch.Size([8, 8, 89]) torch.float32 cuda:0

I found the issue. The problem was from the operation before which calculates x. It was caused by a custom CUDA kernel which spawned 89 threads and 8x8 blocks.
It seems that pytorch evaluates the graph lazy and the error popped up at the following operation.

1 Like

It is the cuda API that is asynchronous. So errors will happen later on indeed.
You can force it to be synchronous by setting CUDA_LAUNCH_BLOCKING=1 env variable to get the error to point at the right place :slight_smile:

1 Like

Ok the actual error was in my custom operator:
The mistake was related to CUDA not directly to PyTorch.
When launching the CUDA kernel with (1,1,N) with N > 64 threads there is a failure. But when you put the number of threads in a scalar instead of a 3d vector, everything works well.