Inplace elementwise multiplication of 3 dim tensors with CUDA

Hi I have a tensor x of shape: [8, 8, 89] and a second tensor s of shape [8,8] containing only values 1 and -1.
Now I want to expand s to the same shape of x:

s = s.unsqueeze(2).expand(x.shape)

and multiple them element-wise:

x = x*s

Two questions:

  1. Why do I get a RuntimeError: CUDA error: invalid configuration argument when running this code with CUDA.
  2. Is there a better operator which does this inplace?

Thank You!


  1. I can’t reproduce this error when running on colab.
    Does the following code works for you?
import torch

s = torch.rand(8, 8, device="cuda")
x = torch.rand(8, 8, 89, device="cuda")

s = s.unsqueeze(2).expand(x.shape)

s * x
  1. You can do this inplace by doing x *= s.

Thank you!
Ok. You are right. The issue must be related to something deeper. I just copied the part in the model which raised the issue, but now all operations on x seem to have same issue.
Had to investigate this in more detail…

Is it possible to dump all relevant information of a tensor to better reproduce issues like that?

x: torch.Size([8, 8, 89]) torch.float32 cuda:0
s: torch.Size([8, 8, 89]) torch.float32 cuda:0

I found the issue. The problem was from the operation before which calculates x. It was caused by a custom CUDA kernel which spawned 89 threads and 8x8 blocks.
It seems that pytorch evaluates the graph lazy and the error popped up at the following operation.

It is the cuda API that is asynchronous. So errors will happen later on indeed.
You can force it to be synchronous by setting CUDA_LAUNCH_BLOCKING=1 env variable to get the error to point at the right place :slight_smile:

Ok the actual error was in my custom operator:
The mistake was related to CUDA not directly to PyTorch.
When launching the CUDA kernel with (1,1,N) with N > 64 threads there is a failure. But when you put the number of threads in a scalar instead of a 3d vector, everything works well.