How to tell if calculations are happening on GPU?

I am writing a custom linear layer and I want to execute it on GPU. The tensors are correctly located on the GPU ( *.is_cuda returns true for input & weight). I am performing multiplications directly using two loops and I am getting the same answer as the in-built PyTorch Linear layer, but I’m not sure if this calculation is actually being done on the GPU. Is there a way to check?

def forward(ctx, X, weight, bias):
    ctx.save_for_backward(X, weight, bias)

    print(f"Input:: GPU: {X.is_cuda}, Weight:: GPU: {weight.is_cuda})

    (m, n_W) = X.shape
    (A, _) = weight.shape

    # Transpose the weights
    weight = torch.transpose(weight, 0, 1)

    output = torch.cuda.FloatTensor(m, A).fill_(0)

    for k in range(m):
        for l in range(A):
            accumulation = 0.0
            for j in range(n_W):
                accumulation += X[k, j] * weight[j, l]
            output[k, l] = accumulation + bias[l]

    return output


If everything is indeed a cuda tensor, then the computation should be happening on GPU. You can run nvprof e.g., nvprof python to get a sense of which kernels are being executed on GPU.

Alternatively if you are running the script interactively (or it is long-running enough), you can do something like watch nvidia-smi to get a sense of GPU usage as the script is running.

1 Like

Hi @eqy, thank you. Yes, the script does run for a long time, so I can try those commands you mentioned. But what I really wanted to check is whether this simple python code gets converted to NV PTX code to run on the GPU automatically? Since I am doing the multiply accumulate manually and until I came across a tutorial, I imagined I would have to write CUDA C++ code to use the GPU. That’s why I wanted to ask and double check.

Normally when you are doing common operations such as matrix multiplication, matrix-vector multiplication, elementwise multiply-add, convolution, etc., these are dispatched to precompiled kernels in ATen (PyTorch’s tensor operator library), cuBLAS, or cuDNN. No compilation to PTX etc., will happen unless you are doing some kind of JIT compilation (e.g., TorchScript — PyTorch 1.9.1 documentation).

Rather than Python code being “converted,” usually what is happening is the Python code is calling into C++ code that calls the relevant CUDA kernels.

1 Like

Thank you. I tried nvprof as you suggested and it seems to be spending a lot of time in this function:
void at::native::vectorized_elementwise_kernel

How can I find where this function is located? I also have the entire PyTorch source downloaded and when I grep for this function there, I just get hits in a bunch of .json files but no source files.

Alternatively, is there somewhere layers are defined in CUDA code directly? If I could get that, I could modify that and plug that in via PyBind. It’s a little hard to parse through the various backends supported, like iDeep, MKLDNN etc. So not sure where I might just find the base CUDA code.

For elementwise kernels, these are typically implemented with PyTorch’s TensorIterator to generate kernels rather than with handwritten CUDA, so there isn’t really any CUDA code to directly inspect in this case.

1 Like

Thank you! I get it now. I’ve also been following this tutorial to see how to write my own efficient gpu version of the layer. Since I need to change the multiplication operation, I need quite low level access to the operation, without sacrificing too much performance. I hadn’t seen that page about the tensor iterator so I’ll be sure to check that out as well!