What is the best practice for choosing between CPU and GPU dispatch for custom C++ extensions?

mdoyle · August 6, 2021, 5:05pm

Hi everyone,

I am currently writing C++ extensions which should sit on top of Torch that implement custom layers. I have both CPU and GPU implementations of my layers which I developed following the guide for developing C++ and CUDA extensions to Torch (Custom C++ and CUDA Extensions — PyTorch Tutorials 1.9.0+cu102 documentation).

The guide is very helpful, however I have a remaining question of what is the best way of selecting which back-end (CPU or GPU) implementation of the layer should be executed at runtime? The example guide defines the steps to provide a C++ extension for either an explicit CPU or GPU implementation, but not for both.

This can be seen here: Custom C++ and CUDA Extensions — PyTorch Tutorials 1.9.0+cu102 documentation. Let’s consider only the forward pass of the layer in the example (see below for simplified code excerpt from the link);

#include <torch/extension.h>
#include <vector>

std::vector<torch::Tensor> lltm_cuda_forward(
    torch::Tensor input,
    torch::Tensor weights,
    torch::Tensor bias,
    torch::Tensor old_h,
    torch::Tensor old_cell);

#define CHECK_CUDA(x) TORCH_CHECK(x.type().is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)

std::vector<torch::Tensor> lltm_forward(
    torch::Tensor input,
    torch::Tensor weights,
    torch::Tensor bias,
    torch::Tensor old_h,
    torch::Tensor old_cell) {
  CHECK_INPUT(input);
  CHECK_INPUT(weights);
  CHECK_INPUT(bias);
  CHECK_INPUT(old_h);
  CHECK_INPUT(old_cell);

  return lltm_cuda_forward(input, weights, bias, old_h, old_cell);
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
  m.def("forward", &lltm_forward, "LLTM forward (CUDA)");
}

lltm_cuda_forward - Function which launches the CUDA kernel is declared.
lltm_forward - Wrapper function is defined which calls lltm_cuda_forward. However, lltm_forward has it’s inputs checked to ensure the inputs reside on GPU memory (using the macro CHECK_INPUT).
The lltm_forward routine is then bound to Python with PyBind11.

To me it makes sense to have a condition which chooses the CPU or GPU implementation in lltm_forward, by just checking where the input tensors reside (on CPU or GPU memory) and select the appropriate back-end. However, is this good practice? Or have you any better suggestions?

The only other option I can see is exposing both the CPU and GPU back-ends to Python through the bindings, and again checking the input types to choose which back-end to choose. This seems a bit less desirable to me.

Thanks for reading this, and for any suggestions/advice!

googlebot · August 7, 2021, 9:21pm

There is a newer registration mechanism with backend dispatching, described here. Though I’m not sure if this is the current suggested approach; if you’re not creating a library with more than 2 backend implementations, cuda/cpu branch dispatch seems good enough to me.

mdoyle · August 13, 2021, 9:53am

Thank you Alex for your reply.

I haven’t come across that link yet, so that is very helpful and seems to be what I was looking for. Yes, I suppose ideally my plan is to then tie them in with TorchScript but the link you posted details that process to some extent too.