Hi everyone,
I am currently writing C++ extensions which should sit on top of Torch that implement custom layers. I have both CPU and GPU implementations of my layers which I developed following the guide for developing C++ and CUDA extensions to Torch (Custom C++ and CUDA Extensions — PyTorch Tutorials 1.9.0+cu102 documentation).
The guide is very helpful, however I have a remaining question of what is the best way of selecting which back-end (CPU or GPU) implementation of the layer should be executed at runtime? The example guide defines the steps to provide a C++ extension for either an explicit CPU or GPU implementation, but not for both.
This can be seen here: Custom C++ and CUDA Extensions — PyTorch Tutorials 1.9.0+cu102 documentation. Let’s consider only the forward pass of the layer in the example (see below for simplified code excerpt from the link);
#include <torch/extension.h>
#include <vector>
std::vector<torch::Tensor> lltm_cuda_forward(
torch::Tensor input,
torch::Tensor weights,
torch::Tensor bias,
torch::Tensor old_h,
torch::Tensor old_cell);
#define CHECK_CUDA(x) TORCH_CHECK(x.type().is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
std::vector<torch::Tensor> lltm_forward(
torch::Tensor input,
torch::Tensor weights,
torch::Tensor bias,
torch::Tensor old_h,
torch::Tensor old_cell) {
CHECK_INPUT(input);
CHECK_INPUT(weights);
CHECK_INPUT(bias);
CHECK_INPUT(old_h);
CHECK_INPUT(old_cell);
return lltm_cuda_forward(input, weights, bias, old_h, old_cell);
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def("forward", &lltm_forward, "LLTM forward (CUDA)");
}
-
lltm_cuda_forward
- Function which launches the CUDA kernel is declared. -
lltm_forward
- Wrapper function is defined which callslltm_cuda_forward
. However,lltm_forward
has it’s inputs checked to ensure the inputs reside on GPU memory (using the macroCHECK_INPUT
). - The
lltm_forward
routine is then bound to Python with PyBind11.
To me it makes sense to have a condition which chooses the CPU or GPU implementation in lltm_forward
, by just checking where the input tensors reside (on CPU or GPU memory) and select the appropriate back-end. However, is this good practice? Or have you any better suggestions?
The only other option I can see is exposing both the CPU and GPU back-ends to Python through the bindings, and again checking the input types to choose which back-end to choose. This seems a bit less desirable to me.
Thanks for reading this, and for any suggestions/advice!