Accelerating layers with Numba

Hi all,
Has anyone had success integrating functions decorated with numba.cuda.jit into a PyTorch workflow? http://pytorch.org/about/ mentions using your ‘favorite libraries…such as Numba’ but I’m not clear on how to actually do that.
Of course it’s no problem on the CPU to decorate a layer with numba.jit and convert inputs and outputs to and from numpy arrays, although even that doesn’t seem particularly elegant because of all the numpy/Tensor conversions required (I’m aware they are zero-overhead, though.)

1 Like
from numba import cuda
os.environ['NUMBAPRO_LIBDEVICE']='/usr/lib/nvidia-cuda-toolkit/libdevice/'
os.environ['NUMBAPRO_NVVM']='/usr/lib/x86_64-linux-gnu/libnvvm.so.3.1.0'


@cuda.jit('(float32[:,:], float32[:,:], float32[:,:], float32[:,:], float32[:,:], int32, int32, int32)')
def cu_something(A, c, d, u, v, b, n, m):
   ...

def get_devicendarray(t):
    assert t.type() == 'torch.cuda.FloatTensor'
    ctx = cuda.cudadrv.driver.driver.get_context()
    mp = cuda.cudadrv.driver.MemoryPointer(ctx, ctypes.c_ulong(t.data_ptr()), t.numel()*4)
    return cuda.cudadrv.devicearray.DeviceNDArray(t.size(), [i*4 for i in t.stride()], numpy.dtype('float32'), 
                                                  gpu_data=mp, stream=torch.cuda.current_stream().cuda_stream)

and then later used it with torch.cuda.FloatTensors A, target, c,d,u,v.

Ad,targetd,cd,dd,ud,vd = (get_devicendarray(x) for x in (A,target,c,d,u,v))
cu_something[((b-1)//BLOCK+1,(n-1)//BLOCK+1),(BLOCK,BLOCK)](Ad,targetd,cd,dd,ud,vd,b,n,m)

It seemed to work but I was warned by people who know better than me that it was dangerous and potentially unstable (e.g. is the context above the right one).
I have not done too much with it.

Best regards

Thomas

3 Likes

Thank you @tom This works like a charm for me, even without the explicit os.environ. Suggestion: @smth It would be great if PyTorch supported officially the conversion of a torch tensor to the DeviceNDArray (Numba gpu resident numpy array). It could be named: torch.Tensor.numba() and would not require some complicated, not supported code, and would be done without the additional data movement from gpu to cpu and back to gpu.

There is a great PR open for numba integration:

2 Likes

Hello, using the cuda_array_interface, how can i build the numba array?

numba_cuda_array = numba.cuda.as_cuda_array(cuda_tensor)

The Numba integration test script probably is a good source of information what works with PyTorch and Numba.

Best regards

Thomas

2 Likes

thanks Thomas V! Just another one I don’t find in the test script, regarding cuda streams, can we build/ communicate a numba.cuda.stream to pytorch or the opposite?

You could try to use the torch Stream object’s cuda_stream property and feed that to numba.cuda.external_stream, but I didn’t try.

Does this work with autograd ? I want to be able to insert a numba accelerated operation inside my model and have backpropagation work as normal. Is this possible? Cheers

No, since Autograd won’t be able to understand 3rd party library operations. You would thus need to implement a custom autograd.Function as described here including the backward pass.