Accelerating layers with Numba

malmaud · February 22, 2018, 3:25pm

Hi all,
Has anyone had success integrating functions decorated with numba.cuda.jit into a PyTorch workflow? http://pytorch.org/about/ mentions using your ‘favorite libraries…such as Numba’ but I’m not clear on how to actually do that.
Of course it’s no problem on the CPU to decorate a layer with numba.jit and convert inputs and outputs to and from numpy arrays, although even that doesn’t seem particularly elegant because of all the numpy/Tensor conversions required (I’m aware they are zero-overhead, though.)

tom · February 22, 2018, 7:10pm

from numba import cuda
os.environ['NUMBAPRO_LIBDEVICE']='/usr/lib/nvidia-cuda-toolkit/libdevice/'
os.environ['NUMBAPRO_NVVM']='/usr/lib/x86_64-linux-gnu/libnvvm.so.3.1.0'


@cuda.jit('(float32[:,:], float32[:,:], float32[:,:], float32[:,:], float32[:,:], int32, int32, int32)')
def cu_something(A, c, d, u, v, b, n, m):
   ...

def get_devicendarray(t):
    assert t.type() == 'torch.cuda.FloatTensor'
    ctx = cuda.cudadrv.driver.driver.get_context()
    mp = cuda.cudadrv.driver.MemoryPointer(ctx, ctypes.c_ulong(t.data_ptr()), t.numel()*4)
    return cuda.cudadrv.devicearray.DeviceNDArray(t.size(), [i*4 for i in t.stride()], numpy.dtype('float32'), 
                                                  gpu_data=mp, stream=torch.cuda.current_stream().cuda_stream)

and then later used it with torch.cuda.FloatTensors A, target, c,d,u,v.

Ad,targetd,cd,dd,ud,vd = (get_devicendarray(x) for x in (A,target,c,d,u,v))
cu_something[((b-1)//BLOCK+1,(n-1)//BLOCK+1),(BLOCK,BLOCK)](Ad,targetd,cd,dd,ud,vd,b,n,m)

It seemed to work but I was warned by people who know better than me that it was dangerous and potentially unstable (e.g. is the context above the right one).
I have not done too much with it.

Best regards

Thomas

Adam_Dziedzic · October 9, 2018, 1:12am

Thank you @tom This works like a charm for me, even without the explicit os.environ. Suggestion: @smth It would be great if PyTorch supported officially the conversion of a torch tensor to the DeviceNDArray (Numba gpu resident numpy array). It could be named: torch.Tensor.numba() and would not require some complicated, not supported code, and would be done without the additional data movement from gpu to cpu and back to gpu.

tom · October 9, 2018, 12:16pm

There is a great PR open for numba integration:

Etienne_Perot · September 14, 2020, 1:47pm

Hello, using the cuda_array_interface, how can i build the numba array?

tom · September 14, 2020, 2:10pm

numba_cuda_array = numba.cuda.as_cuda_array(cuda_tensor)

The Numba integration test script probably is a good source of information what works with PyTorch and Numba.

Best regards

Thomas

Etienne_Perot · September 22, 2020, 6:38pm

thanks Thomas V! Just another one I don’t find in the test script, regarding cuda streams, can we build/ communicate a numba.cuda.stream to pytorch or the opposite?

tom · September 22, 2020, 8:39pm

You could try to use the torch Stream object’s cuda_stream property and feed that to numba.cuda.external_stream, but I didn’t try.

Peter_Featherstone · August 24, 2023, 8:41am

Does this work with autograd ? I want to be able to insert a numba accelerated operation inside my model and have backpropagation work as normal. Is this possible? Cheers

ptrblck · August 24, 2023, 1:49pm

No, since Autograd won’t be able to understand 3rd party library operations. You would thus need to implement a custom autograd.Function as described here including the backward pass.