Integer convolution on GPU

I’m using pytorch to perform some image processing computations (no AI involved).

Especially I’m trying to perform integer convolutions on GPU, excepting a significant boost in performance in comparison to float32 (is it really the case? I observed some strange behaviors like float16 convolutions being slower than float32, so I’m not sure anymore …).

However when I want to do this, I got various error like that type not supported or CUDNN_STATUS_BAD_PARAM

I imagine that integer convolution must be implemented somewhere for quantization computation, but I wasn’t able to find it.
Is there a way to efficiently perform a convolution between int16 tensors with PyTorch? (And will it be really much faster than float32 convolution in practice? )

Thank you for your help !

Integer inputs and parameters shouldn’t be supported in native nn.Modules and you should see an error during the transformation as:
> TypeError: only accepts floating point or complex dtypes, but got desired dtype=torch.int16

If you use integer inputs this error will be raised:

x =
out = conv(x)
> RuntimeError: Input type (short int) and bias type (float) should be the same

Could you describe how you got the cuDNN error and post a minimal and executable code snippet to reproduce the issue, please?

Thank you for your answer.

Here is my code:

import torch.nn.functional as F

dtype = torch.int32

ref = torch.randint(0,256,(1,100,100), dtype=dtype).to(device)
tocompare = torch.randint(0,256,(30,1,100,100), dtype=dtype).to(device)

result = F.conv2d(ref, tocompare, padding=30)

( to give some context, it allows me to find an image in a batch of images, one of which is the reference image but translated )

which gives me the following error with “dtype=torch.int32” but works fine with “dtype=torch.float32”

File "home/user/.local/lib/python3.8/site-packages/torch/utils/", line 62, in __torch_function__
    return func(*args, **kwargs)
RuntimeError: CUDNN_BACKEND_OPERATION: cudnnFinalize Failed cudnn_status: CUDNN_STATUS_BAD_PARAM

Is there a way to accelerate 2d convolutions on integer, or I’m stuck with the float32 convolution? In this case, how does quantized convolution work? Is there a way to hack it to perform my convolution more efficiently?

Thank you

Thanks for the code. The error message is indeed a bit misleading and would fail with a disallowed dtype message as seen in my code snippet using modules.

For inference you could try to use TorchTRT which could quantize the model and speed it up.
I don’t know if another backend already uses integer ops for performance reasons.

The main way to speed up integer math on gpu is probably using torch.compile along with triton. We’re taking this avenue ourselves (the quantization/ao team) for implementing quantization on GPU for certain application areas. I’m not sure if the int8 convolution kernels are exposed as of now but the torch._int_mm kernel is exposed and you can theoretically convert a convolution into a matmul.

Thank you both of you for these relevant answers.
I will look into TorchRT and Triton. It seems it could be good solutions for my need.

Converting the convolution into matrix multiplication is also a great idea, I’m just wondering if building the equivalent matrix won’t slow down the operation.

torch._int_mm kernel is exposed

On my side I wasn’t able to perform an int matrix multiplication, both @ and .mm with gives:

RuntimeError: "addmm_cuda" not implemented for 'Int'

And “torch._int_mm” isn’t recognized neither. (I tested with torch ‘2.0.0a0+8aa34602.nv23.03’)
Maybe this is only valid in a Triton context or something similar?

Thank you!

i’m not sure about that error see this test for an example usage:

otherwise, yes, converting to a matrix multiplication may be slow, i haven’t tested it personally. Also note: _int_mm is for int8 tensors (which is what we use for quantization) not int16 so it should be a ~4x speedup.

fp16 should be ~2x faster than fp32, thats what a lot of people are using on gpu, not sure whether your shapes are in a weird range where the fp16 kernel is slower than fp32 but that shouldn’t be true in general, at least for big shapes.

overall though, i think GPUs aren’t really setup for integer math at the moment, we’re currently making our first forray into native quantization on GPU right now and we haven’t even looked at conv so it may just be bad timing, if you hop forward in time 1 year I expect i’d have a better solution for you.

You could also create a pytorch feature request to expose a basic 2d conv op on gpu, similar to torch._int_mm

I was testing with int16, but I retry with int8 and got same error (with ‘Char’ instead of ‘Int’)

Yes I was surprised to have better performances with float32 vs float16 (I work on an AGX ORIN). It is possible that my shapes are weird (kernels are something like (500,1,50,50) )

Ok I understand your point with GPUs. Do your think adding a basic 2d int conv op like torch.__int__mm will result in a significant gain in performances, more precisely my question is: are issues with int on GPU about hardware not designed for that or only software support?

Thank you !

I’ve seen an error like that, it occured when my triton version didn’t match my version of pytorch. If you use nightlies for both that may fix it.

More importantly IDK how you are calling the addmm_cuda kernel, because torch._int_mm doesn’t call that kernel as far as I know.

note: torch._int_mm only works for 2 by 2 matrices so your shape you mention may require reshaping first.

About performance, yes, torch._int_mm is like 2x faster than fp16 which in our benchmarks is 2x faster than fp32. Its mostly a software issue, until torch._int_mm was exposed, you couldn’t do matmul’s quickly either.

Currently I haven’t yet tried triton, it was just a pure pytorch test.

addmm_cuda was raised when trying to perform an int matmul in pure pytorch

I take note of the compatible matrix size, however my torch version (‘2.0.0a0+8aa34602.nv23.03’) doesn’t even seem to have torch._int_mm: AttributeError: module 'torch' has no attribute '_int_mm'

Thank you for the performances metrics, that’s really interesting. That means I could expect up to a x4 in performances in comparison to my current f32 implementation (and in addition I would save time by skipping int16->f32->int16 convertions which seems also to add some latency).

Exposing a basic 2d int conv on gpu would be a great improvement for me (but I imagine my use case is quite atypical, I’m not even doing AI but simple algorithms in int16)

Thank you !