ATen cuda kernel dispatch for type

I am using ATen and this example, from here.
I have a kernel that needs one array of floats (for input) and one array of ints (for labels), should I still use AT_DISPATCH_FLOATING_TYPES?

  1. does this line mean only floats will be marshalled to cuda kernel? or is it just specifying the scalar_t?

    AT_DISPATCH_FLOATING_TYPES(X.type(), “lltm_forward_cuda” …

  2. Please correct me if I am wrong: The integral types (int, bool, …) I guess need no special treatment

P.S. it looks like both lines 123 and 157 pass “lltm_forward_cuda” as the second argument, that does not seem right if it is for debugging

2 Likes

Re 1: AT_DISPATCH_FLOATING_TYPES will do float and double automatically.
You specify the dispatch for the type of X. When you have an int-Tensor-argument Y, it’ll be the same. You cannot pass an int-Tensor as X here.
If you want half, there also is a _AND_HALF version.
Re 2: You don’t get any special treatment of these and would have to do it on your own.
Probably using backward in the second string is better.

Best regards

Thomas

3 Likes

Hello If anybody will be looking for it now there are multpple at dispatch macros here

1 Like

@tom, @dashesy Hey, what does _AND_HALF do? So is there still no way to mix tensor types when you give them to the CUDA kernel (long and float)

@Jakub_Mitura Did you figure out, which one is the right one? Any Dispatchers of the form AT_DISPATCH_CASE_ALL_TYPES_AND_... seem to be candidates.

No, this is not for mixing types, but for quite a while the default floating point types were FP32 and FP64, so _AND_HALF adds FP16 (and maybe BF16?) support.
Given that each dtype produces a separate kernel (with C++ templating) there is not built-in mechanism for mixing two kernels (I did make that manually for CTC if I recall correctly, due to CuDNN wanting 32 bit ints and PyTorch defaulting to 64 bit ints). These really are just a switch statement, so you could easily make your own, but beware of the explosion of the number of kernels.

Best regards

Thomas

1 Like

@tom First of all, thank you for your answer. To clarify my statement: I do not want to mix different kernels but just mix the dtypes of the arguments: An example

If all the inputs are of long type, the following would work:

    AT_DISPATCH_INTEGRAL_TYPES(at::ScalarType::Long,"Some long error message", [&] {
      some_cuda_kernel<scalar_t><<<blocks, threads>>>(
        some_long_tensor.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>(),
       .... 
      );
      }
    );

where I have used the packed_accesor.

If I only have float tensors, the following works

  AT_DISPATCH_FLOATING_TYPES(at::ScalarType::Float,"Some floating error message", [&] {
    some_cuda_kernel<scalar_t><<<blocks, threads>>>(
      some_float_tensor.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>(),
     .... 
    );
    }
  );

How do I dispatch a kernel that takes both some_long_tensor and some_float_tensor?

Attempts with AT_DISPATCH_ALL_TYPES

AT_DISPATCH_ALL_TYPES(at::ScalarType,"Hello !",[&] {
    some_cuda_kernel<scalar_t><<<blocks, threads>>>(
      some_float_tensor.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>(),
      some_long_tensor.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>(),
     .... 
    );
})

or AT_DISPATCH_INTEGRAL_TYPES_AND

  AT_DISPATCH_INTEGRAL_TYPES_AND(at::ScalarType::Float,
                                 at::ScalarType::Long,
                                 "some combined error message", [&] {
    some_cuda_kernel<scalar_t><<<blocks, threads>>>(...)
  }
);

all failed.

For any raw CUDA kernel, this is not a problem. So if there is any other way to use a raw kernel in this situation, I would appreciate a hint.

You want the first param of packed_accessor (and the PackedAccessor declaration) to be the C++ type you need:

some_float_tensor.packed_accessor<float,2,...>(),
some_long_tensor.packed_accessor<int64_t, 2...>(),

The AT_DISPATCH will have a switch querying a at::ScalarType (~the Python dtype) (of a tensor that might have any of several) and providing the matching C++ type as scalar_t to be used in templating the accessor and kernel call.
If you have fixed dtypes for your arguments, you would not need AT_DISPATCH at all.

2 Likes