ATen cuda kernel dispatch for type

dashesy · May 31, 2018, 10:04pm

I am using ATen and this example, from here.
I have a kernel that needs one array of floats (for input) and one array of ints (for labels), should I still use AT_DISPATCH_FLOATING_TYPES?

does this line mean only floats will be marshalled to cuda kernel? or is it just specifying the scalar_t?

AT_DISPATCH_FLOATING_TYPES(X.type(), “lltm_forward_cuda” …
Please correct me if I am wrong: The integral types (int, bool, …) I guess need no special treatment

P.S. it looks like both lines 123 and 157 pass “lltm_forward_cuda” as the second argument, that does not seem right if it is for debugging

tom · May 31, 2018, 10:43pm

Re 1: AT_DISPATCH_FLOATING_TYPES will do float and double automatically.
You specify the dispatch for the type of X. When you have an int-Tensor-argument Y, it’ll be the same. You cannot pass an int-Tensor as X here.
If you want half, there also is a _AND_HALF version.
Re 2: You don’t get any special treatment of these and would have to do it on your own.
Probably using backward in the second string is better.

Best regards

Thomas

Jakub_Mitura · March 13, 2022, 2:36pm

Hello If anybody will be looking for it now there are multpple at dispatch macros here

github.com

pytorch/pytorch/blob/61d6c4386459441710fb4cfa2929a3f77e95e5f7/aten/src/ATen/Dispatch.h

#pragma once

#include <ATen/core/DeprecatedTypeProperties.h>
#include <c10/macros/Macros.h>
#include <c10/util/Exception.h>
#include <c10/util/Half.h>
#include <c10/util/Metaprogramming.h>
#include <c10/util/complex.h>
#include <c10/util/string_view.h>

#ifdef TEMPLATE_SELECTIVE_BUILD
#include <ATen/selected_mobile_ops.h>
#else
namespace at {
/**
 * The method should_include_kernel_dtype() returns true/false
 * based on whether the switching code for a specific dtype should be
 * included based on build time constants generated from tracing model
 * execution. This method will be implmeneted via code-generation and
 * included in this file when code-gen is ready.

This file has been truncated. show original

Max_Unhold · February 19, 2024, 4:49pm

@tom, @dashesy Hey, what does _AND_HALF do? So is there still no way to mix tensor types when you give them to the CUDA kernel (long and float)

Max_Unhold · February 19, 2024, 4:52pm

@Jakub_Mitura Did you figure out, which one is the right one? Any Dispatchers of the form AT_DISPATCH_CASE_ALL_TYPES_AND_... seem to be candidates.

tom · February 20, 2024, 9:17pm

No, this is not for mixing types, but for quite a while the default floating point types were FP32 and FP64, so _AND_HALF adds FP16 (and maybe BF16?) support.
Given that each dtype produces a separate kernel (with C++ templating) there is not built-in mechanism for mixing two kernels (I did make that manually for CTC if I recall correctly, due to CuDNN wanting 32 bit ints and PyTorch defaulting to 64 bit ints). These really are just a switch statement, so you could easily make your own, but beware of the explosion of the number of kernels.

Best regards

Thomas

Max_Unhold · February 20, 2024, 10:09pm

@tom First of all, thank you for your answer. To clarify my statement: I do not want to mix different kernels but just mix the dtypes of the arguments: An example

If all the inputs are of long type, the following would work:

    AT_DISPATCH_INTEGRAL_TYPES(at::ScalarType::Long,"Some long error message", [&] {
      some_cuda_kernel<scalar_t><<<blocks, threads>>>(
        some_long_tensor.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>(),
       .... 
      );
      }
    );

where I have used the packed_accesor.

If I only have float tensors, the following works

  AT_DISPATCH_FLOATING_TYPES(at::ScalarType::Float,"Some floating error message", [&] {
    some_cuda_kernel<scalar_t><<<blocks, threads>>>(
      some_float_tensor.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>(),
     .... 
    );
    }
  );

How do I dispatch a kernel that takes both some_long_tensor and some_float_tensor?

Attempts with AT_DISPATCH_ALL_TYPES

AT_DISPATCH_ALL_TYPES(at::ScalarType,"Hello !",[&] {
    some_cuda_kernel<scalar_t><<<blocks, threads>>>(
      some_float_tensor.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>(),
      some_long_tensor.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>(),
     .... 
    );
})

or AT_DISPATCH_INTEGRAL_TYPES_AND

  AT_DISPATCH_INTEGRAL_TYPES_AND(at::ScalarType::Float,
                                 at::ScalarType::Long,
                                 "some combined error message", [&] {
    some_cuda_kernel<scalar_t><<<blocks, threads>>>(...)
  }
);

all failed.

For any raw CUDA kernel, this is not a problem. So if there is any other way to use a raw kernel in this situation, I would appreciate a hint.

tom · February 20, 2024, 11:25pm

You want the first param of packed_accessor (and the PackedAccessor declaration) to be the C++ type you need:

some_float_tensor.packed_accessor<float,2,...>(),
some_long_tensor.packed_accessor<int64_t, 2...>(),

The AT_DISPATCH will have a switch querying a at::ScalarType (~the Python dtype) (of a tensor that might have any of several) and providing the matching C++ type as scalar_t to be used in templating the accessor and kernel call.
If you have fixed dtypes for your arguments, you would not need AT_DISPATCH at all.