Which type of CUDA Math API does torch.sin() use?

CUDA has two types of math API to compute transcendental functions such as sin, cos, sqrt, exp, etc.
One is the accurate version, e.g.

__device__​ double sin ( double  x )

which can be found in
The other is the fast approximate version, e.g.

__device__​ float __sinf ( float  x )

which can be found in

MY question is, if I call torch.sin(), which type of CUDA API does it use?
If it uses the accurate version, how can I use the fast approximate version?

The native implementation should use std::sin as seen here.
To use the fast approximate implementation you could use a custom CUDA extension.

1 Like

Thanks for your fast reply! I solved the problem following https://pytorch.org/tutorials/advanced/cpp_extension.html.

BWT I found my implementation of the fast version is slower than I directly call torch.xxx. I know this is maybe a CUDA problem, but it’ll be very appreciated if you could help me.

for example, when I implement the fast version of exp().
In the cuda file, my implementation is

template <typename scalar_t>
__global__ void fastexp_cuda_kernel(
    const torch::PackedTensorAccessor32<scalar_t, 1, torch::RestrictPtrTraits> input,
    torch::PackedTensorAccessor32<scalar_t, 1, torch::RestrictPtrTraits> output) {
  const int index = blockIdx.x * blockDim.x + threadIdx.x;
  output[index] = expf(input[index]);

torch::Tensor fastexp_cuda(torch::Tensor input) {
  const auto size = input.size(0);
  auto output = torch::empty_like(input);
  const int threads = 1024;
  const dim3 blocks((size + threads - 1) / threads);
  AT_DISPATCH_FLOATING_TYPES(input.type(), "fastexp_cuda", ([&] {
    fastexp_cuda_kernel<scalar_t><<<blocks, threads>>>(
        input.packed_accessor32<scalar_t, 1, torch::RestrictPtrTraits>(),
        output.packed_accessor32<scalar_t, 1, torch::RestrictPtrTraits>());
  return output;

I use the JIT compiling method, I do find it can be faster when I use --use_fast_math flag.
BUT anyway it is slower than torch.exp().
does torch.exp() use any optimization? Or does my implementation have any redundant?