`cuda.h' missing during torch.compile in new environment

I am facing the following issue:

PackagesNotFoundError: The following packages are not available from current channels:

Unfortunately torchtriton is only available from pytorch channel, where it is outdated and causes an exception when using torch compile. And conda-forge only has triton package, which doesn’t seem to work.

some additional information

This happens in conda forge. So I am doing this:

conda create -n pytorch312 python=3.12
conda activate pytorch312
conda config --append channels conda-forge
conda install pytorch pytorch-gpu torchvision

Okay so I have pytorch, CUDA works, everything is fine.

But I type

import torch
@torch.compile
def square(x): 
  return x ** 2

x = torch.randn(1)
square(x)

I get this error

Fatal error: cuda.h: No such file or directory

BackendCompilerFailed: backend='inductor' raised:
CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpddetji5a/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpddetji5a/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/var/home/jj/distrobox/fedora/miniconda3/envs/pytorch312/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib64', '-L/lib', '-I/var/home/jj/distrobox/fedora/miniconda3/envs/pytorch312/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpddetji5a', '-I/var/home/jj/distrobox/fedora/miniconda3/envs/pytorch312/include/python3.12', '-I/var/home/jj/distrobox/fedora/miniconda3/envs/pytorch312/targets/x86_64-linux/include']' returned non-zero exit status 1.

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

I have tried setting C_INCLUDE_PATH to the include dir next to output of which nvcc. I also tried setting CUDA_HOME to conda environment folder. Neither solutions worked. What did work is installing torchtriton from pytorch conda channel, however that leads to a different issue when using torch.compile:

torch._dynamo.exc.BackendCompilerFailed: backend='compile_fn' raised:
ImportError: /tmp/torchinductor_ezyang/triton/0/faeb8676474438b8709860bca883a025/cuda_utils.so: undefined symbol: cuModuleGetFunction

to fix this one, I tried deleting ~/.triton folder, as well as /tmp/torchinductor_jj folder, but neither fixed the issue. I believe since pytorch channel is deprecated, the torchtriton version is 3.1.0 instead of 3.2.0 and that causes the error.