CUDA 11.8 package version issues and CUDA 11.7 version issue

Brit_W · September 13, 2023, 4:18pm

My custom built model trains fine when using CUDA 11.8 with these packages and versions:
pytorch-triton==2.1.0+e6216047b8
torch==2.1.0.dev20230804+cu118
torchaudio==2.1.0.dev20230804+cu118
torchvision==0.16.0.dev20230804+cu118

Today I decided to create a new conda environment and followed the instructions to install pytorch 2.0 11.8. I notice that there are different versions now:
pytorch-triton==2.1.0+6e4932cda8
torch==2.2.0.dev20230913+cu118
torchaudio==2.2.0.dev20230913+cu118
torchvision==0.17.0.dev20230913+cu118

My model no longer trains and I get this error when trying to train it (this is an error I did not receive before).

“message”: “backend=‘inductor’ raised:\nCalledProcessError: Command ‘[’/usr/bin/gcc’, ‘/tmp/tmpypr9vj84/main.c’, ‘-O3’, ‘-I/SD5/people/s1208875/miniforge3/envs/torch_test/lib/python3.9/site-packages/triton/common/…/third_party/cuda/include’, ‘-I/SD5/people/s1208875/miniforge3/envs/torch_test/include/python3.9’, ‘-I/tmp/tmpypr9vj84’, ‘-shared’, ‘-fPIC’, ‘-lcuda’, ‘-o’, ‘/tmp/tmpypr9vj84/cuda_utils.cpython-39-x86_64-linux-gnu.so’, ‘-L/lib64’, ‘-L/lib64’]’ returned non-zero exit status 1.\n\nSet TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information\n\n\nYou can suppress this exception and fall back to eager by setting:\n import torch._dynamo\n torch._dynamo.config.suppress_errors = True\n”

This becomes a problem if other people need to reproduce my conda environment that has CUDA 11.8. I read in another post that CUDA 11.7 is more stable so I will try to download and use that one instead.

Any thoughts as to what happened with CUDA 11.8?

ptrblck · September 13, 2023, 5:34pm

The error points to triton in the paths, so I would start by looking into it instead of CUDA.