Cuda version conundrum

Ed_Moman · August 4, 2023, 1:37pm

Hello,

Transformers relies on Pytorch, Tensorflow or Flax. I typically use the first.

In any case, the latest versions of Pytorch and Tensorflow are, at the time of this writing, compatible with Cuda 11.8.

Lucky me, for Cuda 11.8 is supposed to be the first version to support the RTX 4090 cards.

Well, not fully, apparently:

MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 8.9 is undefined.  Default to use Hopper
GPU Device 0: "Hopper" with compute capability 8.9

I believe the 4090 to be an Ada Lovelace, not a Hopper.

Will the fact that the card is not correctly identified by Cuda have any effect in resource utilisation and/or performance?

Is there anything we could do about that?

Does anyone know if Torch works with a more recent Cuda? Or can the MapSMtoCores and MapSMtoArchName variables be somehow hard-coded? Or is this completely irrelevant?

Best,

Ed

Ed_Moman · August 4, 2023, 1:38pm

For Pytorch, the nightly version is compatible with Cuda 12.1, which fully supports the card and simplifies things considerably.

ptrblck · August 4, 2023, 1:41pm

That’s not true, since sm_89 is binary compatible to sm_86 and sm_80 and will thus work with any PyTorch binary using CUDA >= 11.1.

I don’t know where the error message is raised from, but your 4090 is also not a Hopper GPU.

Again, I doubt it’s not detected by CUDA, since the error message is also pointing to the wrong architecture name, so I guess it’s a 3rd party library,
Run a pure PyTorch workload and check that the device is properly recognized and working. A simple smoke test would be python -c "import torch; print(torch.cuda.get_device_properties(0); print(torch.randn(1).cuda())".

PyTorch supports all recent CUDA versions and the nightly binaries ship with 11.8 and 12.1.
Both releases support GPU architectures up to sm_90 (Hopper).

Ed_Moman · August 4, 2023, 1:46pm

Thanks.

I have installed Cuda 12.1 and Pytorch nightly. Everything is fine now.

The messages came from the Cuda Samples (11.8). With Cuda Samples 12.1 the message is gone and the card is correctly recognised as Ada.

The messages are now:

GPU Device 0: "Ada" with compute capability 8.9
GPU Device 1: "Ada" with compute capability 8.9