4090 25% slower than 3090 for pytorch transformers?

Is anyone else having severe performance issues on RTX 4090 cards with pytorch and transformers?

Using nvidia ncg docker images 22.11-py3 didn’t help a bit. Also tried pytorch 2.0/nightly. Nothing works. Pure cuda benchmarks shows 4090 can scale to 450w on cuda workload and perf is as advertised of almost 2x vs 3090. But, on my pytorch transformer workload using huggingspaces translation pipeline, 3090 is consistently 25% faster.

Thank s a bunch.

1 Like

Pytorch 2.0 with cuda 11.8 worked for me -

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch-nightly -c nvidia

I also had similar issues with Pytorch 1.13 and I tried different cuda versions 11.6, 11.7 - these were slower than 3090.

1 Like

Any update to this? Did you try Cuda 11.8? I’m curious because I have a machine with two 4090s and I want to find the best environment.

CUDA 11.8 is recommended because it is the first version that includes 4090 (SM 8.9) support. If you want something more bleeding edge, CUDA 12.1 nightlies are available, e.g., via pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121.

1 Like

If you don’t mind elaborating, what can be expected with SM 8.9 support?

The RTX 4090 is compute capability 8.9, so CUDA 11.8 having SM 8.9 support means it is the earliest toolkit version that can compile code specific to it. In theory, SM 8.9 can run kernels compiled for SM 8.6, the performance could be worse as you may have already experienced.

Small addition: compiling natively for sm_89 will not improve the performance but indeed using a newer CUDA toolkit and cuDNN should boost the performance again. @eqy’s recommendation is correct and you should use the PyTorch binaries shipping with 11.8+.