CUDA runtime version influence performance?

marco_c · March 14, 2023, 1:11am

Hi! I’m trying to understand why the same program running in 2 different environments, one with an RTX A4000 GPU and another with a V100 GPU, have very different execution speeds. The program runs much faster in the environment that has the RTX A4000 (but the V100 is faster than the A4000, right?). The CPUs in both environments are similar. There are differences in the CUDA version installed on each host, the version in the V100 environment is 11.3 (I tested with PyTorch with CUDA 11.7, 11.6 and 11.4 versions, I did not test with 11.3 version because I would have to install by source, the PyTorch whell containing the closest CUDA version to version 11.3 is the one containing version 11.4) and in the A4000 environment is 11.6 (equals PyTorch’s CUDA version). Can this difference in CUDA versions influence performance? Thanks in advance!

eqy · March 14, 2023, 3:15am

If you are not compiling from source then there would be multiple factors affecting performance—the CUDA toolkit packaged with each version of PyTorch, the libraries (e.g., cuDNN) packaged with each version of PyTorch, and changes to PyTorch itself between the releases. In other words, installing different versions of PyTorch and PyTorch binaries built against different versions of the CUDA toolkit can certainly affect performance. However, the CUDA version of the surrounding environment (the system’s CUDA) should not affect performance as it will be overridden by whatever the PyTorch binary was packaged with.

Additionally, I would not expect that the A4000 to be uniformly (slower/faster) across all benchmarks compared to the V100. Consider for instance, that the A4000 has TF32 tensor core support while the V100 does not.

marco_c · March 16, 2023, 9:18pm

Hi, @eqy, thanks for the great and quick response. Ok, but do you mean that always when compiling from source you will have a superior performance (in terms of execution velocity)? About the tests I did, the CUDA versions in both environments were different, but the toolkit versions were the same (version 11.6), including PyTorch (1.12.0).

Considering what you said about the system’s CUDA version not mattering, for best performance should I compile from source and consider the latest version of PyTorch as well as CUDA?

On the V100 test, I tweaked my code to consider automatic mixed precision (Automatic Mixed Precision package - torch.amp — PyTorch 2.0 documentation), and there was at least a 2x speed gain, even though I’m still using a higher number of CPUs in this test with the V100, I will still do a test using the same amount of CPUs that I used in the A4000 test (anyway, even before using this higher number of CPUs I was getting much lower execution speed with the V100 versus running on the A4000). I believe that this fits perfectly with what you said, that the A4000 has TF32 tensor core support while the V100 does not, which would be the reason why the V100 is not better than the A4000 in all benchmarks: because of this, it often does not good to perform the operation with single-precision floating-point math (FP32), compensating perform the operation in half precision (float16), for example. In this test I did, however, the results were worse than the one performed on the A4000.

I’m still finishing another test that will be able to confirm my hypothesis, that the use of another GPU influences the results, this other test just performs the same execution, but without considering the automatic mixed precision (AMP).

But anyway, can using a different GPU affect the results of a neural network convergence using PyTorch? Thanks in advance.

eqy · March 16, 2023, 9:30pm

In general having the newer versions of libraries should provide better performance, but you do not necessarily need to build from source to have current versions; I would consider checking out the NGC containers @ PyTorch | NVIDIA NGC for a prebuilt environment that should have current versions of all libraries.

In most cases, using a different GPU should not affect convergence unless different settings are used, but different GPUs can have different numerics (e.g., TF32 being a default on Ampere vs. FP32 on Volta as you have observed). Additionally, you may see slightly different inference results when e.g., training a model on a GPU and deploying it on a different GPU as different kernels may be used.

marco_c · March 16, 2023, 9:54pm

Hi!

Sorry for my lack of knowledge, but I couldn’t understand, but if I just install the latest versions of PyTorch with CUDA via conda or pip, it wouldn’t be the same as compiling PyTorch with CUDA by source, or using that PyTorch docker made by NVIDIA?

I also think this way, that using a different GPU should not significantly affect the convergence results, as long as the same hyperparameters are used, but I am seeing this in my tests. In my case may be due to this difference in numerics, or would it just affect accuracy in the results? About deploy, it’s not my case, I’m just doing training.

Thanks again!

eqy · March 16, 2023, 9:58pm

It should be the basically same although you might see a delay in when versions of e.g., cuDNN are updated in the pip wheels compared to the monthly container releases.

If you are seeing a convergence difference without changing things such as the batch size, it would be relevant if you could share a reproducible example (feel free to start a separate thread for this).

marco_c · March 16, 2023, 10:10pm

Okay, I’ll consider opening a new thread to ask about this, although I think I’d have to make the entire program code available. But do you think this difference in results could just be due to the differences in numerical precision cited?

Thanks again!

rezafuru · June 1, 2023, 1:24pm

Hi @eqy,

Apologies for reviving a three-month-old thread.

Let’s assume we are executing everything directly on the host’s machine, i.e., no containers or the sort. How would it be possible for the CUDA version of the surrounding environment to differ from the CUDA version the PyTorch binaries are built against. Are you referencing CUDA’s backward compatibility, i.e., when the binaries are built against a version that is older than the host’s version? Perhaps, I’m fundamentally misunderstanding how PyTorch binaries are built, shipped and ran against CUDA. I’d greatly appreciate it if someone could clarify further.

ptrblck · June 1, 2023, 4:47pm

Your locally installed CUDA toolkit won’t be used if you installed the PyTorch binaries as they ship with their own runtime dependencies.
In fact, your would only need to install a proper NVIDIA driver (no other CUDA component) and could directly install the PyTorch binaries and execute workloads.
Your local CUDA toolkit will be used if you are building from source or a custom CUDA extension.

rezafuru · June 1, 2023, 5:00pm

Huh, TIL. Thank you. On a hindsight, this makes a lot of sense.