How to build it with cuda for best performance?

There are so many environment variables need to set, such as USE_CUDA, USE_CUDNN…
For best performance, what vars are needed to be set with cuda?

The pre-built binaries already ship with all libraries and you could additionally enabled TF32 for matmuls as described here and check the performance guide.

1 Like

I want to forward cuda api, so I have to compile pytorch myself to make it not use static cudart.
But I found that the proformance of my cuda forwarding was poor, so I had to find the difference between official builds and mine.