Are you building PyTorch from source using your locally installed 11.1 CUDA toolkit?
If not, then note that the PyTorch binaries will ship with their own CUDA runtime (as well as cuDNN, cuBLAS, NCCL etc.) which you can specify in the install instructions.
Since you are not using torch.compile do you see the speedup in the plain eager mode or are you torch.jit.scripting the model?
I install pt2.0 with conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch-nightly -c nvidia. And print(torch.version.cuda) gives the result 11.7.
However, when I use torch.compile(model), it errors with the info torch._dynamo_exc.BackendCompilerFailed: compile_fn raised RuntimeError: Triton requires CUDA 11.4+.
torch.compile with the pytorch-triton backend uses your locally installed CUDA toolkit to compile the kernels during its code-generation step (in particular the locally installed ptxas) and depends on CUDA>=11.4. In future releases the needed ptxas should be packaged into the binaries so that your local CUDA toolkit would not be needed anymore.
Until then you would need to update your 11.1 CUDA toolkit to 11.4 or newer.
Since you are seeing a general speedup without scripting the model or using torch.compile I would guess some CUDA math libraries could perform better using the PyTorch binaries with CUDA 11.7 such as cuBLAS or cuDNN for your GPU (assuming your previous release used an older CUDA runtime).
Yes, pytorch-triton is already installed as a dependency in the nightly releases.
We also recently discussed how to package ptxas, so the dependency on your locally installed CUDA toolkit should also disappear.