Amazing, code runs much faster when pytorch updates from 1.2 to pytorch 2.0

When running my code on pytorch 1.2, it takes 305s per epoch, it speeds significantly and costs only 220s on pytorch 2.0.

I do not use torch.compile since the cuda is only 11.1 in my environment.

It is really a big surprise for me.

I wonder how it could run so fast ?

Are you building PyTorch from source using your locally installed 11.1 CUDA toolkit?
If not, then note that the PyTorch binaries will ship with their own CUDA runtime (as well as cuDNN, cuBLAS, NCCL etc.) which you can specify in the install instructions.
Since you are not using torch.compile do you see the speedup in the plain eager mode or are you torch.jit.scripting the model?

The version of my local CUDA toolkit is 11.1.

I install pt2.0 with conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch-nightly -c nvidia. And print(torch.version.cuda) gives the result 11.7.

However, when I use torch.compile(model), it errors with the info torch._dynamo_exc.BackendCompilerFailed: compile_fn raised RuntimeError: Triton requires CUDA 11.4+.

I do not use torch.jit.scripting.

torch.compile with the pytorch-triton backend uses your locally installed CUDA toolkit to compile the kernels during its code-generation step (in particular the locally installed ptxas) and depends on CUDA>=11.4. In future releases the needed ptxas should be packaged into the binaries so that your local CUDA toolkit would not be needed anymore.
Until then you would need to update your 11.1 CUDA toolkit to 11.4 or newer.

Since you are seeing a general speedup without scripting the model or using torch.compile I would guess some CUDA math libraries could perform better using the PyTorch binaries with CUDA 11.7 such as cuBLAS or cuDNN for your GPU (assuming your previous release used an older CUDA runtime).

Hello, thank you for your prompt response. I have one more question: will the released pt 2.0 include ‘pytorch-trition’?

Yes, pytorch-triton is already installed as a dependency in the nightly releases.
We also recently discussed how to package ptxas, so the dependency on your locally installed CUDA toolkit should also disappear.

I installed pytorch nightly (2.0.0.dev20230213+cu117). It errors:
name, asm, shared_mem = _triton.code_gen.compile_ttir(backend, module, device, num_warps, num_stages, extern_libs, cc)
RuntimeError: Triton requires CUDA 11.4+

That’s expected for now and you would need to install a local CUDA toolkit as explained before.