Torch.compile() segfaults on CUDA 11.6

cakeislife100 · February 20, 2023, 7:16pm

When running with a torch.compile() model with PyTorch 2.0 and CUDA 11.6, my code is segfaulting with just some sample code. When I remove torch.compile(), the code executes just fine. Any insight into what might be going on would be greatly appreciated .

Here is my CUDA environment:

NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.6

The exact python packages:

pytorch-triton           2.0.0+0d7e753227
torch                    2.0.0.dev20230202+cu116
torchaudio               2.0.0.dev20230201+cu116
torchvision              0.15.0.dev20230201+cu116

And the code that I’m trying to execute.

import torch
import torchvision.models as models

if __name__ == "__main__":
    model = models.resnet18().cuda()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    # compiled_model = model # Works fine when not actually compiled
    compiled_model = torch.compile(model)

    x = torch.randn(16, 3, 224, 224).cuda()
    optimizer.zero_grad()
    out = compiled_model(x)
    out.sum().backward()
    optimizer.step()

ptrblck · February 20, 2023, 10:09pm

Could you update PyTorch to the latest nightly as the pytorch-triton module was recently updated to check if you would still see the segfault?

cakeislife100 · February 21, 2023, 1:16am

@ptrblck As far as I can see from the cu116 index, the latest version of torch for cu116 is the one from 2023-02-02 which is what I already have installed: https://download.pytorch.org/whl/nightly/torch/. (And that’s the one that’s installed when I do pip3 install numpy --pre torch torchvision torchaudio --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu116)

Is there a newer one that you know of?

The strange part is, when I tried this about a month ago, this code worked (no segfault) with the same CUDA + nvidia driver set up. But I’m not sure which versions of I had then of the different Torch libraries…

ptrblck · February 21, 2023, 3:38am

You are right and the binaries using CUDA 11.6 were deprecated in favor of 11.7 and 11.8.
Could you install one of these newer nightly releases and rerun your code, please?

cakeislife100 · February 21, 2023, 7:29am

@ptrblck Ah okay. Is PyTorch 2.0 no longer meant to be compatible with CUDA 11.6?

Also, when I tried it with 11.7 and 11.8, I got this “RuntimeError: Cannot find ptxas” error. I’m guessing I should just hold tight to wait on a resolution to that issue?

ptrblck · February 21, 2023, 9:42am

Yes, the binaries with CUDA 11.6 will be deprecated soon and the PyTorch 2.0 binary release will support CUDA 11.7 and 11.8 as described here.
Note that you will still be able to build PyTorch from source with other CUDA versions and the “deprecation” only means the pip wheels and conda binaries will ship with CUDA 11.7 and 11.8 in the next release.

The ptxas issue seems to be a regression and is tracked here and here.

cakeislife100 · March 1, 2023, 1:46am

@ptrblck Just tried the new 0228 versions with CUDA 11.7 and that worked. Thanks for your help!