Segmentation fault (core dumped) with torch.compile

Qian_Li · December 8, 2022, 9:02am

Describe the Bug
when I run this code, error with “Segmentation fault (core dumped)” appeared. Does someone know how to resolve it?

import torch

batch_n = 100
input_data = 10000
hidden_layer = 100
output_data = 10

class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.lr1 = torch.nn.Linear(input_data, hidden_layer, bias=False)
        self.relu = torch.nn.ReLU()
        self.lr2 = torch.nn.Linear(hidden_layer, output_data, bias=False)
    def forward(self, x):
        x = self.lr1(x)
        x = self.relu(x)
        x = self.lr2(x)
        return x

device = torch.device("cuda:0")
input = torch.randn(batch_n, input_data).to(device)
input.requires_grad = True
label = torch.randn(batch_n, output_data).to(device)

model = MyModel().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
loss_fn = torch.nn.MSELoss()

compiled_model = torch.compile(model)
compiled_model.train()
optimizer.zero_grad()
out = compiled_model(input)
loss = loss_fn(out, label)
loss.backward()
optimizer.step()

Environment

Python version: 3.8.13
torchversion: 1.14.0
CUDA version: 11.7

marksaroufim · December 9, 2022, 7:15pm

Looks like your’e getting the error even without torch.compile() - I’ve seen that erro show up if something is off with my CUDA installation. Worth trying out a fresh environment

James_Liu · February 8, 2023, 3:18pm

I have the same issue when compiling the model. Without compiling, there is no segmentation fault. I run on RTX 2080 Ti.

NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6

Wheest · June 15, 2023, 4:34pm

I also had this issue, which seemed to occur on both CPU and GPU.

However, the issue went away on CPU when running CUDA_VISIBLE_DEVICES="" first. This allowed a CPU inference using torch.compile to run.

I assume that something is up with my CUDA config. My PyTorch version is 2.0.1+cu117, but my CUDA version is 12.1. However in principle this should not be an issue. Will try a fresh Docker image.