`cuda.h` missing error with `torch.compile`

Tom_Ginsberg · February 2, 2023, 8:53pm

Hello, I am using PyTorch 2.0 and having issues with torch.compile

I have installed the nightly build of torch using the instructions on the pytorch website with CUDA 11.7

# pytorch install command
❯ conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch-nightly -c nvidia

# confirming versions
❯ python -V
Python 3.10.8
❯ python -c 'import torch;print(torch.__version__)'
2.0.0.dev20230105
❯ conda list | grep cuda
cuda                      11.7.1                        0    nvidia
...
pytorch                   2.0.0.dev20230105 py3.10_cuda11.7_cudnn8.5.0_0    pytorch-nightly
pytorch-cuda              11.7                 h67b0de4_2    pytorch-nightly

When I try torch.compile I get no issues with the first example provided in the docs

In [1]: import torch
   ...:
   ...: def foo(x, y):
   ...:     a = torch.sin(x)
   ...:     b = torch.cos(x)
   ...:     return a + b
   ...: opt_foo1 = torch.compile(foo)
   ...: print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10)))
# works !

However, trying the next example that uses a resnet18 model on GPU I get many errors of the following form

/tmp/tmpsov1pory/main.c:2:10: fatal error: cuda.h: No such file or directory
 #include "cuda.h"
          ^~~~~~~~
compilation terminated.

This message prints ~ 30 times listing different tmp directories.

Is there a certain environment variable that needs to be set to allow torch to find cuda.h. I tried running find /path/to/my/conda/env -name cuda.h and adding the directories to my LD_LIBRARY_PATH but still got the same error.

Please let me know if anyone can suggest what to troubleshoot next.

Thank you

ptrblck · February 3, 2023, 5:07am

Thanks for reporting as this issue should not happen. Could you try to use CUDA_HOME=/path/to/cuda python script.py args as described here, please?

chinsengi · February 13, 2023, 12:46am

I encounter the same problem, and I solved it by setting the environment variable $C_INCLUDE_PATH

The reason for this problem is that I am using HPC cluster of my institution, and they install cuda in a weird place. Therefore in order for the GCC compiler to find cuda.h, I need to manually include the path to $C_INCLUDE_PATH

You can locate where you have your cuda installed by which nvcc. Mine is /sw/cuda/11.8.0/bin/nvcc, therefore the path I am looking for is /sw/cuda/11.8.0/include

Then type in your terminal export C_INCLUDE_PATH=$C_INCLUDE_PATH:/the/path/to/library/include. Note that the path should end with include.

This solves my problem.

yanielc · April 21, 2023, 10:09am

this solution worked for me

Ethan0 · April 23, 2023, 9:51pm

from torchvision.datasets import MNIST

download_root = './MNIST_DATASET'

train_dataset = MNIST(download_root, train=True, download=True) # train 데이터
test_dataset = MNIST(download_root, train=False, download=True) # test 데이터

from torchvision import transforms

mnist_transform = transforms.Compose([
    transforms.ToTensor(), 
    ])

train_dataset = MNIST(download_root, transform=mnist_transform, train=True, download=True) # train 데이터
test_dataset = MNIST(download_root, transform=mnist_transform, train=False, download=True) # test 데이터

print(len(train_dataset), 'train samples')
print(len(test_dataset), 'test samples')

# 주의사항!
# torch.nn.CrossEntropyLoss 자체적으로 one-hot encoding이 내장되어 있어
# 별도의 작업 없이 진행 가능.

import torch

import torch.nn.functional as F

num_classes = 10 
# print first ten (integer-valued) training labels
print('Integer-valued labels:')
y_train_list = [train_dataset[i][1] for i in range(10) ]
print(y_train_list)

# one-hot encode the labels
# convert class vectors to binary class matrices
y_train = F.one_hot(torch.tensor(y_train_list), num_classes)

# print first ten (one-hot) training labels
print('One-hot labels:')
print(y_train[:10])

%pip install torchsummary ## model을 요약하기 위해 torchsummary 설치
from torchsummary import summary as summary_## 모델 정보를 확인하기 위해 torchsummary 함수 import

## 모델의 형태를 출력하기 위한 함수 
def summary_model(model,input_shape=(1, 28, 28)):
    model = model.cuda()
    summary_(model, input_shape) ## (model, (input shape))

##################################################
## CNN 모델 구조 정의
##################################################
from torch.nn import Sequential
from torch.nn import Conv2d, MaxPool2d, Flatten, Linear, Dropout, ReLU, ZeroPad2d, Softmax

## nn.Sequential 함수는 네트워크를 인스턴스화하는 동시에, 원하는 신경망에 연산 순서를 인수로 전달함.
model = Sequential(
            Conv2d(1,32,kernel_size=3,padding='same'),  ## 참고: Conv2d(in_channel, out_channel, kernerl_size, padding, bias)
            ReLU(),
            MaxPool2d(kernel_size=2),    
            Conv2d(32,64,kernel_size=3,padding='same'),
            ReLU(),
            MaxPool2d(kernel_size=2),   
            Flatten(),
            Linear(3136,64),
            ReLU(),
            Linear(64,10),
            Softmax(dim=1)
        ).cuda()

print(model)
summary_model(model) ## 모델 요약하기

# 모델 학습하기
import time

## train, test 데이터셋의 로더
## 배치사이즈와 불러오는 데이터를 섞을지 결정할 수 있다
trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=32,shuffle=True) ## train 로더
testloader = torch.utils.data.DataLoader(test_dataset, batch_size=1,shuffle=False)## test 로더

# 파라미터 설정
total_epoch = 12
best_loss = 100 ## loss를 기준으로 best_checkpoint를 저장하기 위해 100으로 설정하였음.

learning_rate = 0.001 

optimizer = torch.optim.RMSprop(model.parameters(),lr = learning_rate, alpha=0.9, eps=1e-07) ## RMSprop을 최적화 함수로 이용함. 파라미터는 documentation을 참조!
loss = torch.nn.CrossEntropyLoss().cuda() ## 분류문제이므로 CrossEntropyLoss를 이용함

for epoch in range(total_epoch):
    start = time.time()
    print(f'Epoch {epoch}/{total_epoch}')

    # train
    train_loss = 0
    correct = 0
    for x, target in trainloader: ## 한번에 배치사이즈만큼 데이터를 불러와 모델을 학습함
        
        optimizer.zero_grad() ## 이전 loss를 누적하지 않기 위해 0으로 설정해주는 과정
        y_pred = model(x.cuda()) ## 모델의 출력값
        cost = loss(y_pred, target.cuda()) ## loss 함수를 이용하여 오차를 계산함
    
        cost.backward() # gradient 구하기 
        optimizer.step() # 모델 학습
        train_loss += cost.item()
        
        pred = y_pred.data.max(1, keepdim=True)[1] ## 각 클래스의 확률 값 중 가장 큰 값을 가지는 클래스의 인덱스를 pred 변수로 받음
        correct += pred.cpu().eq(target.data.view_as(pred)).sum() # pred와 target을 비교하여 맞은 개수를 구하는 과정.
                                                                  # view_as함수는 들어가는 인수의 모양으로 맞춰주고, .eq()를 통해 pred와 target의 값이 동일한지 판단하여 True 개수 구하기
        
    train_loss /= len(trainloader) 
    train_accuracy = correct / len(trainloader.dataset)

    
    # eval
    eval_loss = 0
    correct = 0
    with torch.no_grad(): ## 학습하지 않기 위해
        model.eval() # 평가 모드로 변경
        for x, target in testloader:
            y_pred = model(x.cuda())## 모델의 출력값
            cost = loss(y_pred,target.cuda())## loss 함수를 이용하여 test 데이터의 오차를 계산함
            eval_loss += cost
            
            pred = y_pred.data.max(1, keepdim=True)[1]## 각 클래스의 확률 값 중 가장 큰 값을 가지는 클래스의 인덱스를 pred 변수로 받음
            correct += pred.cpu().eq(target.data.view_as(pred)).cpu().sum()# pred와 target을 비교하여 맞은 개수를 구하는 과정
            
        eval_loss /= len(testloader)
        eval_accuracy = correct / len(testloader.dataset)
        
        ## test 데이터의 loss를 기준으로 이전 loss 보다 작을 경우 체크포인트 저장
        if eval_loss < best_loss:
            torch.save({
                'epoch': epoch,
                'model': model,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': cost.item,
                }, './bestCheckPiont.pth')
            
            print(f'Epoch {epoch:05d}: val_loss improved from {best_loss:.5f} to {eval_loss:.5f}, saving model to bestCheckPiont.pth')
            best_loss = eval_loss
        else:
            print(f'Epoch {epoch:05d}: val_loss did not improve')
        model.train()
        
    print(f'{int(time.time() - start)}s - loss: {train_loss:.5f} - acc: {train_accuracy:.5f} - val_loss: {eval_loss:.5f} - val_acc: {eval_accuracy:.5f}')

## 저장되어있는 best 체크포인트 load하기

best_model = torch.load('./bestCheckPiont.pth')['model']  # 전체 모델을 통째로 불러옴, 클래스 선언 필수
best_model.load_state_dict(torch.load('./bestCheckPiont.pth')['model_state_dict'])  # state_dict를 불러 온 후, 모델에 저장


correct = 0
with torch.no_grad(): ## 학습하지 않기 위해
    best_model.eval()
    for x, target in testloader:
        y_pred = best_model(x.cuda())
        
        pred = y_pred.data.max(1, keepdim=True)[1]
        correct += pred.cpu().eq(target.data.view_as(pred)).cpu().sum()
        
    eval_accuracy = correct / len(testloader.dataset)

print(f"Test accuracy: {100.* eval_accuracy:.4f}")

Ethan0 · April 23, 2023, 10:36pm

# DEVICE 초기화
DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f"DEVICE:{DEVICE}")


# model summary
%pip install torchsummary ## model을 요약하기 위해 torchsummary 설치
from torchsummary import summary as summary_## 모델 정보를 확인하기 위해 torchsummary 함수 import

## 모델의 형태를 출력하기 위한 함수 
def summary_model(model,input_shape=(1, 28, 28)):
    model = model.cuda()
    summary_(model, input_shape) ## (model, (input shape))

train_path  = '/kaggle/input/2023-dls-w7/train'
test_path  = '/kaggle/input/2023-dls-w7/test'
from PIL import Image
tmp = Image.open("/kaggle/input/2023-dls-w7/test/img/1.jpg")
tmp.size

## Torchvision의 함수를 이용하여 로더 작성하기
import torch.nn as nn 
import torchvision
from torchvision import transforms

batch_size = 100

## Torchvision의 함수를 이용하여 로더 작성하기
## https://pytorch.org/vision/stable/generated/torchvision.datasets.ImageFolder.html
trainSet = torchvision.datasets.ImageFolder(root = train_path,
                                            transform=transforms.ToTensor())
testSet = torchvision.datasets.ImageFolder(root = test_path,
                                            transform=transforms.ToTensor())

## loader
trainloader = torch.utils.data.DataLoader(trainSet, batch_size=batch_size, shuffle=True)
testloader = torch.utils.data.DataLoader(testSet, batch_size=1, shuffle=False)

images, labels = next(iter(trainloader))
print("image shape:", images.shape)
print("label shape:", labels.shape)

##################################################
## CNN 모델 구조 정의
##################################################
from torch.nn import Sequential
from torch.nn import Conv2d, MaxPool2d, Flatten, Linear, Dropout, ReLU, ZeroPad2d, Softmax

## nn.Sequential 함수는 네트워크를 인스턴스화하는 동시에, 원하는 신경망에 연산 순서를 인수로 전달함.
model = Sequential(
            Conv2d(3,32,kernel_size=3,padding='same'),  ## 참고: Conv2d(in_channel, out_channel, kernerl_size, padding, bias)
            ReLU(),
            MaxPool2d(kernel_size=2),    
            Conv2d(32,64,kernel_size=3,padding='same'),
            ReLU(),
            MaxPool2d(kernel_size=2),   
            Flatten(),
            Linear(3136,64),
            ReLU(),
            Linear(64,10),
            Softmax(dim=1)
        ).cuda()

print(model)
summary_model(model, input_shape=(3,28,28))## 모델 요약 정보 확인

# 모델 학습하기
import time

# 파라미터 설정
total_epoch = 5
best_loss = 100 ## loss를 기준으로 best_checkpoint를 저장하기 위해 100으로 설정하였음.

lr_rate = 0.0001
optimizer = torch.optim.Adam(model.parameters(), lr=lr_rate)## Adma을 최적화 함수로 이용함. 파라미터는 documentation을 참조!

loss = torch.nn.CrossEntropyLoss().cuda()## 분류문제이므로 CrossEntropyLoss를 이용함

## 모델 시각화를 위해 정확도를 저장할 리스트 생성
train_accuracys=[]
eval_accuracys=[]

for epoch in range(total_epoch):
    start = time.time()
    print(f'Epoch {epoch}/{total_epoch}')

    # train
    train_loss = 0
    correct = 0
    for x, target in trainloader:## 한번에 배치사이즈만큼 데이터를 불러와 모델을 학습함

        optimizer.zero_grad()## 이전 loss를 누적하지 않기 위해 0으로 설정해주는 과정
        
        y_pred = model(x.cuda())## 모델의 출력값
        
        cost = loss(y_pred, target.cuda())## loss 함수를 이용하여 오차를 계산함

        cost.backward() # gradient 구하기 
        optimizer.step()# 모델 학습
        train_loss += cost.item()
        
        pred = y_pred.data.max(1, keepdim=True)[1] ## 각 클래스의 확률 값 중 가장 큰 값을 가지는 클래스의 인덱스를 pred 변수로 받음
        correct += pred.cpu().eq(target.data.view_as(pred)).sum() # pred와 target을 비교하여 맞은 개수를 구하는 과정.
                                                                  # view_as함수는 들어가는 인수의 모양으로 맞춰주고, .eq()를 통해 pred와 target의 값이 동일한지 판단하여 True 개수 구하기
    
    train_loss /= len(trainloader)
        
    train_accuracy = correct / len(trainloader.dataset)
    train_accuracys.append(train_accuracy) ## 그래프로 표현하기 위해 리스트에 담음

    print(f'{int(time.time() - start)}s - loss: {train_loss:.5f} - acc: {train_accuracy:.5f}')

import tqdm
#prediction

correct = 0
total = 0
pred_=[]

with torch.no_grad(): ## 학습하지 않기 위해
model.eval()
for data, label in tqdm.tqdm(testloader):

print(data,label)

    outputs = model(data.cuda())
    _, predicted = torch.max(outputs.data, 1)
    pred_.append(predicted.cpu().item()) ## 제출을 위해 예측값을 pred_로 받아줌

## pandas를 이용하여 sample.csv 파일 읽기
import pandas as pd

sample = pd.read_csv("/kaggle/input/2023-dls-w7/sample_submit.csv")
sample.head()

for i, info in enumerate(testloader.dataset.imgs):
    path = info[0]
    id = (path.split("/")[-1]).split(".")[-2] ## test 이미지의 ID를 찾는 방법 
    sample['label'][sample['index']== int(id)]= pred_[i] ## sample의 ID에 맞추어 예측값 넣기

## csv 파일로 저장
sample.to_csv("baseline.csv",index=False,header=True)
from matplotlib import pyplot as plt

plt.plot(train_accuracys, label='train')
plt.plot(eval_accuracys, label='test')
plt.legend()
plt.show()

Ethan0 · April 23, 2023, 10:41pm

train_path  = '/kaggle/input/2023-1-DLS-W6P2/train'
valid_path  = '/kaggle/input/2023-1-DLS-W6P2/valid'
test_path  = '/kaggle/input/2023-1-DLS-W6P2/test'

from torchvision.models import VGG16_Weights ## 사전학습된 VGG16를 이용하기 위해 import

transform_ = VGG16_Weights.DEFAULT.transforms() ## VGG16에서 사용한 transform을 이용
transform_.resize_size=[224] ## resize 크기는 224가 되도록 설정
print(transform_)  ## transform 확인하기

## Torchvision의 함수를 이용하여 로더 작성하기
import torch.nn as nn 
import torchvision

## Torchvision의 함수를 이용하여 로더 작성하기
## https://pytorch.org/vision/stable/generated/torchvision.datasets.ImageFolder.html 
trainSet = torchvision.datasets.ImageFolder(root = train_path,
                                            transform = transform_)
validSet = torchvision.datasets.ImageFolder(root = valid_path,
                                            transform = transform_)
testSet = torchvision.datasets.ImageFolder(root = test_path,
                                            transform = transform_)

## loader
trainloader = torch.utils.data.DataLoader(trainSet, batch_size=10, shuffle=True)
validloader = torch.utils.data.DataLoader(validSet, batch_size=30, shuffle=False)
testloader = torch.utils.data.DataLoader(testSet, batch_size=50, shuffle=False)

images, labels = next(iter(trainloader))
print("image shape:", images.shape)
print("label shape:", labels.shape)

from torchvision.models import VGG16_Weights
## Classifier 부분은 제외하고 가져오기
base_model = torchvision.models.vgg16(weights=VGG16_Weights.IMAGENET1K_V1).features
## 마지막에 global_avg_pool2d 추가하기
base_model.global_avg_pool2d = nn.AdaptiveAvgPool2d((1,1))
print(base_model)
summary_model(base_model, (3,224,224))

## 모델의 뒷단만 학습하도록 하기 위해서 앞부분은 freeze 해주기

for para in base_model[:-10].parameters(): ## keras는 -5이고 pytorch에서는 -10인 이유는 keras는 layer를 설정할 때 활성화함수까지 포함하기 때문..
    para.requires_grad = False

summary_model(base_model,(3,224,224)) ## 모델 요약 정보 확인

## Classifier 만들기
classifier = nn.Sequential(
    nn.Flatten(),
    nn.Linear(512,10), ## 512, 7, 7
    # nn.Softmax() ## CrossEntropy에 Softmax가 포함되므로 사용하지 않음
)

base_model.classifier = classifier ## 모델에 classifier라는 이름으로 추가
print(base_model)
summary_model(base_model, (3,224,224))## 모델 요약 정보 확인

# 모델 학습하기
import time

# 파라미터 설정
total_epoch = 15
best_loss = 100 ## loss를 기준으로 best_checkpoint를 저장하기 위해 100으로 설정하였음.

lr_rate = 0.0001
optimizer = torch.optim.Adam(base_model.parameters(), lr=lr_rate)## Adma을 최적화 함수로 이용함. 파라미터는 documentation을 참조!

loss = torch.nn.CrossEntropyLoss().cuda()## 분류문제이므로 CrossEntropyLoss를 이용함

## 모델 시각화를 위해 정확도를 저장할 리스트 생성
train_accuracys=[]
eval_accuracys=[]

for epoch in range(total_epoch):
    start = time.time()
    print(f'Epoch {epoch}/{total_epoch}')

    # train
    train_loss = 0
    correct = 0
    for x, target in trainloader:## 한번에 배치사이즈만큼 데이터를 불러와 모델을 학습함

        optimizer.zero_grad()## 이전 loss를 누적하지 않기 위해 0으로 설정해주는 과정
        
        y_pred = base_model(x.cuda())## 모델의 출력값
        
        cost = loss(y_pred, target.cuda())## loss 함수를 이용하여 오차를 계산함

        cost.backward() # gradient 구하기 
        optimizer.step()# 모델 학습
        train_loss += cost.item()
        
        pred = y_pred.data.max(1, keepdim=True)[1] ## 각 클래스의 확률 값 중 가장 큰 값을 가지는 클래스의 인덱스를 pred 변수로 받음
        correct += pred.cpu().eq(target.data.view_as(pred)).sum() # pred와 target을 비교하여 맞은 개수를 구하는 과정.
                                                                  # view_as함수는 들어가는 인수의 모양으로 맞춰주고, .eq()를 통해 pred와 target의 값이 동일한지 판단하여 True 개수 구하기
    
    train_loss /= len(trainloader)
        
    train_accuracy = correct / len(trainloader.dataset)
    train_accuracys.append(train_accuracy) ## 그래프로 표현하기 위해 리스트에 담음

    
    #Evaluate
    eval_loss = 0
    correct = 0
    with torch.no_grad(): ## 학습하지 않기 위해
        base_model.eval()# 평가 모드로 변경
        for x, target in validloader:
            y_pred = base_model(x.cuda())## 모델의 출력값
            cost = loss(y_pred, target.cuda())## loss 함수를 이용하여 valid 데이터의 오차를 계산함
            eval_loss += cost
            
            pred = y_pred.data.max(1, keepdim=True)[1]## 각 클래스의 확률 값 중 가장 큰 값을 가지는 클래스의 인덱스를 pred 변수로 받음
            correct += pred.cpu().eq(target.data.view_as(pred)).cpu().sum()# pred와 target을 비교하여 맞은 개수를 구하는 과정
            
        eval_loss /= len(validloader)
        eval_accuracy = correct / len(validloader.dataset)
        eval_accuracys.append(eval_accuracy)## 그래프로 표현하기 위해 리스트에 담음
        
        ## valid 데이터의 loss를 기준으로 이전 loss 보다 작을 경우 체크포인트 저장
        if eval_loss < best_loss:
            torch.save({
                'epoch': epoch,
                'model': base_model,
                'model_state_dict': base_model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': cost.item,
                }, './bestCheckPoint.pth')
            
            print(f'Epoch {epoch:05d}: val_loss improved from {best_loss:.5f} to {eval_loss:.5f}, saving model to bestCheckPiont_resnet50.pth')
            best_loss = eval_loss
        else:
            print(f'Epoch {epoch:05d}: val_loss did not improve')
        base_model.train()
        
    print(f'{int(time.time() - start)}s - loss: {train_loss:.5f} - acc: {train_accuracy:.5f} - val_loss: {eval_loss:.5f} - val_acc: {eval_accuracy:.5f}')

## 저장되어있는 best 체크포인트 load하기
best_model = torch.load('./bestCheckPoint.pth')['model']  # 전체 모델을 통째로 불러옴, 클래스 선언 필수
best_model.load_state_dict(torch.load('./bestCheckPoint.pth')['model_state_dict'])  # state_dict를 불러 온 후, 모델에 저장

#prediction

correct = 0
total = 0
pred_ = []

with torch.no_grad(): ## 학습하지 않기 위해
    best_model.eval()
    for data, label in testloader:
        print(data.shape)
        outputs = best_model(data.cuda())
        _, predicted = torch.max(outputs.data, 1)
        pred_.append(predicted.cpu()) ## 제출을 위해 예측값을 pred_로 받아줌
        total += label.size(0)
        correct += (predicted == label.cuda()).sum().item()
        
print('Test accuracy: %d %%' % (100 * correct / total))

## pandas를 이용하여 sample.csv 파일 읽기
import pandas as pd

sample = pd.read_csv("/kaggle/input/2023-1-DLS-W6P2/sample.csv")
sample.head()
pred_, np.array(pred_[0].cpu())
result = np.array(pred_[0])
for i, info in enumerate(testloader.dataset.imgs):
    path = info[0]
    id = (path.split("/")[-1]).split(".")[-2] ## test 이미지의 ID를 찾는 방법 
    sample['label'][sample['ID']== id]= result[i] ## sample의 ID에 맞추어 예측값 넣기

sample.to_csv("baseline.csv",index=False,header=True)