Hello, I am trying to execute Pytorch with Cuda on two machines. In both cases, I didn’t get the expected results. I will try to explain both cases to see if I am doing something wrong or if there is any incompatibility I cannot see.
Case 1)
- Cuda installed (nvcc --version): 11.2
- torch 1.12.1+cu113
- torchaudio 0.12.1+cu113
- torchvision 0.13.1+cu113
I run the following code (some parts of it are omitted for simplifications purposes):
print(f"Is CUDA supported by this system? {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
# Storing ID of current CUDA device
cuda_id = torch.cuda.current_device()
print(f"ID of current CUDA device: {torch.cuda.current_device()}")
print(f"Name of current CUDA device: {torch.cuda.get_device_name(cuda_id)}")
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.cuda()
model = model.eval()
with torch.no_grad():
for i, (x, y) in enumerate(val_loader):
x = x.cuda()
y = y.cuda()
y_pred = model(x)
The result I get is the following:
Is CUDA supported by this system? True
CUDA version: 11.3
ID of current CUDA device: 0
Name of current CUDA device: NVIDIA A100-PCIE-40GB
inputs torch.Size([1024, 3, 224, 224]) \ rep 0 \ labels torch.Size([1024])
Segmentation fault (core dumped)
When the model does the inference, fails. I don’t know why, because when I run the example with gdb, it gets stuck, and does not finish.
If I run some commands like nvidia-smi -lms to see if there is some program in execution on GPU, it appears, but with a 0% usage of GPU.
If, for instance, I eliminate x.cuda and y.cuda, the program finishes okay, and with Nvidia-semi, it appears like there is some MiB of the GPU allocated but with 0% usage. I suppose this happens because the model is in GPU, but the inference is not.
I change the code, instead of using model.cuda(), model.to(device) or x.to(device) with the same output.
case 2)
- Cuda installed (nvcc --version): 11.8
- torch 2.4.1+cu118
- torchaudio 2.4.1+cu118
- torchvision 0.19.1+cu118
same code, and this is the output:
Is CUDA supported by this system? True
CUDA version: 11.8
ID of current CUDA device: 0
Name of current CUDA device: Tesla V100-PCIE-32GB
Segmentation fault (core dumped)
I get the same output with different Cuda configurations and different GPUs, and I can detect not only the Cuda version but also the GPU in my code with functions like current_device() , get_device() or is_available().
Again, if a don’t use x.to(device) or y.to(device) works but with 0% usage. Surprisingly, model.to(device) or model.cuda() never fails.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:3D:00.0 Off | 0 |
| N/A 35C P0 35W / 250W | 1684MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
The only possible answer I find is that somehow I am using the wrong code to send the model and the tensors to GPU, or I don’t have a compatible torch version.
Any help would be appreciated,
Thank you.