I am trying to use GPU in a virtual machine with PyTorch.
Setup:
Running following PyTorch functions to check the availability of the GPU:
print(torch.cuda.current_device())
print(torch.cuda.get_device_name())
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.device(0))
print(torch.cuda.get_device_name(0))
results in the following output:
0
GRID V100S-16C
True
1
<torch.cuda.device object at 0x7f7e596bdb80>
GRID V100S-16C
Furthermore, Nvidia-smi prints the following:
Sun Sep 3 17:00:29 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.142.00 Driver Version: 450.142.00 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID V100S-16C On | 00000000:02:02.0 Off | N/A |
| N/A N/A P0 N/A / N/A | 1104MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Problem:
Running the following code:
import torch
torch.ones((1, 1)).to('cuda')
results in the following error message:
RuntimeError Traceback (most recent call last)
Cell In[3], line 3
1 import torch
----> 3 torch.ones((1, 1)).to(‘cuda’)
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
I tried to set CUDA_LAUNCH_BLOCKING=1 but the error still occurs and the error message remains the same.
Do you have any idea what the problem might be and how I can fix it? Please let me know if further information is required to be able to understand my problem. Thanks in advance, Daniel.