The error message “CUDA driver error: invalid argument” indicates that there is an issue with the arguments being passed to a CUDA function. In this specific case, it could be caused by a few factors.
One possibility is that the size of the tensor being passed to the prod() function is too large to fit into the GPU memory. You can check the available memory on your GPU by running torch.cuda.memory_allocated() and torch.cuda.max_memory_allocated(). If the tensor is too large, you may need to reduce its size or consider using a larger GPU.
Another possibility is that the values in the tensor are not valid for the operation being performed. For example, if the tensor contains negative values and the operation requires positive values only, you may encounter this error. You can check the values in the tensor by printing it before calling the prod() function.
Additionally, this error could also be caused by a bug in the CUDA driver or an issue with the CUDA installation. In this case, you may need to reinstall the CUDA toolkit or update the driver to a newer version.
To better diagnose the issue, you can also try running the code on CPU instead of GPU by removing the .cuda() call and see if the error persists. Hope this helps
Check the available memory on your GPU by running torch.cuda.memory_allocated() and torch.cuda.max_memory_allocated() to see if the tensor you are trying to process is too large for your GPU. If the tensor is too large, you may need to reduce its size or consider using a larger GPU.
Check the values in the tensor by printing it before calling the prod() function. Ensure that the tensor contains valid values for the operation being performed.
If the tensor is not too large and contains valid values, try updating your CUDA driver to the latest version or reinstalling the CUDA toolkit. You can find instructions for doing this on the NVIDIA website.
Alternatively, try running the code on CPU instead of GPU by removing the .cuda() call and see if the error persists. If the code runs without errors on CPU, it may be a problem with your GPU or CUDA installation.
If none of the above solutions work, try searching for the error message online to see if other people have encountered similar issues and found solutions. You could also post your code and error message on online forums such as Stack Overflow to get help from the community.
Actually, I am able to train some neural networks on this configuration without any problem. So far, the only problem I see is the prod method on a Cuda tensor (CPU works fine).
I am using a cluster, and I have access to both V100 and A100 GPUs and can confirm that I receive this error on both. I also tried 3 different Cuda installations on 3 different conda environments, and they all have this problem. Lastly, I asked another user of the server to regenerate the error on the same setup (identical pytorch/cuda versions and gpu-also driver), but sadly it worked for this case. Therefore, I wonder what might prevent me from calling the prod method on my setup.
data = torch.as_tensor([[32, 32], [16, 16], [8, 8]])
data = data.cuda() #perhaps specify a device?
result = data.prod(1)
You could specify a device ID, like cuda(0) or do .to(device='cuda') and see if that helps.
After having a look on StackOverflow, it seems the error occurs with the installed version of CUDA. If the conda envs keep giving the same error, you could try a virtual env with pip instead?
I tried your code suggestions, but none of them worked. On the cluster, I also have access to another conda, having a set of modules including pytorch-cuda combinations. Instead of using my conda environments, I tried pytorch-gpu/py3/1.13.0 with cuda/11.2 and pytorch-gpu/py3/2.0.0 with cuda/11.7.1 by module load, and the same problem still occurred.
I don’t understand what’s happening, but I won’t pursue it any more. Instead, I will convert the code as below and leave it like that.
import torch
a = torch.tensor([[32, 32], [16, 16], [8, 8]]).cuda()
result = torch.ones(a.shape[0]).cuda()
for i in range(a.shape[1]):
result *= a[:, i]
print(result)
I don’t think any of these suggestions are a valid fix, but I’m also unable to reproduce the issue locally.
In case you have any information how to reproduce the issue, it would be great it you could share these.