Pytorch giving me a weird cuda error

Here is my SO question regarding the same: https://stackoverflow.com/q/45861767/4993513

Possible to help me why Pytorch is giving me that cuda error?

error 30 is usually unrelated to pytorch issues (or your code change). You likely have to run your program once with sudo so that the NVIDIA driver gets initialized.

1 Like

I have done that too. Did sudo python imagenet2.py --world-size 1 --arch ‘alexnet’ ImageNet2. Still the exact same error.

and does the script work if you dont modify it? i.e. the default examples/imagenet/main.py works out of the box?

Not really. I get this error in the beginning:

Traceback (most recent call last):
  File "imagenetoriginal.py", line 316, in <module>
    main()
  File "imagenetoriginal.py", line 83, in main
    model.features = torch.nn.DataParallel(model.features)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 47, in __init__
    output_device = device_ids[0]
IndexError: list index out of range

But then, I added , device_ids=[0, 1] and ran, and it gave the same (30) error.

If the HDMI is connected to the GPU, e.g. for a monitor, then cuda may not work, perhaps because the GPU is occupied by the monitor. A possible solution may be disconnecting the HDMI and use ssh to control your server.

Did you encounter this problem?
In my case I’m losing a bit of the GPU ram, but I’ve never seen any problems using CUDA.