Model.cuda() fails

import torch

class TinyModel(torch.nn.Module):

def __init__(self):
    super(TinyModel, self).__init__()

    self.linear1 = torch.nn.Linear(100, 200)
    self.activation = torch.nn.ReLU()
    self.linear2 = torch.nn.Linear(200, 10)
    self.softmax = torch.nn.Softmax()

def forward(self, x):
    x = self.linear1(x)
    x = self.activation(x)
    x = self.linear2(x)
    x = self.softmax(x)
    return x

tinymodel = TinyModel()
tinymodel.cuda()

^^^^^^^^^^^^^^^^^^^^^^

Produces:


RuntimeError Traceback (most recent call last)
/tmp/ipykernel_34/1502743690.py in
17
18 tinymodel = TinyModel()
—> 19 tinymodel.cuda()

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in cuda(self, device)
456 Module: self
457 “”"
→ 458 return self._apply(lambda t: t.cuda(device))
459
460 def cpu(self: T) → T:

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
352 def _apply(self, fn):
353 for module in self.children():
→ 354 module._apply(fn)
355
356 def compute_should_use_set_data(tensor, tensor_applied):

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
374 # with torch.no_grad():
375 with torch.no_grad():
→ 376 param_applied = fn(param)
377 should_use_set_data = compute_should_use_set_data(param, param_applied)
378 if should_use_set_data:

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in (t)
456 Module: self
457 “”"
→ 458 return self._apply(lambda t: t.cuda(device))
459
460 def cpu(self: T) → T:

RuntimeError: CUDA error: device-side assert triggered

pip list | grep torch:
torch 1.6.0
torchvision 0.7.0

nvcc -V:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

nvidia-smi:
Fri Mar 25 16:39:08 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA Tesla K80 Off | 00000001:00:00.0 Off | 0 |
| N/A 37C P0 69W / 149W | 707MiB / 11441MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
±----------------------------------------------------------------------------+

(Running from within docker container)


Does anyone have any ideas on why model.cuda() rises abovelisted exception?

This sounds like a setup issue. Were you able to use the GPU on this system before and if so, did you change anything (e.g. updated the drivers without a restart etc.)?

I’ve did setup some long time ago, without changing anything.

Just before running code above I’ve launched pytorch on GPU (as confirmed by “watch nvidia-smi” - GPU utilization, memory, power usage).
Code I’ve launched in jupyter notebook next to that failing one (see above) is “siamese-triplet/Experiments_MNIST.ipynb at master · adambielski/siamese-triplet · GitHub
steps I’ve launched are:

  • Prepare dataset
  • Common setup
  • Baseline: Classification with softmax

All worked flawlessly, no restarts of the docker or any changes on host.

If I understand your use case correctly the GPU was indeed working, but you are now hitting the device-side assert when trying to run any CUDA code in your notebook?
If so, then you were most likely already running into an assert before and the CUDA context is corrupted.
I.e. once CUDA runs into an assert the context is corrupt and all following CUDA operations will fail until you restart the process.
Try restarting the notebook and see if the simple to() operation is working again.
If so, then debug which assert was triggered before that.

Wow! Thank you very much!
Restart helped, as to what exactly caused that error, seems during model building I have tried to create models with mismatching layer sizes (that’s a hypothesis, I’m not 100% sure).

Best wishes!