I finally got my code working on my local cpu and now wanted to begin training in earnest using a cloud (AWS) machine with GPU. I thought the process of switching from cpu to gpu would be relatively simple but, wow, was I wrong.
To the best of my understanding I have the proper versions of pytorch and cuda running…
%> nvidia-smi
Wed Mar 24 18:53:53 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 34C P0 26W / 70W | 2251MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
__Python VERSION: 3.6.8 (default, Nov 16 2020, 16:55:22)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
__pyTorch VERSION: 1.8.0
__CUDA VERSION
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
__CUDNN VERSION: 7605
__Number CUDA Devices: 1
__Devices
Active CUDA Device: GPU 0
Available devices 1
Current cuda device 0
I’ve reached a situation where, without changing anything in the code, I will get a different error with each alternate run. The first attempt errors with:
/dev/lib64/python3.6/site-packages/torch/nn/modules/conv.py in _conv_forward(self, input, weight, bias)
394 _pair(0), self.dilation, self.groups)
395 return F.conv2d(input, weight, bias, self.stride,
--> 396 self.padding, self.dilation, self.groups)
397
398 def forward(self, input: Tensor) -> Tensor:
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
Simply running again will then produce:
38 for x_ims, x, y in dataloader:
---> 39 x_ims = x_ims.to( DEVICE)
40 x = x.to( DEVICE)
41 y = y.to( DEVICE)
RuntimeError: CUDA error: invalid device function
Alternate runs will switch back and forth between these errors.
When I then go through and enter the just lines where the problem arose, I get no error at all. So I’m utterly confused. This is my first attempt as using a GPU for computation and, to be honest, I’m a little daunted by all the management that seems to be required.