Really strange OOM error on new pytorch install

qmeeus · November 19, 2020, 11:56am

I am experiencing a really strange issue with out of memory errors on pytorch:

$ python -c "import torch; torch.randn(12,12).cuda()"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: CUDA error: out of memory

Background information:
I just created a new environment with the following (from the documentation):

conda create -n pytorch-test
conda install -n pytorch-test -c pytorch pytorch cudatoolkit=11.0

And my GPU:

$ echo $CUDA_VISIBLE_DEVICES                                                                                           
2
$ nvidia-smi -i 2                                                                                                      
Thu Nov 19 12:52:56 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 455.38       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   2  Tesla P100-PCIE...  Off  | 00000000:0E:00.0 Off |                    2 |
| N/A   47C    P0    29W / 250W |      2MiB / 16280MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Any help is much appreciated, I kinda desperate here

[EDIT]: This seems to be more of a hardware issue. I am experiencing the same problem with a new installation of tensorflow, following the exact same steps… Still, if you have any pointer to solve this, I take it

ptrblck · November 21, 2020, 11:58am

Could you switch the GPU from EXCLUSIVE Process to the default via nvidia-smi -i 0 -c 0 and check if it’s changing the behavior?

qmeeus · November 25, 2020, 3:35pm

Thank you for your answer. Unfortunately, I do not have sufficient permissions to do that.
Nonetheless,here is some more information that makes me think that your approach is probably the right one:

$ nvidia-smi -q | grep "Compute Mode"
    Compute Mode                          : Exclusive_Process

Any way I can go around this problem without changing the compute mode?