Computer with two rtx 2080ti is stuck when running pytorch

Prinsphield · April 4, 2019, 12:18am

My computer is equipped with two RTX 2080ti GPUs. I have installed the cuda 10.1 in the system and the driver version is 418.39.

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2410 G /usr/lib/xorg/Xorg 28MiB |
| 0 2593 G /usr/bin/gnome-shell 98MiB |
| 0 12630 G /usr/lib/xorg/Xorg 252MiB |
| 0 12734 G /opt/teamviewer/tv_bin/TeamViewer 19MiB |
| 0 12793 G /usr/bin/gnome-shell 186MiB |
| 0 13362 G …uest-channel-token=17060840034489939627 49MiB |
±----------------------------------------------------------------------------+

I have tried using anaconda to install pytorch for running, but it will get stuck a few minutes after starting running with no response to mouse and keyboard and the screen is frozen. Besides, I cannot ssh into this computer either. I cannot figure out what kind of problems it has because I cannot see the error since everything is frozen. So I have to restart the computer by force.

It seems that RTX 2080ti requires at least CUDA 10 to run efficiently. In anaconda, I can create several different environments to test. I have tried several combinations of cudatoolkit version and pytorch version, but it still has this problem.

In this computer, I can install tensorflow-gpu 1.9 with cuda 9 toolkit using anaconda, because tensorflow has not yet supported cuda 10. As for pytorch, the cuda 10 is only supported for other gpus such as 1080ti.

ref: https://github.com/pytorch/pytorch/issues/12977

Does anyone have an idea?

david.bernstein · May 10, 2019, 7:15pm

I’m having a similar issue when training on a multiple 2080Ti machine using DataParallel. When using only one GPU it seems to run fine but freezes and crashes in the same way as described above when using DataParallel. Using pytorch 1.1 and cuda10.0.

rasbt · May 11, 2019, 12:28am

Hm, do you have an HDMI cable plugged into one of the cards to drive the GUI? If yes, can you try to run the code without DataParallel on just the card where you have the HDMI cable is plugged in? I am wondering if this really is a DataParallel issue or whether it’s due to exhausting all the GPU memory such that the GUI freezes.

PS: I have a machine with 8 2080Ti and when I use DataParallel on those, i don’t have any issues. I am running a headless install of Ubuntu there though, so no GUI. On my second machine I have 4 cards where I have an HDMI cable plugged into one of those for the GUI, and I notice that the interface is indeed slower when I run model training on this card – but I guess this is expected? (It never crashes/completely freezes though)

david.bernstein · May 11, 2019, 12:43am

Hmm good question. The machine in question has no display, I ssh into it. I’m glad to hear you have such a set up working though! Do you mind giving me a few details about how you installed pytorch and cuda? Which versions and how they were installed? That would be very helpful.

Prinsphield · May 11, 2019, 1:21am

My machine can only run for less than 10 minutes. Then it stucked and I cannot ssh into it anymore even though I turn off the display. I was wondering how to manage to assemble 8 gpus.

Could you please share the spec of motherboard, cpu and other components. I assemble this machine by myself. I am not sure whether other components caused such issue.

rasbt · May 11, 2019, 2:32am

Not sure, but does that really turn off the GPU use (you can check via Nvidia smi, RAM use should be ~0)? I would just try to simply unplug the cable for the graphics card temporarily.

motherboard, cpu and other components.

Motherboard:
Supermicro X11DPG-OT-CPU, X11DPG-OT-CPU | Motherboards | Products | Supermicro

CPUs:
4 8-core Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz (Intel Xeon Silver 4110 Processor 11M Cache 2.10 GHz Product Specifications)

Memory:
376GiB

how you installed pytorch and cuda?

I am using conda – I noticed it’s much faster than the custom-compiled version. Probably because of mmkl in conda since I have intel CPUs.

I.e., I am just using

conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

InnovArul · May 11, 2019, 5:13am

Faced a similar instance. Do monitor the temperature of gpus using watch nvidia-smi, check if it exceeds 90 degrees or so. Also you can check if the temperature throttle is activated with nvidia-smi -a.

rasbt · May 11, 2019, 5:35am

Very good point regarding the temperature. In the 8-GPU machine, mine are usually around 60-70 Celsius

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.93       Driver Version: 410.93       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:1A:00.0 Off |                  N/A |
| 41%   66C    P2   259W / 250W |   4429MiB / 10989MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:1B:00.0 Off |                  N/A |
| 43%   69C    P2   247W / 250W |   4429MiB / 10989MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:3D:00.0 Off |                  N/A |
| 30%   49C    P2   134W / 250W |   4365MiB / 10989MiB |     84%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:3E:00.0 Off |                  N/A |
| 30%   47C    P2   127W / 250W |   4365MiB / 10989MiB |     79%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce RTX 208...  Off  | 00000000:88:00.0 Off |                  N/A |
| 36%   59C    P2   250W / 250W |   4429MiB / 10989MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce RTX 208...  Off  | 00000000:89:00.0 Off |                  N/A |
| 35%   27C    P0    69W / 250W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce RTX 208...  Off  | 00000000:B1:00.0 Off |                  N/A |
| 39%   62C    P2   252W / 250W |   4429MiB / 10989MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce RTX 208...  Off  | 00000000:B2:00.0 Off |                  N/A |
| 38%   62C    P2   242W / 250W |   4429MiB / 10989MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

where the throttling is to occur at 89C:

        GPU Shutdown Temp           : 94 C
        GPU Slowdown Temp           : 91 C
        GPU Max Operating Temp      : 89 C

Could be that your cooling is not sufficient and you approach the Slowdown/Shutdown temperature when you are running your GPUs, which could be an explanation for

My machine can only run for less than 10 minutes.

Maybe do a new run and keep an eye on the nvidia-smi reported temperatures and see if there is a correlation

Prinsphield · May 11, 2019, 5:52am

I use nvidia-smi to watch the status of gpu while running programs. There are two possible cases:

the nvidia-smi will fail to obtain the status of gpus, while programs are still running. The error says you need to reboot your computer. After a while,
the computer was totally stuck and cannot be ssh-ed.

The temperature is normal even at the point when it gets stuck.

david.bernstein · May 31, 2019, 9:21pm

Hi Sebastian, If you don’t mind can you tell me what sort of power supply you’re using on that machine?

david.bernstein · June 4, 2019, 10:18pm

Update, I seem to have gotten my net training on the machine with mutlitple 2080Ti’s. Part of the issue was with the code itself where a few tensors were inadvertently allocated on device 0 instead of on another gpu. Once these were tracked down and fixed the training would run on two gpus but the machine would shut down when running on three or four. This was traced to the power supply of the machine and once that was upgraded the code seems to train fine on all four gpus.

Charl_van_Heerden · August 4, 2019, 12:55pm

Hi David, I’m experiencing similar problems to yours with two 2080Ti’s. Do you mind sharing how you traced the problem to the power supply?

david.bernstein · August 5, 2019, 10:09pm

What happened is that training on all the gpus would begin normally but within a minute or so the machine would simply shut down, i.e., power itself off without any warning. The guy here who built the rig immediately suspected the 1300W power supply wasn’t big enough and replaced it with a 1600W unit. No issues after that. I can ask him for more details if you’d like.