CUDA_VISIBLE_DEVICES make gpu disappear

PengZhenghao · July 20, 2018, 1:57pm

When I am using CUDA_VISIBLE_DEVICES before python main.py, just like:

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --args...

this command make the other gpu, i.e. [4,5,6,7] suddenly disappear!
No matter in my account or other users’.

When I type nvidia-smi,
I can only see four gpu:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   33C    P2    54W / 250W |    595MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   34C    P2    55W / 250W |    569MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:08:00.0 Off |                  N/A |
| 23%   34C    P2    55W / 250W |    569MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 23%   33C    P2    54W / 250W |    569MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

While it ought to be:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   30C    P0    56W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   31C    P0    58W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:08:00.0 Off |                  N/A |
| 23%   30C    P0    58W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 23%   29C    P0    56W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  Off  | 00000000:84:00.0 Off |                  N/A |
| 23%   30C    P0    56W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  Off  | 00000000:85:00.0 Off |                  N/A |
| 23%   29C    P0    56W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 108...  Off  | 00000000:88:00.0 Off |                  N/A |
| 23%   30C    P0    57W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 108...  Off  | 00000000:89:00.0 Off |                  N/A |
| 23%   27C    P0    56W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Pytorch version: 0.4.0
Cuda version: V8.0.61
GPU driver: 384.111

Compared to another user, who can correctly run CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --args..., the only difference I can figure out is I am using Anaconda environment. I don’t know how this can affect the running. I am not using sudo before writing CUDA_VISIBLE_DEVICES.

Thanks you for replying~

aplassard · July 20, 2018, 1:59pm

Your behavior is the expected result - https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars

PengZhenghao · July 20, 2018, 2:06pm

Thank for your reply!

I don’t think it is expected. Because people using this command to assign specified gpu to a task in order to spare the other gpus for other task. If me using CUDA_VISIBLE_DEVICES can affect other user’s observed gpu, then they can’t using the gpu that omitted by my task.

And what’s more, my coworker can using this command CUDA_VISIBLE_DEVICES=0,1,2 python main.py to specify the gpu his task used, without influence my task.

Again, thanks for your reply but I am still not understand

albanD · July 20, 2018, 2:08pm

Hi,

How do you bring these gpus back up after they disapear? Do you have to reboot? Or opening a new shell works?

PengZhenghao · July 20, 2018, 2:11pm

Turn off the old shell can make them appear, in my new opened shell or my coworker’s shell.

albanD · July 20, 2018, 2:17pm

Does that happen for any script that you execute or just for this one?
If it’s only this one, what does it do? Is it doing multigpu work? Are you doing anything else fancy with your gpu?

PengZhenghao · July 20, 2018, 2:28pm

Concretely, even though I type CUDA_VISIBLE_DEVICES=0,1,2,3 after I enter the conda environment, without running any python code, this phenomena also happens.

In fact, the main.py does a simple PyTorch based neural network training, with dataloader, dataparallel in it.

More info:

Dataparallel using 20 workers.
Instead of setting the environment variable, using os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1, 2, 3" inside the python code can also cause the problem. Turn off the shell, or kill the running process, or waiting it finish can solve it.
for target in loader: target_var = target.cuda(async=True) is used. Maybe the async=True will cause some problem?

Thanks for your fast and quick reply!

aplassard · July 20, 2018, 2:33pm

Can you send us a few things to help?

The output of nvidia-smi on your shell when you open a new terminal
The output of nvidia-smi on your coworker’s shell
the output of CUDA_VISIBLE_DEVICES=0,1,2,3 nvidia-smi on your coworker’s shell
the output nvidia-smi on your shell after they have run that
the output of CUDA_VISIBLE_DEVICES=0,1 nvidia-smi on your shell
the output of nvidia-smi on your coworker’s shell ofter you have run that?

albanD · July 20, 2018, 2:57pm

@aplassard Note that nvidia-smi is not impacted by the CUDA_VISIBLE_DEVICES env variable.

@PengZhenghao from a state where it works, could you open two new shells.

Check that nvidia-smi shows all the gpus in both.
Run CUDA_VISIBLE_DEVICES=0,1 on one shell.
Check that nvidia-smi shows all the gpus in both still. Is that the case?
Run export CUDA_VISIBLE_DEVICES=0,1 on one shell.
Check that nvidia-smi shows all the gpus in both still. Is that still the case?
In each shell, run python then inside import torch and print(torch.cuda.device_count()). One should return 2 (the shell that had the export command) and the other 8. Is that the case?

PengZhenghao · July 20, 2018, 2:59pm

The nvidia-smi in newly opened terminal of me or my coworker

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   31C    P0    57W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   32C    P0    58W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:08:00.0 Off |                  N/A |
| 23%   32C    P0    58W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 23%   30C    P0    56W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  Off  | 00000000:84:00.0 Off |                  N/A |
| 23%   31C    P0    56W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  Off  | 00000000:85:00.0 Off |                  N/A |
| 23%   30C    P0    56W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 108...  Off  | 00000000:88:00.0 Off |                  N/A |
| 23%   31C    P0    57W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 108...  Off  | 00000000:89:00.0 Off |                  N/A |
| 23%   28C    P0    56W / 250W |      0MiB / 11172MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

My coworker see exactly the same as me if I do nothing.

Nothing happen after “CUDA_VISIBLE_DEVICES=0,1,2,3 nvidia-smi”

No matter in my shell or coworker’s shell, the same as above shown in nvidia-smi, with or without conda environment.

When running python script, gpu disappear!

When I run sh train.sh inside the conda environment, wherein codes following is inside:

#!/usr/bin/env bash
CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python main.py --fuse-type=average --batch-size=64 --tsn --model=mlp --epochs=1 --lr=5e-4 2>&1 &

At that time, I can see the GPU disappear one-by-one! Start from No.7 gpu. Till all 4-7 gpu disappear.
This phenomena can be observed exactly the same in my and my coworker’s shells.

When the script still running and half of the gpu disappear, in nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   29C    P2    53W / 250W |    525MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   31C    P2    55W / 250W |    521MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:08:00.0 Off |                  N/A |
| 23%   29C    P2    54W / 250W |    521MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 23%   26C    P8    10W / 250W |     18MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     50371      C   python                                       515MiB |
|    1     50371      C   python                                       511MiB |
|    2     50371      C   python                                       511MiB |
|    3     50371      C   python                                         8MiB |
+-----------------------------------------------------------------------------+

After the script ran, everything restore

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   30C    P0    55W / 250W |    595MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   31C    P0    56W / 250W |    569MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:08:00.0 Off |                  N/A |
| 23%   31C    P0    56W / 250W |    569MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 23%   29C    P0    55W / 250W |    569MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  Off  | 00000000:84:00.0 Off |                  N/A |
| 23%   28C    P0    56W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  Off  | 00000000:85:00.0 Off |                  N/A |
| 23%   28C    P0    56W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 108...  Off  | 00000000:88:00.0 Off |                  N/A |
| 23%   30C    P0    57W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 108...  Off  | 00000000:89:00.0 Off |                  N/A |
| 23%   27C    P0    56W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

PengZhenghao · July 20, 2018, 3:04pm

After “Run export CUDA_VISIBLE_DEVICES=0,1 on one shell”, both shell nvidia-smi show 8 gpu
Checking torch.cuda.device_count() in both shell, after one of them run Step1, the phenomena as you wish happen: the user that conduct Step1 get the 2 result, while the other get 8.

In short, everything happen as you wish.

albanD · July 20, 2018, 3:16pm

Ok,

So it is something in your actual script that causes this.
Could you disable the DataParallel? It will run on only one of the 4 available gpus but that should work.
When this is disabled, do you still see the gpus disapear while the script runs?

Also what do you mean by “Dataparallel using 20 workers.” there is no notion of workers for DataParallel. Do you repeat some gpus in the device_ids list? If so try and set each gpu once and see what it does.

PengZhenghao · July 20, 2018, 3:23pm

Yes, all GPU still there!

I only comment DataParallel in my code and rerun the script.

    model = model.cuda()
    # model = torch.nn.DataParallel(model)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   26C    P2    56W / 250W |    549MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   23C    P8     9W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:08:00.0 Off |                  N/A |
| 23%   22C    P8     9W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 23%   22C    P8     8W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  Off  | 00000000:84:00.0 Off |                  N/A |
| 23%   23C    P8     8W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  Off  | 00000000:85:00.0 Off |                  N/A |
| 23%   23C    P8     8W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 108...  Off  | 00000000:88:00.0 Off |                  N/A |
| 23%   25C    P8     9W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 108...  Off  | 00000000:89:00.0 Off |                  N/A |
| 23%   23C    P8     8W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     51957      C   python                                       539MiB |
+-----------------------------------------------------------------------------+

In both shells, thing looks exactly the same!

So in your judgement, what is the key point that trigger this phenomena?

Thanks a lot!

albanD · July 20, 2018, 3:31pm

DataParallel use quite complex backend methods in CUDA and have some “unpredictable” (by me) behaviour when not used in a classical way.
In particular it uses nccl2 that has some system-wide impact on the gpu.

When the gpu disapear, is your code actually running? If so does it uses multiple gpus or just one?
I am not sure what happens when you do not provide any device_ids to DataParallel by reading the doc. I would replace your line by model = torch.nn.DataParallel(model, device_ids=[0, 1, 2, 3]) and check if that works and properly uses all the 4 gpus. Setting only [0, 1] should use only 2.

PengZhenghao · July 20, 2018, 3:52pm

Yes it works! Using DataParallel(model, device_ids=[0,1,2,3])

First I try

CUDA_VISIBLE_DEVICES=2,3,4,5
...
    model = model.cuda()
    model = torch.nn.DataParallel(model, device_ids=[0,1])

And we got:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   24C    P8     8W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   25C    P8     8W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:08:00.0 Off |                  N/A |
| 23%   27C    P2    54W / 250W |    595MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 23%   26C    P2    53W / 250W |    569MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  Off  | 00000000:84:00.0 Off |                  N/A |
| 23%   26C    P8     8W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  Off  | 00000000:85:00.0 Off |                  N/A |
| 23%   25C    P8     8W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 108...  Off  | 00000000:88:00.0 Off |                  N/A |
| 23%   27C    P8     8W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 108...  Off  | 00000000:89:00.0 Off |                  N/A |
| 23%   25C    P8     9W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    2     52106      C   python                                       585MiB |
|    3     52106      C   python                                       559MiB |
+-----------------------------------------------------------------------------+

Then tried:

CUDA_VISIBLE_DEVICES=2,3,4,5
...
    model = model.cuda()
    model = torch.nn.DataParallel(model, device_ids=[0,1,2,3])

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   24C    P8     8W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   25C    P8     9W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:08:00.0 Off |                  N/A |
| 23%   32C    P2    55W / 250W |    595MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 23%   30C    P2    54W / 250W |    569MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  Off  | 00000000:84:00.0 Off |                  N/A |
| 23%   30C    P2    53W / 250W |    569MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  Off  | 00000000:85:00.0 Off |                  N/A |
| 23%   29C    P2    53W / 250W |    569MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 108...  Off  | 00000000:88:00.0 Off |                  N/A |
| 23%   27C    P8     8W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 108...  Off  | 00000000:89:00.0 Off |                  N/A |
| 23%   25C    P8     8W / 250W |     10MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    2     52926      C   python                                       585MiB |
|    3     52926      C   python                                       559MiB |
|    4     52926      C   python                                       559MiB |
|    5     52926      C   python                                       559MiB |
+-----------------------------------------------------------------------------+

Now you can see everything work well!

Thank you very much for your patiently help! Never think about this problem is triggered by DataParallel. It should use all GPU that visible to python. I don’t know why it failed when I don’t provide device_ids.

Sincerely thanks you @albanD and @aplassard .
PyTorch community is prosperous thanks to your quick and helpful help. (Never think a library forum is so activated before I join here

albanD · July 20, 2018, 3:54pm

Great, everything works as expected there !

I will report the problem when no device_ids is provided, not sure what was the intended behaviour but I’m sure making GPU disapear is not what we want !

PengZhenghao · July 21, 2018, 6:56am

I am sorry. But today this phenomena happens again.

After tons of testing, I find the key point:

Using device_id=0,1,2,3 in DataParallel
Using nohup to run your python script
Using a bash script to set CUDA_VISIBLE_DEVICES and run python

I don’t have time to find why this happen. So just run my code under those odd requirements. Codes are provided in case it may help somebody or staff to figure the reason.

In my bash script:

#!/usr/bin/env bash
CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python main.py --fuse-type=cnn --batch-size=256 --model=cnn --epochs=50 --lr=5e-4 2>&1 &
echo 'Start training!'

In my code:

    if args.model == 'cnn':
        model = CNNBase(4, feature_length, num_class)
    elif args.model == 'mlp':
        model = MLPBase(feature_length, num_class)
    model = model.cuda()
    model = torch.nn.DataParallel(model, device_ids=[0,1,2,3])

UPDATE:

I find a more clean way to deal with this problem…
That is declare device_id in DataParallel without setting the environment variable.

When I use device_id=[0,1,2,3] and let torch.cuda.device_count() return 8. The program still work in 4 gpus and in nvidia-smi 8 gpus are visible.