When I am using CUDA_VISIBLE_DEVICES
before python main.py
, just like:
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --args...
this command make the other gpu, i.e. [4,5,6,7] suddenly disappear!
No matter in my account or other users’.
When I type nvidia-smi
,
I can only see four gpu:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 23% 33C P2 54W / 250W | 595MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 23% 34C P2 55W / 250W | 569MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 23% 34C P2 55W / 250W | 569MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 23% 33C P2 54W / 250W | 569MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
While it ought to be:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 23% 30C P0 56W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 23% 31C P0 58W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 23% 30C P0 58W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 23% 29C P0 56W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 108... Off | 00000000:84:00.0 Off | N/A |
| 23% 30C P0 56W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 108... Off | 00000000:85:00.0 Off | N/A |
| 23% 29C P0 56W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 108... Off | 00000000:88:00.0 Off | N/A |
| 23% 30C P0 57W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 108... Off | 00000000:89:00.0 Off | N/A |
| 23% 27C P0 56W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Pytorch version: 0.4.0
Cuda version: V8.0.61
GPU driver: 384.111
Compared to another user, who can correctly run CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --args...
, the only difference I can figure out is I am using Anaconda environment. I don’t know how this can affect the running. I am not using sudo
before writing CUDA_VISIBLE_DEVICES
.
Thanks you for replying~
2 Likes
aplassard
(Andrew Plassard)
July 20, 2018, 1:59pm
2
Thank for your reply!
I don’t think it is expected. Because people using this command to assign specified gpu to a task in order to spare the other gpus for other task. If me using CUDA_VISIBLE_DEVICES
can affect other user’s observed gpu, then they can’t using the gpu that omitted by my task.
And what’s more, my coworker can using this command CUDA_VISIBLE_DEVICES=0,1,2 python main.py
to specify the gpu his task used, without influence my task.
Again, thanks for your reply but I am still not understand
albanD
(Alban D)
July 20, 2018, 2:08pm
4
Hi,
How do you bring these gpus back up after they disapear? Do you have to reboot? Or opening a new shell works?
Turn off the old shell can make them appear, in my new opened shell or my coworker’s shell.
albanD
(Alban D)
July 20, 2018, 2:17pm
6
Does that happen for any script that you execute or just for this one?
If it’s only this one, what does it do? Is it doing multigpu work? Are you doing anything else fancy with your gpu?
Concretely, even though I type CUDA_VISIBLE_DEVICES=0,1,2,3
after I enter the conda environment, without running any python code, this phenomena also happens.
In fact, the main.py
does a simple PyTorch based neural network training, with dataloader, dataparallel in it.
More info:
Dataparallel
using 20 workers.
Instead of setting the environment variable, using os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1, 2, 3"
inside the python code can also cause the problem. Turn off the shell, or kill the running process, or waiting it finish can solve it.
for target in loader: target_var = target.cuda(async=True)
is used. Maybe the async=True
will cause some problem?
Thanks for your fast and quick reply!
aplassard
(Andrew Plassard)
July 20, 2018, 2:33pm
8
Can you send us a few things to help?
The output of nvidia-smi
on your shell when you open a new terminal
The output of nvidia-smi
on your coworker’s shell
the output of CUDA_VISIBLE_DEVICES=0,1,2,3 nvidia-smi
on your coworker’s shell
the output nvidia-smi
on your shell after they have run that
the output of CUDA_VISIBLE_DEVICES=0,1 nvidia-smi
on your shell
the output of nvidia-smi
on your coworker’s shell ofter you have run that?
albanD
(Alban D)
July 20, 2018, 2:57pm
9
@aplassard Note that nvidia-smi
is not impacted by the CUDA_VISIBLE_DEVICES
env variable.
@PengZhenghao from a state where it works, could you open two new shells.
Check that nvidia-smi shows all the gpus in both.
Run CUDA_VISIBLE_DEVICES=0,1
on one shell.
Check that nvidia-smi shows all the gpus in both still. Is that the case?
Run export CUDA_VISIBLE_DEVICES=0,1
on one shell.
Check that nvidia-smi shows all the gpus in both still. Is that still the case?
In each shell, run python
then inside import torch
and print(torch.cuda.device_count())
. One should return 2 (the shell that had the export command) and the other 8. Is that the case?
1 Like
The nvidia-smi in newly opened terminal of me or my coworker
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 23% 31C P0 57W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 23% 32C P0 58W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 23% 32C P0 58W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 23% 30C P0 56W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 108... Off | 00000000:84:00.0 Off | N/A |
| 23% 31C P0 56W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 108... Off | 00000000:85:00.0 Off | N/A |
| 23% 30C P0 56W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 108... Off | 00000000:88:00.0 Off | N/A |
| 23% 31C P0 57W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 108... Off | 00000000:89:00.0 Off | N/A |
| 23% 28C P0 56W / 250W | 0MiB / 11172MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
My coworker see exactly the same as me if I do nothing.
Nothing happen after “CUDA_VISIBLE_DEVICES=0,1,2,3 nvidia-smi”
No matter in my shell or coworker’s shell, the same as above shown in nvidia-smi
, with or without conda environment.
When running python script, gpu disappear!
When I run sh train.sh
inside the conda environment, wherein codes following is inside:
#!/usr/bin/env bash
CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python main.py --fuse-type=average --batch-size=64 --tsn --model=mlp --epochs=1 --lr=5e-4 2>&1 &
At that time, I can see the GPU disappear one-by-one! Start from No.7 gpu. Till all 4-7 gpu disappear.
This phenomena can be observed exactly the same in my and my coworker’s shells.
When the script still running and half of the gpu disappear, in nvidia-smi
:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 23% 29C P2 53W / 250W | 525MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 23% 31C P2 55W / 250W | 521MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 23% 29C P2 54W / 250W | 521MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 23% 26C P8 10W / 250W | 18MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 50371 C python 515MiB |
| 1 50371 C python 511MiB |
| 2 50371 C python 511MiB |
| 3 50371 C python 8MiB |
+-----------------------------------------------------------------------------+
After the script ran, everything restore
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 23% 30C P0 55W / 250W | 595MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 23% 31C P0 56W / 250W | 569MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 23% 31C P0 56W / 250W | 569MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 23% 29C P0 55W / 250W | 569MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 108... Off | 00000000:84:00.0 Off | N/A |
| 23% 28C P0 56W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 108... Off | 00000000:85:00.0 Off | N/A |
| 23% 28C P0 56W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 108... Off | 00000000:88:00.0 Off | N/A |
| 23% 30C P0 57W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 108... Off | 00000000:89:00.0 Off | N/A |
| 23% 27C P0 56W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
After “Run export CUDA_VISIBLE_DEVICES=0,1 on one shell”, both shell nvidia-smi show 8 gpu
Checking torch.cuda.device_count()
in both shell, after one of them run Step1, the phenomena as you wish happen: the user that conduct Step1 get the 2
result, while the other get 8
.
In short, everything happen as you wish.
albanD
(Alban D)
July 20, 2018, 3:16pm
12
Ok,
So it is something in your actual script that causes this.
Could you disable the DataParallel
? It will run on only one of the 4 available gpus but that should work.
When this is disabled, do you still see the gpus disapear while the script runs?
Also what do you mean by “Dataparallel using 20 workers.” there is no notion of workers for DataParallel. Do you repeat some gpus in the device_ids list? If so try and set each gpu once and see what it does.
Yes, all GPU still there!
I only comment DataParallel
in my code and rerun the script.
model = model.cuda()
# model = torch.nn.DataParallel(model)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 23% 26C P2 56W / 250W | 549MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 23% 23C P8 9W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 23% 22C P8 9W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 23% 22C P8 8W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 108... Off | 00000000:84:00.0 Off | N/A |
| 23% 23C P8 8W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 108... Off | 00000000:85:00.0 Off | N/A |
| 23% 23C P8 8W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 108... Off | 00000000:88:00.0 Off | N/A |
| 23% 25C P8 9W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 108... Off | 00000000:89:00.0 Off | N/A |
| 23% 23C P8 8W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 51957 C python 539MiB |
+-----------------------------------------------------------------------------+
In both shells, thing looks exactly the same!
So in your judgement, what is the key point that trigger this phenomena?
Thanks a lot!
albanD
(Alban D)
July 20, 2018, 3:31pm
14
DataParallel
use quite complex backend methods in CUDA and have some “unpredictable” (by me) behaviour when not used in a classical way.
In particular it uses nccl2 that has some system-wide impact on the gpu.
When the gpu disapear, is your code actually running? If so does it uses multiple gpus or just one?
I am not sure what happens when you do not provide any device_ids
to DataParallel
by reading the doc . I would replace your line by model = torch.nn.DataParallel(model, device_ids=[0, 1, 2, 3])
and check if that works and properly uses all the 4 gpus. Setting only [0, 1]
should use only 2.
Yes it works! Using DataParallel(model, device_ids=[0,1,2,3])
First I try
CUDA_VISIBLE_DEVICES=2,3,4,5
...
model = model.cuda()
model = torch.nn.DataParallel(model, device_ids=[0,1])
And we got:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 23% 24C P8 8W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 23% 25C P8 8W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 23% 27C P2 54W / 250W | 595MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 23% 26C P2 53W / 250W | 569MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 108... Off | 00000000:84:00.0 Off | N/A |
| 23% 26C P8 8W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 108... Off | 00000000:85:00.0 Off | N/A |
| 23% 25C P8 8W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 108... Off | 00000000:88:00.0 Off | N/A |
| 23% 27C P8 8W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 108... Off | 00000000:89:00.0 Off | N/A |
| 23% 25C P8 9W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 52106 C python 585MiB |
| 3 52106 C python 559MiB |
+-----------------------------------------------------------------------------+
Then tried:
CUDA_VISIBLE_DEVICES=2,3,4,5
...
model = model.cuda()
model = torch.nn.DataParallel(model, device_ids=[0,1,2,3])
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 23% 24C P8 8W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 23% 25C P8 9W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 23% 32C P2 55W / 250W | 595MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 23% 30C P2 54W / 250W | 569MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 108... Off | 00000000:84:00.0 Off | N/A |
| 23% 30C P2 53W / 250W | 569MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 108... Off | 00000000:85:00.0 Off | N/A |
| 23% 29C P2 53W / 250W | 569MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 108... Off | 00000000:88:00.0 Off | N/A |
| 23% 27C P8 8W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 108... Off | 00000000:89:00.0 Off | N/A |
| 23% 25C P8 8W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 52926 C python 585MiB |
| 3 52926 C python 559MiB |
| 4 52926 C python 559MiB |
| 5 52926 C python 559MiB |
+-----------------------------------------------------------------------------+
Now you can see everything work well!
Thank you very much for your patiently help! Never think about this problem is triggered by DataParallel
. It should use all GPU that visible to python. I don’t know why it failed when I don’t provide device_ids
.
Sincerely thanks you @albanD and @aplassard .
PyTorch community is prosperous thanks to your quick and helpful help. (Never think a library forum is so activated before I join here
albanD
(Alban D)
July 20, 2018, 3:54pm
16
Great, everything works as expected there !
I will report the problem when no device_ids
is provided, not sure what was the intended behaviour but I’m sure making GPU disapear is not what we want !
I am sorry. But today this phenomena happens again.
After tons of testing, I find the key point:
Using device_id=0,1,2,3
in DataParallel
Using nohup
to run your python script
Using a bash script to set CUDA_VISIBLE_DEVICES
and run python
I don’t have time to find why this happen. So just run my code under those odd requirements. Codes are provided in case it may help somebody or staff to figure the reason.
In my bash script:
#!/usr/bin/env bash
CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python main.py --fuse-type=cnn --batch-size=256 --model=cnn --epochs=50 --lr=5e-4 2>&1 &
echo 'Start training!'
In my code:
if args.model == 'cnn':
model = CNNBase(4, feature_length, num_class)
elif args.model == 'mlp':
model = MLPBase(feature_length, num_class)
model = model.cuda()
model = torch.nn.DataParallel(model, device_ids=[0,1,2,3])
UPDATE:
I find a more clean way to deal with this problem…
That is declare device_id
in DataParallel
without setting the environment variable.
When I use device_id=[0,1,2,3]
and let torch.cuda.device_count()
return 8
. The program still work in 4 gpus and in nvidia-smi
8 gpus are visible.