I was trying to train my NLP model in multGPU with 2 K80s
, each K80
has 2 cores, my model works fine in CPU or single GPU with DataParallel
or distributedDataParallel
, but when I use 2 or more cores, embarrassing things happened, it hangs always,this is the symptom
DataParallel
clothModel = myModel.cuda()
clothModel = nn.DataParallel(clothModel) # <-- it works fine
······
out, loss = clothModel(input) # <-- program always hang in this line, even I can't use ctrl+C to shut it down and I get this infomation by using VSCode debug
when I turn to nvidia-smi
, I found this
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:08:00.0 Off | Off |
| N/A 45C P0 70W / 149W | 2113MiB / 12206MiB | 100% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 00000000:09:00.0 Off | Off |
| N/A 35C P0 70W / 149W | 322MiB / 12206MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 00000000:86:00.0 Off | Off |
| N/A 39C P0 57W / 175W | 311MiB / 12206MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 00000000:87:00.0 Off | Off |
| N/A 31C P0 71W / 175W | 320MiB / 12206MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
after a night, it remains this still
distributedDataParallel
after failed to use DataParallel, I turned to distributedDataParallel which is recommanded, and it hangs in clothModel = nn.parallel.DistributedDataParallel(clothModel)
I turn to nvidia-smi
, it seems almost the same as when I use nn.DataParallel
and this time I can use ctrl + C
while the processes are still remain in cuda so that the only way to exit is kill -9 PID
when I use ctrl+c
, it displayed this
File "/home/damaoooo/.conda/envs/test/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/damaoooo/.conda/envs/test/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/damaoooo/.conda/envs/test/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in <module>
main()
File "/home/damaoooo/.conda/envs/test/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main
process.wait()
File "/home/damaoooo/.conda/envs/test/lib/python3.6/subprocess.py", line 1477, in wait
(pid, sts) = self._try_wait(0)
File "/home/damaoooo/.conda/envs/test/lib/python3.6/subprocess.py", line 1424, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
and more interesting, I made a simply CNN for MNIST and turn to DataParallel or distributedDataParallel, it works perfect… I wonder is there something wrong with my clothModel
?if there is, why I turn to single GPU, it works fine?
and how can I solve the confusing hang?