distrubutedDataParallel and dataParallel hangs in specified model

I was trying to train my NLP model in multGPU with 2 K80s, each K80 has 2 cores, my model works fine in CPU or single GPU with DataParallel or distributedDataParallel, but when I use 2 or more cores, embarrassing things happened, it hangs always,this is the symptom

DataParallel

clothModel = myModel.cuda()
clothModel = nn.DataParallel(clothModel) # <-- it works fine
······
out, loss = clothModel(input) #  <-- program always hang in this line, even I can't use ctrl+C to shut it down and I get this infomation by using VSCode debug 

when I turn to nvidia-smi, I found this

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:08:00.0 Off |                  Off |
| N/A   45C    P0    70W / 149W |   2113MiB / 12206MiB |    100%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:09:00.0 Off |                  Off |
| N/A   35C    P0    70W / 149W |    322MiB / 12206MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 00000000:86:00.0 Off |                  Off |
| N/A   39C    P0    57W / 175W |    311MiB / 12206MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 00000000:87:00.0 Off |                  Off |
| N/A   31C    P0    71W / 175W |    320MiB / 12206MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

after a night, it remains this still

distributedDataParallel

after failed to use DataParallel, I turned to distributedDataParallel which is recommanded, and it hangs in clothModel = nn.parallel.DistributedDataParallel(clothModel)
I turn to nvidia-smi, it seems almost the same as when I use nn.DataParallel
and this time I can use ctrl + C while the processes are still remain in cuda so that the only way to exit is kill -9 PID
when I use ctrl+c, it displayed this

  File "/home/damaoooo/.conda/envs/test/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/damaoooo/.conda/envs/test/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/damaoooo/.conda/envs/test/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/damaoooo/.conda/envs/test/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main
    process.wait()
  File "/home/damaoooo/.conda/envs/test/lib/python3.6/subprocess.py", line 1477, in wait
    (pid, sts) = self._try_wait(0)
  File "/home/damaoooo/.conda/envs/test/lib/python3.6/subprocess.py", line 1424, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)

and more interesting, I made a simply CNN for MNIST and turn to DataParallel or distributedDataParallel, it works perfect… I wonder is there something wrong with my clothModel?if there is, why I turn to single GPU, it works fine?
and how can I solve the confusing hang?

You mean DDP hangs at constructor? Can you attach the process to gdb and check the trace to see line is causing the hang?

Have you set CUDA_VISIBLE_DEVICES or pass in device_ids arg properly for DDP? Each DDP process should exclusively work on one GPU.

I wonder is there something wrong with my clothModel?

Given the trace, I assume you are using the launch script. With that, DDP should be constructed in the following way:

clothModel = DistributedDataParallel(clothModel, device_ids=[arg.local_rank], output_device=arg.local_rank)

thanks a lot! that’s the key to the question, after tried that, I successed to run, but How about the DataParallel hang in the first question?

1 Like

Not sure why DataParallel stuck. The source code is here. Can you attach the process/threads to GDB and backtrace the stack?