distrubutedDataParallel and dataParallel hangs in specified model

damaoooo · March 25, 2020, 9:14am

I was trying to train my NLP model in multGPU with 2 K80s, each K80 has 2 cores, my model works fine in CPU or single GPU with DataParallel or distributedDataParallel, but when I use 2 or more cores, embarrassing things happened, it hangs always,this is the symptom

DataParallel

clothModel = myModel.cuda()
clothModel = nn.DataParallel(clothModel) # <-- it works fine
······
out, loss = clothModel(input) #  <-- program always hang in this line, even I can't use ctrl+C to shut it down and I get this infomation by using VSCode debug

when I turn to nvidia-smi, I found this

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:08:00.0 Off |                  Off |
| N/A   45C    P0    70W / 149W |   2113MiB / 12206MiB |    100%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:09:00.0 Off |                  Off |
| N/A   35C    P0    70W / 149W |    322MiB / 12206MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 00000000:86:00.0 Off |                  Off |
| N/A   39C    P0    57W / 175W |    311MiB / 12206MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 00000000:87:00.0 Off |                  Off |
| N/A   31C    P0    71W / 175W |    320MiB / 12206MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

after a night, it remains this still

distributedDataParallel

after failed to use DataParallel, I turned to distributedDataParallel which is recommanded, and it hangs in clothModel = nn.parallel.DistributedDataParallel(clothModel)
I turn to nvidia-smi, it seems almost the same as when I use nn.DataParallel
and this time I can use ctrl + C while the processes are still remain in cuda so that the only way to exit is kill -9 PID
when I use ctrl+c, it displayed this

  File "/home/damaoooo/.conda/envs/test/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/damaoooo/.conda/envs/test/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/damaoooo/.conda/envs/test/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/damaoooo/.conda/envs/test/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main
    process.wait()
  File "/home/damaoooo/.conda/envs/test/lib/python3.6/subprocess.py", line 1477, in wait
    (pid, sts) = self._try_wait(0)
  File "/home/damaoooo/.conda/envs/test/lib/python3.6/subprocess.py", line 1424, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)

and more interesting, I made a simply CNN for MNIST and turn to DataParallel or distributedDataParallel, it works perfect… I wonder is there something wrong with my clothModel？if there is, why I turn to single GPU, it works fine?
and how can I solve the confusing hang?

mrshenli · March 25, 2020, 2:22pm

You mean DDP hangs at constructor? Can you attach the process to gdb and check the trace to see line is causing the hang?

Have you set CUDA_VISIBLE_DEVICES or pass in device_ids arg properly for DDP? Each DDP process should exclusively work on one GPU.

I wonder is there something wrong with my clothModel?

Given the trace, I assume you are using the launch script. With that, DDP should be constructed in the following way:

clothModel = DistributedDataParallel(clothModel, device_ids=[arg.local_rank], output_device=arg.local_rank)

damaoooo · March 27, 2020, 3:45am

thanks a lot! that’s the key to the question, after tried that, I successed to run, but How about the DataParallel hang in the first question?

mrshenli · March 27, 2020, 2:20pm

Not sure why DataParallel stuck. The source code is here. Can you attach the process/threads to GDB and backtrace the stack?