DataParallel hang on machine with two K80 on board

I tried to parallelize the model on two K80 GPU’s, but the code just hangs. It works fine on one GPU or two GTX 1080. What could be an issue?

1 Like

Where is the code that you are refring to?

It is just
model = DataParallel(model).cuda()

If I use DataParallel and two GPUs the code hangs.

Any hints? Tested on several machines with 2 K80 on board - the same thing. @apaszke maybe you have an idea why it is happening?

See:

It is not relevant answer, the same code works on two GTX1080, but hang on two K80.

@smth maybe you have an idea what could be wrong on using two K80? Thank you in advance.

I’m not sure why it’s hanging on the K80.
If anyone could please run the same program with gdb, and when it hangs, get me a stacktrace, this would be helpful.

@smth Thank you for your reply! I tried to get a stacktrace with gdb (sorry if I did it wrong, I did it for the first time)

#0  0x00002aaaab1e279b in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00002aaaab1e282f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00002aaaab1e28cb in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00002aaaaae60313 in PyThread_acquire_lock_timed (lock=0x2aab0c000c10, microseconds=-1000000, intr_flag=1) at Python/thread_pthread.h:352
#4  0x00002aaaaae66fe4 in acquire_timed (lock=0x2aab0c000c10, timeout=-1000000000) at ./Modules/_threadmodule.c:68
#5  0x00002aaaaae67126 in lock_PyThread_acquire_lock (self=0x2aaaf903ea08, args=<optimized out>, kwds=<optimized out>) at ./Modules/_threadmodule.c:151
#6  0x00002aaaaad90df2 in _PyCFunction_FastCallDict (func_obj=0x2aaaf9bbe8b8, args=0x2aaaf9bacaf8, nargs=<optimized out>, kwargs=0x0)
    at Objects/methodobject.c:231
#7  0x00002aaaaae164ec in call_function (pp_stack=0x7fffffffc078, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4798
#8  0x00002aaaaae1915d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#9  0x00002aaaaae14a60 in _PyEval_EvalCodeWithName (_co=0x2aaab2352a50, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, 
    kwnames=0x0, kwargs=0x2aaaf9bb13a0, kwcount=0, kwstep=1, defs=0x2aaab2346f60, defcount=2, kwdefs=0x0, closure=0x0, name=0x2aaab2354420, 
    qualname=0x2aaab234f7b0) at Python/ceval.c:4128
#10 0x00002aaaaae1648a in fast_function (kwnames=<optimized out>, nargs=1, stack=<optimized out>, func=0x2aaab235dae8) at Python/ceval.c:4939
#11 call_function (pp_stack=0x7fffffffc318, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:4819
#12 0x00002aaaaae1915d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#13 0x00002aaaaae14a60 in _PyEval_EvalCodeWithName (_co=0x2aaab23529c0, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, 
    kwnames=0x0, kwargs=0x6a6b62e0, kwcount=0, kwstep=1, defs=0x2aaab23562c8, defcount=1, kwdefs=0x0, closure=0x0, name=0x2aaaaaae2d18, qualname=0x2aaab2351a30)
    at Python/ceval.c:4128
#14 0x00002aaaaae1648a in fast_function (kwnames=<optimized out>, nargs=1, stack=<optimized out>, func=0x2aaab235da60) at Python/ceval.c:4939
#15 call_function (pp_stack=0x7fffffffc5b8, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:4819
#16 0x00002aaaaae1915d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#17 0x00002aaaaae14a60 in _PyEval_EvalCodeWithName (_co=0x2aaae39ceed0, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=4, 
    kwnames=0x0, kwargs=0x2aaaf9baadd8, kwcount=0, kwstep=1, defs=0x2aaae3727de0, defcount=2, kwdefs=0x0, closure=0x0, name=0x2aaae39d3730, 
    qualname=0x2aaae39d3730) at Python/ceval.c:4128
#18 0x00002aaaaae1648a in fast_function (kwnames=<optimized out>, nargs=4, stack=<optimized out>, func=0x2aaae39d6510) at Python/ceval.c:4939
#19 call_function (pp_stack=0x7fffffffc858, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:4819
#20 0x00002aaaaae1915d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#21 0x00002aaaaae13e74 in _PyFunction_FastCall (co=<optimized out>, args=<optimized out>, nargs=4, globals=<optimized out>) at Python/ceval.c:4880
#22 0x00002aaaaae165e8 in fast_function (kwnames=0x0, nargs=4, stack=<optimized out>, func=0x2aaae39ec158) at Python/ceval.c:4915
#23 call_function (pp_stack=0x7fffffffca88, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4819
#24 0x00002aaaaae1915d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#25 0x00002aaaaae14a60 in _PyEval_EvalCodeWithName (_co=0x2aaae39dc150, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=2, 
    kwnames=0x2aaaaaadd060, kwargs=0x2aaaaaadd068, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x2aaae36a8ed8, 
    qualname=0x2aaae39e15d0) at Python/ceval.c:4128
#26 0x00002aaaaae14cfc in _PyFunction_FastCallDict (func=0x2aaae39e3f28, args=0x7fffffffccc0, nargs=2, kwargs=0x2aaaf9032090) at Python/ceval.c:5031
#27 0x00002aaaaad39ba6 in _PyObject_FastCallDict (func=0x2aaae39e3f28, args=0x7fffffffccc0, nargs=<optimized out>, kwargs=0x2aaaf9032090)
    at Objects/abstract.c:2295
#28 0x00002aaaaad39dfc in _PyObject_Call_Prepend (func=0x2aaae39e3f28, obj=0x2aaaf89735f8, args=0x2aaaf9048da0, kwargs=0x2aaaf9032090) at Objects/abstract.c:2358
#29 0x00002aaaaad39e96 in PyObject_Call (func=0x2aaab4146248, args=<optimized out>, kwargs=<optimized out>) at Objects/abstract.c:2246
#30 0x00002aaaaae1a236 in do_call_core (kwdict=0x2aaaf9032090, callargs=<optimized out>, func=0x2aaab4146248) at Python/ceval.c:5067
#31 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3366
#32 0x00002aaaaae14a60 in _PyEval_EvalCodeWithName (_co=0x2aaae3846660, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=2, 
    kwnames=0x0, kwargs=0x8, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x2aaaaaae0170, qualname=0x2aaae3732ab0)
    at Python/ceval.c:4128
#33 0x00002aaaaae14cfc in _PyFunction_FastCallDict (func=0x2aaae397f510, args=0x7fffffffd0b0, nargs=2, kwargs=0x0) at Python/ceval.c:5031
#34 0x00002aaaaad39ba6 in _PyObject_FastCallDict (func=0x2aaae397f510, args=0x7fffffffd0b0, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2295
---Type <return> to continue, or q <return> to quit---
#35 0x00002aaaaad39dfc in _PyObject_Call_Prepend (func=0x2aaae397f510, obj=0x2aaaf89735f8, args=0x2aaaf9048d68, kwargs=0x0) at Objects/abstract.c:2358
#36 0x00002aaaaad39e96 in PyObject_Call (func=0x2aaae2b1fec8, args=<optimized out>, kwargs=<optimized out>) at Objects/abstract.c:2246
#37 0x00002aaaaadb1baf in slot_tp_call (self=0x2aaaf89735f8, args=0x2aaaf9048d68, kwds=0x0) at Objects/typeobject.c:6167
#38 0x00002aaaaad39ade in _PyObject_FastCallDict (func=0x2aaaf89735f8, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#39 0x00002aaaaae162bb in call_function (pp_stack=0x7fffffffd3a8, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#40 0x00002aaaaae1915d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#41 0x00002aaaaae13e74 in _PyFunction_FastCall (co=<optimized out>, args=<optimized out>, nargs=0, globals=<optimized out>) at Python/ceval.c:4880
#42 0x00002aaaaae165e8 in fast_function (kwnames=0x0, nargs=0, stack=<optimized out>, func=0x2aaaf8965f28) at Python/ceval.c:4915
#43 call_function (pp_stack=0x7fffffffd5d8, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4819
#44 0x00002aaaaae1915d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#45 0x00002aaaaae14a60 in _PyEval_EvalCodeWithName (_co=0x2aaaaac12c90, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, 
    kwnames=0x0, kwargs=0x8, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4128
#46 0x00002aaaaae14ee3 in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4149
#47 0x00002aaaaae14f2b in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:695
#48 0x00002aaaaae476c0 in run_mod (arena=0x2aaaaab2f1e0, flags=0x7fffffffd930, locals=0x2aaaaab97048, globals=0x2aaaaab97048, filename=0x2aaaaac83eb0, 
    mod=0x6bac40) at Python/pythonrun.c:980
#49 PyRun_FileExFlags (fp=0x642460, filename_str=<optimized out>, start=<optimized out>, globals=0x2aaaaab97048, locals=0x2aaaaab97048, closeit=<optimized out>, 
    flags=0x7fffffffd930) at Python/pythonrun.c:933
#50 0x00002aaaaae48c83 in PyRun_SimpleFileExFlags (fp=0x642460, filename=<optimized out>, closeit=1, flags=0x7fffffffd930) at Python/pythonrun.c:396
#51 0x00002aaaaae640b5 in run_file (p_cf=0x7fffffffd930, filename=0x604270 L"train.py", fp=0x642460) at Modules/main.c:338
#52 Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:810
#53 0x0000000000400c1d in main (argc=7, argv=<optimized out>) at ./Programs/python.c:69

Compiled pytorch from sources, but the same problem holds.

It also happens to me. My code runs fine on Titan X (PASCAL) but it hangs on Titan X, if I use DataParallel.

Our system administrator found out that if we use multi gpus, it results in the following system call:

futex(0x33ee6d60, FUTEX_WAIT_PRIVATE, 0, NULL)

@smth could you take a look at the stack trace, please, if you have time? Looks like this problem is reproduced by many others :frowning:

I’m investigating this. It is hard to find a K80 (we dont have them). I found one possible deadlock in nccl primitives bindings, i am fixing this. i am hoping that will resolve the K80 issue. I will ping back here once i push that fixes to master.

hi everyone, NVIDIA’s @ngimel has investigated this problem, and the hangs might not be related to pytorch. She has written a detailed comment here on figuring out the issue and working around it:

Please have a look and see if it applies to you.

2 Likes

Sadly, I met the error again.
I am going to see the issue, to find solutions.

The error is the loss.backward().

0it [00:00, ?it/s]Traceback (most recent call last):
  File "train.py", line 310, in <module>
    train_batch_conf()
  File "train.py", line 148, in train_batch_conf
    train(params_dict)
  File "train.py", line 193, in train
    loss.backward()
  File "/data1/public/research_venv/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 156, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/data1/public/research_venv/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 98, in backward
    variables, grad_variables, retain_graph)
  File "/data1/public/research_venv/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 25, in backward
    return comm.reduce_add_coalesced(grad_outputs, self.input_device)
  File "/data1/public/research_venv/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py", line 122, in reduce_add_coalesced
    result = reduce_add(flattened, destination)
  File "/data1/public/research_venv/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py", line 92, in reduce_add
    nccl.reduce(inputs, outputs, root=destination)
  File "/data1/public/research_venv/anaconda3/lib/python3.6/site-packages/torch/cuda/nccl.py", line 161, in reduce
    assert(root >= 0 and root < len(inputs))
AssertionError