Training Job Stalls with no Logs & GPU Usage Spike

I am training a model that consists of some nn.Embeddings, and LSTM, a Feed Forward layer, and torch.distributions.negative_binomial.NegativeBinomial.

Around 90 epochs (~7 hours with num_workers=7, ~13 hours with num_workers=0), the job stalls, stops outputting logs, and GPU usage spikes to 100%. I am running the job with a P100 on Google Cloud AI Platform, and also experience the problem when running on Google Cloud Compute Engine with a P100.

This is the result of ps -aux:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  18376  2968 ?        Ss   Jul17   0:00 /bin/bash ./execute.sh train production 2020-07-10
root         6 96.9 35.6 40669080 22036460 ?   Rl   Jul17 4154:35 python -m sales_forecasting.train --config production --ts 2020-07-10
root     22525  0.0 35.0 40641548 21673036 ?   Sl   Jul17   1:03 python -m sales_forecasting.train --config production --ts 2020-07-10
root     22526  0.0 35.0 40641560 21673048 ?   Sl   Jul17   0:57 python -m sales_forecasting.train --config production --ts 2020-07-10
root     22527  0.0 35.0 40638804 21670476 ?   Sl   Jul17   0:57 python -m sales_forecasting.train --config production --ts 2020-07-10
root     22528  0.0 35.0 40638816 21670476 ?   Sl   Jul17   0:57 python -m sales_forecasting.train --config production --ts 2020-07-10
root     22529  0.0 35.0 40641596 21673092 ?   Sl   Jul17   0:58 python -m sales_forecasting.train --config production --ts 2020-07-10
root     22530  0.0 35.0 40641608 21673148 ?   Sl   Jul17   0:56 python -m sales_forecasting.train --config production --ts 2020-07-10
root     22531  0.0 35.0 40641620 21673236 ?   Sl   Jul17   0:57 python -m sales_forecasting.train --config production --ts 2020-07-10

As mentioned, there is no stack trace with SIGINT or SIGQUIT, but when forcing a stack trace using the signal library it outputs:

Current thread 0x00007f021d6f3740 (most recent call first):
  File "/opt/cpml/sales_forecasting/sales_forecasting/DeepAR/loss.py", line 32 in neg_log_likelihood
  File "/opt/cpml/sales_forecasting/sales_forecasting/DeepAR/estimator.py", line 156 in train_batch
  File "/opt/cpml/sales_forecasting/sales_forecasting/train.py", line 53 in train_epoch
  File "/opt/cpml/sales_forecasting/sales_forecasting/train.py", line 96 in train
  File "/opt/cpml/sales_forecasting/sales_forecasting/train.py", line 246 in <module>
  File "/opt/conda/lib/python3.7/runpy.py", line 85 in _run_code
  File "/opt/conda/lib/python3.7/runpy.py", line 193 in _run_module_as_main

The neg_log_likelihood function is here:

def neg_log_likelihood(
        distribution: Distribution,
        target: torch.Tensor
) -> Tuple[torch.Tensor, int]:
    log_likelihood = distribution.log_prob(target)

    non_zero_mask = target != 0.0

    if not any(non_zero_mask):
        return -torch.sum(log_likelihood), len(target)

    return -torch.sum(log_likelihood[non_zero_mask]), torch.sum(non_zero_mask)

Finally, resource usage changes can be seen here:

Since the job takes 8 hours to fail, I haven’t yet isolated the code to a minimum example that fails. Has anyone seen anything similar before? Any idea what the issue could be?

Also worth noting is that the issue only occurs when using GPU, not CPU.

Worth noting is that line 32 in loss.py is the if statement. Also, machine CPU/memory usage is here (memory usage differs from the ps -aux logs above because they are from different jobs):

Hi,

Do you think you will be able to get a stack trace of where it is stuck?
Also can it be a hardware or allocation issue?
Nothing looks suspicious in the code you shared at first glance.

No, unfortunately the code fails in such a way that there is no stack trace, it simply stalls. The program will not even respond to ctrl+c. The best I could do regarding a stack trace was by using the faulthandler library, which returns the trace you see in my post.

With your hardware question - RAM, CPU, Disk Usage, and GPU memory usage all seem fine. I would be inclined to think it’s not a hardware issue since the issue happens both on AI-platform and a custom VM. Perhaps there’s an incompatibility with the P100 in particular? I don’t know how likely that is.

Happy to try to some things out if you have any suggestions.

I was thinking of a cpp stack trace with gdb or similar tools. But I understand that it can be tricky on cloud platforms :confused:

Happy to try to some things out if you have any suggestions.

If you can manage to reduce your code sample enough (generate random data, shrink the net, remove logging, etc), then we definitely would be able to investigate it further by running it on different machines where we can use gdb and different hardware.

I was thinking of a cpp stack trace with gdb or similar tools. But I understand that it can be tricky on cloud platforms

Oh cool did not realize that was possible. This should be something I can do despite being on cloud since I can reproduce the error on a VM. Honestly I’m really not familiar with Python’s internals or C in general. I found this link: https://wiki.python.org/moin/DebuggingWithGdb that makes it look like I can just install gdb and run my training with that, and when it fails I’ll be able to see the C stack trace. Is that what you’re referring to?

In the mean time I’ll work on getting an MVE (minimum viable error :slight_smile: ) and post it here when I do.

If you run it from the command line on your machine as python foo.py --arg1 bar then you can do the following:
gdb python to get into gdb. Then run foo.py --arg1 bar inside the gdb cli.
This will just run your program as before (the start might be a bit slower but that’s it).
Once the program hangs, you can hit ctrl+C to get back to gdb cli.
There you can use bt to get the backtrace.
And also info threads to get info about the different threads and where they are.
Sharing both here should give us much more info!

Alright, I ran it with GDB and this is what I saw:

#0 0x00007ffedccbab44 in clock_gettime ()

#1 0x00007fab70bdbea6 in __GI___clock_gettime (clock_id=4, tp=0x7ffedcc243a0) at ../sysdeps/unix/clock_gettime.c:115

#2 0x00007fa925c2470e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#3 0x00007fa925cfd837 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#4 0x00007fa925bc5b6c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#5 0x00007fa925c01660 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#6 0x00007fa925b38e98 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#7 0x00007fa925b3942c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#8 0x00007fab6a128a27 in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0

#9 0x00007fab6a1202a0 in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0

#10 0x00007fab6a12d6a7 in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0

#11 0x00007fab6a12f2c1 in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0

#12 0x00007fab6a12243e in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0

#13 0x00007fab6a111de8 in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0

#14 0x00007fab6a14323c in cudaMalloc () from /opt/conda/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0

#15 0x00007fab6c9af477 in c10::cuda::CUDACachingAllocator::THCCachingAllocator::malloc(void**, unsigned long, CUstream_st*) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so

#16 0x00007fab6c9b0d5e in c10::cuda::CUDACachingAllocator::CudaCachingAllocator::allocate(unsigned long) const () from /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so

#17 0x00007fab01cad094 in at::native::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so

#18 0x00007fab0057d8d8 in at::CUDAType::(anonymous namespace)::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so

#19 0x00007faafdf0fc47 in c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<at::Tensor (*)(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>), at::Tensor, c10::guts::typelist::typelist<c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat> > >, at::Tensor (c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so

#20 0x00007faaffecf8a5 in torch::autograd::VariableType::(anonymous namespace)::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so

#21 0x00007faafdf0fc47 in c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<at::Tensor (*)(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>), at::Tensor, c10::guts::typelist::typelist<c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat> > >, at::Tensor (c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so

#22 0x00007faafdc78456 in at::native::to_impl(at::Tensor const&, c10::TensorOptions const&, bool, bool, c10::optional<c10::MemoryFormat>) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so

#23 0x00007faafdc79805 in at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, bool, c10::optional<c10::MemoryFormat>) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so

#24 0x00007faafdfbdcaa in at::TypeDefault::to(at::Tensor const&, c10::TensorOptions const&, bool, bool, c10::optional<c10::MemoryFormat>) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so

#25 0x00007faaffca3976 in torch::autograd::VariableType::(anonymous namespace)::to(at::Tensor const&, c10::TensorOptions const&, bool, bool, c10::optional<c10::MemoryFormat>) ()

from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so

#26 0x00007faafe0086f2 in c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<at::Tensor (*)(at::Tensor const&, c10::TensorOptions const&, bool, bool, c10::optional<c10::MemoryFormat>), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::TensorOptions const&, bool, bool, c10::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, c10::TensorOptions const&, bool, bool, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, at::Tensor const&, c10::TensorOptions const&, bool, bool, c10::optional<c10::MemoryFormat>) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so

#27 0x00007fab4448bd80 in torch::autograd::dispatch_to(at::Tensor const&, c10::Device, bool, bool, c10::optional<c10::MemoryFormat>) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so

#28 0x00007fab445b7920 in torch::autograd::THPVariable_cuda(_object*, _object*, _object*) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so

#29 0x0000561809be9c94 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:694

#30 0x0000561809bf0aef in _PyMethodDescr_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/descrobject.c:288

#31 0x0000561809c5537c in call_function (kwnames=0x0, oparg=2, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4593

#32 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3110

#33 0x0000561809b9959a in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3930

#34 0x0000561809be9497 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:433

#35 0x0000561809c50be6 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616

#36 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3124

#37 0x0000561809be920b in function_code_fastcall (globals=<optimized out>, nargs=2, args=<optimized out>, co=<optimized out>) at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:283

#38 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:408

#39 0x0000561809c55229 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616

#40 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3093

#41 0x0000561809be920b in function_code_fastcall (globals=<optimized out>, nargs=2, args=<optimized out>, co=<optimized out>) at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:283

#42 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:408

#43 0x0000561809c55229 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616

#44 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3093

#45 0x0000561809b99b00 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3930

#46 0x0000561809be9497 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:433

#47 0x0000561809c55229 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616

#48 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3093

#49 0x0000561809be920b in function_code_fastcall (globals=<optimized out>, nargs=1, args=<optimized out>, co=<optimized out>) at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:283

#50 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:408

#51 0x0000561809c55229 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616

#52 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3093

---Type <return> to continue, or q <return> to quit---

#53 0x0000561809be920b in function_code_fastcall (globals=<optimized out>, nargs=2, args=<optimized out>, co=<optimized out>) at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:283

#54 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:408

#55 0x0000561809c50be6 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616

#56 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3124

#57 0x0000561809b992b9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3930

#58 0x0000561809b9a1d4 in PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3959

#59 0x0000561809b9a1fc in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:524

#60 0x0000561809c602ed in builtin_exec_impl.isra.12 (locals=0x7fab71213b40, globals=0x7fab71213b40, source=0x7fab70049540) at /tmp/build/80754af9/python_1588882889832/work/Python/bltinmodule.c:1079

#61 builtin_exec () at /tmp/build/80754af9/python_1588882889832/work/Python/clinic/bltinmodule.c.h:283

#62 0x0000561809be9b19 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:655

#63 0x0000561809be9db1 in _PyCFunction_FastCallKeywords (func=0x7fab7129be10, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:734

#64 0x0000561809c54e94 in call_function (kwnames=0x0, oparg=2, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4568

#65 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3124

#66 0x0000561809b992b9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3930

#67 0x0000561809be9435 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:433

#68 0x0000561809c50be6 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616

#69 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3124

#70 0x0000561809b992b9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3930

#71 0x0000561809b9a3e5 in _PyFunction_FastCallDict () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:376

#72 0x0000561809ca8ce7 in pymain_run_module () at /tmp/build/80754af9/python_1588882889832/work/Modules/main.c:355

#73 0x0000561809cbb60b in pymain_run_python (pymain=0x7ffedcc27390) at /tmp/build/80754af9/python_1588882889832/work/Modules/main.c:2899

#74 pymain_main () at /tmp/build/80754af9/python_1588882889832/work/Modules/main.c:3442

#75 0x0000561809cbb6fc in _Py_UnixMain () at /tmp/build/80754af9/python_1588882889832/work/Modules/main.c:3477

#76 0x00007fab70accb97 in __libc_start_main (main=0x561809b7a3a0 <main>, argc=7, argv=0x7ffedcc274e8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffedcc274d8) at ../csu/libc-start.c:310

#77 0x0000561809c603c0 in _start () at ../sysdeps/x86_64/elf/start.S:103

Thread info is here:

Id   Target Id         Frame 
* 1    Thread 0x7fab712d5740 (LWP 1054) "python3" 0x00007ffedccbab44 in clock_gettime ()
  17   Thread 0x7fab4c36d700 (LWP 1936) "jemalloc_bg_thd" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, 
    expected=0, futex_word=0x7faaf300a5f4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  18   Thread 0x7fab4eb6e700 (LWP 1937) "python3" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x56180c828b40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  19   Thread 0x7fab5136f700 (LWP 1938) "python3" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x56180c828b40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  20   Thread 0x7fab51b70700 (LWP 1939) "python3" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x56180c828b40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  21   Thread 0x7faaf22a5700 (LWP 1940) "python3" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x56180c828b40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  22   Thread 0x7faaf1aa4700 (LWP 1941) "python3" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x56180c828b40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  23   Thread 0x7faaf12a3700 (LWP 1942) "python3" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x56180c828b40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  24   Thread 0x7faaf0aa2700 (LWP 1943) "python3" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x56180c828b40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  25   Thread 0x7faaebfff700 (LWP 1944) "python3" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x56180c828b40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  26   Thread 0x7faaeb7fe700 (LWP 1945) "python3" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x56180c828b40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  27   Thread 0x7faaeaffd700 (LWP 1946) "python3" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x56180c828b40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  28   Thread 0x7faaea7fc700 (LWP 1947) "python3" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x56180c828b40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  29   Thread 0x7faae9ffb700 (LWP 1948) "python3" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x56180c828b40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  30   Thread 0x7faae97fa700 (LWP 1949) "python3" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x56180c828b40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  31   Thread 0x7faae8ff9700 (LWP 1950) "python3" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x56180c828b40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  32   Thread 0x7faae87f8700 (LWP 1951) "python3" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x56180c828b40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  33   Thread 0x7faae7ff7700 (LWP 1952) "python3" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x56180c828b40) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  34   Thread 0x7faae71ff700 (LWP 1953) "jemalloc_bg_thd" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, 
    expected=0, futex_word=0x7faaf300a6c4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  35   Thread 0x7faae61ff700 (LWP 1954) "jemalloc_bg_thd" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, 
    expected=0, futex_word=0x7faaf300a790) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  36   Thread 0x7faae4fff700 (LWP 1955) "jemalloc_bg_thd" 0x00007fab70ea99f3 in futex_wait_cancelable (private=<optimized out>, 
    expected=0, futex_word=0x7faaf300a864) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  39   Thread 0x7faad0dcb700 (LWP 2819) "python3" 0x00007fab70bce237 in accept4 (fd=10, addr=..., addr_len=0x7faad0dcadf8, 
    flags=524288) at ../sysdeps/unix/sysv/linux/accept4.c:32
  40   Thread 0x7fa942fff700 (LWP 2820) "python3" 0x00007fab70bbfbf9 in __GI___poll (fds=0x7fa8c0000bd0, nfds=10, timeout=100)
    at ../sysdeps/unix/sysv/linux/poll.c:29

Does that give us anything useful?

Still working on a pared down version so I can post some code.

And out of curiosity, do you always end up on that clock_gettime() function is you restart from scratch and rerun this?

cc @ngimel does this rings a bell?

I’m running it again right now, but it takes up to ~10 hours to experience the error so I’ll let you know.

1 Like

Sorry for the delay, I didn’t have a chance to get to it this weekend. I ran it again and the bt ended again with a call to gettime() but had different upstream methods:

#0  0x00007ffe49b79b44 in clock_gettime ()
#1  0x00007f2a8a03fea6 in __GI___clock_gettime (clock_id=4, tp=0x7ffe49a56670) at ../sysdeps/unix/clock_gettime.c:115
#2  0x00007f281dd2470e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f281ddfd837 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f281de1f7a9 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007f281dcf9d88 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007f281dbfe612 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007f281dc00dcf in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00007f281dda4cb5 in cuMemcpyDtoHAsync_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#9  0x00007f2a8155834f in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0
#10 0x00007f2a81535643 in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0
#11 0x00007f2a81573af8 in cudaMemcpyAsync () from /opt/conda/lib/python3.7/site-packages/torch/lib/libcudart-1581fefa.so.10.0
#12 0x00007f2a1ab508a3 in at::native::_local_scalar_dense_cuda(at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so
#13 0x00007f2a19a2979b in at::CUDAType::(anonymous namespace)::_local_scalar_dense(at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so
#14 0x00007f2a173744dd in c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<c10::Scalar (*)(at::Tensor const&), c10::Scalar, c10::guts::typelist::typelist<at::Tensor const&> >, c10::Scalar (at::Tensor const&)>::call(c10::OperatorKernel*, at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so
#15 0x00007f2a18f25d6f in torch::autograd::VariableType::(anonymous namespace)::_local_scalar_dense(at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so
#16 0x00007f2a173744dd in c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<c10::Scalar (*)(at::Tensor const&), c10::Scalar, c10::guts::typelist::typelist<at::Tensor const&> >, c10::Scalar (at::Tensor const&)>::call(c10::OperatorKernel*, at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so
#17 0x00007f2a17097dfc in at::_local_scalar_dense(at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so
#18 0x00007f2a1709895f in at::native::item(at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so
#19 0x00007f2a174540e0 in at::TypeDefault::item(at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so
#20 0x00007f2a18fc525b in torch::autograd::VariableType::(anonymous namespace)::item(at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so
#21 0x00007f2a173744dd in c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<c10::Scalar (*)(at::Tensor const&), c10::Scalar, c10::guts::typelist::typelist<at::Tensor const&> >, c10::Scalar (at::Tensor const&)>::call(c10::OperatorKernel*, at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so
#22 0x00007f2a16e085fc in at::Tensor::item() const () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so
#23 0x00007f2a170d454b in at::native::is_nonzero(at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so
#24 0x00007f2a19312988 in torch::autograd::VariableType::(anonymous namespace)::is_nonzero(at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so
#25 0x00007f2a5da57c14 in std::result_of<bool c10::Dispatcher::callUnboxed<bool, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&) const::{lambda(c10::DispatchTable const&)#1} (c10::DispatchTable const&)>::type c10::LeftRight<c10::DispatchTable>::read<bool c10::Dispatcher::callUnboxed<bool, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&) const::{lambda(c10::DispatchTable const&)#1}>(bool c10::Dispatcher::callUnboxed<bool, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&) const::{lambda(c10::DispatchTable const&)#1}&&) const () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#26 0x00007f2a5d987d20 in torch::autograd::THPVariable_is_nonzero(_object*, _object*) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#27 0x000055c5e3dbbaba in _PyMethodDef_RawFastCallDict () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:482
#28 0x000055c5e3dbbc81 in _PyCFunction_FastCallDict (func=0x7f27f816aa50, args=<optimized out>, nargs=<optimized out>, kwargs=<optimized out>) at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:586
#29 0x000055c5e3de8700 in call_unbound_noarg (self=0x7f27f86539b0, func=0x7f27f816aa50, unbound=0) at /tmp/build/80754af9/python_1588882889832/work/Objects/typeobject.c:1518
#30 slot_nb_bool () at /tmp/build/80754af9/python_1588882889832/work/Objects/typeobject.c:6251
#31 0x000055c5e3d82d73 in PyObject_IsTrue () at /tmp/build/80754af9/python_1588882889832/work/Objects/object.c:1425
#32 0x000055c5e3db9418 in builtin_any () at /tmp/build/80754af9/python_1588882889832/work/Python/bltinmodule.c:427
#33 0x000055c5e3deaabd in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:648
#34 0x000055c5e3deadb1 in _PyCFunction_FastCallKeywords (func=0x7f2a8a6ffaa0, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:734
#35 0x000055c5e3e55e94 in call_function (kwnames=0x0, oparg=1, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4568
#36 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3124
#37 0x000055c5e3dea20b in function_code_fastcall (globals=<optimized out>, nargs=2, args=<optimized out>, co=<optimized out>) at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:283
#38 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:408
#39 0x000055c5e3e51be6 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616
#40 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3124
#41 0x000055c5e3dea20b in function_code_fastcall (globals=<optimized out>, nargs=3, args=<optimized out>, co=<optimized out>) at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:283
#42 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:408
#43 0x000055c5e3e56229 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616
#44 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3093
#45 0x000055c5e3d9ab00 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3930
#46 0x000055c5e3dea497 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:433
#47 0x000055c5e3e51be6 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616
#48 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3124
#49 0x000055c5e3dea20b in function_code_fastcall (globals=<optimized out>, nargs=7, args=<optimized out>, co=<optimized out>) at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:283
#50 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:408
#51 0x000055c5e3e51be6 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616
#52 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3124
#53 0x000055c5e3d9a2b9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3930
#54 0x000055c5e3d9b1d4 in PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3959
#55 0x000055c5e3d9b1fc in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:524
#56 0x000055c5e3e612ed in builtin_exec_impl.isra.12 (locals=0x7f2a8a677b40, globals=0x7f2a8a677b40, source=0x7f2a894b04b0) at /tmp/build/80754af9/python_1588882889832/work/Python/bltinmodule.c:1079
#57 builtin_exec () at /tmp/build/80754af9/python_1588882889832/work/Python/clinic/bltinmodule.c.h:283
#58 0x000055c5e3deab19 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:655
#59 0x000055c5e3deadb1 in _PyCFunction_FastCallKeywords (func=0x7f2a8a6ffe10, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:734
#60 0x000055c5e3e55e94 in call_function (kwnames=0x0, oparg=2, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4568
#61 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3124
#62 0x000055c5e3d9a2b9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3930
---Type <return> to continue, or q <return> to quit---
#63 0x000055c5e3dea435 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:433
#64 0x000055c5e3e51be6 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616
#65 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3124
#66 0x000055c5e3d9a2b9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3930
#67 0x000055c5e3d9b3e5 in _PyFunction_FastCallDict () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:376
#68 0x000055c5e3ea9ce7 in pymain_run_module () at /tmp/build/80754af9/python_1588882889832/work/Modules/main.c:355
#69 0x000055c5e3ebc60b in pymain_run_python (pymain=0x7ffe49a597e0) at /tmp/build/80754af9/python_1588882889832/work/Modules/main.c:2899
#70 pymain_main () at /tmp/build/80754af9/python_1588882889832/work/Modules/main.c:3442
#71 0x000055c5e3ebc6fc in _Py_UnixMain () at /tmp/build/80754af9/python_1588882889832/work/Modules/main.c:3477
#72 0x00007f2a89f30b97 in __libc_start_main (main=0x55c5e3d7b3a0 <main>, argc=7, argv=0x7ffe49a59938, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe49a59928) at ../csu/libc-start.c:310
#73 0x000055c5e3e613c0 in _start () at ../sysdeps/x86_64/elf/start.S:103

Thread Info:

(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f2a8a739740 (LWP 154) "python3" 0x00007ffe49b79b44 in clock_gettime ()
  17   Thread 0x7f2a65791700 (LWP 1039) "jemalloc_bg_thd" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f2a0c40a5f4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  18   Thread 0x7f2a65f92700 (LWP 1040) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c5e8211dc4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  19   Thread 0x7f2a6a793700 (LWP 1041) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c5e8211dc4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  20   Thread 0x7f2a6af94700 (LWP 1042) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c5e8211dc4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  21   Thread 0x7f2a0b6a5700 (LWP 1043) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c5e8211dc4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  22   Thread 0x7f2a0aea4700 (LWP 1044) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c5e8211dc4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  23   Thread 0x7f2a0a6a3700 (LWP 1045) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c5e8211dc4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  24   Thread 0x7f2a09ea2700 (LWP 1046) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c5e8211dc4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  25   Thread 0x7f2a096a1700 (LWP 1047) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c5e8211dc4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  26   Thread 0x7f2a08ea0700 (LWP 1048) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c5e8211dc4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  27   Thread 0x7f2a03fff700 (LWP 1049) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c5e8211dc4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  28   Thread 0x7f2a037fe700 (LWP 1050) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c5e8211dc4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  29   Thread 0x7f2a02ffd700 (LWP 1051) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c5e8211dc4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  30   Thread 0x7f2a027fc700 (LWP 1052) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c5e8211dc4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  31   Thread 0x7f2a01ffb700 (LWP 1053) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c5e8211dc4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  32   Thread 0x7f2a017fa700 (LWP 1054) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c5e8211dc4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  33   Thread 0x7f2a00ff9700 (LWP 1055) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c5e8211dc4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  34   Thread 0x7f29dffff700 (LWP 1056) "jemalloc_bg_thd" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f2a0c40a6c4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  35   Thread 0x7f29d73ff700 (LWP 1057) "jemalloc_bg_thd" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f2a0c40a790) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  36   Thread 0x7f29defff700 (LWP 1058) "jemalloc_bg_thd" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f2a0c40a860) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  39   Thread 0x7f29d5bdd700 (LWP 1922) "python3" 0x00007f2a8a032237 in accept4 (fd=10, addr=..., addr_len=0x7f29d5bdcdf8, flags=524288) at ../sysdeps/unix/sysv/linux/accept4.c:32
  40   Thread 0x7f284283f700 (LWP 1923) "python3" 0x00007f2a8a023bf9 in __GI___poll (fds=0x7f27b8000bd0, nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
  48   Thread 0x7f2804c3f700 (LWP 1952) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c7ef6201fc) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  49   Thread 0x7f27f69e1700 (LWP 1953) "python3" 0x00007f2a8a30d9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c7ef61ffbc) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  50   Thread 0x7f281d23f700 (LWP 1961) "python3" 0x00007f2a8130faf1 in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/libgomp-7c85b1e2.so.1
  51   Thread 0x7f281daf9700 (LWP 1962) "python3" 0x00007f2a8130faf1 in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/libgomp-7c85b1e2.so.1
  52   Thread 0x7f28325fe700 (LWP 1963) "python3" 0x00007f2a8130faf1 in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/libgomp-7c85b1e2.so.1
  53   Thread 0x7f281ca3e700 (LWP 1964) "python3" 0x00007f2a8130faf1 in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/libgomp-7c85b1e2.so.1
  1279 Thread 0x7f27f59df700 (LWP 20593) "python3" 0x00007f2a8a3106d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f2738002d50) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  1280 Thread 0x7f27f61e0700 (LWP 20594) "python3" 0x00007f2a8a3106d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f278c003860) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  1281 Thread 0x7f27f3c7f700 (LWP 20595) "python3" 0x00007f2a8a3106d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f275c0047f0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  1282 Thread 0x7f2832dff700 (LWP 20596) "python3" 0x00007f2a8a3106d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f272c004600) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  1283 Thread 0x7f28408bf700 (LWP 20597) "python3" 0x00007f2a8a3106d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f27a40050c0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  1284 Thread 0x7f283097f700 (LWP 20598) "python3" 0x00007f2a8a3106d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f2734001440) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  1285 Thread 0x7f27f347e700 (LWP 20599) "python3" 0x00007f2a8a3106d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f273c003d70) at ../sysdeps/unix/sysv/linux/futex-internal.h:205

@ngimel could this be due to an upstream issue with CUDA?

Any chance it could related to NCCL? Found this thread with a similar stack trace: Distributed training hangs which references this github issue: https://github.com/pytorch/pytorch/issues/20630.

I have pytorch 1.4.0+cu100 and apt search nccl returned

libnccl-dev/unknown 2.7.6-1+cuda11.0 amd64
  NVIDIA Collectives Communication Library (NCCL) Development Files

libnccl2/unknown 2.7.6-1+cuda11.0 amd64
  NVIDIA Collectives Communication Library (NCCL) Runtime

Which potentially is a version mismatch? On the other hand I’m not using multiple GPU’s which is what NCCL seems to be meant for.

NCCL shouldn’t be used for a single GPU run and since you installed the binaries (as it seems based on the version number) the system NCCL, CUDA, cudnn won’t be used.

Could you update to 1.6.0 and rerun the code? Also, is the repository public, so that we could try to rerun it on a P100 machine?

I’ll run it with 1.6.0 today. Data is unfortunately locked down but I’ll check on if the code is OK for me to share - I’m working on finding a minimal version that fails but haven’t had much luck yet. It’s a slow process since it takes so long to experience the error.

Also just FYI I also experience the error with a V100.

Thanks, I appreciate the help. I’ll also look into some of the new memory profiling tools in torch and see if they give any additional info.

It still failed with 1.6.0 - similar stack trace:

#0  0x00007ffe767b7b44 in clock_gettime ()
#1  0x00007f5c2ea25056 in __GI___clock_gettime (clock_id=4, tp=0x7ffe7663c3a0) at ../sysdeps/unix/clock_gettime.c:115
#2  0x00007f59ee03370e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f59ee10c837 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f59ee12e7a9 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007f59ee008d88 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007f59edf0d612 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007f59edf0fdcf in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00007f59ee0b3cb5 in cuMemcpyDtoHAsync_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#9  0x00007f5c25ac795f in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/libcudart-1b201d85.so.10.1
#10 0x00007f5c25aa3b13 in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/libcudart-1b201d85.so.10.1
#11 0x00007f5c25ae3118 in cudaMemcpyAsync () from /opt/conda/lib/python3.7/site-packages/torch/lib/libcudart-1b201d85.so.10.1
#12 0x00007f5bc071918f in at::native::_local_scalar_dense_cuda(at::Tensor const&)::{lambda()#1}::operator()() const ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so
#13 0x00007f5bc071ac27 in at::native::_local_scalar_dense_cuda(at::Tensor const&) ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so
#14 0x00007f5bbf9b24c8 in at::CUDAType::_local_scalar_dense(at::Tensor const&) ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so
#15 0x00007f5bbf9dc7ad in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::Scalar (*)(at::Tensor const&), c10::Scalar, c10::guts::typelist::typelist<at::Tensor const&> >, c10::Scalar (at::Tensor const&)>::call(c10::OperatorKernel*, at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so
#16 0x00007f5bf306e90d in at::_local_scalar_dense(at::Tensor const&) ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#17 0x00007f5bf4cc602c in torch::autograd::VariableType::_local_scalar_dense(at::Tensor const&) ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#18 0x00007f5bf2ff437d in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::Scalar (*)(at::Tensor const&), c10::Scalar, c10::guts::typelist::typelist<at::Tensor const&> >, c10::Scalar (at::Tensor const&)>::call(c10::OperatorKernel*, at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#19 0x00007f5bf306e90d in at::_local_scalar_dense(at::Tensor const&) ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#20 0x00007f5bf2cf5b5b in at::native::item(at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#21 0x00007f5bf317e6b8 in at::TypeDefault::item(at::Tensor const&) ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#22 0x00007f5bf4d0d1b9 in torch::autograd::VariableType::item(at::Tensor const&) ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#23 0x00007f5bf2ff437d in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::Scalar (*)(at::Tensor const&), c10::Scalar, c10::guts::typelist::typelist<at::Tensor const&> >, c10::Scalar (at::Tensor const&)>::call(c10::OperatorKernel*, at::Tensor const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#24 0x00007f5bf320cbbd in at::Tensor::item() const () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#25 0x00007f5bf2d38b01 in at::native::is_nonzero(at::Tensor const&) ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#26 0x00007f5c0228beda in torch::autograd::THPVariable_is_nonzero(_object*, _object*) ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#27 0x000055d4a4a75aba in _PyMethodDef_RawFastCallDict () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:482
#28 0x000055d4a4a75c81 in _PyCFunction_FastCallDict (func=0x7f5a61f18730, args=<optimized out>, nargs=<optimized out>, 
    kwargs=<optimized out>) at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:586
#29 0x000055d4a4aa2700 in call_unbound_noarg (self=0x7f59d8a71fa0, func=0x7f5a61f18730, unbound=0)
    at /tmp/build/80754af9/python_1588882889832/work/Objects/typeobject.c:1518
#30 slot_nb_bool () at /tmp/build/80754af9/python_1588882889832/work/Objects/typeobject.c:6251
#31 0x000055d4a4a3cd73 in PyObject_IsTrue () at /tmp/build/80754af9/python_1588882889832/work/Objects/object.c:1425
#32 0x000055d4a4a73418 in builtin_any () at /tmp/build/80754af9/python_1588882889832/work/Python/bltinmodule.c:427
#33 0x000055d4a4aa4abd in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:648
#34 0x000055d4a4aa4db1 in _PyCFunction_FastCallKeywords (func=0x7f5c2f0e4aa0, args=<optimized out>, nargs=<optimized out>, 
    kwnames=<optimized out>) at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:734
#35 0x000055d4a4b0fe94 in call_function (kwnames=0x0, oparg=1, pp_stack=<synthetic pointer>)
    at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4568
#36 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3124
#37 0x000055d4a4aa420b in function_code_fastcall (globals=<optimized out>, nargs=2, args=<optimized out>, co=<optimized out>)
    at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:283
#38 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:408
#39 0x000055d4a4b0bbe6 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>)
    at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616
#40 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3124
#41 0x000055d4a4aa420b in function_code_fastcall (globals=<optimized out>, nargs=3, args=<optimized out>, co=<optimized out>)
    at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:283
#42 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:408
#43 0x000055d4a4b10229 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>)
    at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616
#44 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3093
#45 0x000055d4a4a54b00 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3930
---Type <return> to continue, or q <return> to quit---
#46 0x000055d4a4aa4497 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:433
#47 0x000055d4a4b0bbe6 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>)
    at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616
#48 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3124
#49 0x000055d4a4aa420b in function_code_fastcall (globals=<optimized out>, nargs=7, args=<optimized out>, co=<optimized out>)
    at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:283
#50 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:408
#51 0x000055d4a4b0bbe6 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>)
    at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616
#52 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3124
#53 0x000055d4a4a542b9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3930
#54 0x000055d4a4a551d4 in PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3959
#55 0x000055d4a4a551fc in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>)
    at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:524
#56 0x000055d4a4b1b2ed in builtin_exec_impl.isra.12 (locals=0x7f5c2f05cb40, globals=0x7f5c2f05cb40, source=0x7f5c2de93540)
    at /tmp/build/80754af9/python_1588882889832/work/Python/bltinmodule.c:1079
#57 builtin_exec () at /tmp/build/80754af9/python_1588882889832/work/Python/clinic/bltinmodule.c.h:283
#58 0x000055d4a4aa4b19 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:655
#59 0x000055d4a4aa4db1 in _PyCFunction_FastCallKeywords (func=0x7f5c2f0e4e10, args=<optimized out>, nargs=<optimized out>, 
    kwnames=<optimized out>) at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:734
#60 0x000055d4a4b0fe94 in call_function (kwnames=0x0, oparg=2, pp_stack=<synthetic pointer>)
    at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4568
#61 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3124
#62 0x000055d4a4a542b9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3930
#63 0x000055d4a4aa4435 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:433
#64 0x000055d4a4b0bbe6 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>)
    at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:4616
#65 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3124
#66 0x000055d4a4a542b9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1588882889832/work/Python/ceval.c:3930
#67 0x000055d4a4a553e5 in _PyFunction_FastCallDict () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:376
#68 0x000055d4a4b63ce7 in pymain_run_module () at /tmp/build/80754af9/python_1588882889832/work/Modules/main.c:355
#69 0x000055d4a4b7660b in pymain_run_python (pymain=0x7ffe7663edd0)
    at /tmp/build/80754af9/python_1588882889832/work/Modules/main.c:2899
#70 pymain_main () at /tmp/build/80754af9/python_1588882889832/work/Modules/main.c:3442
#71 0x000055d4a4b766fc in _Py_UnixMain () at /tmp/build/80754af9/python_1588882889832/work/Modules/main.c:3477
#72 0x00007f5c2e915b97 in __libc_start_main (main=0x55d4a4a353a0 <main>, argc=7, argv=0x7ffe7663ef28, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe7663ef18) at ../csu/libc-start.c:310
#73 0x000055d4a4b1b3c0 in _start () at ../sysdeps/x86_64/elf/start.S:103
  Id   Target Id         Frame 
* 1    Thread 0x7f5c2f11e740 (LWP 1197) "python" 0x00007ffe767b7b44 in clock_gettime ()
  17   Thread 0x7f5c0a1b6700 (LWP 2079) "jemalloc_bg_thd" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, 
    expected=0, futex_word=0x7f5bb6a0a5f4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  18   Thread 0x7f5c0a9b7700 (LWP 2080) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d4a7d28cc0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  19   Thread 0x7f5c0f1b8700 (LWP 2081) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d4a7d28cc0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  20   Thread 0x7f5c0f9b9700 (LWP 2082) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d4a7d28cc0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  21   Thread 0x7f5bb5a24700 (LWP 2083) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d4a7d28cc0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  22   Thread 0x7f5bb5223700 (LWP 2084) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d4a7d28cc0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  23   Thread 0x7f5bb4a22700 (LWP 2085) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d4a7d28cc0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  24   Thread 0x7f5baffff700 (LWP 2086) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d4a7d28cc0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  25   Thread 0x7f5baf7fe700 (LWP 2087) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d4a7d28cc0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  26   Thread 0x7f5baeffd700 (LWP 2088) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d4a7d28cc0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  27   Thread 0x7f5bae7fc700 (LWP 2089) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d4a7d28cc0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  28   Thread 0x7f5badffb700 (LWP 2090) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d4a7d28cc0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  29   Thread 0x7f5bad7fa700 (LWP 2091) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d4a7d28cc0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  30   Thread 0x7f5bacff9700 (LWP 2092) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d4a7d28cc0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  31   Thread 0x7f5bac7f8700 (LWP 2093) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d4a7d28cc0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  32   Thread 0x7f5babff7700 (LWP 2094) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d4a7d28cc0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  33   Thread 0x7f5bab7f6700 (LWP 2095) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d4a7d28cc0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  34   Thread 0x7f5baa5ff700 (LWP 2096) "jemalloc_bg_thd" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, 
    expected=0, futex_word=0x7f5bb6a0a6c4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  35   Thread 0x7f5ba97ff700 (LWP 2097) "jemalloc_bg_thd" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, 
    expected=0, futex_word=0x7f5bb6a0a794) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  36   Thread 0x7f5b93fff700 (LWP 2098) "jemalloc_bg_thd" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, 
    expected=0, futex_word=0x7f5bb6a0a864) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  39   Thread 0x7f5b5b78b700 (LWP 2962) "python" 0x00007f5c2ea173e7 in accept4 (fd=10, addr=..., addr_len=0x7f5b5b78adf8, 
    flags=524288) at ../sysdeps/unix/sysv/linux/accept4.c:32
  40   Thread 0x7f5a608ff700 (LWP 2963) "python" 0x00007f5c2ea08cf9 in __GI___poll (fds=0x7f5980000bd0, nfds=10, timeout=100)
    at ../sysdeps/unix/sysv/linux/poll.c:29
  48   Thread 0x7f59e95ba700 (LWP 2992) "python" 0x00007f5c2ecf29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d6cc21be9c) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  49   Thread 0x7f59eadbd700 (LWP 3000) "python" 0x00007f5c2aba5af1 in ?? ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libgomp-7c85b1e2.so.1
  50   Thread 0x7f59ea5bc700 (LWP 3001) "python" 0x00007f5c2aba5af1 in ?? ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libgomp-7c85b1e2.so.1
  51   Thread 0x7f59ed607700 (LWP 3002) "python" 0x00007f5c2aba5af1 in ?? ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libgomp-7c85b1e2.so.1
  52   Thread 0x7f59ebdbf700 (LWP 3003) "python" 0x00007f5c2aba5af1 in ?? ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libgomp-7c85b1e2.so.1
  53   Thread 0x7f59ede08700 (LWP 3004) "python" 0x00007f5c2aba5af1 in ?? ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libgomp-7c85b1e2.so.1
  54   Thread 0x7f59eb5be700 (LWP 3005) "python" 0x00007f5c2aba5af1 in ?? ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libgomp-7c85b1e2.so.1
  55   Thread 0x7f59e9dbb700 (LWP 3006) "python" 0x00007f5c2aba5af1 in ?? ()
   from /opt/conda/lib/python3.7/site-packages/torch/lib/libgomp-7c85b1e2.so.1
  1365 Thread 0x7f59dbfff700 (LWP 21113) "python" 0x00007f5c2ecf56e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, 
    expected=0, futex_word=0x7f59240046c0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  1366 Thread 0x7f59e670d700 (LWP 21114) "python" 0x00007f5c2ecf56e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, 
    expected=0, futex_word=0x7f5960002d40) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  1367 Thread 0x7f59e570b700 (LWP 21115) "python" 0x00007f5c2ecf56e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, 
    expected=0, futex_word=0x7f5920004a80) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  1368 Thread 0x7f59aa13f700 (LWP 21116) "python" 0x00007f5c2ecf56e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, 
    expected=0, futex_word=0x7f5914003d70) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
---Type <return> to continue, or q <return> to quit---
  1369 Thread 0x7f59e8db9700 (LWP 21117) "python" 0x00007f5c2ecf56e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, 
    expected=0, futex_word=0x7f593c000d50) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  1370 Thread 0x7f59e5f0c700 (LWP 21118) "python" 0x00007f5c2ecf56e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, 
    expected=0, futex_word=0x7f5938000d50) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  1371 Thread 0x7f59e4f0a700 (LWP 21119) "python" 0x00007f5c2ecf56e6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, 
    expected=0, futex_word=0x7f591c005100) at ../sysdeps/unix/sysv/linux/futex-internal.h:205

Unfortunately I can’t share the code as is, so I’ll work on finding a minimum version that errors that I’ll be able to share.

Hi,

This seems to be the same as this issue right: https://github.com/pytorch/pytorch/issues/22259 ?

Definitely sounds similar. I’m not manually starting any threads, just using num_workers in dataloader.

Sounds like an NVCC version mismatch can cause things to hang? Nvidia SMI shows

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

But I am using

torch                      1.6.0+cu101
torchvision                0.6.0+cu101

Think that could be the issue? There is some other talk about NCCL in that issue but I don’t think that applies since I’m using a single GPU.

Could you install the PyTorch binaries with CUDA10.2 just for the sake of debugging?

Wondering if I have a similar issue. GPU memory usage is steady for some epochs and it spikes up and eventually encounters CUDA out of memory error. Is it possible for GPU memory to increase like that after so long into epochs?

Using pytorch-lightning 0.8.5 with pytorch 1.6.0