Futex_wait hang

I am running pytorch c++ in Amazon AMI and encountered a hang.
Env:

(base) [ec2-user@ip-172-31-20-6 ~]$ uname -a
Linux ip-172-31-20-6.ec2.internal 4.14.173-137.228.amzn2.x86_64 #1 SMP Thu Mar 19 16:50:21 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

(base) [ec2-user@ip-172-31-20-6 ~]$ gcc --version
gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6)
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

(base) [ec2-user@ip-172-31-20-6 ~]$ more workspaces/workspace_cpp/pytorch/version.txt 
1.5.0a0

The threads info:

(gdb) info threads
  Id   Target Id         Frame 
  1    Thread 0x7f6708b67d80 (LWP 21026) "rnnlentest" 0x00007f67003b462b in nanosleep () from /lib64/libc.so.6
  2    Thread 0x7d66ee5e4700 (LWP 21034) "rnnlentest" futex_wait (val=10551144, addr=0x309bb24)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
  3    Thread 0x7d66cf663700 (LWP 21039) "rnnlentest" 0x00007f6701710277 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  4    Thread 0x7d66cee62700 (LWP 21093) "rnnlentest" 0x00007f6701710277 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  5    Thread 0x7d66ce661700 (LWP 21094) "rnnlentest" 0x00007f6701710277 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  6    Thread 0x7d66cde60700 (LWP 21095) "rnnlentest" 0x00007f6701710277 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 7    Thread 0x7d668029d700 (LWP 23236) "rnnlentest" futex_wait (val=10551144, addr=0x309bb24)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
  8    Thread 0x7d667f81c700 (LWP 23237) "rnnlentest" futex_wait (val=10551144, addr=0x309bb24)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44

(gdb) thread 1
[Switching to thread 1 (Thread 0x7f6708b67d80 (LWP 21026))]
#0  0x00007f67003b462b in nanosleep () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f67003b462b in nanosleep () from /lib64/libc.so.6
#1  0x00007f67003b454a in sleep () from /lib64/libc.so.6
#2  0x00000000004b3db7 in trainLstmDbOverfit<GRUMaskNet, LmdbDataDefs> (validReader=..., net=..., optimizer=..., sampleNum=131072, seqLen=27, 
    lr=0.00100000005) at /home/ec2-user/workspaces/workspace_cpp/torchpractice/src/testrnnlen.cpp:412
#3  0x00000000004b372e in testTrainByQ<GRUMaskNet> (lr=0.00100000005, sampleNum=131072, seqLen=27)
    at /home/ec2-user/workspaces/workspace_cpp/torchpractice/src/testrnnlen.cpp:535
#4  0x00000000004b318e in main (argc=3, argv=0x7ffe277e66f8) at /home/ec2-user/workspaces/workspace_cpp/torchpractice/src/testrnnlen.cpp:686

(gdb) thread 2
[Switching to thread 2 (Thread 0x7d66ee5e4700 (LWP 21034))]
#0  futex_wait (val=10551144, addr=0x309bb24)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
44	in /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h
(gdb) bt
#0  futex_wait (val=10551144, addr=0x309bb24)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
#1  do_wait (addr=addr@entry=0x309bb24, val=val@entry=10551144)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/wait.h:67
#2  0x00007f6708b96623 in gomp_barrier_wait_end (bar=0x309bb20, state=10551144)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/bar.c:48
#3  0x00007f6708b946f7 in gomp_simple_barrier_wait (bar=0x309bb20)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/posix/simple-bar.h:60
#4  gomp_thread_start (xdata=<optimized out>)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/team.c:127
#5  0x00007f670170a40b in start_thread () from /lib64/libpthread.so.0
#6  0x00007f67003e3e7f in clone () from /lib64/libc.so.6
(gdb) 

(gdb) thread 3
[Switching to thread 3 (Thread 0x7d66cf663700 (LWP 21039))]
#0  0x00007f6701710277 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f6701710277 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f6700cacb0c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x00007f6704a2fe4d in torch::autograd::ReadyQueue::pop() ()
   from /home/ec2-user/workspaces/workspace_cpp/pytorch/build/lib.linux-x86_64-3.7/torch/lib/libtorch_cpu.so
#3  0x00007f6704a34d60 in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) ()
   from /home/ec2-user/workspaces/workspace_cpp/pytorch/build/lib.linux-x86_64-3.7/torch/lib/libtorch_cpu.so
#4  0x00007f6704a2ceac in torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&) ()
   from /home/ec2-user/workspaces/workspace_cpp/pytorch/build/lib.linux-x86_64-3.7/torch/lib/libtorch_cpu.so
#5  0x00007f6700cb2acf in ?? () from /lib64/libstdc++.so.6
#6  0x00007f670170a40b in start_thread () from /lib64/libpthread.so.0
#7  0x00007f67003e3e7f in clone () from /lib64/libc.so.6
(gdb) 

(gdb) thread 7
[Switching to thread 7 (Thread 0x7d668029d700 (LWP 23236))]
#0  futex_wait (val=10551144, addr=0x309bb24)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
44	in /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h
(gdb) bt
#0  futex_wait (val=10551144, addr=0x309bb24)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
#1  do_wait (addr=addr@entry=0x309bb24, val=val@entry=10551144)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/wait.h:67
#2  0x00007f6708b96623 in gomp_barrier_wait_end (bar=0x309bb20, state=10551144)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/bar.c:48
#3  0x00007f6708b946f7 in gomp_simple_barrier_wait (bar=0x309bb20)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/posix/simple-bar.h:60
#4  gomp_thread_start (xdata=<optimized out>)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/team.c:127
#5  0x00007f670170a40b in start_thread () from /lib64/libpthread.so.0
#6  0x00007f67003e3e7f in clone () from /lib64/libc.so.6
(gdb) 

(gdb) thread 8
[Switching to thread 8 (Thread 0x7d667f81c700 (LWP 23237))]
#0  futex_wait (val=10551144, addr=0x309bb24)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
44	in /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h
(gdb) bt
#0  futex_wait (val=10551144, addr=0x309bb24)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
#1  do_wait (addr=addr@entry=0x309bb24, val=val@entry=10551144)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/wait.h:67
#2  0x00007f6708b96623 in gomp_barrier_wait_end (bar=0x309bb20, state=10551144)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/bar.c:48
#3  0x00007f6708b946f7 in gomp_simple_barrier_wait (bar=0x309bb20)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/posix/simple-bar.h:60
#4  gomp_thread_start (xdata=<optimized out>)
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/team.c:127
#5  0x00007f670170a40b in start_thread () from /lib64/libpthread.so.0
#6  0x00007f67003e3e7f in clone () from /lib64/libc.so.6
(gdb) 

The thread #1 is my application thread, it is in sleep; other threads are pytorch threads and they are all in futex_wait. Is it possible that there is some futex_wait issue in kernel + pytorch?

Hi,

No this is expected. Half of them are OMP worker thread and one of them is an autograd engine worker thread.
These are worker threads that are kept around so that we don’t have to recreate them every time we need them. OMP does that by default and we do it ourselves as well in the autograd engine.