I am running pytorch c++ in Amazon AMI and encountered a hang.
Env:
(base) [ec2-user@ip-172-31-20-6 ~]$ uname -a
Linux ip-172-31-20-6.ec2.internal 4.14.173-137.228.amzn2.x86_64 #1 SMP Thu Mar 19 16:50:21 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
(base) [ec2-user@ip-172-31-20-6 ~]$ gcc --version
gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6)
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
(base) [ec2-user@ip-172-31-20-6 ~]$ more workspaces/workspace_cpp/pytorch/version.txt
1.5.0a0
The threads info:
(gdb) info threads
Id Target Id Frame
1 Thread 0x7f6708b67d80 (LWP 21026) "rnnlentest" 0x00007f67003b462b in nanosleep () from /lib64/libc.so.6
2 Thread 0x7d66ee5e4700 (LWP 21034) "rnnlentest" futex_wait (val=10551144, addr=0x309bb24)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
3 Thread 0x7d66cf663700 (LWP 21039) "rnnlentest" 0x00007f6701710277 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
4 Thread 0x7d66cee62700 (LWP 21093) "rnnlentest" 0x00007f6701710277 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
5 Thread 0x7d66ce661700 (LWP 21094) "rnnlentest" 0x00007f6701710277 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
6 Thread 0x7d66cde60700 (LWP 21095) "rnnlentest" 0x00007f6701710277 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 7 Thread 0x7d668029d700 (LWP 23236) "rnnlentest" futex_wait (val=10551144, addr=0x309bb24)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
8 Thread 0x7d667f81c700 (LWP 23237) "rnnlentest" futex_wait (val=10551144, addr=0x309bb24)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f6708b67d80 (LWP 21026))]
#0 0x00007f67003b462b in nanosleep () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f67003b462b in nanosleep () from /lib64/libc.so.6
#1 0x00007f67003b454a in sleep () from /lib64/libc.so.6
#2 0x00000000004b3db7 in trainLstmDbOverfit<GRUMaskNet, LmdbDataDefs> (validReader=..., net=..., optimizer=..., sampleNum=131072, seqLen=27,
lr=0.00100000005) at /home/ec2-user/workspaces/workspace_cpp/torchpractice/src/testrnnlen.cpp:412
#3 0x00000000004b372e in testTrainByQ<GRUMaskNet> (lr=0.00100000005, sampleNum=131072, seqLen=27)
at /home/ec2-user/workspaces/workspace_cpp/torchpractice/src/testrnnlen.cpp:535
#4 0x00000000004b318e in main (argc=3, argv=0x7ffe277e66f8) at /home/ec2-user/workspaces/workspace_cpp/torchpractice/src/testrnnlen.cpp:686
(gdb) thread 2
[Switching to thread 2 (Thread 0x7d66ee5e4700 (LWP 21034))]
#0 futex_wait (val=10551144, addr=0x309bb24)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
44 in /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h
(gdb) bt
#0 futex_wait (val=10551144, addr=0x309bb24)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
#1 do_wait (addr=addr@entry=0x309bb24, val=val@entry=10551144)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/wait.h:67
#2 0x00007f6708b96623 in gomp_barrier_wait_end (bar=0x309bb20, state=10551144)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/bar.c:48
#3 0x00007f6708b946f7 in gomp_simple_barrier_wait (bar=0x309bb20)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/posix/simple-bar.h:60
#4 gomp_thread_start (xdata=<optimized out>)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/team.c:127
#5 0x00007f670170a40b in start_thread () from /lib64/libpthread.so.0
#6 0x00007f67003e3e7f in clone () from /lib64/libc.so.6
(gdb)
(gdb) thread 3
[Switching to thread 3 (Thread 0x7d66cf663700 (LWP 21039))]
#0 0x00007f6701710277 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00007f6701710277 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f6700cacb0c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2 0x00007f6704a2fe4d in torch::autograd::ReadyQueue::pop() ()
from /home/ec2-user/workspaces/workspace_cpp/pytorch/build/lib.linux-x86_64-3.7/torch/lib/libtorch_cpu.so
#3 0x00007f6704a34d60 in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) ()
from /home/ec2-user/workspaces/workspace_cpp/pytorch/build/lib.linux-x86_64-3.7/torch/lib/libtorch_cpu.so
#4 0x00007f6704a2ceac in torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&) ()
from /home/ec2-user/workspaces/workspace_cpp/pytorch/build/lib.linux-x86_64-3.7/torch/lib/libtorch_cpu.so
#5 0x00007f6700cb2acf in ?? () from /lib64/libstdc++.so.6
#6 0x00007f670170a40b in start_thread () from /lib64/libpthread.so.0
#7 0x00007f67003e3e7f in clone () from /lib64/libc.so.6
(gdb)
(gdb) thread 7
[Switching to thread 7 (Thread 0x7d668029d700 (LWP 23236))]
#0 futex_wait (val=10551144, addr=0x309bb24)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
44 in /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h
(gdb) bt
#0 futex_wait (val=10551144, addr=0x309bb24)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
#1 do_wait (addr=addr@entry=0x309bb24, val=val@entry=10551144)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/wait.h:67
#2 0x00007f6708b96623 in gomp_barrier_wait_end (bar=0x309bb20, state=10551144)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/bar.c:48
#3 0x00007f6708b946f7 in gomp_simple_barrier_wait (bar=0x309bb20)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/posix/simple-bar.h:60
#4 gomp_thread_start (xdata=<optimized out>)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/team.c:127
#5 0x00007f670170a40b in start_thread () from /lib64/libpthread.so.0
#6 0x00007f67003e3e7f in clone () from /lib64/libc.so.6
(gdb)
(gdb) thread 8
[Switching to thread 8 (Thread 0x7d667f81c700 (LWP 23237))]
#0 futex_wait (val=10551144, addr=0x309bb24)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
44 in /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h
(gdb) bt
#0 futex_wait (val=10551144, addr=0x309bb24)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
#1 do_wait (addr=addr@entry=0x309bb24, val=val@entry=10551144)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/wait.h:67
#2 0x00007f6708b96623 in gomp_barrier_wait_end (bar=0x309bb20, state=10551144)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/linux/bar.c:48
#3 0x00007f6708b946f7 in gomp_simple_barrier_wait (bar=0x309bb20)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/config/posix/simple-bar.h:60
#4 gomp_thread_start (xdata=<optimized out>)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgomp/team.c:127
#5 0x00007f670170a40b in start_thread () from /lib64/libpthread.so.0
#6 0x00007f67003e3e7f in clone () from /lib64/libc.so.6
(gdb)
The thread #1 is my application thread, it is in sleep; other threads are pytorch threads and they are all in futex_wait. Is it possible that there is some futex_wait issue in kernel + pytorch?