Difficulty limiting thread usage in Pytorch

Hi,

I am attempting to run my code on a HPC cluster. I want to limit the number of threads used to the number of cpus I demand. I find, however, that during the backward step thread usage shoots up.
As example: I demand 1 CPU, but during the backward step, 4 threads are used (and afterwards they remain used).

I have set the following environment variables at the start of the code:

import os
os.environ[“OMP_NUM_THREADS”] = “1”
os.environ[“OPENBLAS_NUM_THREADS”] = “1”
os.environ[“MKL_NUM_THREADS”] = “1”
os.environ[“VECLIB_MAXIMUM_THREADS”] = “1”
os.environ[“NUMEXPR_NUM_THREADS”] = “1”

and also set

import torch
torch.set_num_threads(1)

But thread usage during the backward step remains at 4.
Another thing is that the command in the linux terminal (with PID meaning process id)

ps -o nlwp {PID}

and the method

torch.get_num_threads()

return different results: the former command tells me my process is using 4 threads, the latter says it only sees 1 thread. I am inclined to believe the former.

Help would be greatly appreciated.

Hi,

I guess your cluster has 4 GPUs?
The autograd engine also uses threading to be able to send work to the GPUs fast enough during the backward.
Is that a problem in your setup?

You probably just want to limit concurrency of compute-heavy stuff, so set OMP_NUM_THREADS=1 and num_workers=1 . But thread switching in general is fast, you won’t get much benefit from getting rid of threads that sit idle most of the time. You can see what all the threads are doing “sudo gdb -p … thread apply all bt”

@albanD
My cluster is divided up into several nodes, the current node I am utilizing does not have any GPUs.

@Yaroslav_Bulatov
I am currently not using a dataloader so I don’t believe num_workers=1 is relevant to my case.
The main reason why I want to limit the number of threads is because in the past my jobs have overloaded the compute nodes due to using more threads than the number of CPUs.

If you don’t have GPUs, then the autograd won’t use extra threads.
Are you using the jit?

No, this is the first time I heard of it.

hmmm not sure where these threads could be coming from then…
@VitalyFedyunin might have an idea?

Another solution would be to start your job with gdb and when 4 threads are used, interrupt it and check the stack trace for each thread. That should tell us why it happens.

Here is the trace for the threads:

(gdb) thread apply all bt

Thread 4 (Thread 0x7facf308c700 (LWP 16191)):
#0 0x00007fad7454bd1f in accept4 () from /lib64/libc.so.6
#1 0x00007facf32b6a5a in ?? () from /lib64/libcuda.so.1
#2 0x00007facf32a85bd in ?? () from /lib64/libcuda.so.1
#3 0x00007facf32b8118 in ?? () from /lib64/libcuda.so.1
#4 0x00007fad74f2ae65 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fad7454a88d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7facf288b700 (LWP 16192)):
#0 0x00007fad74f2e9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fad644e582c in std::condition_variable::wait(std::unique_lockstd::mutex&) ()
from /lib64/libstdc++.so.6
#2 0x00007fad1dc8ce23 in torch::autograd::ReadyQueue::pop() ()
from /home/nfs/USER/.local/lib/python3.6/site-packages/torch/lib/libtorch.so
#3 0x00007fad1dc8ed7c in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autogr ad::GraphTask> const&, bool) ()
from /home/nfs/USER/.local/lib/python3.6/site-packages/torch/lib/libtorch.so
#4 0x00007fad1dc88979 in torch::autograd::Engine::thread_init(int) ()
from /home/nfs/USER/.local/lib/python3.6/site-packages/torch/lib/libtorch.so
#5 0x00007fad64d9408a in torch::autograd::python::PythonEngine::thread_init(int) ()
from /home/nfs/USER/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#6 0x00007fad659afdef in execute_native_thread_routine ()
—Type to continue, or q to quit—return
from /home/nfs/USER/.local/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-li nux-gnu.so
#7 0x00007fad74f2ae65 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fad7454a88d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7facf208a700 (LWP 16193)):
#0 0x00007fad74f2e9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fad644e582c in std::condition_variable::wait(std::unique_lockstd::mutex&) ()
from /lib64/libstdc++.so.6
#2 0x00007fad1dc8ce23 in torch::autograd::ReadyQueue::pop() ()
from /home/nfs/USER/.local/lib/python3.6/site-packages/torch/lib/libtorch.so
#3 0x00007fad1dc8ed7c in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autogr ad::GraphTask> const&, bool) ()
from /home/nfs/USER/.local/lib/python3.6/site-packages/torch/lib/libtorch.so
#4 0x00007fad1dc88979 in torch::autograd::Engine::thread_init(int) ()
from /home/nfs/USER/.local/lib/python3.6/site-packages/torch/lib/libtorch.so
#5 0x00007fad64d9408a in torch::autograd::python::PythonEngine::thread_init(int) ()
from /home/nfs/USER/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#6 0x00007fad659afdef in execute_native_thread_routine ()
from /home/nfs/USER/.local/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-li nux-gnu.so
#7 0x00007fad74f2ae65 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fad7454a88d in clone () from /lib64/libc.so.6
—Type to continue, or q to quit—return

Thread 1 (Thread 0x7fad75853740 (LWP 16174)):
#0 0x00007fad744d2424 in memalign () from /lib64/libc.so.6
#1 0x00007fad744d404c in posix_memalign () from /lib64/libc.so.6
#2 0x00007fad1960ab4a in c10::alloc_cpu(unsigned long) ()
from /home/nfs/USER/.local/lib/python3.6/site-packages/torch/lib/libc10.so
#3 0x00007fad1960c5fa in c10::DefaultCPUAllocator::allocate(unsigned long) const ()
from /home/nfs/USER/.local/lib/python3.6/site-packages/torch/lib/libc10.so
#4 0x00007fad1b93c78a in at::native::empty_cpu(c10::ArrayRef, c10::TensorOptions cons t&, c10::optionalc10::MemoryFormat) ()
from /home/nfs/USER/.local/lib/python3.6/site-packages/torch/lib/libtorch.so
#5 0x00007fad1bb8530b in at::CPUType::(anonymous namespace)::empty(c10::ArrayRef, c10 ::TensorOptions const&, c10::optionalc10::MemoryFormat) ()
from /home/nfs/USER/.local/lib/python3.6/site-packages/torch/lib/libtorch.so
#6 0x00007fad1bbccb37 in c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntime KernelFunctor_<at::Tensor ()(c10::ArrayRef, c10::TensorOptions const&, c10::optional< c10::MemoryFormat>), at::Tensor, c10::guts::typelist::typelist<c10::ArrayRef, c10::Ten sorOptions const&, c10::optionalc10::MemoryFormat > >, at::Tensor (c10::ArrayRef, c1 0::TensorOptions const&, c10::optionalc10::MemoryFormat)>::call(c10::OperatorKernel, c10: :ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) ()
from /home/nfs/USER/.local/lib/python3.6/site-packages/torch/lib/libtorch.so
#7 0x00007fad1db8c795 in torch::autograd::VariableType::(anonymous namespace)::empty(c10::A

It does seem like it is autograd that is using extra threads, despite the fact that there is no GPU to make use of.

So thread 4 one seems to be a cuda driver thread. Not sure we can do anything about this ? (cc @ptrblck)

thread 1 is your main worker thread

thread 3/4 are autograd worker thread (not sure why there are 2), but the CPU worker thread should run only while thread 1 will be blocked on waiting for it to finish. So that should use more than one core.

Just in case OMP_NUM_THREADS controls ONLY number of operator threads, and does nothing with Autograd threads or Data Loading threads.