Only 1 thread for backward?


I am training my model with C++ frontend on CPU and 1 worker thread each core. Then I found most time only 1 CPU is fully utilized. I dump the stack and found that all my worker thread is waiting in the Engine::execute from engine.cpp. I read the source code and found that there is only 1 thread for CPU and 1 for every GPU.
Is there any method to run multiple backward() concurrently on each CPU? or should I use multi-processing?

#0 pthread_cond_wait@@GLIBC_2.3.2 () at …/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007f65ea9d991c in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /usr/lib/x86_64-linux-gnu/
#2 0x00007f65f53de39f in torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocatortorch::autograd::Edge > const&, std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocatortorch::autograd::Edge > const&) () from /root/libtorch/lib/
#3 0x00007f65f564e465 in torch::autograd::Variable::Impl::backward(c10::optionalat::Tensor, bool, bool) () from /root/libtorch/lib/
#4 0x00007f65f5650b80 in torch::autograd::VariableType::backward(at::Tensor&, c10::optionalat::Tensor, bool, bool) const () from /root/libtorch/lib/
#5 0x000000000040f8dc in at::Tensor::backward(c10::optionalat::Tensor, bool, bool) ()
#6 0x000000000040d20b in Env::train_critic(bool) ()

(Alban D) #2


The autograd engine has been designed assuming that each task to be done in the backward is more expensive than running the engine itself. Note that this is true for most current cnn applications.
With that in mind, only one thread is needed to drive the computations. Heavy operations like mm or convs can still use multithreading for each op if needed for best performance.