I am training my model with C++ frontend on CPU and 1 worker thread each core. Then I found most time only 1 CPU is fully utilized. I dump the stack and found that all my worker thread is waiting in the Engine::execute from engine.cpp. I read the source code and found that there is only 1 thread for CPU and 1 for every GPU.
Is there any method to run multiple backward() concurrently on each CPU? or should I use multi-processing?
#0 pthread_cond_wait@@GLIBC_2.3.2 () at …/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007f65ea9d991c in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007f65f53de39f in torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocatortorch::autograd::Edge > const&, std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocatortorch::autograd::Edge > const&) () from /root/libtorch/lib/libtorch.so.1
#3 0x00007f65f564e465 in torch::autograd::Variable::Impl::backward(c10::optionalat::Tensor, bool, bool) () from /root/libtorch/lib/libtorch.so.1
#4 0x00007f65f5650b80 in torch::autograd::VariableType::backward(at::Tensor&, c10::optionalat::Tensor, bool, bool) const () from /root/libtorch/lib/libtorch.so.1
#5 0x000000000040f8dc in at::Tensor::backward(c10::optionalat::Tensor, bool, bool) ()
#6 0x000000000040d20b in Env::train_critic(bool) ()