Libtorch does not (really) support multithreading

One of the use-cases for Libtorch was supposed to be multithreading. OK, I can read data this way, I can do forward(), but when it comes to learning, the autograd machinery appears to be serialized per device, backward(), which in my case takes a lot of time, is serialized, see

Here is a comment in engine.cpp:
// XXX: Changes to the way multithreading works in execute should be done with
// great care. Right now the implementation guarantees that a single function’s
// apply will never be entered concurrently (even if multiple graphs are
// executed at the same time). Adding multiple threads per-device or removing
// engine thread affinity to the device can break this invariant, and we depend
// on it in a few places (e.g. AccumulateGrad function).

So, any hope this will get fixed and Libtorch can really support multithreaded learning?
My use case is a concurrent training of several independent NNs, with own trainers etc, so the synchronization is not a problem.