Worker blocked in forward pass when parallelized

Hi everyone,

I’m facing a deadlock I can’t solve by myself.
I’m trying to implement Reactor algorithm (https://openreview.net/forum?id=rkHVZWZAZ) to train an AI playing at French Tarot card game.
When I tried to implement training worker parallelisation, things seemed to work well until I tried to increase the size of some hidden layer.
Then worker seemed to block (like infinity loop) when passing through the first layer of the first forward pass (more precisely, the first matmul -> checked with pudb debugger).

I tried a few things:

  • When worker is called from the main process, everything is fine, whatever the layers size are
  • When worker exploration is performed through a separate thread (inside the secondary process, the main thread of that secondary process being the training), the exploration is ok, and the blocking occures into the training
  • At the contrary, if the exploration & training are performed alternatively, the blocking occures into the first exploration step

The multiprocess (and multithread) are performed with fork (linux environment), by subclassing multiprocess and multithread classes, and then by calling start method.
Forward pass is only performed on local copy of the shared network (one copy for exploration, another for training).
I verified copy process : local networks seem to match perfectly shared network.

I suspect some secondary process memory issue, but I have no other clue or direction to follow.
The network dimension are pretty reasonable I think :

  • one input linear layer 78 -> n
  • two parallel recurrent layer (GRU) n,m -> n (m = hidden state dimension)
  • three head with two linear layers each
    • actor layer m -> n -> 78
    • advantage layer m -> n -> 78
    • value layer m -> n -> 1
      the degrees of freedom are n & m, which where 80 initially. Blocking occures in input layer when n or m is greater than 110… (although m does not appear in input layer dimension …)
      training process is performed on 14 successive steps x batchs of 5 trajectories.

Does that kind of issue seems familiar to anyone ?

Thanks by advance

Hey @driou

Are you using torch.multiprocessing only or does your code also used any feature from torch.distributed or torch.nn.parallel?

If it just torch.multiprocessing, it might relates to forking the processing. If the parent process used any CUDA-related feature, there will be a CUDA context on it which does not fork. Even if it is CPU only, the fork could have broken the OMP internal states see discussion here: https://github.com/pytorch/pytorch/issues/41197

Can you check if using spawn from torch.multiprocessing solves the problem?

Hey,

I’m indeed using torch.multiprocessing (and not torch.distributed nor torch.nn.parallel).
I’m not using any CUDA-related feature : it’s CPU only (in fact I did not performed the full pytorch install with CUDA).
I will try with spawn, but I choosed fork on purpose, because it was more practical with the ability to init process before starting it.

Before that, let’s have a look on that thread you send me: the symptom looks like very similar !

Thanks a lot for your help