Hi everyone,
I’m facing a deadlock I can’t solve by myself.
I’m trying to implement Reactor algorithm (https://openreview.net/forum?id=rkHVZWZAZ) to train an AI playing at French Tarot card game.
When I tried to implement training worker parallelisation, things seemed to work well until I tried to increase the size of some hidden layer.
Then worker seemed to block (like infinity loop) when passing through the first layer of the first forward pass (more precisely, the first matmul -> checked with pudb debugger).
I tried a few things:
- When worker is called from the main process, everything is fine, whatever the layers size are
- When worker exploration is performed through a separate thread (inside the secondary process, the main thread of that secondary process being the training), the exploration is ok, and the blocking occures into the training
- At the contrary, if the exploration & training are performed alternatively, the blocking occures into the first exploration step
The multiprocess (and multithread) are performed with fork (linux environment), by subclassing multiprocess and multithread classes, and then by calling start method.
Forward pass is only performed on local copy of the shared network (one copy for exploration, another for training).
I verified copy process : local networks seem to match perfectly shared network.
I suspect some secondary process memory issue, but I have no other clue or direction to follow.
The network dimension are pretty reasonable I think :
- one input linear layer 78 -> n
- two parallel recurrent layer (GRU) n,m -> n (m = hidden state dimension)
- three head with two linear layers each
- actor layer m -> n -> 78
- advantage layer m -> n -> 78
- value layer m -> n -> 1
the degrees of freedom are n & m, which where 80 initially. Blocking occures in input layer when n or m is greater than 110… (although m does not appear in input layer dimension …)
training process is performed on 14 successive steps x batchs of 5 trajectories.
Does that kind of issue seems familiar to anyone ?
Thanks by advance