I am implementing something similar to (but different from) a GAN, in which there are multiple “agents” which interact with each other, and then each performs a gradient descent step. Each agent has its own optimizer, which is supposed to adjust only the agent’s own parameters, according to its own objective function. The objective functions contradict one another (as in a GAN), so I can’t simply combine them into a single loss function.
The question is how to implement this correctly, such that one agent’s gradients won’t interfere with another’s. A secondary question is how to tell whether it’s working correctly or not.
Most of the GAN examples I’ve found just do the whole computation twice, once to train the discriminator and once to train the generator. This does work for my setup, but I want to avoid it if I can, because it means doing N times more computation than I really need to, where N is the number of agents.
I’ve tried all kinds of different combinations of calling
zero_grad on all my agents; using
retain_graph or not using it; turning on and off
requires_grad for various agents’ parameters, and so on. My problem is that each change I make has some subtle or not-so-subtle effect on the learning outcomes, but none of them seems to quite correspond to the ‘correct’ behaviour that I get from doing separate training steps.
So I guess what I’m asking for is an overview of the “best practices” for this kind of thing - how do multiple optimizers interact in PyTorch, and what techniques can be used to prevent them from interfering with one another?
(One detail that might be important about my case: my agents are recurrent neural networks that are “interleaved” in time, with one agent’s output being the next agent’s input. Because of this, the computation graph has many nodes that depend on the parameters of all N agents.)