I’m editing code from torchbeast to fit my use case. I want to use a discriminator for calculating the reward at a certain state.
I noticed that torchbeast uses a copy of the model for inference called actor_model and a the original model for learning.
This is part of polybeast.py:
def learn(
...
for tensors in learner_queue:
...
optimizer.zero_grad()
total_loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), flags.grad_norm_clipping)
optimizer.step()
scheduler.step()
actor_model.load_state_dict(model.state_dict())
...
I tried the same thing with the discriminator but I ran out of memory and had to use minibatches. I thought of using locks so only one thread is using the discriminator. Is there any other solution to this problem? Is it impossible?