Saving distributed models

iwkim · November 17, 2019, 9:40am

Hi,

I’m trying to train multi-agent reinforcement learning. To do so, each agent has distributed (separate) network model. Therefore, the distributed model has a number of nn.Module for each separate model.

I want to save the entire networks’ parameters to evaluate the trained model. How can I save the entire parameters?

As I know, using

d=model.state_dict()

and

torch.save(d, path)

is appropriate. But the model used in the above command seems to be linked for only one network model (not entire separate model).

How can I save all separate model?

Thank you

ptrblck · November 18, 2019, 6:35am

I’m not sure what “distributed” means in your use case.
Are you working with different models? If so, you could just save the state_dict of each model using a separate file.
Or are you working in a distributed setup, where the models are scattered and gathers using different nodes?
In that case, you could most likely want to reduce the model to the main node and just store this state_dict.

iwkim · November 18, 2019, 7:12am

Thank you for your reply.
Specifically, I have a number of agents and each agent has own policy network.
So I said it as ‘distributed’ but it was quite vague… sorry for the inconvenience.

I think the first one you gave me is applicable, right?
Because I generated a number of policy networks (models) and it is necessary to store separately.
Am I right?

Thank you