When using distribute, how to write log?

I implement my code modified from https://github.com/pytorch/examples/blob/master/imagenet/main.py ,and now I have 2 problems:
1.Dose this code is now the best way to do distribution?
2.How to to write logs? In my opinion, We should find the major process and wrtie, but could the mp.spawn can do this?
Many thanks to your replies!

1.Dose this code is now the best way to do distribution?

This depends on the application requirements. For available tools, see this: PyTorch Distributed Overview — PyTorch Tutorials 2.1.1+cu121 documentation

2.How to to write logs? In my opinion, We should find the major process and wrtie, but could the mp.spawn can do this?

You can use the rank to control which process does the log, e.g.:

import torch.distributed as dist

if dist.get_rank() == 0:
    # do log

If you are using RPC, the counterpart API is get_worker_info.

1 Like