I was wondering if anybody run into a problem while trying to train with mp.spawn() and DistributedDataParallel using two gpus (one process per gpu) where wandb gets stuck and wont allow the training to go on.
Are you using wandb on both processes or just one? I would suggest having one process handle logging and collecting info from the other processes
I am using init in both processes, I tried to run in a single process but it also gets stuck.