Each gpu has independent model

Artem_Storm · October 22, 2022, 2:14am

I usually use small backbones for my work, so if my machine has 4 gpus, I can tune hyperparameters by training 4 independent model. Thus I prepare batch only once on cpu and send it to all subprocesses through the queue like that

ctx = mp.get_context('fork')
p = ctx.Process(target=secondary_training, args=(cfg, queue))
p.start()

It worked well before torch 1.12.
Now it raises an error saying that I should use another context instead of ‘fork’. But I do not understand how to configure it for this case, where models are independent with different hyperparameters and even backbones but the training data are same.
Could you please tell me what parallel method suits here, if ‘fork’ are not supported any more?

Artem_Storm · October 22, 2022, 3:41pm

I’ve found solution using torch.distributed.send and torch.distributed.recv
more info here:
https://pytorch.org/tutorials/intermediate/dist_tuto.html