Problem in computing loss in multiple cpu distribution training

@soumith chintala

Hello everyone.My friend and I are trying to implement multiple cpu distributed training in pytorch using python sockets.we looked at pytorch inbuilt functions but we didn’t get any starter code for detecting whether cpu’s are connected or we decided to implement it in sockets

Main problem:

We distributed data perfectly but I have a theoretical doubt.So my plan is I will compute loss for distributed cpu’s for their batches and send all those losses to main computer and compute average.

Now I will send those average loss to all the distributed cpu’s and update model according to that loss so that model wts will be synchronized.but the avg loss which I am getting from server computer will not no relationship graph wrt to the model stored in the how can I tackle this??

Main doubt is:

Can we update model wrt to loss that has no relation wrt to model??

Typically you want to run the forward and backward pass on each process separately and then average the gradients across all processes and then run the optimizer independently on each process.

I’m wondering what is the reason you’re trying to build this yourself. PyTorch has a Distributed
Data Parallel
module, which does all of this for you.

Hi sir.Thank you for posting.Yesterday we tried it and we forgot to update here.

First problem
Coming to your question we looked at various articles but we are so confused on using this,Like we have to find initially whether pytorch identifes our connected computers.So can you please provide a starter code to check that??

Second problem
So if our main computer runs training code then what our client computers should run in order to accept and process data.We dont have any idea.We connected our computers using LAN cable.Thank you sir for answering…)

You can find a tutorial for this here: In particular the init_process_group part that is called on every node is how PyTorch identifies all the connected nodes.

To use DDP, all your nodes would actually run the same code and train the same model on all nodes. DDP would take care of gradient synchronization across all nodes to ensure all the models are in sync.