I’m working on modifying my model (including my custom data loader) to fit the structure of DDP. I haven’t given my code a try but I’d like to know more about the synchronization process.
According to the many great threads on this forum, DDP takes care of the synchronization during loss.backward(). But what if the number of data in each data loader leads to different for-loop counts, would the processes with n+1 loops be blocked because the processes with n loops never reach the point?
Say, I have 401 images, distributed to 4 data loaders with 101, 100, 100, 100 images respectively. Batch size is 4 so process 0 gets 26 iterations while others get 25. Would my process group get stuck at 26th iteration?
Here is a simplified version of part of my code:
#......(some init process including moving self.model to DDP)...... for phase in ['train', 'eval']: dist.barrier() if phase=='train': self.model.train() self.data_loader.train() else: self.model.eval() self.data_loader.eval() running_loss = 0 for inputs, labels in self.data_loader: self.optimizer.zero_grad() with torch.set_grad_enabled(phase=='train'): outputs = self.model(inputs) loss = self.loss(outputs, labels) if phase == 'train': loss.backward() ### Could this or the following line get stuck during the extra loop by process 0? self.optimizer.step() running_loss += loss.item()*inputs.shape torch.cuda.empty_cache() epoch_loss = running_loss/len(self.data_loader)
Thanks for any helpful hint!