Hi, I am using the distributed data parallel as shown in the turtorial. I have 2 GPUs in a single machine. I want to train the model on all the nodes but evaluate the model on the node with rank 0. I set up a barrier during the evaluation. But the node rank 1 cannot go out from the barrier. My code flow is like this:
def train( self, resume=False ):
for i in range( self._epoch+1, self._niter+1 ):
self._train_sampler.set_epoch(i)
self._train()
if self._testset is not None and i%1 == 0 :
if not self.distributed or self.rank==0:
print('rank {} go to validation'.format(self.rank))
self._validate()
if self.distributed:
print('rank {} go to barrier'.format(self.rank))
dist.barrier()
print('rank {} go out of barrier'.format(self.rank))
self._epoch = i
self.save_training(self._cfg.path.CHECKPOINT.format(self._epoch))
if hasattr(self, '_scheduler'):
self._scheduler.step()
However, the rank 1 will freeze after validation. The output of is like:
training…
start to validate rank 1 go to barrier
validating…
rank 0 go to barrier
rank 0 go out of barrier
checkpoint is saved in xxx…
Then it just freeze for the next epoch training
The validation code is like:
def _validate( self ):
if isinstance(self._model, DDP):
if self.rank != 0:
return
print('start to validate')
self._model.eval()
results = []
with torch.no_grad() :
for idx,(inputs, targets) in enumerate(tqdm.tqdm(self._testset, 'evaluating')):
inputs = self._set_device( inputs )
output = self._model( inputs)
batch_size = len(output['boxes'])
for i in range(batch_size):
if len(output['boxes'][i]) == 0:
continue
# convert to xywh
output['boxes'][i][:,2] -= output['boxes'][i][:,0]
output['boxes'][i][:,3] -= output['boxes'][i][:,1]
for j in range(len(output['boxes'][i])):
results.append({'image_id':int(targets[i]['image_id']),
'category_id':output['labels'][i][j].cpu().numpy().tolist(),
'bbox':output['boxes'][i][j].cpu().numpy().tolist(),
'score':output['scores'][i][j].cpu().numpy().tolist()})
with open('temp_result.json','w') as f:
json.dump(results,f)
self.eval_result(dataset=self._dataset_name) # use coco tool to evaluate the output file
If I remove the evaluation code, the barrier works as expected and the rank 1 can go out from the barrier.
Does anyone know how to solve the problem?