I am trying to utilize distributed data parallel to perform distributed training one node/machine with 4 GPUs.
I think there is a problem with the saving/loading however.
I have the current code below:
if config.local_rank == 0: torch.save(net.state_dict(), os.path.join(config.output_directory, "latest_net.pytorch")) torch.save(optimiser.state_dict(), os.path.join(config.output_directory, "latest_optimiser.pytorch")) # barrier() to finish saving before loading torch.distributed.barrier() map_location = {'cuda:%d' % 0: 'cuda:%d' % local_rank} try: net.load_state_dict(torch.load(os.path.join(config.output_directory, "latest_net.pytorch"), map_location=map_location)) optimiser.load_state_dict(torch.load(os.path.join(config.output_directory, "latest_optimiser.pytorch"), map_location=map_location)) except: print ('It couldn\'t load on local rank %d'%local_rank)
For every epoch run, loss doesn’t decrease steadily as it does in a single thread. I suspect this is because the saving and loading is not occuring properly. I always get the printouts: It couldn’t load on rank 1, rank 2, and rank 3. Can someone please provide some insight into what is blocking the saving loading?