Help implementing Distributed Data Parallel?

I am trying to utilize distributed data parallel to perform distributed training one node/machine with 4 GPUs.
I think there is a problem with the saving/loading however.

I have the current code below:

     if config.local_rank == 0: 
        torch.save(net.state_dict(), os.path.join(config.output_directory, "latest_net.pytorch"))
        torch.save(optimiser.state_dict(), os.path.join(config.output_directory, "latest_optimiser.pytorch"))

        # barrier() to finish saving before loading
        torch.distributed.barrier() 
      map_location = {'cuda:%d' % 0: 'cuda:%d' % local_rank}

      try:
        net.load_state_dict(torch.load(os.path.join(config.output_directory, "latest_net.pytorch"), map_location=map_location))
        optimiser.load_state_dict(torch.load(os.path.join(config.output_directory, "latest_optimiser.pytorch"), map_location=map_location))
      except:
        print ('It couldn\'t load on local rank %d'%local_rank)

For every epoch run, loss doesn’t decrease steadily as it does in a single thread. I suspect this is because the saving and loading is not occuring properly. I always get the printouts: It couldn’t load on rank 1, rank 2, and rank 3. Can someone please provide some insight into what is blocking the saving loading?