Help implementing Distributed Data Parallel?

I am trying to utilize distributed data parallel to perform distributed training one node/machine with 4 GPUs.
I think there is a problem with the saving/loading however.

I have the current code below:

     if config.local_rank == 0:, os.path.join(config.output_directory, "latest_net.pytorch")), os.path.join(config.output_directory, "latest_optimiser.pytorch"))

        # barrier() to finish saving before loading
      map_location = {'cuda:%d' % 0: 'cuda:%d' % local_rank}

        net.load_state_dict(torch.load(os.path.join(config.output_directory, "latest_net.pytorch"), map_location=map_location))
        optimiser.load_state_dict(torch.load(os.path.join(config.output_directory, "latest_optimiser.pytorch"), map_location=map_location))
        print ('It couldn\'t load on local rank %d'%local_rank)

For every epoch run, loss doesn’t decrease steadily as it does in a single thread. I suspect this is because the saving and loading is not occuring properly. I always get the printouts: It couldn’t load on rank 1, rank 2, and rank 3. Can someone please provide some insight into what is blocking the saving loading?