Ddp: evaluation, gather output, loss, and stuff. how to?

sbelharbi · August 30, 2021, 3:16am

first of all, pytorch lightning has done it!!! that’s cool.
metrics over distributed models, an entire package just for this.

based on this threads one and two here are some solutions.

drop distrib.comput. meaning you loose the the distributed comp power. evaluate only over the master for example. to do this, you need to drop the distributed sampler over the validation. use it only for trainsent. the master now can see the entire dataset. you can run and get the performance over the master. either you allow the other processes to do the same thing as the master. so it is a waste of energy. but they will have the correct value of performance. but useless because you have it at the master. the one that you will use to log the perf. or you can block the other processes and let only the master do the evaluation. in this case, you need add some blocking barriers. free only the master. so this solution does not exploit distributed computation for evaluation.
use distributed comp. in this case, each gpu wil work on a chunk of minbatch in parallel. for this you need to manually synch the OUTPUT of your model either between all gpus using all_gather or just gather at the master then compute the metric. one may think to compute an unormalized metric over each gpu, then just send it in order to avoid sending an entire OUTPUT of a network. some metrics may not like that. alos, you may loose the chunk size, so you need to synched it as well. now, that you synched all the gpus, you can compute the metric.

downside of solution 2:
a. you need to deal with each type of your network output.
b. the synch may create an overhead that will slow things. does anyone know how much to cost to do torch.distributed.all_gather? compared to torch.distributed.gather? the last one is expected to be cheap. but you need to keep track in your code that only master has the right metric…
c. your code will be filled with synch calls…
d. there are metrics that do not like this at all because they need to process each sample alone. and if the metric is class that update the metric after seeing each sample, there will be a problem… you need to store all the predictions, then synch them, then do the update by looping over each sample… metrics that can simply average over the entire predictions are fine.

based on code from pytorch lightning, here is a code to synch, and pseudo-code to use it

import torch
import torch.distributed as dist

def sync_tensor_across_gpus(t: Union[torch.Tensor, None]
                            ) -> Union[torch.Tensor, None]:
    # t needs to have dim 0 for troch.cat below. 
    # if not, you need to prepare it.
    if t is None:
        return None
    group = dist.group.WORLD
    group_size = torch.distributed.get_world_size(group)
    gather_t_tensor = [torch.zeros_like(t) for _ in
                       range(group_size)]
    dist.all_gather(gather_t_tensor, t)  # this works with nccl backend when tensors need to be on gpu. 
   # for gloo and mpi backends, tensors need to be on cpu. also this works single machine with 
   # multiple   gpus. for multiple nodes, you should use dist.all_gather_multigpu. both have the 
   # same definition... see [here](https://pytorch.org/docs/stable/distributed.html). 
   #  somewhere in the same page, it was mentioned that dist.all_gather_multigpu is more for 
   # multi-nodes. still dont see the benefit of all_gather_multigpu. the provided working case in 
   # the doc is  vague... 
    return torch.cat(gather_t_tensor, dim=0)

usage,

# size images: (32, 3, 224, 224)
cl_logits = self.model(images)  #
# size cl_logits: (32, 100)  # one process.
cl_logits = sync_tensor_across_gpus(cl_logits)
# size now: (64, 100)  
# gather the 2 processes where each has a chunk of size 32 samples.
# using torch,distributed.all_gather, now every gpu has this full minibatch.

again, please let me know what is the optimum cost for synch, all_gather, or just gather at the master? is the cost of all_gather is huge when the tensors are large?

also, in some metrics, you can simply sync the unnormalized value to avoid sync the model output. you need to synch the total number of samples for normalization.

thanks