Ddp: evaluation, gather output, loss, and stuff. how to?

sbelharbi · August 30, 2021, 1:31am

— sorry for possible redundancy with other threads but i didnt find an answer.
hi,

trying to do evaluation in ddp.
forward in each gpu works fine.
but how can i gather all the outputs to a single gpu (master for example), to measure metrics onces an over ENTIRE minibatch because each process forward only a chunk of the minibatch.
or we can compute the metric over each gpu, but average over single one to reduce the communication overhead, then broadcast to others if necessary also to reduce communication.

how to that? it isa reduce-all op over outputs.
ddp seems to focus only on synch grads…,.
what if you want to synch outputs, losses, other stuff…
for example the computed loss is not the full loss over ENTIRE minibatch, but only a chunk.

so, is there a way to gather things on demand?
still reading the doc.
i may have missed something.
but, i dont think i saw any doc about this, probably ddp does not do this but only synch grads.

pytorch lightning and also here seems to have started working on this…
still reading their thread.
please feel free to leave a message or a snippet of code on how to do it.

the idea is to be able to evaluate using distributed computation, then gather the output or some intermediate metrics to single machine to do the average.

also, cant evaluate properly because every process have access only to a chunk of the minibatch. so, you cant loop over entire dataset unless you create a NEW dataloader, and use ddp.module to do the forward not ddp. because if you try to use ddp forward while only the master is doing the forward, it will get stuck waiting for other processes to do their forward… also, we are using distributedsampler… so it wil prevent a process from seeing the entire data.

again, i may have missed something.

thanks

sbelharbi · August 30, 2021, 3:16am

first of all, pytorch lightning has done it!!! that’s cool.
metrics over distributed models, an entire package just for this.

based on this threads one and two here are some solutions.

drop distrib.comput. meaning you loose the the distributed comp power. evaluate only over the master for example. to do this, you need to drop the distributed sampler over the validation. use it only for trainsent. the master now can see the entire dataset. you can run and get the performance over the master. either you allow the other processes to do the same thing as the master. so it is a waste of energy. but they will have the correct value of performance. but useless because you have it at the master. the one that you will use to log the perf. or you can block the other processes and let only the master do the evaluation. in this case, you need add some blocking barriers. free only the master. so this solution does not exploit distributed computation for evaluation.
use distributed comp. in this case, each gpu wil work on a chunk of minbatch in parallel. for this you need to manually synch the OUTPUT of your model either between all gpus using all_gather or just gather at the master then compute the metric. one may think to compute an unormalized metric over each gpu, then just send it in order to avoid sending an entire OUTPUT of a network. some metrics may not like that. alos, you may loose the chunk size, so you need to synched it as well. now, that you synched all the gpus, you can compute the metric.

downside of solution 2:
a. you need to deal with each type of your network output.
b. the synch may create an overhead that will slow things. does anyone know how much to cost to do torch.distributed.all_gather? compared to torch.distributed.gather? the last one is expected to be cheap. but you need to keep track in your code that only master has the right metric…
c. your code will be filled with synch calls…
d. there are metrics that do not like this at all because they need to process each sample alone. and if the metric is class that update the metric after seeing each sample, there will be a problem… you need to store all the predictions, then synch them, then do the update by looping over each sample… metrics that can simply average over the entire predictions are fine.

based on code from pytorch lightning, here is a code to synch, and pseudo-code to use it

import torch
import torch.distributed as dist

def sync_tensor_across_gpus(t: Union[torch.Tensor, None]
                            ) -> Union[torch.Tensor, None]:
    # t needs to have dim 0 for troch.cat below. 
    # if not, you need to prepare it.
    if t is None:
        return None
    group = dist.group.WORLD
    group_size = torch.distributed.get_world_size(group)
    gather_t_tensor = [torch.zeros_like(t) for _ in
                       range(group_size)]
    dist.all_gather(gather_t_tensor, t)  # this works with nccl backend when tensors need to be on gpu. 
   # for gloo and mpi backends, tensors need to be on cpu. also this works single machine with 
   # multiple   gpus. for multiple nodes, you should use dist.all_gather_multigpu. both have the 
   # same definition... see [here](https://pytorch.org/docs/stable/distributed.html). 
   #  somewhere in the same page, it was mentioned that dist.all_gather_multigpu is more for 
   # multi-nodes. still dont see the benefit of all_gather_multigpu. the provided working case in 
   # the doc is  vague... 
    return torch.cat(gather_t_tensor, dim=0)

usage,

# size images: (32, 3, 224, 224)
cl_logits = self.model(images)  #
# size cl_logits: (32, 100)  # one process.
cl_logits = sync_tensor_across_gpus(cl_logits)
# size now: (64, 100)  
# gather the 2 processes where each has a chunk of size 32 samples.
# using torch,distributed.all_gather, now every gpu has this full minibatch.

again, please let me know what is the optimum cost for synch, all_gather, or just gather at the master? is the cost of all_gather is huge when the tensors are large?

also, in some metrics, you can simply sync the unnormalized value to avoid sync the model output. you need to synch the total number of samples for normalization.

thanks

gcramer23 · August 30, 2021, 11:05pm

DDP provides gradient synchronization across processes. If you require data be shared between processes you need to communicate between the processes Distributed communication package - torch.distributed — PyTorch 1.9.0 documentation .