Following the example:
>>> # Note: Process group initialization omitted on each rank.
>>> import torch.distributed as dist
>>> # Assumes world_size of 3.
>>> gather_objects = ["foo", 12, {1: 2}] # any picklable object
>>> output = [None for _ in gather_objects]
>>> dist.all_gather_object(output, gather_objects[dist.get_rank()])
>>> output
['foo', 12, {1: 2}]
I used similar way to gather tensors into an output list during the training. These tensors occupied to much gpu memory and made CUDA OOM in the next steps. I tried to use del
and torch.cuda.empty_cache()
, but the memory is not released. Does it has another way to free the gpu memory?