How to free the gpu memory of tensor list obtained by all_gather_object api?

Following the example:

>>> # Note: Process group initialization omitted on each rank.
>>> import torch.distributed as dist
>>> # Assumes world_size of 3.
>>> gather_objects = ["foo", 12, {1: 2}] # any picklable object
>>> output = [None for _ in gather_objects]
>>> dist.all_gather_object(output, gather_objects[dist.get_rank()])
>>> output
['foo', 12, {1: 2}]

I used similar way to gather tensors into an output list during the training. These tensors occupied to much gpu memory and made CUDA OOM in the next steps. I tried to use del and torch.cuda.empty_cache(), but the memory is not released. Does it has another way to free the gpu memory?

@roy_zhang could you please show how did you detect that memory was not released?

I just watch the nvidia-smi. But, this problem is solved, I use all_gather in a complex scenario, the cuda tensor are not actually transfer to the target gpu even the target process could get all tensors, I guess it should be mapping? I am sure that each process creates context in all gpus making the gpu memory increasing. I tried to move tensors on cpu in gather_objects, then it is running…

Hi @roy_zhang ,
What syntax are you using ?
I am facing a similar situation where while gathering different parts of a model state dict, I am getting OOM.
BUT reading the the docs, all_gather needs to have the objects on GPU when using nccl.
My code is : torch.distributed.all_gather_object(full_model, model_state_dict)
so should I just add .cpu() to full_model ?