How to collect tensors in all gpus for each batch and save them

richcmwang · June 17, 2021, 12:40am

Hi, I want to collect tensors in all GPUs for each minibatch and save them. Can someone suggest how to do that?

Ali1234 · June 17, 2021, 2:37am

If you are using DDP (DistributedDataParallel()) then you can simply save them like you do without DDP (using torch.save), because every process (i.e. GPU) will run it. Use gpu index to prevent multiple saving
of the same name

gcramer23 · June 17, 2021, 1:41pm

I want to collect tensors in all GPUs for each minibatch and save them.

Do you want all tensors to be on a single process before saving?

You can save a tensor using torch.save — PyTorch 2.1 documentation.

richcmwang · June 17, 2021, 6:16pm

Yes, DDP is fine. The tensors can stay on GPUs. Each tensor should be saved in a file, and I want to make sure I save them without repetition or missing a tensor. Can you suggest a sample code?

Ali1234 · June 21, 2021, 2:15am

To avoid repetition you can use gpu index when saving files. Something like the following:

def save_tensors(gpu, total_gpus):
    torch.cuda.set_device(gpu)
    torch.distributed.init_process_group(backend = 'nccl', init_method='env://', world_size=total_gpus, rank=gpu)
    for i in range(100):
        tensor = torch.rand(2,3)
        torch.save(tensor, f'tensor_{i}_gpu_{gpu}.pt')
def main():
    gpu_count = torch.cuda.device_count()
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '1234'
    torch.multiprocessing.spawn(save_tensors, nprocs = gpu_count, args = (gpu_count, ))