Simple calculations using multiple gpu

I am trying to use pytorch to perform simple calculations across multiple gpu. I am not wanting to train a machine learning model. I’ve posted this in the distributed forum here, but I haven’t gotten a response back about a particular question. Here is the code I have thus far:

import torch
import torch.multiprocessing as mp
import torch.distributed as dist
import torch.nn.functional as F
import pandas as pd


def calc_cos_sims(rank, world_size):
    dist.init_process_group('gloo', rank=rank, world_size=world_size)
    cuda_device = torch.device('cuda:'+str(rank))
    data_path = './embed_pairs_df_million_part_' + str(rank) + '.pkl'
    tmp_df = pd.read_pickle(data_path)

    embeds_a_list = [embed_a for embed_a in tmp_df['embeds_a']]
    embeds_b_list = [embed_b for embed_b in tmp_df['embeds_b']]

    embeds_a_tensor = torch.tensor(embeds_a_list, device=cuda_device)
    embeds_b_tensor = torch.tensor(embeds_b_list, device=cuda_device)
    cosine_tensor = F.cosine_similarity(embeds_a_tensor, embeds_b_tensor)
    


def main():
    world_size = 4 #since I have 4 GPUs on a single machine
    mp.spawn(calc_cos_sims,
             args=(world_size,),
             nprocs=world_size,
             join=True)

if __name__ == 'main':
    main()

Basically, the code calculates the cosine similarity between two different embeddings. I have 4 GPU available to me and I have split my data into 4 slices to run on a given GPU.

It was recommended to use the pytorch collective api to aggregate the results. I read through it, but I’m not entirely sure how to implement it. How would that be done in this case or is there a better way to do all of this? I’d like to be able to save off the aggregated results to a file or have available for use at a further point in my program.

I welcome any feedback about potential improvements. Thank you in advance!

any thoughts on this?