How to store embeddings from different ranks in DistributedDataParallel mode?

I want to run my model on dataset and store all embeddings using DistributedDataParallel. I created dataloader with DistributedSampler and now want to store all embeddings in the form:
(image_name, embedding)

And after that I want to save them as csv or pickle file.

Will it be correct to create a global list and store data there or will there be conflicts with writing to the list?

By “global list”, you mean Python global variable? And this will create a global list per process? Who will be writing to the global list? BTW, any reason for not using nn.Embedding?

1 Like

Yes, by “global list” I mean global python variable. I am using mp.spawn to start distributed training, so I thought that the variables inside the executable file in this case are visible to all ranks. But after executing the code, nothing was written into the dict. What are the benefits of using nn.Embedding? I want to store image_name and embeddings.

Right, global vars are per-process, so each spawned child process will have a different global var.

What are the benefits of using nn.Embedding?

One benefit is that you can then run lookup ops on GPU. And if you need to let the training process to update the embedding as well, using nn.Embedding will also make it easier.

I want to store image_name and embeddings.

If you would like to pass those data back to the main process, one option is to use the multiprocessing SimpleQueue. See the example below.

I am trying to understand this requirement. In your application, is it like each subprocess will produce some image embedding independently and concurrently, and then you wanna save those?

Yes, each subprocess generate embeddings from dataloader batches. I want to process all my data (generate embeddings) as fast as possible, this is why I want to use DistributedDataParallel. Process and after that save everything in one file.

1 Like