Dataparallel with pretrained embedding vectors

I need to use pretrained embedding vectors for my multi-GPU setup. These vectors have about 50 millions float numbers. Apparently I don’t want them to be broadcasted and be gathered at every batch. What’s the right way to use Dataparallel to avoid this?

if they are the first layer of your network, it’s nice to put the 50 million float numbers on CPU, get the embeddings per sequence and transfer these on the fly to appropriate GPUs.

Embedding operation is heavy sparse accesses, so it’s nicer to have it on the CPU.