Best practices for training model X on pre-trained embeddings of model Y?

I’d like to train a model X(input, output) which is a sequence-to-sequence model, but input is a list of pre-trained embedding vectors (number_of_embeddings x embedding_dim), and output is a sentence. The input is originally a sentence, that gets embedded with an external BERT model Y.

I’m running into trouble with loading the embeddings.

So far I’ve tried the following:

  1. Embed all input with Y a priori, save to a pickle, then load as training dataset for model X;
  2. Embed input samples/batches on the fly.

Method 1. works for small datasets but as my input can reach tens of millions of sentences I run out of memory for storing the entire tensor (embedding_dim is 768). Method 2. works when samples are embedded on cpu (I’m using fairseq where data pre-processing is done on cpu). However, this is pretty slow and is going to create a huge bottleneck. I tried doing this on GPU instead: raw text loaded into CPU, BERT model loaded into GPU, and then batch by batch samples moved to GPU, tokenized, embedded, and moved back to CPU. However, this throws the following error:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

For BERT, I’m using DistillBertTokenizer and DistillBertModel from huggingface/transformers.
Is there a better/more efficient way to do this? Thanks in advance!