I have a pretty large embedding matrix (pretrained and frozen) and I don’t want to copy it to each GPU when using DataParallel.
My ideal situation is the embedding matrix is on CPU, the embedded input is pinned, and the embedded input is sent to their respective GPUs when using DataParallel.
Is this possible? Or reasonable? I’m kind of at loss at the right way to handle this.
I tried a few different settings. It seems the easiest thing to do is to ignore the pin_memory flag and embed everything on the CPU before calling DataParallel.
More or less this:
embed = torch.nn.Embedding.from_pretrained(embeddings, freeze=True)
model = Model()
model.cuda()
x = torch.LongTensor(batch_size, dim).random_(0, vocab_size-1)
# If you pin at this point, doesn't impact performance.
emb = embed(x)
# If you pin at this point, slightly slowed performance.
out = torch.nn.parallel.data_parallel(model, (emb, ))
Here’s some example code I used to try various settings: