How to use embeddings and pinned memory for multi-gpu?

I have a pretty large embedding matrix (pretrained and frozen) and I don’t want to copy it to each GPU when using DataParallel.

My ideal situation is the embedding matrix is on CPU, the embedded input is pinned, and the embedded input is sent to their respective GPUs when using DataParallel.

Is this possible? Or reasonable? I’m kind of at loss at the right way to handle this.

Posting a couple links that might help figure this out:

Example custom dataloader with pin_memory on individual examples or batches:

smth mentioning that DataParallel tries to use async:

Yes, DataParallel will try to use async=True by default.
DataParallel model and pin_memory()

I tried a few different settings. It seems the easiest thing to do is to ignore the pin_memory flag and embed everything on the CPU before calling DataParallel.

More or less this:

embed = torch.nn.Embedding.from_pretrained(embeddings, freeze=True)
model = Model()
model.cuda()

x = torch.LongTensor(batch_size, dim).random_(0, vocab_size-1)
# If you pin at this point, doesn't impact performance.
emb = embed(x)
# If you pin at this point, slightly slowed performance.
out = torch.nn.parallel.data_parallel(model, (emb, ))

Here’s some example code I used to try various settings:

Using pinned memory for large embedding memory is not recommended as well, because pinned memory is page-locked and not pre-emptible.

1 Like