How to use embeddings and pinned memory for multi-gpu?

mrdrozdov · October 1, 2018, 3:10am

I have a pretty large embedding matrix (pretrained and frozen) and I don’t want to copy it to each GPU when using DataParallel.

My ideal situation is the embedding matrix is on CPU, the embedded input is pinned, and the embedded input is sent to their respective GPUs when using DataParallel.

Is this possible? Or reasonable? I’m kind of at loss at the right way to handle this.

mrdrozdov · October 1, 2018, 3:43am

Posting a couple links that might help figure this out:

Example custom dataloader with pin_memory on individual examples or batches:

github.com

Cadene/vqa.pytorch/blob/master/vqa/lib/dataloader.py

import torch
import torch.multiprocessing as multiprocessing
from .sampler import SequentialSampler, RandomSampler
import collections
import math
import sys
import traceback
import threading
if sys.version_info[0] == 2:
    import Queue as queue
else:
    import queue


class ExceptionWrapper(object):
    "Wraps an exception plus traceback to communicate across threads"

    def __init__(self, exc_info):
        self.exc_type = exc_info[0]
        self.exc_msg = "".join(traceback.format_exception(*exc_info))

This file has been truncated. show original

smth mentioning that DataParallel tries to use async:

Yes, DataParallel will try to use async=True by default.
DataParallel model and pin_memory()

mrdrozdov · October 1, 2018, 3:07pm

I tried a few different settings. It seems the easiest thing to do is to ignore the pin_memory flag and embed everything on the CPU before calling DataParallel.

More or less this:

embed = torch.nn.Embedding.from_pretrained(embeddings, freeze=True)
model = Model()
model.cuda()

x = torch.LongTensor(batch_size, dim).random_(0, vocab_size-1)
# If you pin at this point, doesn't impact performance.
emb = embed(x)
# If you pin at this point, slightly slowed performance.
out = torch.nn.parallel.data_parallel(model, (emb, ))

Here’s some example code I used to try various settings:

gist.github.com

https://gist.github.com/mrdrozdov/3bcb412ff60151f0cb6caf95d0fcccaa

demo_embeddings.py

import argparse
import json

import torch
import torch.nn as nn

from tqdm import tqdm


class Model(nn.Module):

This file has been truncated. show original

results.txt

# Embeddings on Each GPU / Pin Memory before embedding

python demo_embeddings.py --pin_early --embedding_in_model
time: > 30m
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|

This file has been truncated. show original

smth · October 1, 2018, 4:13pm

Using pinned memory for large embedding memory is not recommended as well, because pinned memory is page-locked and not pre-emptible.