Optimizing CPU-GPU data transfer with nn.DataParallel

When I wrap my model in nn.DataParallel, it requires me to move the model to the GPU (via .cuda()). However, it also requires moving the inputs to the forward pass to CUDA (e.g., via to(torch.device('cuda'))). I saw this post from 2017 mentioning DataParallel allows CPU inputs, but I’m running into issues when passing CPU tensors to the parallelized model (specifically, it’s saying it’s expecting cuda tensors but didn’t get cuda tensors). Maybe things changed since then.

I want to avoid a case where I put my input on one GPU, then DataParallel has to take it off that GPU and distribute it on the rest, making it really slow.

Are there any optimizations I can do to ensure that I’m not doing any unnecessary transfer between GPU and CPU? And is it correct that I have to pass cuda tensors to a parallelized module?


Can you share an example snippet that shows this problem? Looking at the code that gets called there is an explicit mention of copying one big input tensor to all GPU devices you want to use.

My network is a seq2seq net with three inputs:

  • Word embeddings for the input sequence
  • Word embeddings for the output sequence
  • Image-like inputs for each token in the output sequence

When I try to keep any of these three on CPU before the forward pass with DataParallel it complains that the tensors aren’t cuda tensors.

E.g., here’s a snippet of my input encoder:

# Tensor containing sequence word type indices
torch_indices: torch.Tensor = torch.zeros((len(examples), max(seq_lens)), dtype=torch.long)
for idx, (sequence, sequence_length) in enumerate(zip(batch_indices, seq_lens)):
    torch_indices[idx, :sequence_length] = torch.tensor(sequence, dtype=torch.long)

# Now embed things
batch_embeddings: torch.Tensor = self._embedder(torch_indices.to(DEVICE))

len(examples) gets my my batch size, max(seq_lens) gets me the maximum sequence length in the batch, and I iterate over indexed sequences (batch_indices) and modify values of the indices tensor indicating the indices of word types. I then put it on the device (in the case of single-GPU, this will be DEVICE=torch.device('cuda'); in the case of CPU, this will be DEVICE=torch.device('cpu'); and when I have more than one GPU this is by default also DEVICE=torch.device('cpu'). Perhaps I shouldn’t use to (and explicitly place on CPU) at all when using DataParallel?

I just tested removing the call to to(DEVICE) in the snippet above, and it still gives an error that it’s expecting a cuda tensor (e.g., in the call to the embedder,

RuntimeError: Expected object of type torch.cuda.LongTensor but found type torch.LongTensor for argument #3 'index'

self._embedder is an object of a class which extends nn.Module, and has an attribute of type nn.Embedding, which is on the GPU when I make this call to it.

I may be running into these problems given how I set my code up. I wanted the code to be adaptable for zero, one, or multiple GPUs, so I have a ModelWrapper class which keeps track of whether it’s being parallelized or not.

Internal to that I have a member model which is the actual nn.Module being parallelized. It extends both nn.Module and an abstract model class (I use an abstract model class so that I can have multiple kinds of model architectures, but the assumptions are that all models in my project have both an encoder and a decoder, and also implement forward as they are modules).

When initializing the ModelWrapper, I first create the model module. This object has attributes for the encoder and decoder (which are objects also extending nn.Module), and these attributes have attributes which are also modules, e.g., an embedding module, and so on. Once I create the model, if I have more than one GPU, I first wrap it in nn.DataParallel, and then put it on the GPU by calling model.cuda().

When I want to use the model during training, e.g., to compute the loss, I just call model(inputs) (do a forward pass), which returns a tensor.

Perhaps the call to nn.DataParallel is not actually distributing the model parameters on the GPU correctly, given how I wrapped everything in classes?

I did verify that all parameters in my model are on the GPU, and during training all three GPUs are being used by the process.

Sounds like this error is expected here.

The input encoder you posted earlier will always run on CPU, since you don’t pass a device kwarg to torch.zeroes. I’m assuming you’re calling the encoder from within the forward function. If you want inputs to be distributed to all GPUs, you need to call the wrapped module (the resulting model after wrapping it with nn.DataParallel) with the CPU side inputs, and nn.DataParallel will make sure the inputs are distributed accordingly. If you generate the encoded input from within the forward function, there is no place where nn.DataParallel could hook into and move them around.

I’m assuming you’re calling the encoder from within the forward function.

Yes, this code is all in the forward function for the instruction encoder Module (the Module object is an attribute of another Module who is wrapped in DataParallel, and its forward call is called during the top-level forward call. It is very modular code!). The forward call for this model takes as input a list of string vectors seqs (List[List[str]]), and just before the call I posted, I convert them in to lists of ints:

batch_indices: List[List[int]] = [[self._embedder.get_index(tok) for tok in seq] for seq in seqs]
seq_lens: List[int] = [len(instruction) for instruction in instructions]

I think I know what the issue is – does DataParallel require that the input to the forward calls be tensors so it can distribute them?

Another issue with assembling the batch in the forward function is that you end up doing the same work multiple times, depending on the number of GPUs you are using (the forward function is called N times).

Yes. If you pass the input batch (as a tensor) to nn.DataParallel, it will split along the batch dimension and distribute the smaller batches to participating GPUs.

1 Like