When I wrap my model in
nn.DataParallel, it requires me to move the model to the GPU (via
.cuda()). However, it also requires moving the inputs to the forward pass to CUDA (e.g., via
to(torch.device('cuda'))). I saw this post from 2017 mentioning DataParallel allows CPU inputs, but I’m running into issues when passing CPU tensors to the parallelized model (specifically, it’s saying it’s expecting cuda tensors but didn’t get cuda tensors). Maybe things changed since then.
I want to avoid a case where I put my input on one GPU, then DataParallel has to take it off that GPU and distribute it on the rest, making it really slow.
Are there any optimizations I can do to ensure that I’m not doing any unnecessary transfer between GPU and CPU? And is it correct that I have to pass cuda tensors to a parallelized module?
Can you share an example snippet that shows this problem? Looking at the code that gets called there is an explicit mention of copying one big input tensor to all GPU devices you want to use.
My network is a seq2seq net with three inputs:
- Word embeddings for the input sequence
- Word embeddings for the output sequence
- Image-like inputs for each token in the output sequence
When I try to keep any of these three on CPU before the forward pass with DataParallel it complains that the tensors aren’t cuda tensors.
E.g., here’s a snippet of my input encoder:
# Tensor containing sequence word type indices
torch_indices: torch.Tensor = torch.zeros((len(examples), max(seq_lens)), dtype=torch.long)
for idx, (sequence, sequence_length) in enumerate(zip(batch_indices, seq_lens)):
torch_indices[idx, :sequence_length] = torch.tensor(sequence, dtype=torch.long)
# Now embed things
batch_embeddings: torch.Tensor = self._embedder(torch_indices.to(DEVICE))
len(examples) gets my my batch size,
max(seq_lens) gets me the maximum sequence length in the batch, and I iterate over indexed sequences (
batch_indices) and modify values of the indices tensor indicating the indices of word types. I then put it on the device (in the case of single-GPU, this will be
DEVICE=torch.device('cuda'); in the case of CPU, this will be
DEVICE=torch.device('cpu'); and when I have more than one GPU this is by default also
DEVICE=torch.device('cpu'). Perhaps I shouldn’t use
to (and explicitly place on CPU) at all when using DataParallel?
I just tested removing the call to
to(DEVICE) in the snippet above, and it still gives an error that it’s expecting a cuda tensor (e.g., in the call to the embedder,
RuntimeError: Expected object of type torch.cuda.LongTensor but found type torch.LongTensor for argument #3 'index'
self._embedder is an object of a class which extends
nn.Module, and has an attribute of type
nn.Embedding, which is on the GPU when I make this call to it.
I may be running into these problems given how I set my code up. I wanted the code to be adaptable for zero, one, or multiple GPUs, so I have a
ModelWrapper class which keeps track of whether it’s being parallelized or not.
Internal to that I have a member
model which is the actual
nn.Module being parallelized. It extends both
nn.Module and an abstract model class (I use an abstract model class so that I can have multiple kinds of model architectures, but the assumptions are that all models in my project have both an encoder and a decoder, and also implement
forward as they are modules).
When initializing the
ModelWrapper, I first create the
model module. This object has attributes for the encoder and decoder (which are objects also extending
nn.Module), and these attributes have attributes which are also modules, e.g., an embedding module, and so on. Once I create the
model, if I have more than one GPU, I first wrap it in
nn.DataParallel, and then put it on the GPU by calling
When I want to use the model during training, e.g., to compute the loss, I just call
model(inputs) (do a forward pass), which returns a tensor.
Perhaps the call to
nn.DataParallel is not actually distributing the model parameters on the GPU correctly, given how I wrapped everything in classes?
I did verify that all parameters in my model are on the GPU, and during training all three GPUs are being used by the process.
Sounds like this error is expected here.
The input encoder you posted earlier will always run on CPU, since you don’t pass a device kwarg to
torch.zeroes. I’m assuming you’re calling the encoder from within the forward function. If you want inputs to be distributed to all GPUs, you need to call the wrapped module (the resulting model after wrapping it with
nn.DataParallel) with the CPU side inputs, and
nn.DataParallel will make sure the inputs are distributed accordingly. If you generate the encoded input from within the forward function, there is no place where
nn.DataParallel could hook into and move them around.
I’m assuming you’re calling the encoder from within the forward function.
Yes, this code is all in the
forward function for the instruction encoder Module (the Module object is an attribute of another Module who is wrapped in
DataParallel, and its forward call is called during the top-level forward call. It is very modular code!). The forward call for this model takes as input a list of string vectors
seqs (List[List[str]]), and just before the call I posted, I convert them in to lists of ints:
batch_indices: List[List[int]] = [[self._embedder.get_index(tok) for tok in seq] for seq in seqs]
seq_lens: List[int] = [len(instruction) for instruction in instructions]
I think I know what the issue is – does DataParallel require that the input to the forward calls be tensors so it can distribute them?
Another issue with assembling the batch in the forward function is that you end up doing the same work multiple times, depending on the number of GPUs you are using (the forward function is called N times).
Yes. If you pass the input batch (as a tensor) to
nn.DataParallel, it will split along the batch dimension and distribute the smaller batches to participating GPUs.