Embedding layer: arguments located on different gpus

I am using nn.DataParallel and I have an error inside the embedding layer that said “RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:403”. My network architecture is the following

class WordEmbeddingNetwork(nn.Module):

    def __init__(self, word_embeddings_path, word2id, pad_token, unk_token, freeze=False):

        super(WordEmbeddingNetwork, self).__init__()
        self.pad_token = pad_token
        self.unk_token = unk_token
        self.word2id = word2id
        self.embedding_file = word_embeddings_path.split('/')[-1]

        embedding_weights = self.get_embeddings_weights(OOV_corrections)

        num_embeddings, self.embedding_dim = embedding_weights.shape
        self.embedding_layer = nn.Embedding(num_embeddings, self.embedding_dim)
        self.embedding_layer.load_state_dict({'weight': embedding_weights})
        if freeze:
            for p in self.embedding_layer.parameters():
                p.requires_grad = False

    def forward(self, batch):
        emb = self.embedding_layer(batch)
        return emb

class MyNet(nn.Module):

    _HIDDEN_SIZE = 300

    def __init__(self, word_embeddings_path, word2id, pad_token, unk_token, seed, device='cpu'):
        super(MyNet, self).__init__()

        self.device = device
        self.word_embeddings_layer = WordEmbeddingNetwork(word_embeddings_path=word_embeddings_path, 

    def __init__(self, utterances, ...):

I don’t understand why embedding layer and the given input are on different gpus. Can you help me?

I tried to remove the embeddings and put them on cpu. Now I have the same error on LSTM, it seems to me that nn.DataParallel moves things wrongly from one gpu to another

RuntimeError: Input and parameter tensors are not at the same device, found input tensor at cuda:0 and parameter tensor at cuda:1

Finally I have solved. nn.DataParallel moves to the correct gpu only tensors, if you have list of tensors as input of your model forward() method, you need to move one by one tensors in the list on the correct gpu. The correct gpu can be retrieved by accessing the .device attribute of a tensor automatically moved by the nn.DataParallel on the correct gpu. Never force a .to(device) with the wrong device!

Hey @Seo, IIUC, the DataParallel should be able to automatically scatter tensors in the input list to the correct device along the batch dimension. It uses the following code. Is your use case different from this assumption?

Hi @mrshenli, thank you for your response. Actually my case is different, I have a list of tensors and I want to chunk the list along its length. I have solved by implementing my own scatter method like this:

def scatter(inputs, target_gpus, dim=0):
    Slices tensors into approximately equal chunks and
    distributes them across given GPUs. Duplicates
    references to objects that are not tensors.
    def scatter_map(obj):
        if isinstance(obj, torch.Tensor):
            return Scatter.apply(target_gpus, None, dim, obj)
        if isinstance(obj, tuple) and len(obj) > 0:
            return list(zip(*map(scatter_map, obj)))
        if isinstance(obj, list) and len(obj) > 0:
            #on the last gpu the torch scatter always put the remaining samples to fit the batch
            # (e.g., batch=256, n_gpus=3 ==> chunks=[86, 86, 84])
            size = math.ceil(len(obj)/len(target_gpus))
            chunk = [obj[i * size:(i + 1) * size] for i in range(len(target_gpus)-1)]
            diff = len(obj) - size*(len(target_gpus)-1)
            return chunk
        if isinstance(obj, dict) and len(obj) > 0:
            return list(map(type(obj), zip(*map(scatter_map, obj.items()))))
        return [obj for targets in target_gpus]

    # After scatter_map is called, a scatter_map cell will exist. This cell
    # has a reference to the actual function scatter_map, which has references
    # to a closure that has a reference to the scatter_map cell (because the
    # fn is recursive). To avoid this reference cycle, we set the function to
    # None, clearing the cell
        return scatter_map(inputs)
        scatter_map = None