Arguments are located on different GPUs; couldn't solve by register_buffer

Dear Pytorch team,

Greetings!

I am trying to run a model with multiple GPUs, but an arguments are located on different GPUs error occurs. I researched this problem in pytorch forum but all the solutions seem not to work.

Particularly, I have been trying to use register_buffer

self.register_buffer("positions", positions)

to solve the problem

RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorMasked.cu:35

But the same error still occurs.

So my torch vision is 1.0.0, and I have referred to this post to try to solve. But it didn’t work.

My code is below. It is implementing the positional embedding of transformer model. The error occurs at the very end of this piece.

class SinusoidalPositionalEmbedding(nn.Module):
    """This module produces sinusoidal positional embeddings of any length.
    Padding symbols are ignored, but it is necessary to specify whether padding
    is added on the left side (left_pad=True) or right side (left_pad=False).
    """

    def __init__(self, embedding_dim, padding_idx=0, left_pad=0, init_size=128):
        super().__init__()
        self.embedding_dim = embedding_dim
        self.padding_idx = padding_idx
        self.left_pad = left_pad
        self.weights = SinusoidalPositionalEmbedding.get_embedding(
            init_size,
            embedding_dim,
            padding_idx,
        )
        # Here is the buffers
        self.register_buffer('_float_tensor', torch.FloatTensor(1))
        positions = None
        self.register_buffer("positions", positions)
        mask = None
        self.register_buffer("mask", mask)
    @staticmethod
    def get_embedding(num_embeddings, embedding_dim, padding_idx=None):
        """Build sinusoidal embeddings.
        This matches the implementation in tensor2tensor, but differs slightly
        from the description in Section 3.5 of "Attention Is All You Need".
        """
        half_dim = embedding_dim // 2
        emb = math.log(10000) / (half_dim - 1)
        emb = torch.exp(torch.arange(half_dim, dtype=torch.float) * -emb)
        emb = torch.arange(num_embeddings, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0)
        emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1)
        if embedding_dim % 2 == 1:
            # zero pad
            emb = torch.cat([emb, torch.zeros(num_embeddings, 1)], dim=1)
        if padding_idx is not None:
            emb[padding_idx, :] = 0
        return emb

    def forward(self, input):
        """Input is expected to be of size [bsz x seqlen]."""
        bsz, seq_len = input.size()
        max_pos = self.padding_idx + 1 + seq_len
        if self.weights is None or max_pos > self.weights.size(0):
            self.weights = SinusoidalPositionalEmbedding.get_embedding(
                max_pos,
                self.embedding_dim,
                self.padding_idx,
            )
        self.weights = self.weights.type_as(self._float_tensor)
        max_pos = self.padding_idx + 1 + input.size(1)
        if not hasattr(make_positions, 'range_buf'):
            make_positions.range_buf = input.new()
        make_positions.range_buf = make_positions.range_buf.type_as(input)
        if make_positions.range_buf.numel() < max_pos:
            torch.arange(self.padding_idx + 1, max_pos, out=make_positions.range_buf)
        self.mask = input.ne(self.padding_idx)
        self.positions = make_positions.range_buf[:input.size(1)].expand_as(input)
        if self.left_pad:
            positions = positions - self.mask.size(1) + self.mask.long().sum(dim=1).unsqueeze(1)

        # !!! Here is where the error points to
        self.positions = input.clone().masked_scatter_(self.mask, self.positions[self.mask]).long()
        return self.weights.index_select(0, self.positions.view(-1)).view(bsz, seq_len, -1).detach()

The error I get:

  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/qianlim/dl_signal/transformer/modules/position_embedding.py", line 92, in forward
    self.positions = input.clone().masked_scatter_(self.mask, self.positions[self.mask]).long()
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorMasked.cu:35

Thank you so much in advance!

Could you print the device of self.positions, input, and self.mask before the line of code which throws the error?
Skimming through your code I cannot find any obvious mistake where you force a certain device.

Hi ptrblck,

Thank you so much for your reply!

position: cuda:0
mask    : cuda:0
input   : cuda:0

(The same pattern reoccur for multiple times before the error occurs).

I print the device as you suggested. To my surprise all are on cuda:0.

Do you only see cuda:0 as the device?
If you are using nn.DataParallel, you should also see the other GPUs.
How many GPUs are you using and how large is your batch size?

>>> import torch
>>> torch.cuda.device_count()
2

It turns out there are two cuda device. My batch size is 512.

Thanks for the information!
So, you are seeing multiple times the cuda:0 output and never cuda:1?
Could you post some code snippets where you create your model and the training loop?

This might be a bit far fetched, but is your GPU1 working without any issues in a single GPU setup?

Hi ptrblck,

I will first check that my GPU1 working by setting device to cuda:1, and if it’s working smoothly I will post the snippets.

Thank you so much for the quick reply again!

So I run the test on a compute node with four device [0, 1, 2, 3], where device 1 and device 2 are accessible to me (by using nvidia-smi). When I run CUDA_VISIBLE_DEVICE=1 python3.6 train.py, the device printed out for position, mask and input are still

position: cuda:0
mask    : cuda:0
input   : cuda:0

Hi ptrblck,

For the code snippets I might need to double check with my advisor, but I will reply to this post with snippets once I am authorized.

Sure, if it’s not possible we might also debug it without the actual code.
I just wanted to make sure you are using something like this:

model = MyModel()
model = nn.DataParallel(model, device_ids=[0, 1])
model = model.to('cuda:0')  # assignment not necessarily needed here

for data, target in loader:
    data, target = data.to('cuda:0'), target.to('cuda:0')
    optimizer.zero_grad()
    output = model(data)
    ...

Also, no data is needed if that would be an issue as we can just use random inputs.

Hi ptrblck,

I have revised my code according to your example, but it didn’t seem to work. I will double check if I did the to device correctly and I will reply once I am able to post my code.

Thank you!

position: cuda:0
mask    : cuda:0
input   : cuda:0
position: cuda:0
mask    : cuda:1
input   : cuda:1

It turns out that after I revised the code, the position and mask seems to be in different cuda, even when I specify
CUDA_VISIBLE_DEVICE=1 python3.6 train.py before running.

Could you try to change this line

make_positions.range_buf = input.new()

to

make_positions.range_buf = input.new(input.size())

I’m not sure if an empty tensor might create this issue, but might be worth a shot.

Hi ptrblck,

I tried to implement what you suggests, and I get the following error:

File "/home/qianlim/dl_signal/transformer/modules/position_embedding.py", line 85, in forward
    self.positions = make_positions.range_buf[:input.size(1)].expand_as(input)
RuntimeError: The expanded size of the tensor (512) must match the existing size (10) at non-singleton dimension 0.  Target sizes: [512, 10].  Tensor sizes: [10, 10]

A temporal workaround might be to explicitly push position to the current device via:

self.positions = self.positions.to(self.mask.device)

Can’t comment on the underlying cause here, but beware the environment variable for selecting devices is CUDA_VISIBLE_DEVICES, not the singular form.

1 Like

Hi ptrblck,

Thanks for the explicit solution. After I tried that it also seemed not working. I am trying to put nn.DataParallel on the whole SinusoidalPositionalEmbedding object and see what will happen.

Hi Pieter,

Do you mean I should specify the cuda device as CUDA_VISIBLE_DEVICES=0, 1 python3.6 train.py instead of inside the python files?

Yes, but without the space (in your example). If you set it inside the Python files, it won’t be picked up, unless you set it before running import torch, which is when CUDA is initialized.

Hi Pieter,

Sorry for late update!

Unfortunately last time I tried specifying CUDA but it’s not working. My project is a bit postponed so I won’t be able to promptly give feedback, but I will keep updated to the forum once I have any good news!

Thank you so much!