Using cuda tensor in DataParallel model

there are model wrapped by nn.DataParallel

self.model = Bert(6, 12, 513, 384*4, 64, 64, 2, 384, self.base_task.max_vocab_indexes['input_ids'])
self.model = nn.DataParallel(self.model).cuda()

and in model there are constant tensor, which name is pos.

class Embedding(nn.Module):
    def __init__(self, maxlen, d_model, n_segments, vocab_size, device='cuda'):
        super(Embedding, self).__init__()
        self.device = device
        self.tok_embed = nn.Embedding(vocab_size, d_model)  # token embedding
        self.pos_embed = nn.Embedding(maxlen, d_model)  # position embedding
        self.seg_embed = nn.Embedding(n_segments, d_model)  # segment(token type) embedding
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, seg):
        seq_len = x.size(1)
        pos = torch.arange(seq_len, dtype=torch.long, device=self.device)
        pos = pos.unsqueeze(0).expand_as(x)  # (seq_len,) -> (batch_size, seq_len)
        embedding = self.tok_embed(x) + self.pos_embed(pos) + self.seg_embed(seg)
        return self.norm(embedding)

I used this code with just one GPU, and this works.
But this time I need to use more gpus, and I have to modify that tensor which was manually located device=self.device, to something “which is dynamically located each gpu at DataParallel”.
… But it is hard to me, code below doesn’t works. The tensor just located in cpu, even with .cuda()

How can I solve this issue? I am almost struggling with this issue all day…

class Embedding(nn.Module):
    def __init__(self, maxlen, d_model, n_segments, vocab_size, device='cuda'):
        super(Embedding, self).__init__()
        self.device = torch.device('cuda')
        self.tok_embed = nn.Embedding(vocab_size, d_model)  # token embedding
        self.pos_embed = nn.Embedding(maxlen, d_model).cpu()  # position embedding
        self.seg_embed = nn.Embedding(n_segments, d_model)  # segment(token type) embedding
        self.norm = nn.LayerNorm(d_model)
        self.pos = torch.arange(513, dtype=torch.long, requires_grad=False).unsqueeze(0)

    def forward(self, x, seg):
        # seq_len = x.size(1)
        print(f'pos device: {self.pos.device}') # printed by "cpu"
        pos = self.pos.expand_as(x)  # (seq_len,) -> (batch_size, seq_len)
        cuda_pos = self.pos_embed(pos).cuda()
        print(f'pos device: {cuda_pos.device}, x device : {x.device}, seg device: {seg.device}') # printed by "cpu" for pos
        embedding = self.tok_embed(x) + cuda_pos + self.seg_embed(seg)
        return self.norm(embedding)

I also tried add that tensor to model’s input. But with this case, when I use 2 gpus, the result is just about half of one batch, (64 x ...) tensor returned, just half of one batch (128).

0/5120 [00:00<? ?it/s] Traceback (most recent call last):
  File "main.py", line 234, in <module>
    trainer.train()
  File "main.py", line 116, in train
    loss_lm = self.criterion(logits_lm.transpose(1, 2), batch.masked_tokens.transpose(0,1)) # for masked LM
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 932, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2317, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2113, in nll_loss
    .format(input.size(0), target.size(0)))
ValueError: Expected input batch_size (64) to match target batch_size (128).

and below is training loop.

        pos = torch.arange(513, dtype=torch.long, requires_grad=False).unsqueeze(0).to(device=torch.device('cuda'))
        for epoch in range(max_epoch):
            loss_sum, acc_sum, len_batch_sum = 0., 0., 0.
            ds_iter.init_epoch()
            tr_total = math.ceil(total_len / self.batch_size)
            tq_iter = tqdm(enumerate(ds_iter), total=tr_total, miniters=min_iters, unit_scale=self.batch_size,
                           bar_format='{n_fmt}/{total_fmt} [{elapsed}<{remaining} {rate_fmt}] {desc}')

            self.model.train()
            print('epoch starts')
            for i, batch in tq_iter:
                self.model.zero_grad()
                print('batch starts')
                device = torch.device('cuda')
                print(device)
                print(f'batch.input_ids device : {batch.input_ids.device}, batch.segment_ids : {batch.segment_ids.device}, batch.masekd_pos : {batch.masked_pos.device}')
                print(f'batch.input_ids shape : {batch.input_ids.shape}, batch.segment_ids : {batch.segment_ids.shape}, batch.masekd_pos : {batch.masked_pos.shape}, pos : {pos.shape}')
                logits_lm, logits_clsf = self.model(batch.input_ids.transpose(0,1).to(device=device), batch.segment_ids.transpose(0,1).to(device=device), batch.masked_pos.transpose(0,1).to(device=device), pos.to(device=device))
                print(f'logits_lm, logits_clsf shape : {logits_lm.shape}, {logits_clsf.shape}')

and these are logs.

epoch starts
batch starts
cuda
batch.input_ids device : cuda:0, batch.segment_ids : cuda:0, batch.masekd_pos : cuda:0
batch.input_ids shape : torch.Size([513, 128]), batch.segment_ids : torch.Size([513, 128]), batch.masekd_pos : torch.Size([5, 128]), pos : torch.Size([1, 513])
pos device: cuda:0 # log from Embedding layer in the model
pos device: cuda:0, x device : cuda:0, seg device: cuda:0 # log from Embedding layer in the model
logits_lm, logits_clsf shape : torch.Size([64, 5, 6015]), torch.Size([64, 2])

logits_lm device: cuda:0, batch target device: cuda:0
0/5120 [00:00<? ?it/s] Traceback (most recent call last):
  File "main.py", line 236, in <module>
    trainer.train()
  File "main.py", line 118, in train
    loss_lm = self.criterion(logits_lm.transpose(1, 2), batch.masked_tokens.transpose(0,1)) # for masked LM
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 932, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2317, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2113, in nll_loss
    .format(input.size(0), target.size(0)))
ValueError: Expected input batch_size (64) to match target batch_size (128).

Hey @cybaj, the model needs to be moved to GPU before passing it to DataParallel ctor. Have you tried changing the following code:

to

self.model = nn.DataParallel(self.model.to("cuda:0"))

BTW, here is the DataParallel tutorial: https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

Thank you for reply! @mrshenli, I already tried with to("cuda:0") . and as far as I knew cuda() and to(device="cuda") is same.
Below is the issues.

  1. In my model, I need to use some constant tensor, and I defined it in model’s forward as you can see top post.
  2. So I give torch.device('cuda') to model, and in model, locate that tensor to the device. more specified,
device = torch.device('cuda')
...
self.model = Bert(6, 12, 513, 384*4, 64, 64, 2, 384, 
						self.base_task.max_vocab_indexes['input_ids'], device=device)
self.model = nn.DataParallel(self.model)
self.model = self.model.cuda()
  • I ran it with 2 gpus, and it shows up right.
    device_count = torch.cuda.device_count()
    print(f'gpu count: {torch.cuda.device_count()}')

gpu count: 2
  • dataset iterator is device=torch.device('cuda')
  • DataParallel wrapped model with cuda()
  • and model get returned, logits_clsf and logits_lm
  • loss was calculated.
  • and loss.backward() phase, I think stuck. print after loss.backward() wasn’t printed.
    belows are logs and the model.
batch starts
logits_lm device: cuda:0, batch target device: cuda:0
loss_lm calculated
loss_clsf calculated
            print('epoch starts')
            for i, batch in tq_iter:
                self.model.zero_grad()
                print('batch starts')
                logits_lm, logits_clsf = self.model(batch.input_ids.transpose(0,1), batch.segment_ids.transpose(0,1), batch.masked_pos.transpose(0,1))
                print(f'logits_lm device: {logits_lm.device}, batch target device: {batch.masked_tokens.device}')
                loss_lm = self.criterion(logits_lm.transpose(1, 2), batch.masked_tokens.transpose(0,1)) # for masked LM
                print('loss_lm calculated')
                loss_lm = (loss_lm.float()).mean()
                loss_clsf = self.criterion(logits_clsf, batch.is_next) # for sentence classification
                print('loss_clsf calculated')
                loss = loss_lm + loss_clsf
                loss.backward()
                print('loss backwarded')

without any error message or logs, the log loss backwarded didn’t shows up. I think it’s stuck.
I don’t know what I have to do.
All the post in this thread, are my tries after this happened…

I assumed that,
in backward phase,
‘replica 0’, which is dedicated to cuda:0 is worked well(backward completed well), and waiting for ‘replica 1’, which is dedicated to cuda:1 to be completed with it’s backward, but in this case, some error occurred to him and failed to backward, so ‘replica 0’ keeps waiting and thus 'loss backwarded' log didn’t show up.

Further more assuming that,
even embedding tensor in each replica’s model, looks like splitted well and dedicated well like this,

class Embedding(nn.Module):
    def __init__ 
        ...
    def forward(self, x, seg):
        seq_len = x.size(1)
        pos = torch.arange(seq_len, dtype=torch.long, device=self.device)
        pos = pos.unsqueeze(0).expand_as(x)  # (seq_len,) -> (batch_size, seq_len)
        pos.requires_grad = False
        print(f'pos tensor device: {pos.device}, shape: {pos.shape}')
        embedding = self.tok_embed(x) + self.pos_embed(pos) + self.seg_embed(seg)
        return self.norm(embedding)


pos tensor device: cuda:1, shape: torch.Size([64, 513])
pos tensor device: cuda:0, shape: torch.Size([64, 513])

embedding tensor device: cuda:1, shape: torch.Size([64, 513, 384])
embedding tensor device: cuda:0, shape: torch.Size([64, 513, 384])

and about other cuda() defined tensor in the model, too

context tensor device: cuda:1, shape: torch.Size([64, 513, 768])
multihead output tensor device: cuda:1, shape: torch.Size([64, 513, 384])
context tensor device: cuda:1, shape: torch.Size([64, 513, 768])
multihead output tensor device: cuda:1, shape: torch.Size([64, 513, 384])
context tensor device: cuda:1, shape: torch.Size([64, 513, 768])
multihead output tensor device: cuda:1, shape: torch.Size([64, 513, 384])
context tensor device: cuda:0, shape: torch.Size([64, 513, 768])
multihead output tensor device: cuda:0, shape: torch.Size([64, 513, 384])
context tensor device: cuda:1, shape: torch.Size([64, 513, 768])
multihead output tensor device: cuda:1, shape: torch.Size([64, 513, 384])
context tensor device: cuda:0, shape: torch.Size([64, 513, 768])
multihead output tensor device: cuda:0, shape: torch.Size([64, 513, 384])
context tensor device: cuda:1, shape: torch.Size([64, 513, 768])
multihead output tensor device: cuda:1, shape: torch.Size([64, 513, 384])
context tensor device: cuda:0, shape: torch.Size([64, 513, 768])
multihead output tensor device: cuda:0, shape: torch.Size([64, 513, 384])
context tensor device: cuda:1, shape: torch.Size([64, 513, 768])
multihead output tensor device: cuda:1, shape: torch.Size([64, 513, 384])
context tensor device: cuda:0, shape: torch.Size([64, 513, 768])
multihead output tensor device: cuda:0, shape: torch.Size([64, 513, 384])
context tensor device: cuda:0, shape: torch.Size([64, 513, 768])
multihead output tensor device: cuda:0, shape: torch.Size([64, 513, 384])
context tensor device: cuda:0, shape: torch.Size([64, 513, 768])
multihead output tensor device: cuda:0, shape: torch.Size([64, 513, 384])

and the last logit gathered well, too,

logtis_lm shape : torch.Size([128, 5, 6015])

There are some error could happened at backward phase, but I cannot even imagine about it…

Or maybe the sum about loss, loss = loss_lm + loss_clsf or loss_lm.float().mean() after get gathered outputs, affects to backward phase, I think but … I don’t know what should I do.

                loss_lm = self.criterion(logits_lm.transpose(1, 2), batch.masked_tokens.transpose(0,1)) # for masked LM
                print('loss_lm calculated')
                print(f'loss_lm tensor device: {loss_lm.device}, shape: {loss_lm.shape}')
                loss_lm = (loss_lm.float()).mean()
                print(f'loss_lm tensor device: {loss_lm.device}, shape: {loss_lm.shape}')
                loss_clsf = self.criterion(logits_clsf, batch.is_next) # for sentence classification
                print('loss_clsf calculated')
                print(f'loss_clsf tensor device: {loss_clsf.device}, shape: {loss_clsf.shape}')
                loss = loss_lm + loss_clsf
                print(f'sumloss tensor device: {loss.device}, shape: {loss.shape}')
                loss.backward()
                self.optimizer.step()
                print('stepped')

Hey @cybaj, I am a little confused about the above code. IIUC, the self.device attribute on both replicas point to cuda:0, as replicate.py is not smart enough to change that for you. In that case how did pos successfully located to different devices? Looks like expand_as would not automatically change device either:

>>> import torch
>>> x = torch.arange(10, device="cuda:0")
>>> y = torch.ones(10, 10).to(1)
>>> z = x.expand_as(y)
>>> z.device
device(type='cuda', index=0)
>>> y.device
device(type='cuda', index=1)
>>> x.device
device(type='cuda', index=0)

If you would like to get the correct device, can you read that from x.device? (input to forward should be scattered properly.)

Thank you, @mrshenli . I change all self.device to something like x.device on forwarding.

… It looks like works but still, stuck after loss.backward() .
All loss calculated and, loss tensor is on cuda:0, which is default output device.

loss_lm calculated
loss_lm : 72.27577209472656
loss_lm tensor device: cuda:0, shape: torch.Size([])
after mean loss_lm : 72.27577209472656
after mean loss_lm tensor device: cuda:0, shape: torch.Size([])
loss_clsf calculated
loss_clsf : 0.7298979759216309
loss_clsf tensor device: cuda:0, shape: torch.Size([])
sumloss tensor device: cuda:0, shape: torch.Size([])

Why it is stuck when cuda:0 loss tensor starts to backward?

Hey @cybaj, could you please share a self-contained min repro program? It will be hard to tell with just printed outputs.

It is totally my bad, thank you @mrshenli. I learned how to create and use the tensor at module’s forward in DataParallel from your advices.

os.environ['CUDA_LAUNCH_BLOCKING'] = '1' is the fault…

Someone who use this cuda option for checking the logs, because it stops DataParallel process, it is recommend to comment it…

1 Like