RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:0)

Neo_Zhu · November 8, 2023, 12:12pm

Greetings,

I am writing a diffusion model. I am also using the pytorch-lightning framework. I have 2 GPUs. Somehow I encountered this error.

All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

  | Name  | Type           | Params
-----------------------------------------
0 | model | DiffusionModel | 72.4 M
-----------------------------------------
68.1 M    Trainable params
4.2 M     Non-trainable params
72.4 M    Total params
289.470   Total estimated model params size (MB)
Sanity Checking: |                                                                                                                  | 0/? [00:00<?, ?it/s]/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
1
Traceback (most recent call last):
  File "/home/xz479/DiffuBot/train.py", line 148, in <module>
    train(args)
  File "/home/xz479/DiffuBot/train.py", line 134, in train
    trainer.fit(task, train_loader, val_loader)
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
    call._call_and_handle_interrupt(
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
    results = self._run_stage()
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1034, in _run_stage
    self._run_sanity_check()
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1063, in _run_sanity_check
    val_loop.run()
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 134, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 391, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_args)
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 402, in validation_step
    return self._forward_redirection(self.model, self.lightning_module, "validation_step", *args, **kwargs)
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 628, in __call__
    wrapper_output = wrapper_module(*args, **kwargs)
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 621, in wrapped_forward
    out = method(*_args, **_kwargs)
  File "/home/xz479/DiffuBot/train.py", line 91, in validation_step
    loss = self.model.loss_fn(batch['dst_matrix'])
  File "/home/xz479/DiffuBot/train.py", line 42, in loss_fn
    output, epsilon, alpha_bar = self.forward(x, idx=idx, get_target=True)
  File "/home/xz479/DiffuBot/train.py", line 54, in forward
    used_alpha_bars = self.alpha_bars[idx][:, None, None]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:0)
Sanity Checking DataLoader 0:   0%|                                                                                                 | 0/2 [00:00<?, ?it/s]0
0
0
torch.Size([])
Sanity Checking DataLoader 0:  50%|████████████████████████████████████████████▌                                            | 1/2 [00:00<00:00, 51.59it/s]0
0
0
torch.Size([])
Sanity Checking DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  0.55it/s]/home/xz479/miniconda3/envs/delibot/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:429: It is recommended to use `self.log('validation_epoch_average', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
[rank: 1] Child process with PID 1109548 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟                           
Killed

My code is below

class DiffusionModel(nn.Module):
    
    def __init__(self, device, beta_1, beta_T, T, decoder =None):
        '''
        The epsilon predictor of diffusion process.

        beta_1    : beta_1 of diffusion process
        beta_T    : beta_T of diffusion process
        T         : Diffusion Steps
        input_dim : a dimension of data

        '''

        super().__init__()
        self.device = device
        self.alpha_bars = torch.cumprod(1 - torch.linspace(start = beta_1, end=beta_T, steps=T), dim = 0).to(device = self.device)
        self.backbone = Transformer_Denoiser(T, batch_first=True).to(device = self.device)
        
        self.to(device = self.device)
    
    def loss_fn(self, x, idx=None):
        '''
        This function performed when only training phase.

        x          : real data if idx==None else perturbation data
        idx        : if None (training phase), we perturbed random index. Else (inference phase), it is recommended that you specify.

        '''
        output, epsilon, alpha_bar = self.forward(x, idx=idx, get_target=True)
        loss = (output - epsilon).square().mean()
        print(loss.shape)
        return loss
        
    def forward(self, x, idx=None, get_target=False):
        # print(x.shape)
        if idx == None:
            ### Generate random timestep index for noise sampling
            #idx_device = self.alpha_bars.get_device()
            idx = torch.randint(0, len(self.alpha_bars), (x.size(0), )).to(self.device)
            print(idx.get_device())
            used_alpha_bars = self.alpha_bars[idx][:, None, None]
            print(self.alpha_bars.get_device())
            print(used_alpha_bars.get_device())
            epsilon = torch.randn_like(x).to(self.device)
            #TODO lock noise with respect to turns
            x_tilde = torch.sqrt(used_alpha_bars) * x + torch.sqrt(1 - used_alpha_bars) * epsilon
            
        else:
            idx = torch.Tensor([idx for _ in range(x.size(0))]).long()
            x_tilde = x
        
        # print(x_tilde.shape)
        #return idx
        output = self.backbone(x_tilde, idx)
        
        return (output, epsilon, used_alpha_bars) if get_target else output

I am pretty sure all the tensors are on the same device as in the code I print out the devices of the alpha_bar tensor and the index tensor, and as shown in the error message, they are all on the device 0.

I’d be really appreciated if someone could help.

Thank you very much.

Nitzan_Bar · November 20, 2023, 12:38pm

I’m facing the same issue when using pytorch lightning 2.1.1 with ddp strategy and wandb logger. Could not find a workaround or a solution.
please help, Thanks.

ptrblck · November 21, 2023, 2:28am

Are you seeing these issues without PyTorch lightning in a pure PyTorch model?

Nitzan_Bar · November 21, 2023, 8:06am

I cant tell, since my way to use distributed training is by using pytorch lightning.
If I train on a single GPU, then there are no problems.
But, when I set devices=4, then I recieve the error:

“[rank: 2] child process with pid 4049 terminated with code -6. Forcefully terminating all other processes to avoid zombies ”

Thanks

manoj_batra · April 18, 2024, 4:53am

Observed similar issue with 4 GPUs and DDP strategy with PyTorch Lightning, ran for couple of hours completed 1 epoch, crashed while completing second epoch. Not sure if it has anything to do with logging, using TBLogger, as i have set logging at end of each epoch.

[rank: 0] Metric val_loss improved. New best score: 3.062
[rank: 3] Metric val_loss improved. New best score: 3.065
[rank: 1] Metric val_loss improved. New best score: 3.070
[rank: 2] Metric val_loss improved. New best score: 3.053
Epoch 0, global step 4413: 'val_loss' reached 3.06154 (best 3.06154), saving model to '/local/home/user/sftws/mpg/output/tft_outlier_treated_no_data_sampling/model_checkpoints/epoch=0-step=4413.ckpt' as top 3
Epoch 0/19 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4413/4413 0:58:24 • 0:00:00 1.31it/s v_num: 0.000 train_loss_step: 2.704 val_loss: 3.062 train_loss_epoch: 2.997
[rank: 1] Child process with PID 26604 terminated with code -9. Forcefully terminating all other processes to avoid zombies 🧟
Traceback (most recent call last):
  File "/home/user/lib/miniconda3/envs/py310/lib/python3.10/multiprocessing/resource_sharer.py", line 138, in _serve
    with self._listener.accept() as conn:
  File "/home/user/lib/miniconda3/envs/py310/lib/python3.10/multiprocessing/connection.py", line 466, in accept
    answer_challenge(c, self._authkey)
  File "/home/user/lib/miniconda3/envs/py310/lib/python3.10/multiprocessing/connection.py", line 752, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/home/user/lib/miniconda3/envs/py310/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/user/lib/miniconda3/envs/py310/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/user/lib/miniconda3/envs/py310/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

torch.version
‘2.2.2+cu121’
pl.version
‘2.2.2’
Python 3.10.14

SvenBur · August 5, 2024, 6:50am

A little late to the party but I had the same issue when training on multiple GPUs but training on a single GPU worked fine.
Moving the diffusion model parameters to the configure_model() method of the LightningModule as was suggested from the lightning documentation for distributed training solved the issue. While I am not using FSDP or DeepSpeed, it also seems to help with DDP.

class DDPM(LightningModule):
    def __init__(self):
        super().__init__()
        # don't instantiate layers and parameters here
        # move the creation of parameters and layers to `configure_model`
    def configure_model(self) -> None:
        self.beta = torch.linspace(
            beta_start,
            beta_stop,
            num_steps,
            device=self.device,
        )

>>> torch.__version__
'2.2.0+cu121'
>>> pl.__version__
'2.2.1'
>>>

SvenBur · August 20, 2024, 6:36am

Probably the best way to deal with this issue is using the register_buffer metmod of the LightningModule class.