My code can't run on two GPUs

Roqyah_bdeen · January 26, 2022, 8:15am

I use a self-attention layer in my network but Unfortunately I got cuda out of memory error even though I reduced the batch_size to one only
So, I tried to run the code on two GPUs but I got this error

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

I moved the model and the input image to cuda0 to be on similar gpu but each time I move one item to cuda0 I found a new item should be on the same gpu too
how could I resolve this problem please? any help

JuanFMontesinos · January 26, 2022, 8:47am

Could you paste an script? How are you running in “two gpus”?

Roqyah_bdeen · January 26, 2022, 9:39am

first I prepare devices

def _prepare_device(self, n_gpu_use):
        """
        setup GPU device if available, move model into configured device
        """
        n_gpu = torch.cuda.device_count()
        if n_gpu_use > 0 and n_gpu == 0:
            self._log.warning("Warning: There\'s no GPU available on this machine,"
                              "training will be performed on CPU.")
            n_gpu_use = 0
        if n_gpu_use > n_gpu:
            self._log.warning(
                "Warning: The number of GPU\'s configured to use is {}, "
                "but only {} are available.".format(n_gpu_use, n_gpu))
            n_gpu_use = n_gpu
        device = torch.device('cuda:0,1' if n_gpu_use > 0 else 'cpu')
        list_ids = list(range(n_gpu_use))
        return device, list_ids

next initiate model

  def _init_model(self, model):
        model = model.to(self.device)
        #print("model dvc",self.device)
        if self.cfg.pretrained_model:
            self._log.info("=> using pre-trained weights {}.".format(
                self.cfg.pretrained_model))
            epoch, weights = load_checkpoint(self.cfg.pretrained_model)
            model.load_state_dict(weights)
        else:
            self._log.info("=> Train from scratch.")
            model.init_weights()
        model = torch.nn.DataParallel(model, device_ids=self.device_ids)
        return model

Roqyah_bdeen · January 26, 2022, 9:41am

device_ids=[0,1]
self.device=cuda:0

JuanFMontesinos · January 26, 2022, 1:47pm

I would say you have some hard-coded device inside the model.
Basically your input is in GPU 1 (this is done automatically by DataParallel
If your layer is in GPU0 it probably means you have some line with a fixed device.

Could you try your script with a simple
model = torch.nn.Linear(1,2)
and then a silly input like
model(torch.rand(4,1))

Roqyah_bdeen · January 26, 2022, 3:06pm

First thanks for your reply
I used model = torch.nn.Linear(1,2) instead of model = torch.nn.DataParallel(model, device_ids=self.device_ids) and got this error

Traceback (most recent call last):
  File "train.py", line 43, in <module>
    basic_train.main(cfg, _log)
  File "/home/ubuntu/Rokia/UnRigidFlow-master-2-smooth -attention/basic_train.py", line 49, in main
    train_loader, valid_loader, model, loss, _log, cfg.save_root, cfg.train)
  File "/home/ubuntu/Rokia/UnRigidFlow-master-2-smooth -attention/trainer/kitti_flow_trainer.py", line 13, in __init__
    train_loader, valid_loader, model, loss_func, _log, save_root, config)
  File "/home/ubuntu/Rokia/UnRigidFlow-master-2-smooth -attention/trainer/base_trainer.py", line 28, in __init__
    self.optimizer = self._create_optimizer()
  File "/home/ubuntu/Rokia/UnRigidFlow-master-2-smooth -attention/trainer/base_trainer.py", line 70, in _create_optimizer
    {'params': bias_parameters(self.model.module),
  File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 539, in __getattr__
    type(self).__name__, name))
AttributeError: 'Linear' object has no attribute 'module'

is this what you have asked me to do?