I use a self-attention layer in my network but Unfortunately I got cuda out of memory error even though I reduced the batch_size to one only
So, I tried to run the code on two GPUs but I got this error
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)
I moved the model and the input image to cuda0 to be on similar gpu but each time I move one item to cuda0 I found a new item should be on the same gpu too
how could I resolve this problem please? any help
Could you paste an script? How are you running in “two gpus”?
first I prepare devices
def _prepare_device(self, n_gpu_use):
"""
setup GPU device if available, move model into configured device
"""
n_gpu = torch.cuda.device_count()
if n_gpu_use > 0 and n_gpu == 0:
self._log.warning("Warning: There\'s no GPU available on this machine,"
"training will be performed on CPU.")
n_gpu_use = 0
if n_gpu_use > n_gpu:
self._log.warning(
"Warning: The number of GPU\'s configured to use is {}, "
"but only {} are available.".format(n_gpu_use, n_gpu))
n_gpu_use = n_gpu
device = torch.device('cuda:0,1' if n_gpu_use > 0 else 'cpu')
list_ids = list(range(n_gpu_use))
return device, list_ids
next initiate model
def _init_model(self, model):
model = model.to(self.device)
#print("model dvc",self.device)
if self.cfg.pretrained_model:
self._log.info("=> using pre-trained weights {}.".format(
self.cfg.pretrained_model))
epoch, weights = load_checkpoint(self.cfg.pretrained_model)
model.load_state_dict(weights)
else:
self._log.info("=> Train from scratch.")
model.init_weights()
model = torch.nn.DataParallel(model, device_ids=self.device_ids)
return model
device_ids=[0,1]
self.device=cuda:0
I would say you have some hard-coded device inside the model.
Basically your input is in GPU 1 (this is done automatically by DataParallel
If your layer is in GPU0 it probably means you have some line with a fixed device.
Could you try your script with a simple
model = torch.nn.Linear(1,2)
and then a silly input like
model(torch.rand(4,1))
First thanks for your reply
I used model = torch.nn.Linear(1,2)
instead of model = torch.nn.DataParallel(model, device_ids=self.device_ids)
and got this error
Traceback (most recent call last):
File "train.py", line 43, in <module>
basic_train.main(cfg, _log)
File "/home/ubuntu/Rokia/UnRigidFlow-master-2-smooth -attention/basic_train.py", line 49, in main
train_loader, valid_loader, model, loss, _log, cfg.save_root, cfg.train)
File "/home/ubuntu/Rokia/UnRigidFlow-master-2-smooth -attention/trainer/kitti_flow_trainer.py", line 13, in __init__
train_loader, valid_loader, model, loss_func, _log, save_root, config)
File "/home/ubuntu/Rokia/UnRigidFlow-master-2-smooth -attention/trainer/base_trainer.py", line 28, in __init__
self.optimizer = self._create_optimizer()
File "/home/ubuntu/Rokia/UnRigidFlow-master-2-smooth -attention/trainer/base_trainer.py", line 70, in _create_optimizer
{'params': bias_parameters(self.model.module),
File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 539, in __getattr__
type(self).__name__, name))
AttributeError: 'Linear' object has no attribute 'module'
is this what you have asked me to do?