I am currently trying to make a translation model with Trnasformer model through PyTorch. Since I have 2 GPUs (2080ti x 2) available for training, I want to train the model through multi-gpu. Currently, the gpu is assigned to 0 and 1 respectively. The way I use multi-gpu is to put nn.DataParallel on the model object.
Declared encoder, decoder, and model objects
enc = Encoder(INPUT_DIM, HIDDEN_DIM, ENC_LAYERS, ENC_HEADS, ENC_PF_DIM, ENC_DROPOUT, device) dec = Decoder(OUTPUT_DIM, HIDDEN_DIM, DEC_LAYERS, DEC_HEADS, DEC_PF_DIM, DEC_DROPOUT, device) model = nn.DataParallel(Transformer(enc, dec, SRC_PAD_IDX, TRG_PAD_IDX, device).to(device))
Since the above method has been used in various models of pytorch in the past, I thought it would work without any problems. However, if I go through the actual training in this way, I get the following error:
--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Cell In , line 87 84 for epoch in range(N_EPOCHS): 85 start_time = time.time() # 시작 시간 기록 ---> 87 train_loss = train(model, train_iterator, optimizer, criterion, CLIP) 88 valid_loss = evaluate(model, validation_iterator, criterion) 90 end_time = time.time() # 종료 시간 기록 Cell In , line 15, in train(model, iterator, optimizer, criterion, clip) 11 optimizer.zero_grad() 13 # 출력 단어의 마지막 인덱스(<eos>)는 제외 14 # 입력을 할 때는 <sos>부터 시작하도록 처리 ---> 15 output, _ = model(src, trg[:,:-1]) 17 # output: [배치 크기, trg_len - 1, output_dim] 18 # trg: [배치 크기, trg_len] 20 output_dim = output.shape[-1] File ~/anaconda3/envs/jki_pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs) 1126 # If we don't have any hooks, we want to skip the rest of the logic in 1127 # this function, and just call forward. 1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1129 or _global_forward_hooks or _global_forward_pre_hooks): -> 1130 return forward_call(*input, **kwargs) 1131 # Do not call functions when jit is used ... return forward_call(*input, **kwargs) File "/tmp/ipykernel_203464/284771533.py", line 31, in forward src = self.dropout((self.tok_embedding(src) * self.scale) + self.pos_embedding(pos)) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
As you can see from the above error, the part where the error occurred was line 15 of the train() function. The train function is defined as:
def train(model, iterator, optimizer, criterion, clip): model.train() # 학습 모드 epoch_loss = 0 for i, batch in enumerate(tqdm(iterator, ascii=True, ncols=100)): src = batch.src trg = batch.trg optimizer.zero_grad() output, _ = model(src, trg[:,:-1]) output_dim = output.shape[-1] output = output.contiguous().view(-1, output_dim) trg = trg[:,1:].contiguous().view(-1) loss = criterion(output, trg) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), clip) optimizer.step() epoch_loss += loss.item() return epoch_loss / len(iterator)
To solve this error, I tried nn.DataParallel in the encoder decoder, but the same error is returned. I also tried the model in the train() function.
Also, I tracked the variables to see where the Tensors are assigned to which cuda, but only cuda, not cuda:0, cuda:1. I’ve tried so many different things and still can’t find this error. If anyone knows please help me
Please let me know if you have any additional code I need to show you.