Hi everyone,
I want to load my previous saved model and continue training. This is the code for saving the model. The network is being trained on GPU (and works well there).
# Save model
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss, }, f'{chkp_Path}/epoch{epoch}_{datetime.now().strftime("%Y%m%d-%H%M%S")}.pt')
During the loading phase, I load the model on CPU (to prevent out of memory issue, like) and then transfer it back to GPU:
chkPath = 'Path_to_chk_file.pt'
checkpoint = torch.load(chkPath, map_location='cpu')
start_epoch = checkpoint['epoch']
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
losslogger = checkpoint['loss']
device = torch.device(str("cuda:0") if torch.cuda.is_available() else "cpu")
model = model.to(device) # the network should be on GPU, next(model.parameters()).is_cuda==False
Then , when I start the training phase, I receive the following error. Is there any way to find out where those torch tensors on CPU come from? BTW, the data loader is on GPU during the training.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
The above error happen at this line and after loss calculaiton
opt.step()