Hi!
I’ve designe a network for a regressive task using LSTM.However, my loss becomes NaN when I try to train it and I don’t understand why.
This is my training loop:
def training_loop(dataloader,model,loss_fn,report_loss_fn,optimizer, batch_size, epoch,report_freq=10, writer=writer):
size = len(dataloader.dataset)
model.train()
for batch_n, (X,y,x_mask,y_mask) in enumerate(train_dataloader):
pred = model(X)
#validate predictions by selecting non-padded values
valid_pred = torch.masked_select(pred,y_mask)
valid_targets = torch.masked_select(y,y_mask)
#calculate optim loss
loss = loss_fn(valid_pred,valid_targets)
loss.backward()
#adjust params
optimizer.step()
optimizer.zero_grad()
#get loss for each pass
running_loss = loss.item()
if batch_n % report_freq == 0:
current_values = batch_n*batch_size + len(X)
writer.add_scalar('Train optim loss',running_loss,epoch+1)
print(f"Train loss: {running_loss:>7f} [{current_values:>5d}/{size:>5d}]")
The print statements print this, where it all nan after:
Train loss: 1.087582 [ 4/ 1634]
Train loss: 0.840909 [ 104/ 1634]
Train loss: nan [ 204/ 1634]
Train loss: nan [ 304/ 1634]
Train loss: nan [ 404/ 1634]
I’m using this optimizer:
optim = torch.optim.Adam(params=lstm_model.parameters(recurse=True),weight_decay=1e-05,lr=0.0001)
I’ve tried changing the learning rate, when I increased it to 0.5, the first 3-4 were not nan, so i think it could be something related to that.