Hi,
I’ve installed pytorch into a mamba environment through the official installer.
I have checked my prediction and loss tensors for infinity/NaNs, and the forwards loss is a real number. I’ve disabled CUDA for now. I get a variety of memory errors during the backwards pass (single_loss.backwards()
) - this is the training loop:
for i in range(epochs):
epoch_loss = 0
for inputs, groundTruth in dataloader:
optimizer.zero_grad()
y_pred = model(inputs)
testForBadValues(y_pred,"Training forward pass outputs")
single_loss = loss_function(y_pred, groundTruth)
# The backwards pass does not handle nans properly, it seems.
testForBadValues(single_loss, "Training loss")
single_loss.backward()
epoch_loss += single_loss
optimizer.step()
train_losses.append(epoch_loss)
model.eval()
val_loss = 0
for inputs, groundTruth in val_dataloader:
y_pred = model(inputs)
single_loss = loss_function(y_pred, groundTruth)
val_loss += single_loss
val_losses.append(val_loss)
model.train()
# Save model if validation loss has decreased
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(model.state_dict(), 'best_model.pth')
Before I paste a lot of code, is there anything I can do to debug this? I see mention of similar problems come up a number of times through Google, but each seems to be a different corner case, and especially suspicious if I’ve hidden the GPU through CUDA_VISIBLE_DEVICES
. I’ve mostly been using Keras (am modifying someone else’s code), so am not massively familiar with Torch. It seems the official installation does not include debugging symbols, so I can’t inspect my backtrace in much detail. I’d rather not spend hours building from source if I can avoid it. Is there any other way to get more information about the cause of the crash? Exit code is always 245
Thanks