Segfault, double freee, corrupted linked-list, etc. in loss.backward()

stellarpower · April 16, 2024, 12:12am

Hi,

I’ve installed pytorch into a mamba environment through the official installer.

I have checked my prediction and loss tensors for infinity/NaNs, and the forwards loss is a real number. I’ve disabled CUDA for now. I get a variety of memory errors during the backwards pass (single_loss.backwards()) - this is the training loop:

for i in range(epochs):
    epoch_loss = 0
    for inputs, groundTruth in dataloader:
        optimizer.zero_grad()
        y_pred = model(inputs)
        testForBadValues(y_pred,"Training forward pass outputs")

        single_loss = loss_function(y_pred, groundTruth)
        
        # The backwards pass does not handle nans properly, it seems.
        testForBadValues(single_loss, "Training loss")

        single_loss.backward()
        epoch_loss += single_loss
        optimizer.step()
    train_losses.append(epoch_loss)

    model.eval()
    val_loss = 0
    for inputs, groundTruth in val_dataloader:
        y_pred = model(inputs)
        single_loss = loss_function(y_pred, groundTruth)
        val_loss += single_loss
    val_losses.append(val_loss)
    model.train()

    # Save model if validation loss has decreased
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'best_model.pth')

Before I paste a lot of code, is there anything I can do to debug this? I see mention of similar problems come up a number of times through Google, but each seems to be a different corner case, and especially suspicious if I’ve hidden the GPU through CUDA_VISIBLE_DEVICES. I’ve mostly been using Keras (am modifying someone else’s code), so am not massively familiar with Torch. It seems the official installation does not include debugging symbols, so I can’t inspect my backtrace in much detail. I’d rather not spend hours building from source if I can avoid it. Is there any other way to get more information about the cause of the crash? Exit code is always 245

Thanks