CNN ASR getting nan and core dump after epoch 1 with custom dataset

user12233 · June 17, 2022, 4:48pm

I’m trying to implement the code from here using a custom data set. I’m able to get the code to run with the librispeech dataset but when I use my dataset I get the following:

Train Epoch: 1 [0/2875 (0%)] Loss: 10.740855

Then the next value for the loss would be NAN

Any help is appreciated!

I added gradient clipping here:

loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 5)
optimizer.step()

Once I do that I get the following:

Train Epoch: 1 [0/2875 (0%)] Loss: 10.740855
Segmentation fault (core dumped)

My dataset has clips from 3 to 14 seconds.

Henry_Chibueze · June 17, 2022, 5:47pm

What loss function are you using? can you post the code block of how you pass arguments to it as well?

user12233 · June 20, 2022, 7:17pm

Thanks for the reply! I’m using the CTC loss function:

criterion = nn.CTCLoss(blank=0).to(device)

Below is the block of code I use to train the model which is the arguments I pass to the loss function:

def train(model, device, train_loader, criterion, optimizer, scheduler, epoch, iter_meter, experiment):
    model.train()
    data_len = len(train_loader.dataset)
    with experiment.train():
        for batch_idx, _data in enumerate(train_loader):
            spectrograms, labels, input_lengths, label_lengths = _data
            spectrograms, labels = spectrograms.to(device), labels.to(device)
            optimizer.zero_grad()
            output = model(spectrograms)  # (batch, time, n_class)
            output = F.log_softmax(output, dim=2)
            output = output.transpose(0, 1) # (time, batch, n_class)
            loss = criterion(output, labels, input_lengths, label_lengths)
            loss.backward()

            torch.nn.utils.clip_grad_norm_(model.parameters(), 5)

            experiment.log_metric('loss', loss.item(), step=iter_meter.get())
            experiment.log_metric('learning_rate', scheduler.get_lr(), step=iter_meter.get())

            optimizer.step()
            scheduler.step()
            iter_meter.step()
            if batch_idx % 100 == 0 or batch_idx == data_len:
                print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    epoch, batch_idx * len(spectrograms), data_len,
                    100. * batch_idx / len(train_loader), loss.item()))