Model returns NaN after first iteration

francisco · April 24, 2021, 6:34pm

Hello everyone

I’m testing how suitable the models made available by torchvision are at, among other things, analyzing both images and audio (In regards to the audio, I first extract MFCC features from the audio clip, and turn said MFCC features into an image, as I saw some people doing it, and saying that apparently it’s somewhat common practice).

However, after the first iteration, some of the models weights turn into NaN and subsequently return NaN as predictions. (These weights were checked using this code:

for a in [6,8]:
    print(a, model_a.features[a].weight)

)

I’m currently using a pretrained AlexNet (despite eventually wanting to attempt this pipeline with other models such as VGGs, GoogleNets, etc.). Up until now, I was using an LR value of 0.001, but as a result of finding a topic here in the forums in which someone suggested that a cause for exploding gradients was a high LR value, I have since lowered it to 0.000001 which does not solve the problem.

Help would be greatly appreciated!

Thank you to everyone who reads this

JuanFMontesinos · April 25, 2021, 12:24pm

The most probable situation is you have a NaN in your input or in some operation which is generating a NaN. I would focus on checking the correctness of your input data.

francisco · April 25, 2021, 6:28pm

Hello Juan!

First of all, thank you for spending your time reading and replying, I truly appreciate it.

After reading your response I changed my program to stop predicting and to print instead if the extracted features contain any NaN.

I have 2 folders ‘train’ and ‘val’, each with 2 folders inside, one for each of the classes (in my case ‘nv’ and ‘v’ which stand for not violent and violent, as I’m doing using the models to determine if a certain scene is or not violent) and each of these folders have 50 videos, so 200 videos total.

This whole situation occured with the videos in the ‘train’ folder, and since the videos were randomly selected, I tested all 100 videos, from the ‘nv’ and ‘v’ folders, extracting their features using the following code:

def audio_feature_extractor(video_name):
    y, sr = librosa.load(video_name) # load file

    mfccs = librosa.feature.mfcc(y=y,sr=sr,n_mfcc=20,norm='ortho') # extract mfccs

    mfccs = mfccs / np.linalg.norm(mfccs) # normalize array (20,x)

    quadmesh = librosa.display.specshow(mfccs) # convert array to quadmesh

    fig = quadmesh.get_figure() # get figure from quadmesh

    inputs = get_img_from_fig(fig) # get img from figure (3,224,224)

    inputs = inputs/255

    return inputs

and then I checked the resulting tensor to see if it contained any NaN’s with the following code:

# AUDIO ANALYSIS
audio_features = audio_feature_extractor(video_name)

print(torch.isnan(audio_features).any())

Unfortunately, every single clip came out without any NaN’s, so this does not appear to be the problem here.

If you have any clue about other issues that might cause this trouble, I’d love to test them, since I’ve been stagnated with this problem for the last 2 weeks

Kind Regards,

Francisco

JuanFMontesinos · April 25, 2021, 8:12pm

Yes but for example you are normalizing np.linalg.norm(mfccs).
If the tensor is near zero you would be in troubles.

Anyway a good thing you can do is check for NaNs in the loss (it’s a cheap operation).
If there is a NaN you can follow back the error and revise the input or the layers and try to find the reasoin reproducing that
There are very few resons why you would get a NaN.
These are:
A NaN in the weights (propbably due to loss=NaN which was backpropagated)
A operator over an empty tensor.
A normalization step over.

francisco · April 25, 2021, 10:42pm

Anyway a good thing you can do is check for NaNs in the loss (it’s a cheap operation).
If there is a NaN you can follow back the error and revise the input or the layers and try to find the reasoin reproducing that

Can you tell me how I’d be able to do that? So far, I’ve only found suggestions on using something called autograd.set_detect_anomaly(True) which I used as follows:

with torch.set_grad_enabled(phase == 'train'):
    with torch.autograd.set_detect_anomaly(True):
        audio_output = model_a(audio_features) # prediction
                                
        #audio_output = torch.nan_to_num(model_a(audio_features), neginf=0.0, posinf=1.0) # make prediction

        if (isinstance(audio_output,GoogLeNetOutputs)): # extract tensor from GoogLeNetOutputs
            audio_output = audio_output.logits

            preds =  torch.argmax(audio_output, 1) # get prediction
            # print(f"preds: {preds}, audio_output: {audio_output}, labels: {labels}\n")
            a_loss = criterion(audio_output, labels) # calculate loss
            print(f"A Loss: {a_loss}")
            #a_loss = a1_loss.float() # convert from long to float

            if phase == 'train':
                print('\n--- IF ---\n')
                a_loss.backward()

                for name, param in model_a.named_parameters():
                    print(name, torch.isfinite(param.grad).all()) # print grads

                model_a_optimizer.step()

                print('\n--- STEP ---\n')
                for name, param in model_a.named_parameters():
                    print(name, torch.isfinite(param.grad).all()) # print grads
                            
                #statistics
                print('a: ', a_loss, a_loss.item(), audio_features.size(0))
                a_running_loss += a_loss.item() * audio_features.size(0)
                a_running_corrects += torch.sum(preds == labels.data)
                                
                audio_dataset_sizes[phase] = audio_dataset_sizes[phase]+1

Should I be using set_detect_anomaly or should I use something else?

There are very few resons why you would get a NaN.

I assume you mean get a NaN in the loss component, correct?