Loss becomes 'nan' after some iterations

Hrant_Baloyan · April 29, 2022, 4:25pm

I am trying to build Autoencoder whose encoder,decoder are nested TreeLSTM-s. During training after some iterations loss becomes ‘nan’. I did try to decrease learning rate, do gradient clapping,data normalization but still it becomes ‘nan’. What can be wrong?

RuntimeError: Function ‘MulBackward0’ returned nan values in its 0th output. This is error message.

AbdulsalamBande · April 29, 2022, 10:54pm

I suspect some your labels or training data might contain Nans. If your data is in a csv file.
Remove Nans before training

data = data.dropna()

Hrant_Baloyan · April 30, 2022, 6:39am

My data is in this format pytorch-tree-lstm/example_usage.py at master · unbounce/pytorch-tree-lstm · GitHub look tree in this code. How can I check if it contain Nans here?

ksmdanl · April 30, 2022, 12:30pm

Given the following code snippet,

h, c = model(
            data['features'],
            data['node_order'],
            data['adjacency_list'],
            data['edge_order']
        )

I suppose you can iterate through the dataset data['features'] to inspect which features contribute to NaNs value and print them out.

Hrant_Baloyan · April 30, 2022, 5:50pm

I did but there is no nan element in data[‘features’]

ksmdanl · May 1, 2022, 11:41am

That’s okay. Did you try for debugging purpose to include a regularisation layer or term?

Hrant_Baloyan · May 1, 2022, 2:49pm

I didn’t. Are there any other reasons for nan output?

ksmdanl · May 2, 2022, 8:23am

Depending on your model or your training setup it can be few reasons.
If your model is deep enough it might suffer from vanishing gradients, if you don’t apply regularisation, especially if you’re using sigmoid/tanh as activations.
It can also be that you’re learning rate was too high and the gradients became infinity hence your model diverged.
These are typical scenarios, few tricks should help, like choosing a smaller learning rate or apply regularisation.

Hrant_Baloyan · May 2, 2022, 7:56pm

My training set is 3000 style and content tensors. When I make my training set 100 and then train my model only on that 100 tensors I don’t get ‘nan’ in my loss. But increasing training set I get. Can there be other problem going on?