Infinite value occurring in training dataset after exactly 7 epochs

Lachlan_Burne · August 12, 2021, 5:49am

I’m having a problem where the LogSoftmax layer of my network returns NaN values at the 0th output after exactly 7 epochs, or on the beginning of the 8th to be exact. I believe this is due to the presence of an inf value somewhere in the training dataset being fed into the network. The confusion comes from the fact that the dataset in question does not originally contain an inf value. When I import the dataset from an h5py file and use np.isinf(feature_data).any() it returns false, then after I convert the dataset into a tensor I once again confirm that there is no inf present using torch.isinf(feature_data).any() and once again returns false. However, after the LogSoftmax layer returns a NaN value at the beginning of the 8th epoch, torch.isinf(feature_data).any() returns true. The only possible explanation I can think of is that this occurs during the normalization process? But there are no random transformations used within the dataset, just a simple normalization utilizing the maximum and minimum values.

I’m also not sure if this helps, but for some reason the training loss progressively explodes over the course of 5 epochs, then suddenly lowers itself back to a reasonable value. It starts off at approximately 2-3, then exponentially increases into the millions by the 5th epoch, only to return to a value of approximately 1 on the 6th epoch. Even stranger is that the validation loss remains relatively stable at approximately 1-3 for the duration of training, and the validation accuracy still increases over time.

Any help would be greatly appreciated. Thanks!

Lachlan_Burne · August 13, 2021, 2:51am

Fixed my own problem. In case anyone else runs into the same issue, the problem was my custom dataset class. I mistakenly believed that python created copies of variables by default, e.g.
a = [5, 5, 5], b = a
would create two distinct variables, when in actuality b is simply a reference of a. This meant that my normalization code in the dataset class was editing the values of my data every time it was passed through the dataset class which eventually lead to inf values appearing after a set number of epochs.

My solution was to the ndarray.copy() command to avoid the original data being edited.