Code runs good once then is always broken. Help?

Trying to get started with Kaggle and I’ve ran into what appears to be a somewhat unique problem. The code can be found here if I did everything right:
You’ll have to check it out to see what I’m talking about since I’m new and can only embedded one thing >_>


BCEWithLogitsLoss becomes unstable while training a four hidden layer FC network, giving nan’s after anywhere between the first and 10th call, breaking the training. I have been googling a bunch and trying to figure out where the destabilization is coming from, but to no avail. Please help!

#Further details:
For the “understanding stage” of solving this problem, I’ve been plotting the value of the loss function, and the test and training data set accuracy. On a typical run, the losses gives no proper plot. This comes from the function turn to nan’s fairly quickly. In the training set accuracy, the model initially learns for an epoch or two and then oscillates as the loss function becomes useless.

It’s worth noting that I have tried some of the tricks out there on stack exchange for nan-ing loss functions. Interestingly, when I add a small value (1e-10) to the predictions or labels, this is enough to allow training to complete one time. When you track the value of the loss function and train/test acc, the successful train gives this:
While this may feel like a success, the terrible truth is that this only sometimes works once. After you’ve made a successful run, rerunning the exact same code gives failed results where nan’s creep in somewhere along the training.

One final note. In the documentation for BCEWithLogitsLoss, it specifies that the target values should be between zero and one. To that end, I tried mapping the 0’s and 1’s in the “survived” column to points within the interval (e.g. 0->1/3, 1 ->2/3), but this always gives even worse results! Unfortunately, the accuracy becomes fully zero when doing then, which certainly isn’t what I’m after.

If anyone can offer insight or help, that would be greatly appreciated. This is code and a network that successfully trains on the wine UCI dataset and I cannot figure out what’s different here.

I’m embarrassed. This was a data processing error. There were some nan’s in a column that I didn’t expect