Why does my deep-learning model get only 30% cater rate even after 1000 epochs

Hi, friends,

The user case of my self-created PyTorch Deep Learning model is based on patients’ medical-appoint booking behaviours. In details, it is to predict how many days in advance (except weekends and public holidays) a patient will book a medical appointment. The range is 2 to 20 days.

Actually this is quite a simple user case. However, the cater rate of my model is all the way very low: even after 1,000 epochs, the accuracy rate is only 30.9%, and the average loss is still as high as 2.058354.

In my model, the data tensor is indeed a vector composed of the following 7 fields:

  1. Gender: 1 – male, 2 – female
  2. Age
  3. Area: I take the first 3 digits of a patient’s residential postal codes and map to an integer from 0 to 999.
  4. Medical Examination: 1 – yes, 0 – no
  5. Blood Test: 1 – yes, 0 – no
  6. Urine Test: 1 – yes, 0 – no
  7. Fasting: 1 – yes, 0 – no
    The labels of my model are taken by subtracting the booking-in-advance days by 2, so I get the range from 0 to 18, totally 19 categories.
    Totally I prepare 377 training data and 94 testing data for the deep-learning model.

Basically my deep-learning model just follows the Fashion-MNIST learning model from the tutorials in pytorch.org official website (see the hyperlink Optimizing Model Parameters — PyTorch Tutorials 1.12.0+cu102 documentation). In the model, I also take the following parameters: the size of middle layer in the Neural Network is 512, the learning rate is 0.01 and the batch size is 10. (At the end of this post, I will attach the full set of my python source codes for your reference).

Can any experts help me to analyse and diagnose why my cater rate is so low? I feel that the low accuracy rate may be caused by one or multiple reasons as below:

  1. Training data size is too small.
  2. Running epochs (so far 1000 epochs) are not enough.
  3. The learning model is not efficient enough (do I need to consider some other algorithms the model? e.g., is the loss function good enough)?
  4. Other parameters, e.g., batch size, learning rate, and so on, may be misconfigured.

Thanks a lot for any help. If anybody needs to get more info to investigate the issue, please kindly let me know, and I will try my best to provide.

My source codes of the Pytorch deep-learning model:

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor, Lambda
import numpy as np
import pandas as pd

class CustomDataset(Dataset):
‘Characterizes a dataset for PyTorch’
def init(self, csv_file):
“”"
Args:
csv_file (string): Path to the csv file.
“”"
raw_data = pd.read_csv(csv_file)
raw_data = torch.tensor(raw_data.to_numpy())
x_size = list(raw_data.size())[1]
self.data_tensor = raw_data[:,:x_size-1].clone()
self.data_tensor = self.data_tensor.type(torch.float32)
self.label_tensor = raw_data[:,x_size-1:].clone()
self.label_tensor = self.label_tensor.flatten()
print(f"self.label_tensor.size() = {self.label_tensor.size()}")
def len(self):
‘Denotes the total number of samples’
return len(self.data_tensor)
def getitem(self, index):
data, label = self.data_tensor[index], self.label_tensor[index]
return data, label

training_data_csv_file = “D:\Tools\PyTorch\Deep-Learning-Model\input_data\training_data.csv”
training_data = CustomDataset(training_data_csv_file)
test_data_csv_file = “D:\Tools\PyTorch\Deep-Learning-Model\input_data\testing_data.csv”
test_data = CustomDataset(test_data_csv_file)

train_dataloader = DataLoader(training_data, batch_size=10)
test_dataloader = DataLoader(test_data, batch_size=10)

class NeuralNetwork(nn.Module):
def init(self):
super(NeuralNetwork, self).init()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(7, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 19),
)

def forward(self, x):
    x = self.flatten(x)
    logits = self.linear_relu_stack(x)
    return logits

model = NeuralNetwork()

def train_loop(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
print(f’size = {size}‘);
for batch, (X, y) in enumerate(dataloader):
# Compute prediction and loss
pred = model(X)
print(f’pred = {pred}’);
print(f’y = {y}‘);
loss = loss_fn(pred, y)
print(f’loss = {loss}’);
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()

    if batch % 100 == 0:
        loss, current = loss.item(), batch * len(X)
        print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

learning_rate = 0.01
batch_size = 10

def test_loop(dataloader, model, loss_fn):
size = len(dataloader.dataset)
num_batches = len(dataloader)
test_loss, correct = 0, 0

with torch.no_grad():
    for X, y in dataloader:
        pred = model(X)
        test_loss += loss_fn(pred, y).item()
        correct += (pred.argmax(1) == y).type(torch.float).sum().item()

test_loss /= num_batches
correct /= size
print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

epochs = 1000
for t in range(epochs):
print(f"Epoch {t+1}\n-------------------------------")
train_loop(train_dataloader, model, loss_fn, optimizer)
test_loop(test_dataloader, model, loss_fn)
print(“Done!”)

Hi Small!

My speculation is that you are treating your problem as a classification
problem, that is, you are trying to predict the number of days exactly.

I think you will get better results treating your use case as a regression.

A bit of explanation:

Let’s say that the actual number of days you are trying to predict is 14.
When training as classification, if your model predicts 14, it’s right, and
you get a low loss function. But if your model predicts 13 or 15, it’s
wrong, and it’s just as wrong as if it had predicted 3 or 22, even though
a prediction of 13 or 15 is really quite good, even if not perfect.

In a regression, you will penalize your model more for being way off (e.g.,
3 or 22) than for being close, but not perfect (e.g., 13 or 15).

Regression seems to me to be a better conceptual fit for what you are
actually trying to accomplish.

Furthermore, even though your use case is straightforward and easy to
describe, I could see it being quite a difficult problem if your measure of
success is getting your predictions exactly right,. Realistically, based on
your input data, how is your model supposed to predict 14 instead of 15
(for the example where the correct result is 14)? That seems hard, so an
accuracy rate of 30% doesn’t sound so bad.

To perform a regression, change your final Linear layer to have
out_features = 1, that is, Linear (512, 1), and use MSELoss as
your loss criterion. (Let your predictions be floating-point numbers, e.g.,
“predicted number of days = 14.72.”)

Then measure your performance based not on how often your model
gets the number of days exactly right, but rather, on how close it usually
is. So use the mean-squared-error or root-mean-squared-error of your
predictions to judge how well your model is doing.

Best.

K. Frank

Great!
Thanks Frank, I will make a retry.

Hi, Frank,

Right now I have customized my Pytorch source codes to adopt the Regression solution instead of the Classification solution.

The relevant customization of codes is highlighted below:

  1. In Layer 2, change the “out_features” to be 1:
    ====================================================================
    class NeuralNetwork(nn.Module):
    def init(self):
    super(NeuralNetwork, self).init()
    self.flatten = nn.Flatten()
    self.linear_relu_stack = nn.Sequential(
    nn.Linear(7, 512),
    nn.ReLU(),
    nn.Linear(512, 512),
    nn.ReLU(),
    nn.Linear(512, 1),
    )

    ====================================================================

  2. Change the data type of both data and label tensors from Long to float32:
    ====================================================================
    class CustomDataset(Dataset):
    ‘Characterizes a dataset for PyTorch’
    def init(self, csv_file):
    “”"
    Args:
    csv_file (string): Path to the csv file.
    “”"
    raw_data = pd.read_csv(csv_file)
    raw_data = torch.tensor(raw_data.to_numpy())
    x_size = list(raw_data.size())[1]
    self.data_tensor = raw_data[:,:x_size-1].clone()
    self.data_tensor = self.data_tensor.type(torch.float32)
    self.label_tensor = raw_data[:,x_size-1:].clone()
    self.label_tensor = self.label_tensor.type(torch.float32)
    self.label_tensor = self.label_tensor.flatten()

    ====================================================================

However, when I was running the new model, the values of the predicted tensor are full of “nan” (see the output text shown below):

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
y = tensor([ 5., 4., 3., 6., 3., 10., 8., 5., 3., 1.])
loss = nan
pred = tensor([[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan]], grad_fn=)
y = tensor([8., 5., 1., 0., 4., 6., 1., 6., 5., 8.])
loss = nan
pred = tensor([[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan]], grad_fn=)
y = tensor([8., 1., 5., 5., 4., 1., 2.])
loss = nan
Test Error:
Accuracy: 6.4%, Avg loss: nan
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

I suspect that the function in layer 2 is totally wrong: in a regression deep-learning model, we should use a sigmoid function instead of a linear function, right?

Thanks.

Hi, Frank,

Just now I changed the final layer of NN to be a sigmoid function instead of a linear function (see the code snippet below):

class NeuralNetwork(nn.Module):
def init(self):
super(NeuralNetwork, self).init()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(7, 512),
nn.ReLU(),
nn.Linear(512, 1),
nn.ReLU(),
nn.Sigmoid(),
)

Then I run the model, but this time it returns all 1 values in the predicted tensor (see the output text below):
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
y = tensor([ 5., 4., 3., 6., 3., 10., 8., 5., 3., 1.])
loss = 20.799999237060547
pred = tensor([[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.]], grad_fn=)
y = tensor([8., 5., 1., 0., 4., 6., 1., 6., 5., 8.])
loss = 19.0
pred = tensor([[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.]], grad_fn=)
y = tensor([8., 1., 5., 5., 4., 1., 2.])
loss = 13.0
Test Error:
Accuracy: 6.4%, Avg loss: 7.595000

Done!
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Any idea on this issue?

Thanks for your further help.

Hi Small!

No, replacing you final Linear layer with a Sigmoid “activation” doesn’t
make sense. Sigmoid returns a value that ranges from 0.0 to 1.0. You
are trying to predict a number of days that ranges from 0 to 19 so it’s
only natural that your network will learn to always predict 1.0 as that’s
(almost) always the value that it can predict that is closest to the target
value.

Going back to your previous architecture (with the three Linear layers),
I don’t see anything that is obviously wrong. There is nothing wrong with
using ReLU for your activations. (Once you get things working, you could
also try Sigmoid activations and see which works better for your use case.)

To debug the nans you are getting, first read through your code looking
for any obvious bugs. (Always a good first step.)

Do one forward pass. Do you get a sensible (and non-nan) loss? Then
do one backward pass. Do you get non-nan gradients?

Then try optimizing with plain-vanilla SGD (no momentum nor weight
decay) with a very small learning rate. Is everything okay after a single
batch / update? Try training for a few batches. Is your loss tending to
go down (even if it jitters up a little sometimes)?

If you can get that working, try increasing your learning rate so that you
can train faster, but without having your training become unstable.

If you keep getting nans with these simple tests, try using autograd’s
anomaly detection to see if that helps you track the issue down.

If things are under control, you can try turning on momentum, again in
order to train faster. (You could also try the Adam optimizer. It often trains
much faster, but can jump around more and be less stable.)

If you get to a stage where you can train stably for a while, but then your
training becomes unstable and / or you get nans, you could try turning
on weight decay to see if that helps stabilize things.

Good luck.

K. Frank

1 Like

Hi, Frank,

Thanks for your feedback.
Btw, do you feel that the size of the middle-layer (512) is too big? Can I reduce them to about 20-30? How can we decide the size of the middle-layer based on the size of in/out features?

Hi, Frank,

Just now I tried to print the weights of all layers of my deep-learning model, but surprisingly found that except for Layer 0, both Layer 2 & 4 get all weight values as “nan” (see the output text below)!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Model Parameters:

Layer: linear_relu_stack.0.weight | Size: torch.Size([20, 7]) | Values : tensor([[ 0.0312, 0.1015, -0.2662, 0.3075, 0.2584, -0.0975, -0.0711],
[-0.0832, -0.2944, -0.2128, -0.2432, 0.0232, 0.3756, -0.1670]],
grad_fn=)

Layer: linear_relu_stack.0.bias | Size: torch.Size([20]) | Values : tensor([-0.0503, -0.1474], grad_fn=)

Layer: linear_relu_stack.2.weight | Size: torch.Size([20, 20]) | Values : tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]],
grad_fn=)

Layer: linear_relu_stack.2.bias | Size: torch.Size([20]) | Values : tensor([nan, nan], grad_fn=)

Layer: linear_relu_stack.4.weight | Size: torch.Size([1, 20]) | Values : tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]],
grad_fn=)

Layer: linear_relu_stack.4.bias | Size: torch.Size([1]) | Values : tensor([nan], grad_fn=)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

But how can be like this?

For your reference, I share my code snippet to build up the NN model below. Actually what I have done is nothing more than changing the loss function from nn.CrossEntropyLoss() to nn.MSELoss(), but somehow hit such an issue.

======================================================================
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor, Lambda
import numpy as np
import pandas as pd

class CustomDataset(Dataset):
‘Characterizes a dataset for PyTorch’
def init(self, csv_file):
“”"
Args:
csv_file (string): Path to the csv file.
“”"
raw_data = pd.read_csv(csv_file)
raw_data = torch.tensor(raw_data.to_numpy())
x_size = list(raw_data.size())[1]
self.data_tensor = raw_data[:,:x_size-1].clone()
self.data_tensor = self.data_tensor.type(torch.float32)
self.label_tensor = raw_data[:,x_size-1:].clone()
self.label_tensor = self.label_tensor.type(torch.float32)
self.label_tensor = self.label_tensor.flatten()
print(f"self.label_tensor.size() = {self.label_tensor.size()}")
def len(self):
‘Denotes the total number of samples’
return len(self.data_tensor)
def getitem(self, index):
data, label = self.data_tensor[index], self.label_tensor[index]
return data, label

training_data_csv_file = “D:\Tools\PyTorch\Medex-Deep-Learning-Model\input_data\training_data.csv”
training_data = CustomDataset(training_data_csv_file)
test_data_csv_file = “D:\Tools\PyTorch\Medex-Deep-Learning-Model\input_data\testing_data.csv”
test_data = CustomDataset(test_data_csv_file)

train_dataloader = DataLoader(training_data, batch_size=10)
test_dataloader = DataLoader(test_data, batch_size=10)

class NeuralNetwork(nn.Module):
def init(self):
super(NeuralNetwork, self).init()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(7, 20),
nn.ReLU(),
nn.Linear(20, 20),
nn.ReLU(),
nn.Linear(20, 1),
)

def forward(self, x):
    x = self.flatten(x)
    logits = self.linear_relu_stack(x)
    return logits

model = NeuralNetwork()

def train_loop(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
print(f’size = {size}‘);
for batch, (X, y) in enumerate(dataloader):
# Compute prediction and loss
pred = model(X)
print(f’pred = {pred}’);
print(f’y = {y}‘);
loss = loss_fn(pred, y)
# print(f’loss = {loss}’);
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()

    if batch % 100 == 0:
        loss, current = loss.item(), batch * len(X)
        print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

learning_rate = 0.01
batch_size = 10

def test_loop(dataloader, model, loss_fn):
size = len(dataloader.dataset)
num_batches = len(dataloader)
test_loss, correct = 0, 0

with torch.no_grad():
    for X, y in dataloader:
        pred = model(X)
        print(f'testing-->pred = {pred}');
        print(f'testing-->y = {y}');
        loss = loss_fn(pred, y).item()
        print(f'testing-->loss = {loss}');
        test_loss += loss
        print(f'testing-->pred.argmax(1) = {pred.argmax(1)}');
        print(f'testing-->(pred.argmax(1) == y) = {(pred.argmax(1) == y)}');
        corr = (pred.argmax(1) == y).type(torch.float).sum().item()
        print(f'testing-->corr = {corr}');
        correct += corr
print(f'testing-->test_loss = {test_loss}, num_batches = {num_batches}');
test_loss /= num_batches
print(f'testing-->correct = {correct}, size = {size}');
correct /= size
print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

epochs = 10
for t in range(epochs):
print(f"Epoch {t+1}\n-------------------------------")
train_loop(train_dataloader, model, loss_fn, optimizer)
test_loop(test_dataloader, model, loss_fn)
print(“Done!”)

print(f"Model structure: {model}\n\n")

print(f"Model Parameters:\n\n")
for name, param in model.named_parameters():
print(f"Layer: {name} | Size: {param.size()} | Values : {param[:2]} \n")

Thanks again for your kind help.

Hi, Frank,

I refer to your post on 08 Aug.

" Do one forward pass. Do you get a sensible (and non-nan ) loss?"

just now I was testing the learning model with only 1 epoch, but it immediately hit the “nan” issue after running the loss function nn.MSELoss(). Please refer to the output text below:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
testing–>y = tensor([0., 4., 0., 1., 2., 1., 1., 2., 2., 1.])
testing–>loss = nan
testing–>pred.argmax(1) = tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
testing–>(pred.argmax(1) == y) = tensor([ True, False, True, False, False, False, False, False, False, False])
testing–>corr = 2.0
testing–>pred = tensor([[nan],
[nan],
[nan],
[nan]])
testing–>y = tensor([3., 1., 6., 3.])
D:\Tools\Python\lib\site-packages\torch\nn\modules\loss.py:530: UserWarning: Using a target size (torch.Size([4])) that is different to the input size (torch.Size([4, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.mse_loss(input, target, reduction=self.reduction)
testing–>loss = nan
testing–>pred.argmax(1) = tensor([0, 0, 0, 0])
testing–>(pred.argmax(1) == y) = tensor([False, False, False, False])
testing–>corr = 0.0
testing–>test_loss = nan, num_batches = 10
testing–>correct = 6.0, size = 94
Test Error:
Accuracy: 6.4%, Avg loss: nan
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Do you have any idea on this MSELoss() “nan” value issue?

Thanks.

Sorry, Frank,

Please ignore my above reply, because the info is incorrect.
Just now I use “findstr” command to single out all training loss values, and get the real scenario: the loss value is somehow increasing rapidly and exponentially, and it quickly runs out of the range, and hit the “nan” issue. Please refer to the output text below:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
D:\Tools\PyTorch\Medex-Deep-Learning-Model\Program>python Medex_Booking_Model.py | findstr /c:"Training–>loss = " | more
D:\Tools\Python\lib\site-packages\torch\nn\modules\loss.py:530: UserWarning: Using a target size (torch.Size([10])) that is different to the input size (torch.Size([10, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.mse_loss(input, target, reduction=self.reduction)
D:\Tools\Python\lib\site-packages\torch\nn\modules\loss.py:530: UserWarning: Using a target size (torch.Size([7])) that is different to the input size (torch.Size([7, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.mse_loss(input, target, reduction=self.reduction)
D:\Tools\Python\lib\site-packages\torch\nn\modules\loss.py:530: UserWarning: Using a target size (torch.Size([4])) that is different to the input size (torch.Size([4, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.mse_loss(input, target, reduction=self.reduction)
Training–>loss = 28.644052505493164
Training–>loss = 9922584576.0
Training–>loss = 2.5872629588033536e+16
Training–>loss = 2.4454491212850324e+24
Training–>loss = 9.77978754869175e+20
Training–>loss = 9.392507535610442e+20
Training–>loss = 9.020564390322656e+20
Training–>loss = 8.663350126878696e+20
Training–>loss = 8.320280684701889e+20
Training–>loss = 7.990798039650906e+20
Training–>loss = 7.674363167145344e+20
Training–>loss = 7.370456745853174e+20
Training–>loss = 7.078588305627477e+20
Training–>loss = 6.798274413195753e+20
Training–>loss = 6.529064004907822e+20
Training–>loss = 6.270512350300482e+20
Training–>loss = 6.022201459033317e+20
Training–>loss = 5.783721785015214e+20
Training–>loss = 5.5546866519969156e+20
Training–>loss = 5.457337345065749e+34
Training–>loss = nan
Training–>loss = nan
Training–>loss = nan
Training–>loss = nan
Training–>loss = nan
Training–>loss = nan
Training–>loss = nan
Training–>loss = nan
Training–>loss = nan
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Pls help me to analyze why the MSE loss value can get such a terrible exponential increase.
Thanks!

Don’t ignore this warning as it’s real:

UserWarning: Using a target size (torch.Size([10])) that is different to the input size (torch.Size([10, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.mse_loss(input, target, reduction=self.reduction)

Fix the shape of the model output and the target and make sure they are matching. Otherwise the loss calculation will broadcast one tensor and will thus calculate a wrong loss value.

Now I come to know why the loss value keeps on increasing in such a terrible way: since there are 19 labels, with the range from 0 to 18, if we get an output value 1, then the MSE loss value will be quickly hiked to (18-1)^2=289. In next round, the value will further hike to (289-18)^2 = 271^2…, so on and so forth, this is really a horrible exponential “divergent” trend.

To prevent the divergency trend from happening, I plan to take the following two approaches: (1) I will still apply a sigmoid function in the final layer of the NN, to limit the range of any output value between 0 and 1; (2) The crucial point is that the label values are too big, must also be confined between 0 and 1. So I will do this way: I redefine the label 0 to be new label value 1/19, 1 to be 2/19, … so on and so forth, then the original label value 18 to be projected as new value 19/19, i.e., 1.

By this way, hopefully we can control the loss values between 0 and 1, and keep a convergent trend during the training process.

Hi, ptrblck,

Thank you for the reminding. I have fixed the issue by flattening the prev tendor for 2D to 1D (see the codes below):

===========================================================================

pred = model(X)
pred = pred.flatten()

Dear friends,

I have managed to implement the sigmoid function in the final layer of NN, and cater the label values to be within the range of [0,1]. (as mentioned in the last post). My new NN structure is shown by Python source codes as below:

===================================================================
class NeuralNetwork(nn.Module):
def init(self):
super(NeuralNetwork, self).init()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(7, 20),
nn.ReLU(),
nn.Linear(20, 1),
nn.ReLU(),
nn.Sigmoid(),
)

def forward(self, x):
    x = self.flatten(x)
    logits = self.linear_relu_stack(x)
    return logits

===================================================================

Also I managed to cast the label data to be between 0 and 1. (see my Python source codes to initialize the label tensor below).

===================================================================
class CustomDataset(Dataset):
‘Characterizes a dataset for PyTorch’
def init(self, csv_file):
“”"
Args:
csv_file (string): Path to the csv file.
“”"
raw_data = pd.read_csv(csv_file)
raw_data = torch.tensor(raw_data.to_numpy())
x_size = list(raw_data.size())[1]
self.data_tensor = raw_data[:,:x_size-1].clone()
self.data_tensor = self.data_tensor.type(torch.float32)
self.label_tensor = raw_data[:,x_size-1:].clone()
self.label_tensor = self.label_tensor.type(torch.float32)
self.label_tensor = self.label_tensor.flatten()
# projected to 0, 18/1, 2/18, …, 17/18, 1
self.label_tensor = self.label_tensor.div(18.0)
print(f"self.label_tensor.size() = {self.label_tensor.size()}")
def len(self):
‘Denotes the total number of samples’
return len(self.data_tensor)
def getitem(self, index):
data, label = self.data_tensor[index], self.label_tensor[index]
return data, label

Then I reran the learning model, and found that the “nan” issue is gone, but I hit by another issue: after 100 epochs, all predicted values converged to the mean value, which is 0.5 and can be projected 11 in-advance-days. Please see the resultant output text below:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
testing–>pred = tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
testing–>y = tensor([0.0556, 0.0556, 0.0556, 0.3889, 0.3889, 0.0556, 0.1111, 0.0556, 0.1111,
0.2222])
testing–>loss = 0.13919752836227417
testing–>pred.argmax(0) = 0
testing–>(pred.argmax(0) == y) = tensor([False, False, False, False, False, False, False, False, False, False])
testing–>corr = 0.0
testing–>pred = tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
testing–>y = tensor([0.1111, 0.2222, 0.3333, 0.0556, 0.1111, 0.4444, 0.0000, 0.0556, 0.1111,
0.1111])
testing–>loss = 0.1358024775981903
testing–>pred.argmax(0) = 0
testing–>(pred.argmax(0) == y) = tensor([False, False, False, False, False, False, True, False, False, False])
testing–>corr = 1.0
testing–>pred = tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
testing–>y = tensor([0.1111, 0.1111, 0.0556, 0.1667, 0.1667, 0.1111, 0.0556, 0.2778, 0.0556,
0.2778])
testing–>loss = 0.13672839105129242
testing–>pred.argmax(0) = 0
testing–>(pred.argmax(0) == y) = tensor([False, False, False, False, False, False, False, False, False, False])
testing–>corr = 0.0
testing–>pred = tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000])
testing–>y = tensor([0.0000, 0.2222, 0.0000, 0.0556, 0.1111, 0.0556, 0.0556, 0.1111, 0.1111,
0.0556])
testing–>loss = 0.18209877610206604
testing–>pred.argmax(0) = 0
testing–>(pred.argmax(0) == y) = tensor([ True, False, True, False, False, False, False, False, False, False])
testing–>corr = 2.0
testing–>pred = tensor([0.5000, 0.5000, 0.5000, 0.5000])
testing–>y = tensor([0.1667, 0.0556, 0.3333, 0.1667])
testing–>loss = 0.11188270896673203
testing–>pred.argmax(0) = 0
testing–>(pred.argmax(0) == y) = tensor([False, False, False, False])
testing–>corr = 0.0
testing–>test_loss = 1.3430555164813995, num_batches = 10
testing–>correct = 6.0, size = 94
Test Error:
Accuracy: 6.4%, Avg loss: 0.134306
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Then the accuracy rate is only 6.4%.

Anybody can help me to diagnose this “Convergent to the same value” issue?

Thanks.

Right now I was running the learning model by 1000 epochs, but found that this time all predicted values are approaching to 0 (see the output text below).

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
pred = tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
pred = tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
pred = tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
pred = tensor([1.1067e-35, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
pred = tensor([1.6154e-33, 1.6154e-33, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
pred = tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
pred = tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 7.7509e-16, 0.0000e+00,
pred = tensor([0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0002,
pred = tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
pred = tensor([0.0000e+00, 4.3497e-04, 1.0232e-17, 1.0232e-17, 1.0232e-17, 0.0000e+00,
pred = tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
pred = tensor([0.0000e+00, 4.8779e-15, 4.8779e-15, 4.8779e-15, 5.1095e-36, 0.0000e+00,
pred = tensor([1.8333e-32, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 2.7634e-14,
pred = tensor([0.0000e+00, 0.0000e+00, 6.7904e-24, 0.0000e+00, 0.0000e+00, 0.0000e+00,
pred = tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 9.7125e-13, 1.5379e-35, 4.2071e-17,
pred = tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 2.5491e-35, 0.0000e+00,
pred = tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
pred = tensor([0.0000e+00, 1.8329e-30, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
pred = tensor([0.0000e+00, 6.4101e-16, 0.0000e+00, 0.0000e+00, 1.2785e-36, 1.2785e-36,
pred = tensor([1.1297e-18, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
pred = tensor([0.0000e+00, 5.8366e-35, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
pred = tensor([1.1963e-12, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
pred = tensor([7.0908e-31, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 9.5245e-36,
pred = tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
pred = tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
pred = tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
pred = tensor([0., 0., 0., 0., 0., 0., 0.], grad_fn=)
testing–>pred = tensor([0.0000e+00, 1.2501e-15, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
testing–>pred = tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
testing–>pred = tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.8935e-14, 0.0000e+00,
testing–>pred = tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
testing–>pred = tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
testing–>pred = tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
testing–>pred = tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
testing–>pred = tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.2295e-15,
testing–>pred = tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 3.2930e-35,
testing–>pred = tensor([0., 0., 0., 0.])
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

So the trend is something like with the increase of learning epochs, the predicted values are all converged to 1, then keep on decreasing, and then after around 100 epochs, all of the predicted values are converged to 0.5, and finally all of them are converged and approached to 0.

But is this quite an abnormal learning results? I suspect that the functions employed by my NN are something wrong.

Can any experts help me to look into this “convergency” issue?

Thanks.

@KFrank posted great tips in this post and advised against using a Sigmoid layer as it wouldn’t match your use case.
Besides that he has also walked you through some steps what to check first and how to try to stabilize your training.