Training and val loss nearly zero

I have a simulated dataset of an electric grid that I am feeding into two networks (LSTM and VAE). The task is just to reconstruct the input, but I can’t seem to find the bug. The training loss and validation loss is almost zero and stay about the same (gets minimally smaller).

I tried to see if my model is the issue, or the data pipeline, but with a simpler dataset that I have been working on another project (grids, but different features) it looks all fine. Training loss is relatively low, but gets smaller throughout training, while validation loss is low throughout (not the best dataset).

Could it be that my dataset is the culprit here?
Also, where should I look first?

A simple litmus test would be to check if your setup behaves similarly with random train/test data compared to your “real” data, which is usually assumed to be more difficult for the model to fit.

Thanks for the quick answer. I did as you said, and got this:

Epoch [1/10], Training Loss: 0.0689, Val Loss: 0.0672
Epoch [2/10], Training Loss: 0.0703, Val Loss: 0.0675
Epoch [3/10], Training Loss: 0.0687, Val Loss: 0.0672
Epoch [4/10], Training Loss: 0.0689, Val Loss: 0.0675
Epoch [5/10], Training Loss: 0.0691, Val Loss: 0.0674
Epoch [6/10], Training Loss: 0.0693, Val Loss: 0.0671
Epoch [7/10], Training Loss: 0.0688, Val Loss: 0.0670
Epoch [8/10], Training Loss: 0.0685, Val Loss: 0.0670
Epoch [9/10], Training Loss: 0.0692, Val Loss: 0.0676
Epoch [10/10], Training Loss: 0.0696, Val Loss: 0.0672

The initial one (with the full dataset):

Epoch [1/10], Training Loss: 0.0002, Val Loss: 0.0003
Epoch [2/10], Training Loss: 0.0001, Val Loss: 0.0003
Epoch [3/10], Training Loss: 0.0000, Val Loss: 0.0002
Epoch [4/10], Training Loss: 0.0000, Val Loss: 0.0003
Epoch [5/10], Training Loss: 0.0000, Val Loss: 0.0003
Epoch [6/10], Training Loss: 0.0000, Val Loss: 0.0003
Epoch [7/10], Training Loss: 0.0000, Val Loss: 0.0003
Epoch [8/10], Training Loss: 0.0000, Val Loss: 0.0003
Epoch [9/10], Training Loss: 0.0000, Val Loss: 0.0003
Epoch [10/10], Training Loss: 0.0000, Val Loss: 0.0004

And the simple dataset (another simulation):

Epoch [1/10], Training Loss: 0.1900, Val Loss: 0.1628
Epoch [2/10], Training Loss: 0.1407, Val Loss: 0.1215
Epoch [3/10], Training Loss: 0.0735, Val Loss: 0.0155
Epoch [4/10], Training Loss: 0.0653, Val Loss: 0.0112
Epoch [5/10], Training Loss: 0.0608, Val Loss: 0.0102
Epoch [6/10], Training Loss: 0.0535, Val Loss: 0.0098
Epoch [7/10], Training Loss: 0.0504, Val Loss: 0.0090
Epoch [8/10], Training Loss: 0.0483, Val Loss: 0.0091
Epoch [9/10], Training Loss: 0.0461, Val Loss: 0.0081
Epoch [10/10], Training Loss: 0.0438, Val Loss: 0.0080

Kindly have a look at the above^
Is it a good idea to maybe not train on the whole training data? Does it make sense to look into the training behavior on a lower level than epochs?

If random training data is more difficult to fit, then that is a good sign. However, without knowledge of what your loss function is or the final downstream evaluation metrics, it’s difficult to interpret the scale of the loss meaningfully.

Whether you want to reserve more data for validation is another question but that would be dependent on your problem scenario and whether you think the validation data is representative or your target use-case.

The loss function is MSE, initially (for the vae part) I had included KL loss, but that gave me a negative loss, even with a small beta.

I tried tracking how the loss behaves within one epoch, and it seems that after 100 batches the loss converges toward the minimum.
My assumption is that the data is highly repetitive and very easy to learn. I wouldn’t probably even need that much data or this kind of network, to begin with.

Batch [1/7439], Batch Training Loss: 0.0888
Batch [21/7439], Batch Training Loss: 0.0738
Batch [41/7439], Batch Training Loss: 0.0455
Batch [61/7439], Batch Training Loss: 0.0136
Batch [81/7439], Batch Training Loss: 0.0074
Batch [101/7439], Batch Training Loss: 0.0066
Batch [121/7439], Batch Training Loss: 0.0052
...
Batch [2000/7439], Batch Training Loss: 0.0006

Right, and the scale of MSE would depend on your data and (e.g., whether any normalizations were applied during preprocessing, etc.).

Yes, normalization between 0 and 1 was applied. I did look at the data manually, as well.
To give you some more info on the data:
It’s an electric grid with generators, nodes, transformers, and loads. additionally, we also have static values for the cables. All input data were normalized beforehand.

Here is how the data looks like:
voltage:
Unknown-10

current:
Unknown-11

loads:
Unknown-14

generators:
Unknown-13

transformers:
Unknown-12

The voltage, current, and transformer values appear very small, so if any one of those are prediction targets, it would make the magnitude of MSE small as well, and it would be important to consider the magnitude of the prediction target alongside the MSE.

Above, I’m using the input data as target, so purely reconstruction, but I do get what you’re saying. Makes absolute sense. However, the normalization is category-wise. So even though the current and voltage are very small (in this example), there are other examples where the signal reaches 1.

I was trying to change the training targets, from reconstruction to actual prediction tasks (generators, transformers, loads, cables → voltage/current).
Still, the same learning outcome; overfits pretty quickly after only a few hundred batches.

I simulated the data again, and this time, made it 8x more complex (in terms of less steady states), but the outcome is still similar.

I have also capped (torch.clamp) the KL divergence loss term at min=0.1, otherwise, it would go negative. Here’s again the latest run:

Batch [1/14999], Batch Training Loss: 3.3345
Batch [2/14999], Batch Training Loss: 3.1343
Batch [3/14999], Batch Training Loss: 2.9440
Batch [4/14999], Batch Training Loss: 2.7247
Batch [5/14999], Batch Training Loss: 2.4829
Batch [6/14999], Batch Training Loss: 2.2184
Batch [7/14999], Batch Training Loss: 1.8933
Batch [8/14999], Batch Training Loss: 1.5799
Batch [9/14999], Batch Training Loss: 1.2666
Batch [10/14999], Batch Training Loss: 0.8640
Batch [11/14999], Batch Training Loss: 0.6118
Batch [12/14999], Batch Training Loss: 0.3489
Batch [13/14999], Batch Training Loss: 0.1858
Batch [14/14999], Batch Training Loss: 0.0420
Batch [15/14999], Batch Training Loss: 0.0355
Batch [16/14999], Batch Training Loss: 0.0321
Batch [17/14999], Batch Training Loss: 0.0269
Batch [18/14999], Batch Training Loss: 0.0233
Batch [19/14999], Batch Training Loss: 0.0214
Batch [20/14999], Batch Training Loss: 0.0197
Batch [21/14999], Batch Training Loss: 0.0196
Batch [22/14999], Batch Training Loss: 0.0202
Batch [23/14999], Batch Training Loss: 0.0206
Batch [24/14999], Batch Training Loss: 0.0186
Batch [25/14999], Batch Training Loss: 0.0957
Batch [26/14999], Batch Training Loss: 0.0159
Batch [27/14999], Batch Training Loss: 0.0149
Batch [28/14999], Batch Training Loss: 0.0143
Batch [29/14999], Batch Training Loss: 0.0143
Batch [30/14999], Batch Training Loss: 0.0143
Batch [31/14999], Batch Training Loss: 0.0137
Batch [32/14999], Batch Training Loss: 0.0138
Batch [33/14999], Batch Training Loss: 0.0132
Batch [34/14999], Batch Training Loss: 0.0141
Batch [35/14999], Batch Training Loss: 0.0132
Batch [36/14999], Batch Training Loss: 0.0129
Batch [37/14999], Batch Training Loss: 0.0133
Batch [38/14999], Batch Training Loss: 0.0137
Batch [39/14999], Batch Training Loss: 0.0130
Batch [40/14999], Batch Training Loss: 0.0128
Batch [41/14999], Batch Training Loss: 0.0126
Batch [42/14999], Batch Training Loss: 0.0125
Batch [43/14999], Batch Training Loss: 0.0128
Batch [44/14999], Batch Training Loss: 0.0126
Batch [45/14999], Batch Training Loss: 0.0123
Batch [46/14999], Batch Training Loss: 0.0122
Batch [47/14999], Batch Training Loss: 0.0119
Batch [48/14999], Batch Training Loss: 0.0120
Batch [49/14999], Batch Training Loss: 0.0119
Batch [50/14999], Batch Training Loss: 0.0120
Batch [51/14999], Batch Training Loss: 0.0117
Batch [52/14999], Batch Training Loss: 0.0118
Batch [53/14999], Batch Training Loss: 0.0119
Batch [54/14999], Batch Training Loss: 0.0116
Batch [55/14999], Batch Training Loss: 0.0118
Batch [56/14999], Batch Training Loss: 0.0114
Batch [57/14999], Batch Training Loss: 0.0115
Batch [58/14999], Batch Training Loss: 0.0116
Batch [59/14999], Batch Training Loss: 0.0115
Batch [60/14999], Batch Training Loss: 0.0115
Batch [61/14999], Batch Training Loss: 0.0116
Batch [62/14999], Batch Training Loss: 0.0116
Batch [63/14999], Batch Training Loss: 0.0114
Batch [64/14999], Batch Training Loss: 0.0116
Batch [65/14999], Batch Training Loss: 0.0114
Batch [66/14999], Batch Training Loss: 0.0113
Batch [67/14999], Batch Training Loss: 0.0115
Batch [68/14999], Batch Training Loss: 0.0116
Batch [69/14999], Batch Training Loss: 0.0113
Batch [70/14999], Batch Training Loss: 0.0114
Batch [71/14999], Batch Training Loss: 0.0113
Batch [72/14999], Batch Training Loss: 0.0114
Batch [73/14999], Batch Training Loss: 0.0115
Batch [74/14999], Batch Training Loss: 0.0114
Batch [75/14999], Batch Training Loss: 0.0114
Batch [76/14999], Batch Training Loss: 0.0115
Batch [77/14999], Batch Training Loss: 0.0115
Batch [78/14999], Batch Training Loss: 0.0113
Batch [79/14999], Batch Training Loss: 0.0113
Batch [80/14999], Batch Training Loss: 0.0112
Batch [81/14999], Batch Training Loss: 0.0113
Batch [82/14999], Batch Training Loss: 0.0114
Batch [83/14999], Batch Training Loss: 0.0112
Batch [84/14999], Batch Training Loss: 0.0113
Batch [85/14999], Batch Training Loss: 0.0110
Batch [86/14999], Batch Training Loss: 0.0112
Batch [87/14999], Batch Training Loss: 0.0113
Batch [88/14999], Batch Training Loss: 0.0113
Batch [89/14999], Batch Training Loss: 0.0114
Batch [90/14999], Batch Training Loss: 0.0114
Batch [91/14999], Batch Training Loss: 0.0113
Batch [92/14999], Batch Training Loss: 0.0113
Batch [93/14999], Batch Training Loss: 0.0113
Batch [94/14999], Batch Training Loss: 0.0113
Batch [95/14999], Batch Training Loss: 0.0113
Batch [96/14999], Batch Training Loss: 0.0111
Batch [97/14999], Batch Training Loss: 0.0114
Batch [98/14999], Batch Training Loss: 0.0112
Batch [99/14999], Batch Training Loss: 0.0113
Batch [100/14999], Batch Training Loss: 0.0110