I am forecasting traffic flow using Tranformers (univariate - only traffic flow as input and output) and I have created multiple models which have the same architecture but different hyper-parameters.
During training, the behavior of each one was different. Also, when loading the parameters.pth file for instance the model and make predictions, it reveals different weights for each of the models.
However, when I make predictions for different input data and also different lengths, the results are exactly the same for every model, point by point.
Maybe one important detail is that I have used batch size = 12 (one hour of traffic data) when training but I could only make predictions using a single batch (batch size = input data size).
Could anyone help me with any tips? I am totally lost about the cause of this issue and I have not found any similar case.
Can you share a bit of the code, for example where you load two different models, you show that they have different parameters, and then show that they make identical predictions?
It sounds like maybe there’s a bug somewhere because there ought to be some (small) variability between runs even when training the same model with the same hyperparameters, let alone different ones.
Is the way you’re evaluating the model predictions categorical in some way, such that small differences in model output are bucketed under the same prediction?
Thank you very much for your answer!
It depends on the input test data but in general, the prediction’s output size is around 900 data points, float values which are rounded to integers (once the prediction is the number of cars). Besides changing the hyperparameters I have already added more identical encoder layers and the predictions are still exactly the same - not only the error metrics (MAPE, RMSE, MAE, R2) but indeed point by point, all of them are equal for different models considering the same input test data.
After training I am saving the models using the following command: torch.save(model.state_dict(), ‘<model_name>.pth’)
When predicting, I redefine the model classes and load the weights pth file using the following commands:
i) instance the model after defining it again: model = <model_calss_name>.to(device)
ii) model.load_state_dict(torch.load(<model_name>, map_location=device))
I have verified the models’ weights and biases are different by printing the parameters pth files by the following command (around 170 weights were different): print(torch.load(’<model_name>.pth’, map_location=torch.device(‘cpu’))
Thank you very much!
I guess there are two distinct hypotheses, and I think you can debug them by following the same steps. The hypotheses are:
- The results are not real, caused by a subtle bug.
- The results are real: your data is such that all the different models you’ve trained converge on very similar solutions.
I would debug this by having a baseline model (take one of your current models and keep it as the baseline) and then training another model with progressively reduced numbers of parameters. If your results were real, eventually the performance of the reduced model ought to degrade as you strip away more layers / shrink the layers (it should, sooner or later, give different & worse results than the baseline). If your results were caused by a bug, the reduced model will keep performing just as well as the baseline forever, even once it’s down to a really tiny version - however at that point it should become easy to debug since you can follow exactly what it’s doing.
Sorry I can’t offer more specific advice, this does indeed sound peculiar, but you ought to be able to figure out which of those two possibilities it is.