Hi everyone, I am a beginner in DL and PyTorch, and I would like to ask for help in a specific situation. I built a model and trained it for 200 epochs using a specific dataset. Now, I have a different dataset that supplements the previous one. Should I load the pretrained weights and keep training in the new dataset? Should I retrain the model merging both datasets? Should I train two different models and ensemble them? What is the easiest strategy? Thank you!
If your new data is basically the same as your original dataset – if the
data is basically just more of the same but you didn’t have it when you
first started training – you should keep the pre-trained weights and then
continue training with the combined dataset.
If your new dataset if for a similar, but distinct problem, and you want
to build a model for the new problem, keep the pre-trained weights to
get a head start, but only train with the new dataset.
(I can’t think of a non-contrived version of this use case where I would
train two separate models and combine them.)
My second dataset was developed to a similar problem. However, its normalization parameters are a bit different due to the quality of the cameras and the weather. My expectation is to enhance the performance of the model in the first dataset, but supplementing it with the second dataset. I am not sure if this is possible or right. I did a few tests keeping the weights as you suggested, and the performance with the first dataset got worse. So, I thought of merging both datasets to train all together.
If the difference is primarily one of normalization, you might be able
to normalize one or the other of the two datasets to make them match
better. But see below.
The important question is what does the real-world data to which you’ll
be applying your model for inference look like?
You have two (potentially competing) goals: You want to have the data
you train on be as representative as possible of the data on which you
will be running inference. But you also want to have enough training
data to be able to do a good job of training your model.
If you have more than enough training data, stick to the data that most
closely matches your inference data. But if you’re short on training data
it can be helpful to add more data, even if it’s not a perfect match for
your real-world data.
Which dataset best matches the normalization of your real-world data?
If both datasets otherwise match reasonably well, I would adjust the
normalization of the dataset whose normalization matches less well.
As a final comment, let’s say that your first dataset is really rather
different than your real-world data (and your second dataset matches
the real-world data reasonably well), but you spent a lot of time training
with it. Even if it would not make sense to use a model trained on the
first dataset on your real-world data, you can still use it as a pre-trained
model that you “fine tune” by further training it on the second dataset,
potentially saving significant time in comparison to training from scratch
on the second dataset.
Remember, your goal is not narrowly to have the model perform
well on either the first or second dataset, but to have it perform well
on your real-world data. So you want it to perform well on whatever
test / training data best represents your real-world data.
Hi @KFrank , thanks for the explanation. If I correctly understand, if I train a model using dataset 1 and test it in dataset 2, which is a similar dataset, the model should generalize to a new dataset, correct?
When I test the model in the second dataset, the results are quite wrong. Is it an indication that the dataset is wrong or the model lacks training?
Another question I have, I am using
transforms.Normalize to normalize the first dataset during training. Do I need to calculate the mean and std of the second dataset to retrain the model?
If I understand it correctly, you have two datasets taken in different conditions, and you want your model to apply well to both of them?
If you believe that the model should apply well to both (the subjects are similar, for instance) but the conditions they were taken in are interfering, I can see two main options:
- If you can know in advance if any unseen data will come from a good source or bad source (picture metadata etc., image statistics), then you can preprocess it to resemble Dataset 1.
- If you can’t predict this or run such preprocessing, then you can combine the two datasets and normalize them, then train on the combined dataset. This will, however, probably have worse accuracy than a model trained only on Dataset 1 or only on Dataset 2 when tested on those.
If the second dataset is sufficiently similar to the first, then the hope is that a model trained on the first dataset will generalise. If the results are wrong, then it might mean that the model has too much training and has overfit your first dataset (it doesn’t generalise), it might mean the second dataset isn’t similar enough to the first (it would generalise to unseen data, but only if it’s actually similar), or it could mean that the second dataset hasn’t been correctly processed (the model would generalise and the dataset is similar, but the format is off).
You could combine the two datasets into one and train a single model on them, but at that point your test set has become part of your training and validation set and you need new test data–but if it represents a large category you want to generalise to with or without additional image preprocessing, it might be worth it rather than having a narrowly specialised model.
HI @cpeters , interesting points! My datasets vary in image quality (brightness, colors, etc) and in scene elements (cars, people, etc). I am working on a model to estimate a robot’s pose. In the first dataset, it is known that it lacks image sequences when the robot is at high speeds. Thus, I am trying to create another dataset to supplement the first one. However, in a different scenario and weather conditions (image quality). So, I am not sure if this strategy works. Maybe, I can merge both datasets and test the trained model in a separated dataset, just as you suggested. Do you think it is a reasonable alternative?
It certainly seems worth a try; the obvious problem being that you might have it learning the noise as a differentiating measure, so if you got poor-quality low-speed poses or high-quality fast poses it might have trouble learning them.
It sounds like you might want to try processing the images in both datasets first to equalise features like brightness etc.?
It sounds like you might want to try processing the images in both datasets first to equalize features like brightness etc.?
I think it will take some time, and I am unsure if it will enhance the final results. Probably I will merge them and calculate the mean and std to use the