Data augmentation in Segmentation

edshkim98 · October 11, 2020, 9:11pm

Hello everyone,
I am currently doing my project on segmentation.
The problem with my dataset is that my training and validation dataset is actually slightly different to the actual raw dataset or test dataset.

So, I came up with a solution which is that since data augmentation tends to help regularization and prevents overfitting, I have given a random noise and rotation to my training dataset.
Hence, my training dataset is comprised with 1000 original training dataset + 1000 noisy training dataset while keeping the validation dataset unchanged.
However, my current model is around 94% with 0.87IoU in epoch 50 whereas my original model performance was 95% with 0.89IoU in epoch 50.

Can someone explain to me that whether my data augmentation technique has actually worsen my model’s performance and if not, should I wait for longer epochs to see the results?

ptrblck · October 12, 2020, 7:44am

Data augmentation might increase the model performance on the validation dataset, if the data distribution of the test data would be made more similar to the validation distribution.
E.g. if your validation data contains more gaussian noise, adding this preprocessing step to the training might help.

edshkim98 · October 12, 2020, 9:44am

@ptrblck, Thank you for your reply!
Unfortunately, my main goal is to test on the raw dataset, so it is not possible to change the distribution
akin to the validation dataset. My validation data also has similar distribution as the training dataset.
Then my question is, should I also add gaussian noise to my validation dataset or do you still recommend adding noise to generate more data in this case?
I know it would be difficult, but I want to maximize my validation data accuracy as well as accuracy on the raw dataset.
Furthermore, would it be better to train with 1000 original training data + 1000 fixed noised data or 1000 noised data that changes the amount of noise in every epoch? e.g. transforms in torch.vision

ptrblck · October 12, 2020, 8:39pm

I think I misunderstood the use case and thought you might have a slight difference between the training and validation datasets, but it seems the difference is between training+validation vs. test?

I don’t think adding data augmentation to the validation set is a good idea, but also the difference of the test dataset is concerning. A common way would be to decrease the test set and use some of it in the training and validation set. However, if your test set is small or if you don’t have the targets, this won’t be easily possible.
In that case unfortunately I don’t know what the best approach would be to generalize to a “new” data domain.

edshkim98 · October 12, 2020, 9:11pm

@ptrblck Thank you for the reply!
Yeah, unfortunately it is not possible to use the test data for training as there is no target value.
So in this case, do you recommend not using data augmentation to give extra noise onto my training dataset and instead should aim to increase the validation dataset accuracy?

ptrblck · October 13, 2020, 6:27am

That’s hard to tell, since your validation dataset is not a good proxy of the test data.
I.e. even if your training and validation performance is great, your model might just fail on the test dataset, since the distribution could be too different.
Note that “looking” into the test set is not a good idea (you should not observe the test predictions during training), as you would leak the test information into the training.

One other approach, which comes to my mind, would be to use a small portion of the test set to train the model in an unsupervised fashion, e.g. to reconstruct the input.
For this you could use a special “branch” for the output layers to make sure the output shape fits the input shape and remove it later during the classification training. This could maybe pretrain some early layers, but you would have to experiment with it.