After training my network, I have got similar cost-errors for training and development data-sets, and twice as large cost-error over the test set.
I used the cost-error over the development data-set to tune the hyperparameters.
This looks like the overfitting of the training and the development data-sets.
How can I improve my network, to get the cost-error over the test data-set down to the level of the cost-error over the train/dev data-sets?
Interesting situation. Essentially, your development set must be a poor generalization of the distribution you’re attempting to model.
For example, suppose I was training a model to classify dogs/cats/birds. It’s possible that both my train-set and dev-set both lack samples of bird of images whereas my test-set is primarily bird images. As a result, we would get a similar behavior as to what you described.
I would suggest revisiting your data splits and consider if the splits made are reasonable. Is there something interesting about test not shown in dev or train? Is there any kind of large class imbalance that could have seeped through? You just really need to examine your data to figure this out.
Thanks a lot for your reply