Interesting situation. Essentially, your development set must be a poor generalization of the distribution you’re attempting to model.
For example, suppose I was training a model to classify dogs/cats/birds. It’s possible that both my train-set and dev-set both lack samples of bird of images whereas my test-set is primarily bird images. As a result, we would get a similar behavior as to what you described.
I would suggest revisiting your data splits and consider if the splits made are reasonable. Is there something interesting about test not shown in dev or train? Is there any kind of large class imbalance that could have seeped through? You just really need to examine your data to figure this out.