Take in account that Resnest is most likely pre-trained in ImageNet, which is a thousand class, RGB dataset, meanwhile Emotion Detection seems to have a have 7 classes and images in a scale of grey, so you are missing several of the key representations that the Backbone was trained for. I think you wouldn’t find much of a difference if you train from scratch.
Whether you use a pre-trained dataset or not, I’d recommend optimizing other Hiperparameters. Try to increase Batch-size, Resolution or try different Optimizers (SGD rather than Adam if you see too much overfitting).