Learning Rate , Decay Rate , Optimizer weights

I have been training recently a model using different RNNs (simple RNN, LSTMs, GRU), so far the best one was a GRU with one layer dimension and 256 hidden dimensions giving me 87.5% OA and 76.5% MIOU so here are my questions :
1- I have been using a learning rate equal to 0.1 that changes on a certain epoch ( I used 200 epoch for training and I changed on the 140th epoch by learning _rate_new = learning rate_old * decay rate ( 0.1 * 0.7) I only did this once on the 140th epoch not repeatedly, so my question is there a better method for updating the learning rate using a built-in function or another way.
2- for the optimizer weights, I used a tensor of the inverse squared of the classes frequencies in my training dataset ( I read it in another forum that it performs well), Is there a better way for doing this?

and thank you , I just want to push the envelope one last bit and get that sweat 90% OA without overfitting.

Does OA mean overall accuracy?
If so, do you have an imbalanced dataset and how are the class frequencies?
You could also try just the inverse of the class frequencies and see, how well your model performs.

For the learning rate schedule, you could try ReduceLROnPlateau.

hello @ptrblck ,
yes OA means overall accuracy, yes I have an imbalanced dataset, some classes have fewer examples than others, I tried just the inverse of the class frequencies it performed poorly comparing to the inverse square root.
I’ll have a look at the scheduler , it has been on my radar lately.

Thanks for the info!
Interesting idea to use the square root.

Just to dig a bit deeper: are you calculating the OA by summing the right predictions (diag of confusion matrix) and dividing it by the sum of all samples?
This might be problematic in an imbalanced setup.

yes Sir , OA = correct predictions (a.k.a trace of my confusion matrix ) / all samples , MIOU = mean ( cm[i,i]/ (cm[i,:] + cm[:i] - cm[i,i]) ) , so my OA is wrong ?

No not at all. Your OA calculation is right. It might give you hight accuracy values, if some of your classes have very little support and might lead to the accuracy paradox.

I’ll be trying the scheduler , let’s hope for the best , as for the weight update I should leave like that the inverse square root ? I’ll be getting a larger dataset soon-ish ( the one I used has 9230 examples , this one will have about 250,000 but with 28 classes ) that’s why i’m trying to tune and squeeze a very good model hyper-paramertes so I woudln’t have to repeat the whole process again on the large dataset.
Again thank you for your help , wish you all the best.