Audio pronounciation analyzer

I have audio files as a dataset and i want to create a model which will rate the user provided audio as good, bad or moderate with respect to the pronounciation. Which approach can I use to make a model of this kind?

If you have sequential data, one of the classical approaches would be to use recurrent neural network or LSTM (however I have used 1D CNNs at it also worked out perfectly). Another newer approach would be the use of transformer architecture for this kind of task which in my case has shown to get good performance and was associated with working better to out of distribution samples (see for instance [1].

[1] Exploring the Limits of Out-of-Distribution Detection

Will look into it. Thanks