Using nn.transformer for audio spectrograms?(speech recognition)

Hi, I was wondering if anyone knows of any examples where people use nn.transformer on spectrogram data?

I currently have a gru model but wanted to implement a transformer model as well, but most examples are for NLP and I end up getting confused when trying to convert the model to work with my data.

My data consists of spectrogram bands and the target labels are words that have been said in each spectrogram. In the form of (Samples, Sequence Length, Features/bands), label_example=‘blue’.

Thanks for any help!