How to use pytorch Transformer to do categorical classification

Hi, I intend to build a multi-class classifier using a transformer.
The input data is an audio spectrogram for example MFCC in (batch, steps, features).
The labels are just 0,1,2,3.

Can this be achieved using torch.nn.Transformer?
Or should I write the transformer by myself especially the decoder?
This is because the example in Pytorch Transformer suggests the ‘tgt’ is also a 3D array. I believe this is for NLP.