Non speech audio embedding

I have a large dataset of machine sound waveforms (normalized from -1 to 1 in value) that i want to use in a transformer model. This means i need to get the audio in a vector representation, but i do not know how to achieve this. I assume going for speech trained models like wav2vec or music models is not the way to go.
Any tips/pointers are welcome!

[edit]: this might be stupid, but can the audio signal be seen as a vector in and of itself? No embeddings have to be done and the waveform can be a direct input to the transformer?

Thanks in advance