I’m looking to train a RoBERTa model on protein sequences, which is in many ways similar to normal nlp training, but in others quite different.
In the language of proteins, I have 20 characters instead of the normal 26 characters used in english (it is 26 right? :D), so that is rather similar. The big difference is that you don’t really combine the characters in proteins to actual words, but rather just keep each character as a distinct token or class.
Hence essentially my input to the transformer model could just be a list of numbers ranging from 0-19. However that would mean that my input would only have 1 feature if I did that, and I’m not sure a transformer could work with that?
I’m thinking of just doing a onehot encoding of these characters, which would give me 20 input features. However this is of course still very low in comparison to how normal transformers are trained, where d_model is somewhere in the range of 128-512 if I understand correctly.
Does anyone have any experience with anything like this? any good advice for how it is most likely to work?