GPT-2 in DNA data

Dear community,

I’m trying to build a GPT-2 transformer from scratch (without any pre-train model) with DNA sequences in order to generate DNA sequences on top of smaller ones. I am a bit stuck and I couldn’t find any repo applying this kind of decoder-transformer with a DNA background, to have some clues in what’s the best tokenization, and some other technical choices…

Does someone have any references or think that’s a good idea?

Thank you in advance!