How to train a vision transformer model without pretrained parameters?

I’m only a freshman for transformer. I trained my vit model on flower102 dataset in 400 epochs, but however I adjust my hyper-parameters, the test accuracy reachs only 30%!
Then I tried pytorch official vit-b model without any pretrained parameters, the result is same! No matter what data augmentation options I select, the vit model’s test accuracy only stops at about 30%.

Is it indispensable for me to use pretrained parameters of Google’s to fine-tune my vit model?
Sometime, I need to change the structure of MSA or SW-MSA to design a new model for my work, so I can’t use pretrained parameters.