I think your general approach sounds right and you should indeed use the target values directly (in the range [0, nb_classes-1]).
The initial accuracy seems to already start at ~94% and increases a bit afterwards.
If so, I would guess that your dataset is heavily imbalanced and the model might predict the majority class only?
Could you check the number of samples for each class and see if one of them is used in ~94% of all samples?