The wav2vec2.0 base 960h model never seems to return a beginning of sentence or end of sentence token (or ’ or unknown so far). Is that expected? I can’t seem to find this discussed anywhere. Why are those tokens in the decoding dictionary? Why are those those options in the final emission matrix? Or am I just feeding in audio that is too difficult for the model to determine eos/bos? If so, can someone provide a counter-example?