wav2vec for pre-training to extract high-dimensional speech features from datasets

Gorgen · March 30, 2022, 10:14pm

sorry. I disturb you again. To be honest, I know each transformer layer are suitable for different downstream tasks. Emotion recognition, ASR, PR, and etc are needed to utilze different transformer layer(the original paper is SUPERB: Speech processing Universal PERformance Benchmark). However, they said “PR, KS, SID,IC, ER are simple tasks that are solvable with
linear downstream models. Hence, we use a frame-wise linear
transformation for PR with CTC loss; mean-pooling followed
by a linear transformation with cross-entropy loss for utterancelevel tasks (KS, SID, IC, and ER)”. I don’t know where is the frae-wise linear transformation compared to 12 transformers. Could you please give me some suggestions? Thanks