I have trouble with how to feed mfcc features into conv2d layer(N,C,H, W). However, the shape mfcc feature is audio_feature (N, time frames, Mel). I don’t know how to do it? or just unsqueeze(1) like audio_feature.unsqueeze(1)
(N, 1, time frames, Mel). Is it true?
How to do it? Thanks
By the way, feed mfcc features into VGG/Resnet
Technically this is possible.
However, I have seen some papers using MFCC Spectrogram as a direct input to CNNs as well.
Disclaimer: I am not an expert in speech processing / and do not know a lot of literature.
Thank you for reply. Do you know how to feed mfcc features directly into Resent? I mean if the shape of mfcc feature is [batch, time_frame, mel], how to do it?