In the context of audio processing, a 2D spectrogram is the most common input representation to the network such as CNN. Normally, one convert a 2D spectrogram into a 4D tensor as the input to the network, i.e., (n_frame, 1, n_freqbin, n_winsize), where n_frame, n_freqbin and n_winsize refers to the number of time frame, frequency bin, and size of context window*, respectively, and 1 refers to the channel analogously to image channels.
My problem is I would like to have an additional dimension of number of clip, so my input data is in size of (n_clip, n_frame, 1, n_freqbin, n_winsize). Notice that we are still batching the number of frame in this case.
And now, the input would go through an auto-encoder, which results in the code layer in size of (n_clip, n_frame, codesize), where codesize refers to the predefined dimension of the latent code. So my question is how do I pass a 5D tensor (n_clip, n_frame, 1, n_freqbin, n_winsize) to a CNN auto-encoder, and generate latent codes (n_clip, n_frame, codesize)? The current workaround is to iterate through n_clip, and stack the output in every iteration.
Thanks for any tip!
*: a single frame of spectrogram is commonly represented as a context window, which includes prior and posterior frames (n_winsize), thereby convert a single time frame to a 2D representation.