5D audio data input to convolutional auto-encoder

In the context of audio processing, a 2D spectrogram is the most common input representation to the network such as CNN. Normally, one convert a 2D spectrogram into a 4D tensor as the input to the network, i.e., (n_frame, 1, n_freqbin, n_winsize), where n_frame, n_freqbin and n_winsize refers to the number of time frame, frequency bin, and size of context window*, respectively, and 1 refers to the channel analogously to image channels.

My problem is I would like to have an additional dimension of number of clip, so my input data is in size of (n_clip, n_frame, 1, n_freqbin, n_winsize). Notice that we are still batching the number of frame in this case.

And now, the input would go through an auto-encoder, which results in the code layer in size of (n_clip, n_frame, codesize), where codesize refers to the predefined dimension of the latent code. So my question is how do I pass a 5D tensor (n_clip, n_frame, 1, n_freqbin, n_winsize) to a CNN auto-encoder, and generate latent codes (n_clip, n_frame, codesize)? The current workaround is to iterate through n_clip, and stack the output in every iteration.

Thanks for any tip!

*: a single frame of spectrogram is commonly represented as a context window, which includes prior and posterior frames (n_winsize), thereby convert a single time frame to a 2D representation.

1 Like

There are batch+channel+3d ops like conv3d (eg for video), but likely this isn’t exactly what you want. Depending on the relation you want between the clips, you could also see if .view ing them in some 4d way makes sense - if you want to treat the clip dimension similar to batch (same weights for all clips, no interaction) or channel (per clip weights and full interaction).

Best regards


I’m looking into a similar problem. What did you end up doing?