How to feed audio into a model?&Can a fully connected layer followed by a conv layer?

Hi, almighty guys :smiley:.
I’m a beginner in digital audio signal processing. And I met some questions during reading a paper.

  1. When I get the filter banks outputs from a 10s audio segment, should I send all of them into the model or just several frames?
  2. I’m a little confused about the model architecture in the paper. How does the dense layer connect with the encoder and decoder. Does it convert a Bx1x1x128 tensor to a Bx128xhxw tensor?