How to build a CNN BLSTM Hybrid Model?


I’ve been trying to implement an end to end acoustic model for speech recognition and for that I’m using normalised raw signal as my input. I’m trying to implement the following 2 architectures:

  1. CNN + DNN
  2. CNN + BLSTM.

For model 1, the loss is remaining constant mostly while for model 2 the loss is increasing, sometimes decreasing and sometimes going nan. I can’t seem to understand how to get it to work since a lot of papers have used this approach.

I’ve tried reducing the learning rate to very very low, 0.0001, I’ve tried gradient clipping but they don’t seem to work. I’ve also tried simpler architecture with few or only one layer but that doesn’t seem to work either.

Model 1 is using relu activation for all the layers while model 2 is using tanh for all the layers. The input wav signal is 400 dimensional vector. There are two things I don’t understand.

  1. Do I need to use a linear layer between CNN and BLSTM layer? If yes then why? If I use relu activation for the last layer of CNN would I still need the linear layer?

  2. What is a dense layer? I’ve seen this term which comes after the LSTM unit.

Is there any available code snippet that I can use for the implementation of such an architecture?