I am trying to implement TDNN based speaker verification system (X-VECTORS: ROBUST DNN EMBEDDINGS FOR SPEAKER RECOGNITION)
I have found some implementation of TDNN from github jonasvdd, cvqluu, SiddGururani. I am using implementation given by jonasvdd. I verified that this implementation is perfectly matching the purpose of TDNN. The TDNN layer contains series of layers, namely, Conv1D, ReLU, Dropout, and BatchNorm.
TDNN ( Conv1D --> ReLU --> Dropout --> BatchNorm )
Complete model is as follows
Model ( TDNN1 --> TDNN2 --> TDNN3 --> TDNN4 --> TDNN5 --> StatsPool --> Linear --> ReLU --> Linear )
Here StatsPool layers calculates the mean and standard deviation of output of TDNN5 layer.
My input to the network is (64, 480, 60), where 64 is batch size, 480 is sequence length, and 60 is feature size. As this is multi-class classification task, I am using CrossEntropy Loss.
The problem is when I use either BatchNorm or Dropout or both in TDNN, it gives me NaN after some iteration (after 8 to 10 batches only). But, when I remove both the layers from the network it works perfectly fine.
As suggested in other queries on this forums, I checked that my input does not contain NaN. I also tried by applying gradient clipping, that also doesn’t work.
I have seen some queries regarding effect of changing sequence of layers. Is there any improper sequence of layers in my model?
Please suggest solution.