I’ve been trying to work on encoding the Spatio-temporal features using 3D CNN’s. I’m passing the sequence of images of shape (16,22,3,28,28) ----> (batch, sequence_length, channels, height, width) to the encoder-decoder model and trying to generate the compressed vector space that represents the sequence of images. But the network isn’t learning/converging. As in, the loss isn’t decreasing. Sometimes, I also get nan for loss.
About the data I’m using:
I’m using synthetic data and I have generated 2000 image sequences (2000,22,3,28,28). I have augmented it by 7 times using the techniques like injecting random Gaussian noise, cropping, changing the contrast and color space, etc making it (14000,22,3,28,28).
About the model:
It is an encoder-decoder architecture with 3D convolutional layers (5 in encoder and 5 in decoder).
Loss: MSE loss
I have tried playing with the learning rate, using dropout layers, normalization layers, changing the dimension of the compressed vector space of the encoder. But nothing helped so far. Please find my model and code below. Your help and suggestions will be appreciated.
Thanks in advance