VAE Loss not decreasing

Akshay_Subramanian · June 13, 2019, 10:21am

I have implemented a Variational Autoencoder in Pytorch that works on SMILES strings(String representations of molecular structures).
When trained to output the same string as the input, the loss does not decrease between epochs.

I have tried removing the KL Divergence loss and sampling and training only the simple autoencoder. But, even this fails to converge.
Therefore I assume that the error must be with either the encoder, decoder or the training loop and probably is not related to the loss or sampling.

I have also tried the following unsuccessfully:

Addition of more GRU layers to improve learning capability of model.
Increasing and decreasing learning rate.
Changing the optimizer from Adam to SGD.
Modifying the batch size(increasing and decreasing)

The following is a link to my code:
https://colab.research.google.com/drive/1LctSm_Emnn5sHpw_Hon8xL5fF4bmKRw5

The following is a link to an equivalent( Same architecture) keras model that successfully trains.
https://colab.research.google.com/drive/170Peseik03CFYpWPNyD8B8mxUGxTQx67

Akshay_Subramanian · June 16, 2019, 2:56am

@ptrblck Could you please have a look at the problem? I have been stuck on it for almost a week now. Any ideas would be helpful. Thanks.

ptrblck · June 16, 2019, 12:44pm

Could you explain the general workflow of your model a bit?
It seems that you are using nn.BCELoss with a softmax output from your decoder.
Also does data represent a probability distribution in [0, 1]?

I would suggest to use a small data sample (e.g. 10 samples) and try to overfit your model.
If your model cannot overfit this small sample, there might be other code bugs.

Akshay_Subramanian · June 16, 2019, 1:50pm

Thanks for your reply @ptrblck.
Yes, I am using nn.BCELoss with a softmax output from the decoder.
data represents a one hot encoded vector of shape [600, 120, 33] (120 is the length of each string and 33 is the length of the character set used to make these strings).

I tried training the model on a small data sample of 10 samples. The initial loss was 20.636. The loss stagnated at 19.803 after around 7500 epochs. The loss decreased very slowly but continuously through these epochs.
So, I guess it is not overfitting at all.

Please let me know if I can clarify anything else about the code.
Any ideas on what can be done @ptrblck ? Thanks.

Adnan-annan · November 17, 2020, 10:15am

@Akshay_Subramanian were you able to get the vae working ? i am facing a similar problem where loss is not decreasing at all. please give me any suggestions.

Keyv_Krmn · January 4, 2021, 2:35am

I recommend to first try with a very small data size as @ptrblck suggested. Also, see if you can converge with Autoencoders first, VAE could be tricky to converge. Chek if you apply optimizer.step(), the dimenstion of the output, your loss function, and finally hyperparameters. The best way to debug if to print the output of the model and see if they change