Why cant I get this simple model to overfit on a single training example?

Im trying to do some experiments with embeddings and generating images, but have not been able to get a model that generates good output.I created a very simple model made up of an embedding layer and linear layer that takes a pair of an image and some text that represents that image and then tries to regenerate that image. I keep getting random noise as the output of the model instead of an image. I am testing on just trying to get the model to overfit on a single example ,but I get just get back random garbage. I train for 200 epochs. Even if the architect is not good, shouldnt the model be able to overfit and duplicate the exact image? I use mean squared error on the pixels comparing the training image to generated image, and the image is only 40x40 pixels. Am I doing something fundamentally wrong? Can give me pointers for this? Here is the full repo with training data: https://github.com/jtoy/overfit_image_pytorch

Here is what the model generates:
https://github.com/jtoy/overfit_image_pytorch/blob/master/results/0_0.png
Here is the training example Im trying to overfit: https://github.com/jtoy/overfit_image_pytorch/blob/master/results/0_0target.png

I quickly checked your code, and here are a few questions/pointers.

im sure it doesnt generate the same target labels. Here is how it parsed the text:
Counter({’,’: 5, ‘(’: 2, ‘)’: 2, ‘\n’: 2, ‘5’: 2, ‘fill’: 1, ‘r_two_five_five’: 1, ‘g_eight_five’: 1, ‘b_eight_five’: 1, ‘ellipse’: 1, ‘x_two_eight’: 1, ‘y_two_seven’: 1})

for this training data:
https://github.com/jtoy/overfit_image_pytorch/blob/master/training_data/code.txt

And here is the line that parses the actual string: https://github.com/jtoy/overfit_image_pytorch/blob/master/build_vocab.py#L28

That given, I will see if I can simplify the code more to show more clearly that that area of code is not causing the problem.

The model does learn, I just trained with a bunch of different learning rates, I get the best result with learning rate 0.01. The loss was 0.1180
Here is what training loss looks like:
`Epoch [0/200], Step [0/1], Loss: 0.9265, accuracy: 0.00 Perplexity: 2.5256
Epoch [1/200], Step [0/1], Loss: 2.8060, accuracy: 0.00 Perplexity: 16.5436
Epoch [2/200], Step [0/1], Loss: 0.6228, accuracy: 0.00 Perplexity: 1.8642
Epoch [3/200], Step [0/1], Loss: 0.8439, accuracy: 0.00 Perplexity: 2.3253
Epoch [4/200], Step [0/1], Loss: 0.9583, accuracy: 0.00 Perplexity: 2.6072
Epoch [5/200], Step [0/1], Loss: 0.9609, accuracy: 0.00 Perplexity: 2.6142
Epoch [6/200], Step [0/1], Loss: 0.9588, accuracy: 0.00 Perplexity: 2.6086
Epoch [7/200], Step [0/1], Loss: 0.9474, accuracy: 0.00 Perplexity: 2.5789
Epoch [8/200], Step [0/1], Loss: 0.9370, accuracy: 0.00 Perplexity: 2.5524
Epoch [9/200], Step [0/1], Loss: 0.9292, accuracy: 0.00 Perplexity: 2.5325
Epoch [10/200], Step [0/1], Loss: 0.9257, accuracy: 0.00 Perplexity: 2.5235
Epoch [11/200], Step [0/1], Loss: 0.9181, accuracy: 0.00 Perplexity: 2.5045
Epoch [12/200], Step [0/1], Loss: 0.9106, accuracy: 0.00 Perplexity: 2.4857
Epoch [13/200], Step [0/1], Loss: 0.9039, accuracy: 0.00 Perplexity: 2.4692
Epoch [14/200], Step [0/1], Loss: 0.8969, accuracy: 0.00 Perplexity: 2.4521
Epoch [15/200], Step [0/1], Loss: 0.8897, accuracy: 0.00 Perplexity: 2.4344
Epoch [16/200], Step [0/1], Loss: 0.8847, accuracy: 0.00 Perplexity: 2.4224
Epoch [17/200], Step [0/1], Loss: 0.8815, accuracy: 0.00 Perplexity: 2.4145
Epoch [18/200], Step [0/1], Loss: 0.8753, accuracy: 0.00 Perplexity: 2.3997
Epoch [19/200], Step [0/1], Loss: 0.8690, accuracy: 0.00 Perplexity: 2.3844
Epoch [20/200], Step [0/1], Loss: 0.8642, accuracy: 0.00 Perplexity: 2.3732

Epoch [126/200], Step [0/1], Loss: 0.1181, accuracy: 0.00 Perplexity: 1.1253
Epoch [127/200], Step [0/1], Loss: 0.1181, accuracy: 0.00 Perplexity: 1.1253
Epoch [128/200], Step [0/1], Loss: 0.1180, accuracy: 0.00 Perplexity: 1.1253`

And the generated image is starting to look like the training data, but there is still tons of noise: https://cl.ly/3E1d0X0w1r0b

Am I wrong to expect the model to learn to duplicate/overfit the exact image with just a single training image? How can I make the model overfit this and get the same result?

If you can’t overfit the single training example, try increasing your model complexity maybe?

I have been using more complex models ( rnns, deconv layers, cnn,etc) in various configurations. I’ve tried to reduce the model to the simplest complexity as possible. In this current repo, the final layer I use is a linear layer that takes an embedding layer and outputs 40x40pixels x 3 (rgb channels). I’m not sure what I am doing that is wrong.

I’m not sure what is the expected output range that you have, but do you need the last relu in your model?

@fmassa Yes, that was the issue! now I have a base to rebuild up, thank you!