Limits for a bottleneck

Hello! I have some 64x64 pixels frames from a (simulated) video, with a spaceship moving on a fixed background. The spaceship moves in a straight line with constant velocity from left to right (along the x-axis), and the frames are from equal time intervals. I can also place the ship at different y positions and let it move. In total I have 8 y positions and 64 frames for each y position (the details don’t matter that much). Intuitively, as the background is fixed, and the shape of the ship is the same, all the information to reconstruct the image is found in the x and y position of the spaceship. What I am trying to do is to have a NN with an encoded and a decoder and a bottleneck in the middle and I want that bottleneck to have just 2 neurons. Ideally, the network would learn in these 2 neurons some function of x and y in the encoder, and the decoder would invert that function to give the original image. Here is my NN architecture (in Pytorch):

class Rocket_E_NN(nn.Module):
    def __init__(self):
        super().__init__()
        
        self.encoder = nn.Sequential(
            nn.Conv2d(3, 32, 4, 2, 1),          # B,  32, 32, 32
            nn.ReLU(True),
            nn.Conv2d(32, 32, 4, 2, 1),          # B,  32, 16, 16
            nn.ReLU(True),
            nn.Conv2d(32, 64, 4, 2, 1),          # B,  64,  8,  8
            nn.ReLU(True),
            nn.Conv2d(64, 64, 4, 2, 1),          # B,  64,  4,  4
            nn.ReLU(True),
            nn.Conv2d(64, 256, 4, 1),            # B, 256,  1,  1
            nn.ReLU(True),
            View((-1, 256*1*1)),                 # B, 256
            nn.Linear(256, 2),             # B, 1
        )
            
    def forward(self, x):
        z = self.encoder(x)
        return z

class Rocket_D_NN(nn.Module):
    def __init__(self):
        super().__init__()
        self.decoder = nn.Sequential(
            nn.Linear(2, 256),               # B, 256
            View((-1, 256, 1, 1)),               # B, 256,  1,  1
            nn.ReLU(True),
            nn.ConvTranspose2d(256, 64, 4),      # B,  64,  4,  4
            nn.ReLU(True),
            nn.ConvTranspose2d(64, 64, 4, 2, 1), # B,  64,  8,  8
            nn.ReLU(True),
            nn.ConvTranspose2d(64, 32, 4, 2, 1), # B,  32, 16, 16
            nn.ReLU(True),
            nn.ConvTranspose2d(32, 32, 4, 2, 1), # B,  32, 32, 32
            nn.ReLU(True),
            nn.ConvTranspose2d(32, 3, 4, 2, 1),  # B, 3, 64, 64
        )
            
    def forward(self, z):
        x = self.decoder(z)
        return x

And this is the example of one of the images that I have (it was much higher resolution but I brought it down to 64x64):

enter image description here

So after training it for around 2000 epoch with a bs of 128, with Adam, trying several LR schedules (going from 1e-3 to 1e-6) I can’t get the loss below an RMSE of 0.010-0.015 (the pixel values are between 0 and 1). The reconstructed image looks ok by eye, but I would need a better loss for the purpose of my project. Is there any way I can push the loss lower, or am I asking too much from the NN to distill all the information in these 2 numbers?

Have a look at the CoordConv paper which uses an additional input channel with a grid to give the model additional information.
I think this method might also work quite good in your use case.

Thank you for this! It looks promising. I am a bit confused about implementation details. For the encoder it is pretty straightforward to add 2 (or 3) more channels to the images. But how should I do this for the decoder? The input to my decoder is (ideally) a 2 dimensional vector. How do I use the CoordDeconv on that? Thank you!

You could probably add the coordinate channel after the view operation in your decoder or just try to use the coord convs just in your encoder.

Let me know how your experiments worked out!

Thank you for your reply. I tried it in several ways, I even added the coordConv layer at each step (and also just in the beginning), tried with 2 and 3 layers (they also try a radial distance layer), but I don’t seem to get any improvement. I also tried my own implementation and used some GitHub code, with the same results. I am not sure what I am doing wrong. One thing that could be (but not sure at all) is that in their paper they mostly analyze the results of the VAE and GAN visually. My results also look really good visually. But when I compare the input and output pixels, the error is still pretty big. Any further suggestion on what I should do?