Basic Fully connected NN - encoding a single image

Hi,

I’ve been playing around with bits and pieces of nn code, trying to understand how to build neural nets from scratch. (Gone through the fast ai course; now really trying to understand torch and building from ground up)

There’s a really great channel on Youtube, that’s making a neural net library from scratch, and I’m trying to port over the example set up into torch, but I’m stumbling a bit.

The idea is to take 2 of the mnist digits and a using a set of linear layers encode the images using sigmoid activations. End result would be so you can interpolate between the two images.

In the example he has on the yt channel, he uses three neurons in the first layer, 1st for X, 2nd for Y, and 3rd Neuron is the interpolator. The images are given markers 0 and 1, and the third neuron during inference controls the interpolation, so 0 = image 1, and 1 = image 2; 0.5 being a midway between shapes.

This is where I’ve got stuck trying to figure out, as using Torch we’re not encoding x and y into separate neurons. And i’m not entirely sure how I’d add this interpolation neuron

Has anyone got any idea or points in the right direction? And it seems like it works, but I don’t really get it, as during inference I just pass in the same data it was trained with. Or is that what an autoencoder is, same data in, same data out?

If anyone has literally any insights I’d be really greatful, as even this simple set up is kinda confusing to me.

This is my code so far:

from torch import nn
from torch.optim import Adam
from torchvision import datasets
from torchvision.transforms import ToTensor
import torch.nn.functional as F
import torch
from PIL import Image
from torchvision.transforms import ToPILImage

train = datasets.MNIST(root='data', download=True, train=True, transform=ToTensor())

im1 = train[0][0]
im2 = train[1][0]
img1 = ToPILImage()(im1)
img2 = ToPILImage()(im2)
img1.show()
img2.show()

totens = ToTensor()
img1tensor = totens(img1)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 10),
            nn.Sigmoid(),
            # nn.ReLU(),
            nn.Linear(10, 20),
            nn.Sigmoid(),
            # nn.ReLU(),
            nn.Linear(20, 20),
            nn.Sigmoid(),
            # nn.ReLU(),
            nn.Linear(20, 28*28),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

# random_data = torch.rand((1, 1, 28, 28))
data = img1tensor

model = NeuralNetwork()
result = model(data) # run data through model to see if it works just passing through
print(result.shape)
result = result.reshape(28,28)
img = ToPILImage()(result)
img.show()

model = NeuralNetwork().to('cpu') # change to cuda if you want
print(model)
optimizer = Adam(model.parameters(), lr = 1e-1)
loss_func = nn.MSELoss()
loss = 0

for epoch in range(100):
      optimizer.zero_grad() # reset the gradients back to zero
      # compute reconstructions
      outputs = model(data)
      # print(outputs.shape, data.shape)
      # compute training reconstruction loss
      train_loss = loss_func(outputs, data.reshape(1,784))
      # compute accumulated gradients
      train_loss.backward()
      # perform parameter update based on current gradients
      optimizer.step()
      # add the mini-batch training loss to epoch loss
      loss += train_loss.item()
      print(train_loss)
      result = model(data) # run data through model to see if it works just passing through
      result = result.reshape(28,28)
      img = ToPILImage()(result)
      img.show()

And this is the video I’m trying to port from : tsoding daily - nn from scratch

I’ve tweaked the code a bit, and answered a couple of my own questions by playing around. (Like the autoencoder wonderings)

Now passing in empty data, and rand data and getting reconstructions of the digit, no matter the input.

Tweaked the sigmoid activation, so there’s only one, and still trains. Instead of sigmoid after each linear layer.

The bit that’s confusing me is the large amount of trainable params, it’s including all the 784 at the in and out, which i’m not 100% on whether that’s correct.

I know it’s not the most glamorous or clever thing to be playing around with; I’m just trying to understand things from the ground up before getting more complex with conv nets.

I also set a manual seed for reproducibility

from torch import nn
from torch.optim import Adam
from torchvision import datasets
from torchvision.transforms import ToTensor
import torch.nn.functional as F
import torch
from PIL import Image
from torchvision.transforms import ToPILImage

train = datasets.MNIST(root='data', download=True, train=True, transform=ToTensor())

im1 = train[0][0]
im2 = train[1][0]
img1 = ToPILImage()(im1)
img2 = ToPILImage()(im2)
print('these are the input images:')
img1.show()
img2.show()

totens = ToTensor()
img1tensor = totens(img1)

torch.manual_seed(7331)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 3),
            nn.Linear(3, 1),
            nn.Linear(1, 1),
            # nn.Sigmoid(),
            nn.Linear(1, 28*28),
            nn.Sigmoid(), # this is odd, still learns no matter where I put sigmoid, comm here, and uncomm the other above.
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

random_data = torch.rand((1, 1, 28, 28))
empty_data = torch.zeros((1, 1, 28, 28))
data = img1tensor

randt = ToPILImage()(random_data[0])
emptyt = ToPILImage()(empty_data[0])
print('this is rand tens img:')
randt.show()
print('this is empty tens img:')
emptyt.show()

model = NeuralNetwork()
print('this is passing through empty tensor to untrained net:')
result = model(empty_data) # run data through model to see if it works just passing through
# print(result.shape)
result = result.reshape(28,28)
img = ToPILImage()(result)
img.show()

model = NeuralNetwork().to('cpu') # change to cuda if you want
# print(model)
optimizer = Adam(model.parameters(), lr = 1e-1)
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
print('param count =', count_parameters(model))
loss_func = nn.MSELoss()
loss = 0
print('Beginning training:')

for epoch in range(100):
      optimizer.zero_grad() # reset the gradients back to zero
      # compute reconstructions
      outputs = model(empty_data) # passing through empty data now, instead of data
      # compute training reconstruction loss
      train_loss = loss_func(outputs, data.reshape(1,784))
      # compute accumulated gradients
      train_loss.backward()
      # perform parameter update based on current gradients
      optimizer.step()
      # add the mini-batch training loss to epoch loss
      loss += train_loss.item()
      if epoch % 100 == 0:
        print('loss:', train_loss.item())
        print(epoch)
        result = model(empty_data) # run random data through model to see if it works
        result = result.reshape(28,28)
        img = ToPILImage()(result)
        img.show()

print("final train loss:", train_loss.item())
empty_data = torch.zeros((1, 1, 28, 28))
result = model(empty_data) # run data through model to see if it works just passing through
result = result.reshape(28,28)
img = ToPILImage()(result)
print('reconstruction with empty data:')
img.show()

rand_data = torch.rand((1, 1, 28, 28))
result = model(rand_data) # run data through model to see if it works just passing through
result = result.reshape(28,28)
img = ToPILImage()(result)
print('reconstruction with random data:')
img.show()

If anyone has any insights of how to get this closer to the original example in the video I’d be grateful. Still don’t think I’m doing it quite right.

This approach sounds correct as you would usually use all inputs in the first layer and would then output the same shape in an autoencoder.
I’m not sure what the author of the video uses and if he is processing the input to create features or so.

I think this totally makes sense and is a great way to get familiar with the framework.

Note that your current model is using stacked linear layers:

        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 3),
            nn.Linear(3, 1),
            nn.Linear(1, 1),
            # nn.Sigmoid(),
            nn.Linear(1, 28*28),
            nn.Sigmoid(), # this is odd, still learns no matter where I put sigmoid, comm here, and uncomm the other above.
        )

which can mathematically be collapsed to a single linear layer if no non-linear activation function is used between them, so you might want to revisit the architecture.

Awesome thank you for all the encouraging words! I’ve seen your name lots around the forum, so really I appreciate you taken the time to have a ganders at this!

So that solidifies I was incorrect removing sigmoids after each layer, there must be activations after each layer then; if you’re suggesting it’s all equivalent to a single fc layer doing it with only one activation. That’s one wonder out of the way!

In the video example, he normalises the pixel co-ords (this might be the key step i’m missing), so the X pixels are normalised from W (28), to 0-1 range, and same for Y. In the basic example (before the interpolation) he’s able to use this net for simply upscaling the image. I feel with the 28*28 I’m literally just mapping back to exact pixels, which I’m not sure I’m after.

I’ve been trying to play around with a vis library, to help at least visually debug or get across/explain the architecture in an easier way, so I added that at the bottom of this nb: Colab - basic_nn

This is the intended arch, 3 in 1 out, and the third neuron controls the morph, for now could just be 2 in 1 out, with each input neuron for normalised x and y. Does that make sense, or is that possible?

The direct mapping from 28x28 pixels would make sense if you want to reconstruct the input and learn e.g. an internal latent feature tensor.
However, it seems the author has reduced the spatial dimensions somehow, which I don’t quite understand yet. In the end you would need to reduce 28x28 values to only 2 or 3 input features. Did the author explain how this reduction was done and could you describe what “normalize” means in this context?

It’s an interesting one; had to have a think about that. I looked at the code again, and made some comments.

I’ve added the training data creation loop, I’ve over verbosely added in comments. I think the gist is the X is norm’d spatially every 28pix, and Y is norm’d spatially over W*H, and the tensor/matrix looks something like:

[normX, normY, interp, brightness] - pix 1
[normX, normY, interp, brightness] - pix 2 etc
++ so each column would end up having W * H (* amount of images)
[normX, normY, interp, brightness] - pix 784*(amount of images)

I’m not sure I know the exact vocabulary for this stuff yet, but I guess the equivalent would be that each pixel represents a single element in the batch. Instead of having a tensor with multiple images, we have a tensor with multiple pixels.

In torch you’d normally do something like [1, 3, 28, 28], where the first would denote the image in a batch (above just 1), 2nd the amount of colours (above 3 cols), and third and fourth the X and Y (28 and 28).

The author has chosen to re-arrange things slightly, with:
(normX, normY, interp, brightness). Each entry/column has the same length/size.

So normX.size = w * h * amountOfImages
and so would normY, interp and brightness.

if(!trainset_created)
{
    const int w = img1.getWidth(); // get width of picture, for mnist is 28 for width
    const int h = img1.getHeight(); // get height of picture, for mnist is 28 for height
    
    for (int y = 0; y < h; ++y) // loop height values in image
    {
        for (int x = 0; x < w; ++x) // loop over width values in image
        {
            training_data[x][y] = img1.getPixelAt(x, y).getPixelARGB().getBlue(); // sample the pixel vals and put into 2d arr
            int ind = y * w + x; // our index to save into is current y * w + x
            float normX = x / (w - 1.0); // this normalises width spacially, if x is 27, and width is 28, then 27 / (28 - 1.0) = 1
                                         // this normalises width spacially, if x is 0,  and width is 28, then 0 / (28 - 1.0) = 0
            float normY = y / (h - 1.0); // same of y so gives range of 0 - 1
            float normBrightness = training_data[x][y] / 255.f; // this normalises brightness value, from 8bit val to range 0-1
            
            MAT_AT(mat, ind, 0) = normX; // So this is our main X // Goes to input 1 of net
            MAT_AT(mat, ind, 1) = normY; // this is out main Y // Goes to input 2 of net
            MAT_AT(mat, ind, 2) = 0.0f; // this would be the interpolation, maybe equiv to one hot encoding? // Goes to input three
            MAT_AT(mat, ind, 3) = normBrightness; // And this is the value actually passed into the net to train on
            //Looks like you're left with a matrix the size of, length pixel amount width * height (* amount of images):
            //[normX, normY, interp, brightness]
            //[normX, normY, interp, brightness]
            //++
            //continuing on for 784   for 1 pic
            //continuing on for 784*2 for 2 pic
        }
    }
    // this is for 2nd image
    for (int y = 0; y < h; ++y)
    {
        for (int x = 0; x < w; ++x)
        {
            training_data2[x][y] = img2.getPixelAt(x, y).getPixelARGB().getBlue();
            int ind = w * h + y * w + x; // this offsets the index by adding an additional w * h, so offset entry by 784 in mnist case
            float normX = x / (w - 1.0); // normalise spacially again
            float normY = y / (h - 1.0);
            float normBrightness = training_data2[x][y] / 255.f; // norm brightness
            
            MAT_AT(mat, ind, 0) = normX; // does same thing as above, save into
            MAT_AT(mat, ind, 1) = normY;
            MAT_AT(mat, ind, 2) = 1.0f; // This would be your interpolation encoding, input to neuron 3 of net.
            MAT_AT(mat, ind, 3) = normBrightness;
        }
        trainset_created = true;
    }
}

And this would be an example of what you’d see for training input:

It’s definitely a different way than you’re presented with in normal torch docs!

ok well this is embarrassing to post, as it’s so overly verbose for python, and could probably be simplified, but cards on the table, I’m not a python programmer at all. But I think this is what the training data should be/ or look like:

import numpy as np

im1 = im1.squeeze(dim=0) # squeeze to just get an array of (28, 28)
im1.shape

width = 28
height = 28
w = range(width)
h = range(height)
itr = 0

xArr = []
yArr = []
brightArr = []
interpArr = []

for y in h:
  for x in w:
    normX = itr / (width*height - 1.0)
    xArr = np.append(xArr, normX)
    normY = x / (height - 1.0)
    yArr = np.append(yArr, normY)
    interpArr = np.append(interpArr, 0)
    itr = itr + 1
    #print("xpos:", x, "normX:", normX, "normY:", normY) # what's that line that only displays to certain number of decimals, this is a mess

for y, valy in enumerate(im1):
  for x, valx in enumerate(valy):
    #print(valx)
    brightArr = np.append(brightArr, valx)

print("These should all be same size", xArr.size, yArr.size, brightArr.size, interpArr.size)

finarr = [xArr, yArr, interpArr, brightArr]

print("0th pos is X normalised", finarr[0].shape , "1st pos is Y normalised", finarr[1].shape, "2nd pos is interpolation", finarr[2].shape, "3rd pos is brightness", finarr[3].shape)

#overly verbose, and probably not very "pythonic", but just getting the thing at least working before tweaking and finding simpler/better way

traintensor = torch.from_numpy(np.array(finarr))
traintensor.shape

The one thing I’m wondering or not sure now, is how the pixel brightness is being sent in to train on. If each input of the net is for X, Y, and Interp, do I send the brightness into each neuron. I’m super confused now!!

right I think I’m a tiny bit closer, but still no cigar:

So what i was missing was there’s a training in, which is the XNormal, YNormal, and Interpolation number, and the training out, would be the actual brightness values.

I’m also missing the batching step.

I honestly thought it would be a nice little project, but it’s kinda difficult reverse engineering someone’s thinking, and implementing in another language/framework.

However it’s definitely given me a bit more confidence in hacking around with torch! Even if I’m still confused as all hell!

from torch import nn
from torch.optim import Adam
from torchvision import datasets
from torchvision.transforms import ToTensor
import torch.nn.functional as F
import torch
from PIL import Image
from torchvision.transforms import ToPILImage
import numpy as np

train = datasets.MNIST(root='data', download=True, train=True, transform=ToTensor())

im1 = train[0][0]
im2 = train[1][0]
img1 = ToPILImage()(im1)
img2 = ToPILImage()(im2)
print('these are the input images:')
img1.show()
img2.show()

im1 = im1.squeeze(dim=0)
im1.shape

width = 28
height = 28
w = range(width)
h = range(height)
itr = 0

xArr = []
yArr = []
brightArr = []
interpArr = []

for y in h:
  for x in w:
    normX = itr / (width*height - 1.0)
    xArr = np.append(xArr, normX)
    normY = x / (height - 1.0)
    yArr = np.append(yArr, normY)
    interpArr = np.append(interpArr, 0)
    itr = itr + 1
    #print("xpos:", x, "normX:", normX, "normY:", normY) # what's that line that only displays to certain number of decimals, this is a mess

for y, valy in enumerate(im1):
  for x, valx in enumerate(valy):
    #print(valx)
    brightArr = np.append(brightArr, valx)

#### this is for the 2nd image
# for y in h:
#   for x in w:
#     normX = itr / (width*height - 1.0)
#     xArr = np.append(xArr, normX)
#     normY = x / (height - 1.0)
#     yArr = np.append(yArr, normY)
#     interpArr = np.append(interpArr, 1)
#     itr = itr + 1
#     #print("xpos:", x, "normX:", normX, "normY:", normY) # what's that line that only displays to certain number of decimals, this is a mess

# for y, valy in enumerate(im2):
#   for x, valx in enumerate(valy):
#     #print(valx)
#     brightArr = np.append(brightArr, valx)

print("These should all be same size", xArr.size, yArr.size, brightArr.size, interpArr.size)

finarr  = [xArr, yArr, interpArr]
testarr = [brightArr]
print("0th pos is X normalised", finarr[0].shape , "1st pos is Y normalised", finarr[1].shape, "2nd pos is interpolation", finarr[2].shape, "3rd pos is brightness", testarr[0].shape)
#overly verbose, and probably not very "pythonic", but just getting the thing at least working before tweaking and finding simpler/better way
train_tensor = torch.from_numpy(np.array(finarr))
test_tensor  = torch.from_numpy(np.array(testarr))
print('test tensor:', test_tensor.shape, 'train input tensor:', train_tensor.shape)

torch.manual_seed(7331)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 3),
            nn.Sigmoid(),
            nn.Linear(3, 10),
            nn.Sigmoid(),
            nn.Linear(10, 10),
            nn.Sigmoid(),
            nn.Linear(10, 28*28),
            nn.Sigmoid(), # this is odd, still learns no matter where I put sigmoid, comm here, and uncomm the other above.
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits


random_data = torch.rand((3, 28*28))
empty_data = torch.zeros((3, 28*28))
test_data = test_tensor
train_data = train_tensor
test_data = test_data.to(torch.float32)
train_data = train_data.to(torch.float32)
print('you want these to be all the same datatype:',test_data.dtype, train_data.dtype, random_data.dtype, empty_data.dtype)
print(test_data.shape)

random_data = random_data.reshape(3,28,28)
randt  = ToPILImage()(random_data)
emptyt = ToPILImage()(empty_data)
print('this is rand tens img:')
randt.show()
print('this is empty tens img:')
emptyt.show()

model = NeuralNetwork()
print('this is passing through empty tensor to untrained net:')
result = model(empty_data) # run data through model to see if it works just passing through
print('output of model:', result.shape) # what why is the output torch.Size([3, 784])? When the output of final layer is 28*28

result = result.reshape(3,28,28)
img = ToPILImage()(result)
img.show()

model = NeuralNetwork().to('cpu') # change to cuda if you want
# print(model)
optimizer = Adam(model.parameters(), lr = 1e-1)
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
print('param count =', count_parameters(model))
loss_func = nn.MSELoss()
loss = 0
print('Beginning training:')

for epoch in range(200):
      optimizer.zero_grad() # reset the gradients back to zero
      # compute reconstructions
      outputs = model(empty_data) # passing through empty data now, instead of data
      # compute training reconstruction loss
      # print('data shape:', data.shape, 'output shape:', outputs.shape)
      train_loss = loss_func(outputs, test_data)
      # train_loss = loss_func(outputs, data.reshape(1,784))
      # compute accumulated gradients
      train_loss.backward()
      # perform parameter update based on current gradients
      optimizer.step()
      # add the mini-batch training loss to epoch loss
      loss += train_loss.item()
      if epoch % 100 == 0:
        print('loss:', train_loss.item())
        print(epoch)
        result = model(empty_data) # run random data through model to see if it works
        result = result.reshape(3,28,28)
        img = ToPILImage()(result)
        img.show()

print("final train loss:", train_loss.item())
empty_data = torch.zeros((3, 28*28))
result = model(empty_data) # run data through model to see if it works just passing through
result = result.reshape(3,28,28)
img = ToPILImage()(result)
print('reconstruction with empty data:')
img.show()

Yo I figured it out. Had to start with the XOR model and work my way to the Image, but I understand what’s going on. And how I was messing up. It was a simple transpose I was missing.

Here’s my notebook, going from XOR to single image, using two input neurons (some hidden do-da’s) and a single output Neuron.

The results are not amazing, and now starting to wonder if Adam is the best for this task, I’ll look into SGD, as that seemed to have better results in the original Authors net/experiments.

I’m also wondering if the weights are being initialised in the correct way to work best for the model, wondering if I need to init the weights in range of -1 to 1 instead of 0-1.

Does that make a bit more sense now? So you’re getting a model to create a continuous mapping function for brightness, so if you were to pass through a larger tensor say 128 x 128, you could normalise to 0-1 for x and y, and the model would be able to fill in the gaps. No need for directly mapping pixels to actual nn inputs.