Reshape tensors your preserve variable information/structure

elemeo · December 27, 2021, 10:20am

I have a multidimensional dataset, (1827, 5).
From that dataset I exctract the variable I want to predict and im left with my X and y variables as such:
X has size (1827, 4) and y has size (1827, 4).

Then i further split them into train and test datasets giving 20% to the test dataset. now the shapes I have are these:

> print(X_train.shape)
> print(y_train.shape)
> print(X_test.shape)
> print(y_test.shape)

torch.Size([1461, 4])
torch.Size([1461])
torch.Size([366, 4])
torch.Size([366])

The batch size I have to use (I cannot change that) is 50.

My question is, when reshaping a tensor so as to fit my model, shouldnt I take the original dimensions into account ? By dividing the train dataset into batches of 50, I will surely put data that belongs to the same observation into different batches, simply because 50 is not divisible by 4 exactly.

My network architecture is this:

class NeuralNetwork(nn.Module):

    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.hidden1 = torch.nn.Linear(50, 25) # hidden layer
        self.hidden2 = torch.nn.Linear(25, 25) # hidden layer
        self.out = torch.nn.Linear(25, 1)      # output layer

    def forward(self, x):
        z = F.relu(self.hidden1(x)) # activation function for first hidden layer
        z = F.relu(self.hidden2(z)) # activation function for second hidden layer
        z = self.out(z)             # linear output
        return z

Im the first layer takes 50 as input because the batch size is 50, the output layer outputs 1 because im doing regression. The in-between I chose 25 because I read that a good number is the median between the input and output.

This problem came to be because I tried training the model with the data as is and I got this error:

mat1 and mat2 shapes cannot be multiplied (50x4 and 50x25)

I know that by changing the layers to output 4 it will be solved but Im wondering if its better to just reshape the data to (50, 25) instead.

ptrblck · December 30, 2021, 2:38am

This seems to be wrong, since the layer dimensions are not depending on the batch size.
You should defined the layers using their expected input features, not the number of samples they would see during training/inference.
Based on your initial description you are dealing with 1827 samples where each sample has 4 features. If that’s the case, use self.hidden1 = torch.nn.Linear(4, 25) and let the DataLoader create the batches each with a shape of [50, 4] (except the last one which might be smaller).

elemeo · December 30, 2021, 9:42am

Alright so first hidden layer is based on the input features, however how was the 25 chosen ? Also, since the first layer outputs a shape of 25 then the second layer will have an input of 25 with what output ?

Other than that which I understand I’m a bit confused with what happens to variable information when a batch splits an observation into multiple batches.

ptrblck · December 30, 2021, 9:47am

I’ve reused the out_features from your code snippet as you’ve picked this number of output features.

Yes, that’s correct. The number of output features again depends on your choice.

Most layers process the samples of a batch independently so you would only see the expected abs. error due to the limited numerical precision in their outputs. Exceptions are e.g. batchnorm layers, which calculate the stats to normalize the inputs from the entire batch.

elemeo · December 30, 2021, 9:51am

Alright perfect, that helped a lot, thank you!

Kling · January 4, 2022, 10:04am

I was doing that wrong thanks for the info i was looking for.