Flatten vs unflatten concept


I am testing a simple linear model. I am not posting a code as this more of conceptual question in PyTorch. I have 2D input with three features(columns), X, Y, Z with each 100 samples. I input in the model

Linear3=50,1(output is 100x1 )

however if i flatten the 2D Input , and make a 1d vector of of size (300,) and use the following model:

The flattened model converges faster and get the desired output in far lesser iteraations, which is not the case in unflattened model

What i could think of is that in unflatten model we have 3 features, whereas in flatten case, each sample is treated as a feature, and therefore the flatten model is working in high dimensions(300), instead of 3?

I wish to improve my understanding and concept.


If you have 3 features how can you input a vector of dimension 300 ?
Are you trying to feed 100 samples by 100 samples during the training ? If that’s the case you should use a batch size of 100, not changing the network architecture

That is what I would also think. But again, the latter (network) is logically and fundamentally wrong.

May I ask why is it wrong? For linear layers, do we have to have regression for each of the three features?

@Lelouch1, thanks, even with input of 100,1 vs flattened (100,), the flattened converges faster. i know they are different inputs, but would it matter, if we have inputs flattened, if it converges faster in linear modelling.

Hi @Salman_Abbasi,

What it seems like you’re doing is in the 1st case, you’re defining a function that has 3 inputs nodes and 1 output nodes and you’re passing through 100 samples to learn a mapping between the input and output (which is the correct way of modelling that data).

In the 2nd case, you have a network that has 300 input nodes and mapping to 100 output nodes. In this 2nd case, what you’re doing is you’re treating every sample as an individual feature input to your network, so you’re basically saying that the 1st feature for every sample is an individual feature in own right (which does not make sense at all).

The reason why the 2nd model works better is that you’re essentially training over all your data and memorizing the dataset, whereas in the 1st use case you have a function that takes any sample and maps it to an output. So, the network doesn’t memorize your data but is forced to learn a function underlying the true data distribution. For example, if you were to use say 50 samples the 1st use case model would work fine, but you’d get a mis-match on the input size for the 2nd model (because you’re treating your sample/batch-size as an input, which is wrong!).

TL;DR - 1st use-case is correct, 2nd use-case is wrong. 1st model is learning the true data (hence is subject to bias-variance trade-off). 2nd case is just memorizing the dataset directly, which is an incorrect approach!

1 Like

@AlphaBetaGamma96 , thanks for clearing. That really helped. Just curious are there cases where we can use flattened inputs. Just to be clear, there is no second dimension in inputs?

This is a conceptual question based on my work, where I am replacing machine learning with optimization. I do not intend to reuse the network. All I am doing is inputting a data(preferably flattened), and it outputs an improved version of it, and we keep on training until it reaches convergence(based on the constraints).

A follow up question, based on your detailed explanation is, what if I train a flattened data of 100 samples, and then use another input of the same size. Would the trained network on previous flattened data , be able to generalize?


Hi @Salman_Abbasi,

Your data can be any size for the input but this “size” neglects the batch dimension, because your model should take a single sample and map that to an output of a given size. In natural language processing, the inputs can be 2-dimensional (or 3-dimensional if you include the batch size). It also doesn’t make sense to me why you’d want to include all your sample as input nodes to your network, because 1) your network would be dependent on a fixed amount of data and 2) it isn’t learning how to map an input but the entire set of data, which isn’t typical in machine learning.

If the two different datasets are related to each other, it might be able to learn something, but I doubt it. If it hasn’t seen the data (and there’s no link between the two datasets), it’ll most likely give random outputs. You could try this and see what happens, but I wouldn’t expect it to perform well.

Hi @AlphaBetaGamma96 . Thanks again. You have clearly stated that this is not the correct practice. But in my case, I do not want to reuse this network. Just train it. I am trying to construct a seismic wavefield. Normally it is an optimization process where you input a data and it optimizes. I am replacing the process. The output of the network will be wave speed. that would be fed to another network(another linear layer), to match the target. If it matches the target, we would infer that wave speed(last iteration output of first network)is correct. The reason I am flattening is the convergence time. With unflatten 2D input it takes near to million iterations, whereas the flattened converges in 4500 iteration. If you have idea about wave equations, I can further elaborate. Thankfully :slight_smile: