If the first linear layer has
in_features = 1 and I input
[1, 2, 3] into the model, how will that linear layer be trained? Will it train it independently on 1, 2, and 3 so the layer keeps track of the gradient for each input, and then the optimizer will use the average of all their gradients? If so, is there a way to tell the optimizer to use a custom function instead of the average to combine them?
If the first linear layer has
nn.Linear just wraps matrix multiplication, where right matrix is trainable in_features x out_features map. So, left matrix (input) must have in_features columns (last dimension), and arbitary number of rows (non-last dimensions reshaped into single one).
I assume your [1,2,3] denotes a python list. Normal conversion to a tensor would produce a tensor with 3 columns, not compatible with in_features=1. Adding extra tail dimension (shape (3,1)) would enable independent training, doing same transform to all scalars.
I’m not sure I understand what you want from gradients, but if your final network output still has that 3 element dimension, you can scale loss along that dimension, this is similar to having weighted samples.
How are the values updated by default? Will it just be an average of the 3 gradients as if they form a batch in gradient descent?
but if your final network output still has that 3 element dimension
What if the output has a 1 element dimension, then how will the previous layers with 3 elements train?
…having weighted samples
How do you do that in pytorch?
Thank you for your help
This means that you combine elements somewhere, so they’re not independent. Your 3 intermediate outputs are just like hidden features, so it makes less sense to train them as independent (1x3 vs 3x3 map) or scale gradients, as they’ll be entangled later anyway.
Pytorch loss classes have reduction=‘mean’ argument. If you use reduction=‘none’, you can adjust elementwise losses before computing mean.
…(1x3 vs 3x3 map) or scale gradients, as they’ll be entangled later anyway
Where does 3x3 come from?
What does it mean to scale gradients? Do you mean in a weighted sum?
I don’t understand what is “entangled” and why it affects scaling gradients.
Sorry, that was incorrect. With input shape (3,1) you have 1 x out_features map. With usual shape (*,3) you have 3 x out_features map.
If you multiply elementwise losses by some tensor C, all gradients are also multiplied by C. So, that’s one of the ways to change gradients.
Nevermind. Basically, for 3d -> 1d function, expresed with a neural net, 3 inputs usually only work together.
But if I do have 3D -> 1D, then how will the training work? If the second layer has 3D, but the output is 1D, how will the weights for the second layer get updated? Will it calculate 3 separate gradients for all 3 elements only for the second layer, and deeper layers will have only 1 shared gradient?
That sounds right.
But this description somehow ignores hidden layer and non-linearity, needed to get a universal function approximator.
So, minimalistic network looks like: 3->64->relu->1, it has linear maps: 3x64,64x1. Thing is, your proposed 1d linear transformations, prepended to this network, won’t do anything beyond what 3x64 map does.
You can of course prepend 1d non-linear transformations to that network, but I don’t see a point.
My goal is for the network to accept an arbitrary number of inputs and to learn how to do inference with information from all of the inputs. So if I apply a 1x5 linear transformation to input
[, , ], the output will be 3x5 tensor
[[a,b,c,d,e],[f,g,h,i,j],[k,l,m,n,o]]. Then if I aggregate the 3 outputs into
[[v,w,x,y,z]], the later 5xM layers will learn to use information from all inputs, while early layer will learn to represent them in a way that allows for a useful aggregate.
Am I misunderstanding something? Can you elaborate so I can understand better.
Ok, with variable input lengths, this makes some sense. But you’ll probably need a deep network (MLP) for 1d->5d transformation, as 1x5 linear map just represents a line in 5d space, not sure if that’s of any use with your aggregation method… In most scenarios, another linear map 5xH will be applied to that 5 column tensor, so at least some non-linearity is needed
(otherwise (input @ 1x5) @ 5xH = input @ (1x5 @ 5xH) = input @ (1xH), so first linear layer is not needed).
Alright, that makes sense!
Can you point me to the pytorch docs that implies the gradients of the first layers (before aggregation) will be averaged?
Do you have any superior idea for doing this kind of aggregation?
Don’t know if such a thing exists. That’s just how gradients work with tensor broadcasting (if scalar from one of inputs is used multiple times, you sum relevent derivatives).
That’s a bit too abstract. Without positional info, you’re basically limited to mean() and exponential smoothing. Self-attention maybe. If you allow input position to affect output, you’re in the realm of usual sequence handlers: rnn, transformer/attention, causal convolutions etc.
Hmm, I guess I will read about tensor broadcasting docs then.
In my problem only the first input matter, then the next inputs (2nd, 3rd, 4th, etc.) order does not matter, so I think we can use a weighted sum. How would exponential smoothing and self-attention work here?
Exponential smoothing is just a recursive weighted mean. But it is not permutation invariant, to model that you should use simple mean(). If first input is special, you’re free to handle is separately.
Self-attention - I don’t understand it well enough to elaborate. Its use for aggregation is described in Attentive Neural Processes paper.