Computation of nn.Linear and nn.Embedding

TL;DR: I don’t know how I should imagine input to be processed, if input is a single vector or a sequence of vectors. Because the multiplications it does is different for either case.

import torch
import torch.nn as nn

lin = nn.Linear(6, 10, True) 
W = lin.weight #BUT, W will be a 10 x 6 matrix! 
bias = lin.bias

I want to manually compute what nn.Linear does to understand what is happening.
In PyTorch, vectors are ROW vectors. In Math, people write it as a COLUMN vectors. So funny stuff happens when you try to code a mathematic formulation.

x = torch.randn(6) # 6 dimensional ROW vector.
out1 = lin(x) # 10 dimensional ROW vector.

If we want to get this manually, we do:

out2 = torch.matmul(W, x) + bias

But hold on! This computation doesn’t make sense because W is a 10 x 6 matrix and x is a 1 x 6 matrix. But because it’s just a tensor with dimension 6, PyTorch is able to treat it like a 6 x 1 matrix (COLUMN vector). That’s what the docs says. It doesn’t work if you use torch.mm and I will not use @ because that transposes under the hood for you.

So let’s try again but with a differnt x.

x = torch.rand(2, 6) # Two 6-dimensional row vectors. 
out1 = lin(x) # Works
out2 = torch.matmul(x, W.T) + bias # Now all the sudden we have to care again about the dimensions

So it’s like, I struggle to understand how I have to imagine how input is being processed. If I model it, can I consider input as single vector? But when it trains, you can pass in a tensor of vectors and it either sequentially computes it for each individual vector like before or does it do fancy transpose to pull that off? I really hate that I have no idea what it does.

According to the doc, it does: y = x * A^T
However, it’s unclear what A is. Is it `lin.weight’ or does it have the same dimensions I gave to it?

To connect this back to nn.Embedding

embed == nn.Embedding(8, 6)
E = embed.weight # THIS IS A 8 x 6 Matrix, 

So the transposing stuff doesn’t happen. At least I have somewhat more control here, because the only thing nn.Embedding does is generate a random matrix and allows me to access it’s rows by passing in a tensor of indices. Mathematically, that was equivalent to multiplying that Matrix with hot vectors (all entries except one zeroes, one entry 1) but if a computer can access it by index anyway, it does seem okay to omit that multiplication. Still a bit annoying that for one it transposes it and for the other it does not.

I’m not sure if this is what you mean but the difference in your tests likely comes from inconsistent dims

x = torch.rand(2) # this creates a 1D tensor of shape (2,) [a,b]
x = torch.rand(1,2) #this will create a 2D tensor of shape (1,2) [[a,b]]

the reason torch might be able to treat the first x as needed is because there is no 2nd dim deffined.

If W is 10 x 2, x must be n x 10 and the output will be n x 2
If you pass x with 1 dim torch likely just does x.unsqueeze(0), for this to work W must be transposed though thats likely why even if you init the layer as nn.Linear(2,10) the weight matrix will be 10 x 2. This means the weight is initialized with something like torch.rand(out_size,in_size) → which is the same as W.T in your approach.

To answer your question about the fancy transpose:
Not quite it just creates the weight already transposed

In short, if your first tensor was 2D you would probably have to care about dims again, but without the 2nd dim torch likely handles it as needed

For your embedding question I don’t think its comparable as arg 1 in nn.Embedding is the max token amount not necessarily the input size, and arg 2 is the embedding dim which for this case is basically the output dimension

E.g. you create a embedding layer with (10,64) you can pass in torch.randint(0,10,(n,)) and will get the encoded embedding with shape (n x 64), or you could even pass input = torch.randint(0,10,(n,m)) and you will get output like (m x n x 64)

Disclaimer: This is just what i think is the reason i have not looked into the documentation