Why does the tensor dimension not seem to matter here?

Hello everybody,

in this (NMT Tutorial) Tutorial (Section: “Training the Model”), the initial decoder_input is defined like this:

decoder_input = torch.tensor([[SOS_token]], device=device)

This returns a tensor that looks like this:

tensor([[0]], device=‘cuda:0’) torch.Size([1, 1])


Now, after doing a forward pass in the Decoder a few lines below, the next decoder_input is defined like this:

decoder_input = target_tensor[di]

This returns a tensor that looks like this:

tensor([129], device=‘cuda:0’) torch.Size([1])


For context, the decoder input is fed into the forward() function as inp:

    def forward(self, inp, hidden, encoder_outputs):
        embedded = self.embedding(inp).view(1, 1, -1)
        ...

The embedding layer is defined as this:

self.embedding = nn.Embedding(self.output_size, self.hidden_size) (Output size is vocab size of the input language).

Why does the difference in shape not matter here? If it does, what should I use and why?

Hi,
In the simplest terms, nn.Embedding(4, 3) will act as a look-up table for 3-dimensional vectors corresponding to 4 indices 0, 1, 2, 3.

It matters inasmuch as it determines the output shape, like so:

x = torch.tensor([1, 2, 0, 3])  # a single vector, 1 dimensional
y = torch.tensor([[1, 2, 0, 3]]) # batch_size * sequence length
emb = nn.Embedding(4, 3)
x_emb = emb(x)
y_emb = emb(y)
print(x_emb, y_emb)

gives:

(tensor([[ 0.4080,  1.3991,  1.1883],
         [-0.3503,  0.1206,  0.2660],
         [ 0.8378, -0.3656,  1.6117],
         [ 1.5515, -1.0316,  1.6244]], grad_fn=<EmbeddingBackward0>),
 tensor([[[ 0.4080,  1.3991,  1.1883],
          [-0.3503,  0.1206,  0.2660],
          [ 0.8378, -0.3656,  1.6117],
          [ 1.5515, -1.0316,  1.6244]]], grad_fn=<EmbeddingBackward0>))

Notice the difference in output shapes.

I think using an extra dimension for the batch_size brings clarity. Nevertheless, depends on your use-case.

Batch size, that makes sense. I see your point, thanks for your quick reply!
If I understand you right, then this would mean, that y in your example corresponds to a batch size of 1.
And since the output of the embedding layer will be reshaped using (.view(1,1,-1), the final output of the embedding layer (embedd) will always be the same, regardless of the extra dimension for batch size.

Right.

Yes, so the output from the embedding layer shall have the shape (1, 1, x), where x depends on the batch_size, sequence length etc.