Difference between F.embedding and indexing

omer_sahban · May 29, 2020, 12:09am

w = torch.rand(10000, 256)
x = torch.tensor([0, 1, 2])

y1 = F.embedding(x, w)
y2 = w[x]

print((y1-y2).abs().sum().item())
# Out: 0

it seems like F.embedding and numpy style indexing can be used interchangeably, but F.embedding is about 20% faster. is there any other differences between these two methods? for example their behavior during backward pass

omer_sahban · May 29, 2020, 12:30am

apparently the weight parameter of F.embedding has to be a 2d tensor, otherwise it fails to backprop.

w = torch.rand(10000, 10, 256, requires_grad=True)
o = torch.rand(256, 10000, requires_grad=True)

x = torch.tensor([0, 1, 2])
y = F.embedding(x, w)
logits = y.mean(dim=1) @ o
loss = F.cross_entropy(logits, x)
loss.backward()
# RuntimeError: shape '[3, 256]' is invalid for input of size 7680

omer_sahban · May 29, 2020, 2:36am

I’ve also discovered that multiplying embedding matrix with onehot vectors is much faster than F.embedding, but creating onehot vectors with F.onehot is exteremly slow. I created a “whiteboard” vector filled with zeros, then scattered the indices onto it to create onehot vectors, this is much faster than creating a new tensor at each training step.

batch_size=128
seq_len =100
vocab_size=10000
w = torch.rand(vocab_size, 256, requires_grad=True)
whiteboard = torch.zeros([batch_size, seq_len, vocab_size])

for i in range(100):
  x = torch.randint(vocab_size, size=[batch_size, seq_len])
  onehot = whiteboard.scatter(2, x[:,:,None], 1)
  # onehot = F.onehot(x, vocab_size) # very slow
  
  y = onehot @ w
  # y = F.embedding(x, w)

  # ...