How does backward work for EmbeddingBag?

Let’s assume we have

params = torch.cuda.FloatTensor(16384, 256).cuda()
table = nn.EmbeddingBag.from_pretrained(params, sparse=True)
idx = torch.randint(low=0, high=16384,size=(8192, 64), dtype=torch.int64).cuda()
ugrad = torch.cuda.FloatTensor(8192, 256)
tmp = table(idx)

I assume calling the backward should behave as follows:

  1. Read the upstream gradient (8192 x 256 x 4 (Bytes) = 8.3 MB read.)
  2. Write the corresponding gradient value for each element of embedding bag that is looked up. Hence, 8192 x 64 x 256 x 4 (Bytes) = 536 MB write.

However, looking at the behavior of the function in Nsight suggests that my understanding is not accurate. In the profiler running the backward launches 2 major kernels:

  1. indexSelectLargeIndex that reads 12 MB and writes 516 MB - similar to what I expect.
  2. unrolled_elementwise_kernel that reads 539 MB and writes 539 MB - I’m confused what this kernel does.

I would appreciate it if someone could shed some light on how backward on embedding_bag operates and where these numbers come from.

Thank you!


I am not a specialist of EmbeddingBag but from what I remember, the default will perform averaging on common entries. So that second kernel would correspond to the backward of that averaging no?

Thank you @albanD for your response.
I think the first kernel corresponds to the averaging. That’s why it reads a rather small amount of data (upstream gradients) and writes a large amount of data (gradient values w.r.t to the all of the vectors whose average was computed).