How does backward work for EmbeddingBag?

iaagi · November 19, 2020, 3:54am

Let’s assume we have

params = torch.cuda.FloatTensor(16384, 256).cuda()
table = nn.EmbeddingBag.from_pretrained(params, sparse=True)
table.requires_grad_(True)
idx = torch.randint(low=0, high=16384,size=(8192, 64), dtype=torch.int64).cuda()
ugrad = torch.cuda.FloatTensor(8192, 256)
tmp = table(idx)
tmp.backward(ugrad)

I assume calling the backward should behave as follows:

Read the upstream gradient (8192 x 256 x 4 (Bytes) = 8.3 MB read.)
Write the corresponding gradient value for each element of embedding bag that is looked up. Hence, 8192 x 64 x 256 x 4 (Bytes) = 536 MB write.

However, looking at the behavior of the function in Nsight suggests that my understanding is not accurate. In the profiler running the backward launches 2 major kernels:

indexSelectLargeIndex that reads 12 MB and writes 516 MB - similar to what I expect.
unrolled_elementwise_kernel that reads 539 MB and writes 539 MB - I’m confused what this kernel does.

I would appreciate it if someone could shed some light on how backward on embedding_bag operates and where these numbers come from.

Thank you!

albanD · November 19, 2020, 2:55pm

Hi,

I am not a specialist of EmbeddingBag but from what I remember, the default will perform averaging on common entries. So that second kernel would correspond to the backward of that averaging no?

iaagi · November 19, 2020, 8:15pm

Thank you @albanD for your response.
I think the first kernel corresponds to the averaging. That’s why it reads a rather small amount of data (upstream gradients) and writes a large amount of data (gradient values w.r.t to the all of the vectors whose average was computed).