Let’s assume we have
params = torch.cuda.FloatTensor(16384, 256).cuda()
table = nn.EmbeddingBag.from_pretrained(params, sparse=True)
table.requires_grad_(True)
idx = torch.randint(low=0, high=16384,size=(8192, 64), dtype=torch.int64).cuda()
ugrad = torch.cuda.FloatTensor(8192, 256)
tmp = table(idx)
tmp.backward(ugrad)
I assume calling the backward should behave as follows:
 Read the upstream gradient (8192 x 256 x 4 (Bytes) = 8.3 MB read.)
 Write the corresponding gradient value for each element of embedding bag that is looked up. Hence, 8192 x 64 x 256 x 4 (Bytes) = 536 MB write.
However, looking at the behavior of the function in Nsight suggests that my understanding is not accurate. In the profiler running the backward launches 2 major kernels:

indexSelectLargeIndex
that reads 12 MB and writes 516 MB  similar to what I expect. 
unrolled_elementwise_kernel
that reads 539 MB and writes 539 MB  I’m confused what this kernel does.
I would appreciate it if someone could shed some light on how backward on embedding_bag
operates and where these numbers come from.
Thank you!