Let’s assume we have
params = torch.cuda.FloatTensor(16384, 256).cuda() table = nn.EmbeddingBag.from_pretrained(params, sparse=True) table.requires_grad_(True) idx = torch.randint(low=0, high=16384,size=(8192, 64), dtype=torch.int64).cuda() ugrad = torch.cuda.FloatTensor(8192, 256) tmp = table(idx) tmp.backward(ugrad)
I assume calling the backward should behave as follows:
- Read the upstream gradient (8192 x 256 x 4 (Bytes) = 8.3 MB read.)
- Write the corresponding gradient value for each element of embedding bag that is looked up. Hence, 8192 x 64 x 256 x 4 (Bytes) = 536 MB write.
However, looking at the behavior of the function in Nsight suggests that my understanding is not accurate. In the profiler running the backward launches 2 major kernels:
indexSelectLargeIndexthat reads 12 MB and writes 516 MB - similar to what I expect.
unrolled_elementwise_kernelthat reads 539 MB and writes 539 MB - I’m confused what this kernel does.
I would appreciate it if someone could shed some light on how backward on
embedding_bag operates and where these numbers come from.