EmbeddingBag vs Embedding performance

mkula · August 23, 2017, 7:31pm

I’ve been trying to use the new EmbeddingBag layer to improve the performance of parts of my models where I first perform indexing into an Embedding layer, then sum or mean operations on the resulting embeddings.

Unfortunately, I have found using EmbeddingBag to make my models slower, often more than three times slower. Running simple indexing operations in a loop suggests that, for the simple case of embedding indexing followed by a sum, the EmbeddingBag layer is 40% slower than Embedding then sum on a CPU, and about 25% slower on a GPU.

I used the following snippet (on PyTorch 0.2.0):

import time

import torch
from torch.nn import Embedding, EmbeddingBag
from torch.autograd import Variable


def time_layer(layer_class, repetitions=100):

    layer = layer_class(10000, 32)
    indices = Variable(torch.ones(1000).long().view(100, 10))

    offsets = Variable(torch.arange(0, indices.numel(), indices.size(1)).long())

    if torch.cuda.is_available():
        layer = layer.cuda()
        indices = indices.cuda()
        offsets = offsets.cuda()

    start = time.time()
    for _ in range(repetitions):

        if isinstance(layer, EmbeddingBag):
            result = layer(indices.view(-1), offsets).view(indices.size(0), -1)
        else:
            result = layer(indices)
            result = result.sum(1)

        loss = result.sum()
        loss.backward()

    print(result.size())

    return time.time() - start


if __name__ == '__main__':

    embedding_time = time_layer(Embedding)
    embedding_bag_time = time_layer(EmbeddingBag)

    print('Embedding: {}, EmbeddingBag: {}, ratio: {}'.format(
        embedding_time,
        embedding_bag_time,
        embedding_time / embedding_bag_time))

I generated these timings:

CPU: Embedding: 0.10044050216674805, EmbeddingBag: 0.1490950584411621, ratio: 0.6736675461741425
GPU: Embedding: 0.18432116508483887, EmbeddingBag: 0.2421271800994873, ratio: 0.7612576374494734

Am I doing something that’s obviously wrong?

smth · August 23, 2017, 8:52pm

Currently, EmbeddingBag is only optimized on the GPU. The CPU implementation is quite naive. I’ll work on optimizing the CPU implementation for the next release.

mkula · August 23, 2017, 10:07pm

Thank you for the speedy response!

I gleaned from the source comments that the CPU implementation is a temporary solution; however, even on the GPU EmbeddingBag is substantially slower. In the snippet I posted above, it’s ~25% slower; if you increase the size of the embedding layer and the number of indices used to

    layer = layer_class(100000, 256)
    indices = Variable(torch.ones(10 ** 5).long().view(10 ** 3, 10 ** 2))

it becomes twice as slow as the Embedding layer followed by a sum (measured on the Amazon p2.xlarge instance).

mkula · August 25, 2017, 9:49am

@smth To add a little bit more detail, I ran nvprof on both embedding and embedding bag layers; the results are in this gist.

The curx of the problem seems to be that the cunn_LookupTableBag_accGradParametersKernel is about twice as slow as cunn_LookupTable_accGradParametersKernel, far outweighing any gains from not allocating intermediate results when summing over some dimensions of embedding layers.

egilm-ragulpr · April 11, 2018, 3:10am

Any updates on this? Is it worth using EmbeddingBag?

For reference to googlers like me, see

SimonW · April 11, 2018, 4:49am

CPU embedding bag has been improved a lot by @cpuhrsch and many others.

Since you linked those issue/PRs, you should see that they are already solved/merged… So I’m not sure why you asked.

egilm-ragulpr · April 11, 2018, 5:26am

Thanks, well the merged shows improvement but no comparison to see whether the claim in docs of large performance improvement over regular Embedding is true.

SimonW · April 11, 2018, 2:54pm

I didn’t do benchmarks personally, but from the code it should be more efficient than achieving the same thing with Embedding. Let me know what you find if you do a benchmark yourself

xvdp · December 15, 2019, 9:56am

I do realize that Im waking up an old thread. Apologies. However I dont get results as I expected: see code below

Running a slight modification of this benchmark code ( to load the gpu sufficientlly and sync cuda) on pytorch 1.3, Driver Version: 435.21 CUDA Version: 10.1; TitanRTX does not justify the claim that EmbeddingBag is more efficient. In CPU it is 50% better, in GPU marginally worse.
caveat, Ive been working with vision mostly, only just now started looking at Embedings to combine text and vision. If I am using embeddings incorrectly please let me know.
Perhaps to get the full utility of embedding/bag i need to run deeper network? Or?

import time
import torch
import torch.nn as nn
def time_layer(layer_class, repetitions=100, device='cpu'):
    mul = 1
    if device == "cuda":
        mul = 10

    layer = layer_class(100000*mul, 512)
    layer.to(device=device)
    indices = torch.ones(10000*mul, device=device, dtype=torch.int64).view(1000*mul, 10)
    offsets = torch.arange(0, indices.numel(), indices.size(1), device=device, dtype=torch.int64)

    start = time.time()
    for _ in range(repetitions):
        if isinstance(layer, nn.EmbeddingBag):
            result = layer(indices.view(-1), offsets).view(indices.size(0), -1)
        else:
            result = layer(indices)
            result = result.sum(1)

        loss = result.sum()
        loss.backward()

    print(result.size(), result.device)
    if device == "cuda":
        torch.cuda.synchronize()
    return time.time() - start

if __name__ == '__main__':
    DEVICE = 'cpu'
    print(DEVICE)
    embedding_time = time_layer(nn.Embedding, device=DEVICE)
    embedding_bag_time = time_layer(nn.EmbeddingBag, device=DEVICE)

    print('Embedding:\t{},\nEmbeddingBag:\t{}'.format(
        embedding_time,
        embedding_bag_time))
    
    Z = torch.zeros(1, device='cuda') # isolate cuda init from timing
    DEVICE = 'cuda'
    print(DEVICE)

    embedding_time = time_layer(nn.Embedding, device=DEVICE)
    embedding_bag_time = time_layer(nn.EmbeddingBag, device=DEVICE)

    print('Embedding:\t{},\nEmbeddingBag:\t{}'.format(
        embedding_time,
        embedding_bag_time))
python embedbag.py 
``` Results
cpu
torch.Size([1000, 512]) cpu
torch.Size([1000, 512]) cpu
Embedding:      6.203256607055664,
EmbeddingBag:   4.765470743179321
cuda
torch.Size([10000, 512]) cuda:0
torch.Size([10000, 512]) cuda:0
Embedding:      1.6973094940185547,
EmbeddingBag:   1.9424304962158203

Is this code using the embeddingbag wrong?
Thank you