EmbeddingBag vs Embedding performance

I’ve been trying to use the new EmbeddingBag layer to improve the performance of parts of my models where I first perform indexing into an Embedding layer, then sum or mean operations on the resulting embeddings.

Unfortunately, I have found using EmbeddingBag to make my models slower, often more than three times slower. Running simple indexing operations in a loop suggests that, for the simple case of embedding indexing followed by a sum, the EmbeddingBag layer is 40% slower than Embedding then sum on a CPU, and about 25% slower on a GPU.

I used the following snippet (on PyTorch 0.2.0):

import time

import torch
from torch.nn import Embedding, EmbeddingBag
from torch.autograd import Variable


def time_layer(layer_class, repetitions=100):

    layer = layer_class(10000, 32)
    indices = Variable(torch.ones(1000).long().view(100, 10))

    offsets = Variable(torch.arange(0, indices.numel(), indices.size(1)).long())

    if torch.cuda.is_available():
        layer = layer.cuda()
        indices = indices.cuda()
        offsets = offsets.cuda()

    start = time.time()
    for _ in range(repetitions):

        if isinstance(layer, EmbeddingBag):
            result = layer(indices.view(-1), offsets).view(indices.size(0), -1)
        else:
            result = layer(indices)
            result = result.sum(1)

        loss = result.sum()
        loss.backward()

    print(result.size())

    return time.time() - start


if __name__ == '__main__':

    embedding_time = time_layer(Embedding)
    embedding_bag_time = time_layer(EmbeddingBag)

    print('Embedding: {}, EmbeddingBag: {}, ratio: {}'.format(
        embedding_time,
        embedding_bag_time,
        embedding_time / embedding_bag_time))

I generated these timings:

  • CPU: Embedding: 0.10044050216674805, EmbeddingBag: 0.1490950584411621, ratio: 0.6736675461741425
  • GPU: Embedding: 0.18432116508483887, EmbeddingBag: 0.2421271800994873, ratio: 0.7612576374494734

Am I doing something that’s obviously wrong?

1 Like

Currently, EmbeddingBag is only optimized on the GPU. The CPU implementation is quite naive. I’ll work on optimizing the CPU implementation for the next release.

Thank you for the speedy response!

I gleaned from the source comments that the CPU implementation is a temporary solution; however, even on the GPU EmbeddingBag is substantially slower. In the snippet I posted above, it’s ~25% slower; if you increase the size of the embedding layer and the number of indices used to

    layer = layer_class(100000, 256)
    indices = Variable(torch.ones(10 ** 5).long().view(10 ** 3, 10 ** 2))

it becomes twice as slow as the Embedding layer followed by a sum (measured on the Amazon p2.xlarge instance).

@smth To add a little bit more detail, I ran nvprof on both embedding and embedding bag layers; the results are in this gist.

The curx of the problem seems to be that the cunn_LookupTableBag_accGradParametersKernel is about twice as slow as cunn_LookupTable_accGradParametersKernel, far outweighing any gains from not allocating intermediate results when summing over some dimensions of embedding layers.

Any updates on this? Is it worth using EmbeddingBag?

For reference to googlers like me, see

CPU embedding bag has been improved a lot by @cpuhrsch and many others.

Since you linked those issue/PRs, you should see that they are already solved/merged… So I’m not sure why you asked.

Thanks, well the merged shows improvement but no comparison to see whether the claim in docs of large performance improvement over regular Embedding is true.

I didn’t do benchmarks personally, but from the code it should be more efficient than achieving the same thing with Embedding. Let me know what you find if you do a benchmark yourself :slight_smile:

1 Like

I do realize that Im waking up an old thread. Apologies. However I dont get results as I expected: see code below

Running a slight modification of this benchmark code ( to load the gpu sufficientlly and sync cuda) on pytorch 1.3, Driver Version: 435.21 CUDA Version: 10.1; TitanRTX does not justify the claim that EmbeddingBag is more efficient. In CPU it is 50% better, in GPU marginally worse.
caveat, Ive been working with vision mostly, only just now started looking at Embedings to combine text and vision. If I am using embeddings incorrectly please let me know.
Perhaps to get the full utility of embedding/bag i need to run deeper network? Or?

import time
import torch
import torch.nn as nn
def time_layer(layer_class, repetitions=100, device='cpu'):
    mul = 1
    if device == "cuda":
        mul = 10

    layer = layer_class(100000*mul, 512)
    layer.to(device=device)
    indices = torch.ones(10000*mul, device=device, dtype=torch.int64).view(1000*mul, 10)
    offsets = torch.arange(0, indices.numel(), indices.size(1), device=device, dtype=torch.int64)

    start = time.time()
    for _ in range(repetitions):
        if isinstance(layer, nn.EmbeddingBag):
            result = layer(indices.view(-1), offsets).view(indices.size(0), -1)
        else:
            result = layer(indices)
            result = result.sum(1)

        loss = result.sum()
        loss.backward()

    print(result.size(), result.device)
    if device == "cuda":
        torch.cuda.synchronize()
    return time.time() - start

if __name__ == '__main__':
    DEVICE = 'cpu'
    print(DEVICE)
    embedding_time = time_layer(nn.Embedding, device=DEVICE)
    embedding_bag_time = time_layer(nn.EmbeddingBag, device=DEVICE)

    print('Embedding:\t{},\nEmbeddingBag:\t{}'.format(
        embedding_time,
        embedding_bag_time))
    
    Z = torch.zeros(1, device='cuda') # isolate cuda init from timing
    DEVICE = 'cuda'
    print(DEVICE)

    embedding_time = time_layer(nn.Embedding, device=DEVICE)
    embedding_bag_time = time_layer(nn.EmbeddingBag, device=DEVICE)

    print('Embedding:\t{},\nEmbeddingBag:\t{}'.format(
        embedding_time,
        embedding_bag_time))
python embedbag.py 
``` Results
cpu
torch.Size([1000, 512]) cpu
torch.Size([1000, 512]) cpu
Embedding:      6.203256607055664,
EmbeddingBag:   4.765470743179321
cuda
torch.Size([10000, 512]) cuda:0
torch.Size([10000, 512]) cuda:0
Embedding:      1.6973094940185547,
EmbeddingBag:   1.9424304962158203

Is this code using the embeddingbag wrong?
Thank you