nn.Embedding performance issue?

The nn.functional.embedding function on the CPU shows that the time taken for FP32 operations is approximately double that of FP16 and BF16 operations, which is reasonable due to the differences in per-element size.

However, on CUDA, the time taken for FP32 operations is very similar to that of FP16 and BF16 operations, which seems unreasonable.

I ran this cuda code on colab T4 GPU, and I pasted the output below.

import torch
import torch.nn as nn
import time

device = 'cuda'

vocab_size = 8192
n = 768
token_count = 384 * 1024
loop_times = 1000

for dtype in [torch.bfloat16, torch.float16, torch.float]:
    indices = torch.randint(0, vocab_size, (token_count,), dtype=torch.int, device=device)
    weight = torch.rand((vocab_size, n), dtype=dtype, device=device)

    # warm up
    for i in range(0, 10):
        output = nn.functional.embedding(indices, weight)
        torch.cuda.synchronize()

    start = time.time_ns()
    for i in range(0, loop_times):
        output = nn.functional.embedding(indices, weight)
        torch.cuda.synchronize()
    end = time.time_ns()

    print('elapsed time: %f ms' % ((end - start) / 1e6 / loop_times))
    print(output.shape)
    print(output.dtype)

Output:

elapsed time: 10.571431 ms
torch.Size([393216, 768])
torch.bfloat16
elapsed time: 10.287250 ms
torch.Size([393216, 768])
torch.float16
elapsed time: 12.261419 ms
torch.Size([393216, 768])
torch.float32

Change device = 'cuda' to device = 'cpu'. The output is pasted below.

elapsed time: 376.148446 ms
torch.Size([393216, 768])
torch.bfloat16
elapsed time: 355.628159 ms
torch.Size([393216, 768])
torch.float16
elapsed time: 720.934025 ms
torch.Size([393216, 768])
torch.float32

GPUs are optimized for matrix operations in linear and convolutional layers, but embeddings rely on memory lookups and don’t benefit from that acceleration. Even if acceleration is applied, FP32 and FP16/BF16 are processed in parallel, which minimizes the performance gap on the GPU.