The nn.functional.embedding
function on the CPU shows that the time taken for FP32 operations is approximately double that of FP16 and BF16 operations, which is reasonable due to the differences in per-element size.
However, on CUDA, the time taken for FP32 operations is very similar to that of FP16 and BF16 operations, which seems unreasonable.
I ran this cuda code on colab T4 GPU, and I pasted the output below.
import torch
import torch.nn as nn
import time
device = 'cuda'
vocab_size = 8192
n = 768
token_count = 384 * 1024
loop_times = 1000
for dtype in [torch.bfloat16, torch.float16, torch.float]:
indices = torch.randint(0, vocab_size, (token_count,), dtype=torch.int, device=device)
weight = torch.rand((vocab_size, n), dtype=dtype, device=device)
# warm up
for i in range(0, 10):
output = nn.functional.embedding(indices, weight)
torch.cuda.synchronize()
start = time.time_ns()
for i in range(0, loop_times):
output = nn.functional.embedding(indices, weight)
torch.cuda.synchronize()
end = time.time_ns()
print('elapsed time: %f ms' % ((end - start) / 1e6 / loop_times))
print(output.shape)
print(output.dtype)
Output:
elapsed time: 10.571431 ms
torch.Size([393216, 768])
torch.bfloat16
elapsed time: 10.287250 ms
torch.Size([393216, 768])
torch.float16
elapsed time: 12.261419 ms
torch.Size([393216, 768])
torch.float32
Change device = 'cuda'
to device = 'cpu'
. The output is pasted below.
elapsed time: 376.148446 ms
torch.Size([393216, 768])
torch.bfloat16
elapsed time: 355.628159 ms
torch.Size([393216, 768])
torch.float16
elapsed time: 720.934025 ms
torch.Size([393216, 768])
torch.float32