Efficiency of two type of attention is different for CPU and GPU

linye-boli · April 26, 2023, 8:17am

I am trying to implement dot product type attention block. There are two type of implementation with different computation cost. One is $z = QK^T V/n$, the other is z = $Q (K^T V)/n$. Theoretically, the first type will have quadratic complexity, the second one will have linear complexity.

The implementation on cpu is as follows

import time
import torch 

b = 5 
n = 256
h = 4 
dk = 16
dv = 16

Q = torch.rand((b, h, n, dk))
K = torch.rand((b, h, n, dk))
V = torch.rand((b, h, n, dv))

# type 1 : z = (QK)V
times = []
niters = 10000
for i in range(niters):
    torch.cuda.synchronize()
    start_epoch = time.time()
    QK = torch.einsum('bhnd,bhmd->bhnm', Q, K)
    v1 = torch.einsum('bhnm,bhmd->bhnd',QK, V)
    end_epoch = time.time()
    elapsed = end_epoch - start_epoch
    times.append(elapsed)

avg_time = sum(times)/niters
print(avg_time) # result on my machine 0.0004075302600860596

# type 2 : z = Q(KV)

times = []
niters = 10000
for i in range(niters):
    torch.cuda.synchronize()
    start_epoch = time.time()
    KV = torch.einsum('bhni,bhnj->bhij', K, V)
    v2 = torch.einsum('bhni,bhij->bhnj', Q, KV)
    end_epoch = time.time()
    elapsed = end_epoch - start_epoch
    times.append(elapsed)

avg_time = sum(times)/niters
print(avg_time) # result on my machine 0.00023783543109893798

Here, on cpu case, type 1 is obviously slower than type 2.
However, if we put Q,K,V on cuda. The result on my machine for type 1 is 0.00014664556980133057, and the result for type 2 is 0.0001561680793762207.

Type 2 on gpu is slower than type 1, which is inconsistent with the cpu result.

I’m quite confuse with this result.

ptrblck · April 26, 2023, 8:33am

Based on your code snippet you are not synchronizing the GPU before stopping the timers.
Since CUDA operations are executed asynchronously you are now profiling the dispatching, kernel launches, etc. but not the actual GPU execution time as the kernel execution might not be finished yet.

linye-boli · April 26, 2023, 12:03pm

Thank you, I change the code to the following, it works

import time 
import torch
import torch.nn as nn

b = 5 
n = 2048
h = 4 
dk = 16
dv = 16

Q = torch.rand((b, h, n, dk)).cuda()
K = torch.rand((b, h, n, dk)).cuda()
V = torch.rand((b, h, n, dv)).cuda()

times = []
niters = 10000
for i in range(niters):
    torch.cuda.synchronize()
    start_epoch = time.time()
    QK = torch.einsum('bhnd,bhmd->bhnm', Q, K)
    v1 = torch.einsum('bhnm,bhmd->bhnd',QK, V)
    torch.cuda.synchronize()
    end_epoch = time.time()
    elapsed = end_epoch - start_epoch
    times.append(elapsed)

avg_time = sum(times)/niters
print(avg_time)

times = []
niters = 10000
for i in range(niters):
    torch.cuda.synchronize()
    start_epoch = time.time()
    KV = torch.einsum('bhni,bhnj->bhij', K, V)
    v2 = torch.einsum('bhni,bhij->bhnj', Q, KV)
    torch.cuda.synchronize()
    end_epoch = time.time()
    elapsed = end_epoch - start_epoch
    times.append(elapsed)

avg_time = sum(times)/niters
print(avg_time)

the result of type 1 is 0.003689867615699768, and type 2 is 0.0018505871534347535, which is consistent with theory

linye-boli · April 27, 2023, 8:32am

Hi, I have another question about this, since type 2 should perform better on speed, however, in practice, if we don’t measure the gpu excution time, the whole excution does not speed up(maybe due to asynchronized implementation from cuda), so this means using type2 won’t boost the overall speed. Do you have any suggestion on this problem?

ptrblck · April 27, 2023, 9:13am

I don’t understand the claim as your initial profiling was invalid.
As previously described you were not profiling the actual kernel execution but the host-side work, i.e. dispatching, kernel launch etc.
In a real scenario you would be interested to profile the time it takes to get a valid result from the computation not only the time it takes for the CPU to “return” from the call, since you would want to actually use the result values.
The proper profiling seems to match your expectations.

linye-boli · April 28, 2023, 7:51am

I see, I will check my code in detail, thank you