Efficiency of two type of attention is different for CPU and GPU

I am trying to implement dot product type attention block. There are two type of implementation with different computation cost. One is $z = QK^T V/n$, the other is z = $Q (K^T V)/n$. Theoretically, the first type will have quadratic complexity, the second one will have linear complexity.

The implementation on cpu is as follows

import time
import torch 

b = 5 
n = 256
h = 4 
dk = 16
dv = 16

Q = torch.rand((b, h, n, dk))
K = torch.rand((b, h, n, dk))
V = torch.rand((b, h, n, dv))

# type 1 : z = (QK)V
times = []
niters = 10000
for i in range(niters):
    torch.cuda.synchronize()
    start_epoch = time.time()
    QK = torch.einsum('bhnd,bhmd->bhnm', Q, K)
    v1 = torch.einsum('bhnm,bhmd->bhnd',QK, V)
    end_epoch = time.time()
    elapsed = end_epoch - start_epoch
    times.append(elapsed)

avg_time = sum(times)/niters
print(avg_time) # result on my machine 0.0004075302600860596

# type 2 : z = Q(KV)

times = []
niters = 10000
for i in range(niters):
    torch.cuda.synchronize()
    start_epoch = time.time()
    KV = torch.einsum('bhni,bhnj->bhij', K, V)
    v2 = torch.einsum('bhni,bhij->bhnj', Q, KV)
    end_epoch = time.time()
    elapsed = end_epoch - start_epoch
    times.append(elapsed)

avg_time = sum(times)/niters
print(avg_time) # result on my machine 0.00023783543109893798

Here, on cpu case, type 1 is obviously slower than type 2.
However, if we put Q,K,V on cuda. The result on my machine for type 1 is 0.00014664556980133057, and the result for type 2 is 0.0001561680793762207.

Type 2 on gpu is slower than type 1, which is inconsistent with the cpu result.

I’m quite confuse with this result.

Based on your code snippet you are not synchronizing the GPU before stopping the timers.
Since CUDA operations are executed asynchronously you are now profiling the dispatching, kernel launches, etc. but not the actual GPU execution time as the kernel execution might not be finished yet.

Thank you, I change the code to the following, it works

import time 
import torch
import torch.nn as nn

b = 5 
n = 2048
h = 4 
dk = 16
dv = 16

Q = torch.rand((b, h, n, dk)).cuda()
K = torch.rand((b, h, n, dk)).cuda()
V = torch.rand((b, h, n, dv)).cuda()

times = []
niters = 10000
for i in range(niters):
    torch.cuda.synchronize()
    start_epoch = time.time()
    QK = torch.einsum('bhnd,bhmd->bhnm', Q, K)
    v1 = torch.einsum('bhnm,bhmd->bhnd',QK, V)
    torch.cuda.synchronize()
    end_epoch = time.time()
    elapsed = end_epoch - start_epoch
    times.append(elapsed)

avg_time = sum(times)/niters
print(avg_time)

times = []
niters = 10000
for i in range(niters):
    torch.cuda.synchronize()
    start_epoch = time.time()
    KV = torch.einsum('bhni,bhnj->bhij', K, V)
    v2 = torch.einsum('bhni,bhij->bhnj', Q, KV)
    torch.cuda.synchronize()
    end_epoch = time.time()
    elapsed = end_epoch - start_epoch
    times.append(elapsed)

avg_time = sum(times)/niters
print(avg_time)

the result of type 1 is 0.003689867615699768, and type 2 is 0.0018505871534347535, which is consistent with theory

Hi, I have another question about this, since type 2 should perform better on speed, however, in practice, if we don’t measure the gpu excution time, the whole excution does not speed up(maybe due to asynchronized implementation from cuda), so this means using type2 won’t boost the overall speed. Do you have any suggestion on this problem?

I don’t understand the claim as your initial profiling was invalid.
As previously described you were not profiling the actual kernel execution but the host-side work, i.e. dispatching, kernel launch etc.
In a real scenario you would be interested to profile the time it takes to get a valid result from the computation not only the time it takes for the CPU to “return” from the call, since you would want to actually use the result values.
The proper profiling seems to match your expectations.

I see, I will check my code in detail, thank you