I am testing the following codes. I want to improve the speed by segmenting the data into “batches”, hoping to make use of the latent asynchronous operation. However, when I set the “batch” value to 16 or even larger, the time is almost the same as the value 1. I have no idea why the calculation time didn’t reduce. Is there any misunderstandings about the asynchronous operation?
def improved_efficient_matmul(a, c, index, batch=2):
"""
:param a: N * I * J
:param b: N * J * K
:return: N * I * K
"""
per_batch_len = a.shape[0] // batch
tmp = {}
for b in range(batch):
tmp[b] = torch.cat([torch.matmul(a[i + per_batch_len * b:i + per_batch_len * b+1, :, :], c[index[i + per_batch_len * b], :, :]) for i in
range(per_batch_len)], dim=0)
tmp = dict(sorted(tmp.items(), key=lambda x: x[0]))
out = []
for k in tmp.keys():
out.append(tmp[k])
out = torch.cat(out, dim=0)
return out
rad = np.random.randint(0, high=16384, size=1048576)
rad = torch.from_numpy(rad).long()
a = torch.rand([1048576, 1, 64]).cuda()
b = torch.rand([16384, 64, 64]).cuda().requires_grad_()
print(torch.cuda.memory_allocated() // 1024 // 1024)
print("max", torch.cuda.max_memory_allocated() // 1024 // 1024)
torch.cuda.synchronize()
start = time.time()
out1 = improved_efficient_matmul(a, b, rad, 1)
torch.cuda.synchronize()
end = time.time()
print(end - start)
print(torch.cuda.memory_allocated() // 1024 // 1024)
print("max", torch.cuda.max_memory_allocated() // 1024 // 1024)
print(out1.shape)
This post and the GTC presentation linked in my other post in the same thread might be interesting.
TL;DR: use streams and make sure compute resources are available. Matmuls tend to have a high occupancy, so you might not be able to overlap them.
Many thanks for your questions. I actually want to do matmul for two large tensor, which, however, can cause OOD in my device if use torch.matmul(A, B) directly. Thus, I want to reduce the memory cost by doing matmul for each single value while keeping the speed.
I have just found out where my misunderstanding is in the above codes. And just break the two tensors into batch can achieve what I want. That is: torch.matmul(A[batch, …], B[index[batch], …]) (The above code uses batch=1, thus very slow). Previously, I was worried that B[index[batch]] might lead to a new copy tensor which would increase the GPU memory cost. But it seems it won’t.