We have a PyTorch-based inference system that heavily uses graph capture+replay. Looking at profiles, every time there is a cudaMemset, there are gaps in the execution that look like this:
In this case, the (4byte) cudaMemsetAsync takes 1.7us, but the gap between kernels is almost 13us. PyTorch itself makes very little use of memset (it seems to prefer fill kernels), but plenty of libraries use memset as part of various kernel prep (here, the TransformerEngine fp8 layernorm kernel).
I was able to reproduce this effect with a small microbenchmark. It looks roughly like this:
z = torch.zeros(1, device="cuda")
x = torch.randn(512, 1024, device="cuda")
w = torch.randn(1024, 1024, device="cuda")
# [warmup, setup, etc], then:
# capture
with torch.cuda.graph(g):
for _ in range(1000):
torch.ops.memset_cuda.memset_cuda(z) # Case 1: use memset
# z.zero_() # Case 2: use vectorized_elementwise_kernel
torch.matmul(x, w.T)
# Measure graph time
torch.cuda.synchronize()
tic = time.time()
g.replay()
torch.cuda.synchronize()
toc = time.time()
print(f"{(toc - tic) * 1000.}ms")
On a single H100-SXM5, I get these measurements for the graph time:
using memset: 34.8ms
using zero_: 33.4ms
This is almost a 5% E2E slowdown due to memset vs. a fill kernel! Looking at the profile, I see similar gaps as above, though their relative size is smaller. Any ideas why this might be? Or any workarounds? In our own code we can prefer the fill kernels, but it is impossible to workaround (eg) cuBLAS’ use of memset. Tagging @ptrblck for NV awareness.
Aside: the call to memset_cuda is a 5-line cpp extension I added that simply calls:
cudaMemsetAsync(x.data_ptr<float>(), 0, x.numel() * sizeof(float),
at::cuda::getCurrentCUDAStream().stream());
