Why is there idle time between kernels before and after gpu_memset?

Hello,

While using NVIDIA Nsight Systems to analyze the performance of a Pytorch program, I noticed an issue as shown in the attached figure: there is a significant idle period between the gpu_memset operation and the kernel executions before and after it.
Why does the gpu_memset operation result in such long kernel gaps? Can I avoid this?
Thanks!!!

And my example code is:

from torch import nn
import torch


torch.cuda.set_device(1)

conv = nn.Conv2d(384, 384, kernel_size=(3, 1), padding=(1, 0)).to(device="cuda:1").eval()

x = torch.randn(1, 384, 8, 8).to(device="cuda:1")

conv(x)

g = torch.cuda.CUDAGraph()

with torch.cuda.graph(g):
    conv(x)

g.replay()