cudaLaunchKernel takes 99% of CPU time

AimlessArrow · December 14, 2022, 1:51pm

I try to profile the text detection model from GitHub - ymy-k/DPText-DETR: DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer (AAAI 2023) using torch.profiler
According to the results cudaLaunchKernel takes 99.19% of CPU time.
I tried to use other non-Pytorch-related profilers, that shows that this part is the one to consume the most part of the time.

    def gen_encoder_output_proposals(self, memory, memory_padding_mask, spatial_shapes):
        N_, S_, C_ = memory.shape
        base_scale = 4.0
        proposals = []
        _cur = 0
        for lvl, (H_, W_) in enumerate(spatial_shapes):
            mask_flatten_ = memory_padding_mask[:, _cur:(_cur + H_ * W_)].view(N_, H_, W_, 1)
            valid_H = torch.sum(~mask_flatten_[:, :, 0, 0], 1)
            valid_W = torch.sum(~mask_flatten_[:, 0, :, 0], 1)

            grid_y, grid_x = torch.meshgrid(torch.linspace(0, H_ - 1, H_, dtype=torch.float32, device=memory.device),
                                            torch.linspace(0, W_ - 1, W_, dtype=torch.float32, device=memory.device))
            grid = torch.cat([grid_x.unsqueeze(-1), grid_y.unsqueeze(-1)], -1)

            scale = torch.cat([valid_W.unsqueeze(-1), valid_H.unsqueeze(-1)], 1).view(N_, 1, 1, 2)
            grid = (grid.unsqueeze(0).expand(N_, -1, -1, -1) + 0.5) / scale
            wh = torch.ones_like(grid) * 0.05 * (2.0 ** lvl)
            proposal = torch.cat((grid, wh), -1).view(N_, -1, 4)
            proposals.append(proposal)
            _cur += (H_ * W_)
        output_proposals = torch.cat(proposals, 1)
        output_proposals_valid = ((output_proposals > 0.01) & (output_proposals < 0.99)).all(-1, keepdim=True)
        output_proposals = torch.log(output_proposals / (1 - output_proposals))
        output_proposals = output_proposals.masked_fill(memory_padding_mask.unsqueeze(-1), float('inf'))
        output_proposals = output_proposals.masked_fill(~output_proposals_valid, float('inf'))

        output_memory = memory
        output_memory = output_memory.masked_fill(memory_padding_mask.unsqueeze(-1), float(0))
        output_memory = output_memory.masked_fill(~output_proposals_valid, float(0))
        output_memory = self.enc_output_norm(self.enc_output(output_memory))
        return output_memory,

So, I tried to play around this part, creating some random tensors like this:

memory = torch.randn([1, shape, 256], device = 'cuda')
memory_padding_mask = torch.randn([1, shape], device = 'cuda') > 0.5
spatial_shapes = torch.randint(16, 224, [4, 2], device = 'cuda')
spatial_shapes = torch.tensor([[125, 125],
        [ 63,  63],
        [ 32,  32],
        [ 16,  16]], device='cuda')
tensors.append((memory, memory_padding_mask, spatial_shapes))

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], with_stack=True,) as prof:
    with record_function("model_inference"):
        for memory, memory_padding_mask, spatial_shapes in tensors:
            r = gen_encoder_output_proposals(memory, memory_padding_mask, spatial_shapes)

And I got the same result. Warming up the model doesn’t help me to improve the performance.

However, if I firstly run it on e.g. tensors[:1] and then I execute the whole loop, I got way better results.

So whether there is a weak part that should be replaced or is it a problem of incorrect profiling?

ptrblck · December 15, 2022, 1:14am

For a more detailed understanding of your workload I would recommend to profile your code with e.g. Nsight Systems as described here. In the timeline view you would be able to see if your workload is CPU-limited and where the bottlenecks are.

AimlessArrow · December 19, 2022, 12:34pm

@ptrblck Thanks for the recomendation.
I tried to use nsight and got the following result:

So, I still wonder if it is okay for cudaLaunchKernel to take the most part of Total Time?

ptrblck · December 19, 2022, 4:39pm

What do you see in the timeline? Are all kernels packed or are you seeing white spaces between their launches? Were you able to isolate the bottleneck?

qixiang · August 11, 2023, 10:30am

Hi, recently I have met almost the same problem. I’ve tried to profile two diffferent models and result show that all the first cudaLaunchKernel cost most of time. I am using Torch 1.10 and torch vision 0.10

ptrblck · August 11, 2023, 1:48pm

The very first calls are irrelevant as they might represent the warmup phase. Especially if your GPU was at its idle state the first op will cost more.

qixiang · August 11, 2023, 2:22pm

Thanks for your relying! @ptrblck

I did run a few times (about 10 to 15) forward for warmup. However, since I am not clear with the definition of it, I am not sure whether this is enough for warmup.

I warmup like this:

warmup_time=15
for data, i in enumerate(dataloader):
model(data)
if i == warmup_time:
# profile
break
The tracing is like this:

Thank you!