Variable model inferencing speed

I am trying to optimize SadTalker inference code and this is the code where the face renderer is running .

When I try profiling using line_profile and torch.profiler, I see a very confusing output. In the current form of the code, the line 32 idx_tensor = torch.FloatTensor(idx_tensor).type_as(pred).to(device) is shown to take the bulk of the time (88%) to 11% of the model running on line 128.

With torch.profile as well, I can emulate this with:

            with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
                with record_function("keypoint transformation"):
                    kp_driving = keypoint_transformation(kp_canonical, he_driving)
            print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=2))
                
            kp_norm = kp_driving
            if use_half:
                source_image = source_image.half()
                kp_source = {k: v.half() for k, v in kp_source.items()}
                kp_norm = {k: v.half() for k, v in kp_norm.items()}
            with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
                with record_function("model_inference"):
                    out = generator(source_image, kp_source=kp_source, kp_driving=kp_norm)
            print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=2))

This prints:

STAGE:2024-01-08 22:07:34 17294:17294 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-08 22:07:34 17294:17294 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-08 22:07:34 17294:17294 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
    keypoint transformation         0.35%     548.000us       100.00%     157.668ms     157.668ms             1  
                aten::empty         0.00%       7.000us         0.00%       7.000us       1.750us             4  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 157.668ms

STAGE:2024-01-08 22:07:34 17294:17294 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-08 22:07:34 17294:17294 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-08 22:07:34 17294:17294 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 model_inference         2.27%     739.000us       100.00%      32.553ms      32.553ms             1  
                    aten::conv2d         0.77%     252.000us         6.73%       2.191ms      27.049us            81  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 32.553ms

The keypoint_transformation internally calls headpose_pred_to_degree which has the majority time taken issue and I could localize it to the index creation statement using time.times().

To resolve this, I take the index out of the function:

idx_tensor = torch.FloatTensor([idx for idx in range(66)]).to("cuda")
def headpose_pred_to_degree(pred):  
    pred = F.softmax(pred)
    degree = torch.sum(pred*idx_tensor, 1) * 3 - 99
    return degree

However, now the model starts taking up a lot more time in inference. The per hit time went to 275509.8e-06 from 30943.1e-06 in line_profiler. The output with torch.profile:

STAGE:2024-01-08 22:09:23 17469:17469 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-08 22:09:23 17469:17469 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-08 22:09:23 17469:17469 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
    keypoint transformation        50.98%     416.000us       100.00%     816.000us     816.000us             1  
              aten::softmax         0.61%       5.000us         3.06%      25.000us       8.333us             3  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 816.000us

STAGE:2024-01-08 22:09:23 17469:17469 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-08 22:09:23 17469:17469 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-08 22:09:23 17469:17469 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 model_inference         0.92%       1.703ms       100.00%     184.203ms     184.203ms             1  
                    aten::conv2d         0.12%     220.000us         1.20%       2.209ms      27.272us            81  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 184.203ms

Why is this happening and how can I resolve this?

CUDA operations are executed asynchronously and based on your description it seems synchronizing operations are accumulating the time of unfinished operations.
Profile the code with a visual profiler and check the actual timeline to see how long each operation takes or explicitly synchronize the code before starting and stopping host timers to profile specific operations.