Intermediate Variables are actually lazy? (profiling listing inside)

I was trying to profile my code with line_profiler because cProfile does not give much useful information for some reason (e.g. does not measure things like a[x]). It seems like the actual computations happen only when one tries to infer resulting values, because in code above, if I add a print, runtime splits pretty much exactly into two parts where “inference” is required (prints and casting to float).

64       102       862955   8460.3      1.8          output_var = model(data)  # [B, C, H, W]
65       101          600      5.9      0.0          class_n = output_var.size(1)
66       101     20781085 205753.3     42.8          print(output_var.sum())
67       101        10971    108.6      0.0          output_flat = output_var.permute(0, 2, 3, 1).contiguous().view(-1, class_n)
68       101        15572    154.2      0.0          cross_ent = F.cross_entropy(output_flat, target.view(-1), size_average=False)
69       101     18728730 185433.0     38.6          test_loss_t += cross_ent.data[0]
70       100         5518     55.2      0.0          pred = output_var.data.max(1)[1]  # [B, H, W]

and without print on line 66

64       102       862955   8460.3      1.8          output_var = model(data)  # [B, C, H, W]
65       102          606      5.9      0.0          class_n = output_var.size(1)
66       102         7586     74.4      0.0          output_flat = output_var.permute(0, 2, 3, 1).contiguous().view(-1, class_n)
67       102        12863    126.1      0.0          cross_ent = F.cross_entropy(output_flat, target.view(-1), size_average=False)
68       102     39538526 387632.6     81.3          test_loss_t += cross_ent.data[0]

see - total runtime just slitted between those two lines.

Do I interpret this right? If yes, is there a way to change this behaviour for profiling purposes? Thanks.

Is this on CUDA? CUDA kernel calls are asynchronous, kernels won’t complete running until later in your Python code. You can disable this with CUDA_LAUNCH_BLOCKING=1 but prepare for slower code.

Indeed! Very useful for profiling, thank you! In my case, performance drop was absolutely minimal.

Also, you should seriously consider using nvprof/nvvp.