Intermediate Variables are actually lazy? (profiling listing inside)

Ben_Usman · August 25, 2017, 1:35am

I was trying to profile my code with line_profiler because cProfile does not give much useful information for some reason (e.g. does not measure things like a[x]). It seems like the actual computations happen only when one tries to infer resulting values, because in code above, if I add a print, runtime splits pretty much exactly into two parts where “inference” is required (prints and casting to float).

64       102       862955   8460.3      1.8          output_var = model(data)  # [B, C, H, W]
65       101          600      5.9      0.0          class_n = output_var.size(1)
66       101     20781085 205753.3     42.8          print(output_var.sum())
67       101        10971    108.6      0.0          output_flat = output_var.permute(0, 2, 3, 1).contiguous().view(-1, class_n)
68       101        15572    154.2      0.0          cross_ent = F.cross_entropy(output_flat, target.view(-1), size_average=False)
69       101     18728730 185433.0     38.6          test_loss_t += cross_ent.data[0]
70       100         5518     55.2      0.0          pred = output_var.data.max(1)[1]  # [B, H, W]

and without print on line 66

64       102       862955   8460.3      1.8          output_var = model(data)  # [B, C, H, W]
65       102          606      5.9      0.0          class_n = output_var.size(1)
66       102         7586     74.4      0.0          output_flat = output_var.permute(0, 2, 3, 1).contiguous().view(-1, class_n)
67       102        12863    126.1      0.0          cross_ent = F.cross_entropy(output_flat, target.view(-1), size_average=False)
68       102     39538526 387632.6     81.3          test_loss_t += cross_ent.data[0]

see - total runtime just slitted between those two lines.

Do I interpret this right? If yes, is there a way to change this behaviour for profiling purposes? Thanks.

ezyang · August 25, 2017, 1:52am

Is this on CUDA? CUDA kernel calls are asynchronous, kernels won’t complete running until later in your Python code. You can disable this with CUDA_LAUNCH_BLOCKING=1 but prepare for slower code.

Ben_Usman · August 25, 2017, 2:19am

Indeed! Very useful for profiling, thank you! In my case, performance drop was absolutely minimal.

ezyang · August 25, 2017, 2:29am

Also, you should seriously consider using nvprof/nvvp.