Torch.autograd.profiler doesn't save much

f10w · September 12, 2020, 10:02pm

I have a training script that I launch with python -m torch.distributed.launch --nproc_per_node=1 --use_env train.py. I would like to profile it, so I did something like this:

import torch.autograd.profiler as profiler
...
    with profiler.profile() as prof:
        with profiler.record_function("training"):
            print("Start training")
            for epoch in range(epochs):
                ...
    if is_main_process():
        prof.export_chrome_trace(output_dir / 'trace.json')

The resulted json contains almost nothing:

[{"name": "training", "ph": "X", "ts": 140.203, "dur": 139.73299999999998, "tid": 1, "pid": "CPU functions", "args": {}}]

What should I do to have the detailed running time of all the operations of my training?

Thank you very much in advance for your help!

hhaoao · September 13, 2020, 3:39am

The reason lies in this code, you put too much code in this block.

with profiler.record_function("training"):

f10w · September 13, 2020, 11:07am

@hhaoao Thanks. But how do we know how much is too much?

hhaoao · September 13, 2020, 12:31pm

I personally summarized the use of record_function in the following points:

If you think it is redundant, only the code segment with general statistics is needed.
Code snippets that need comment help to view.

Of course, there are more than these uses, you can dig by yourself.