I have a training script that I launch with python -m torch.distributed.launch --nproc_per_node=1 --use_env train.py
. I would like to profile it, so I did something like this:
import torch.autograd.profiler as profiler
...
with profiler.profile() as prof:
with profiler.record_function("training"):
print("Start training")
for epoch in range(epochs):
...
if is_main_process():
prof.export_chrome_trace(output_dir / 'trace.json')
The resulted json contains almost nothing:
[{"name": "training", "ph": "X", "ts": 140.203, "dur": 139.73299999999998, "tid": 1, "pid": "CPU functions", "args": {}}]
What should I do to have the detailed running time of all the operations of my training?
Thank you very much in advance for your help!