Are there some demo codes for the DDP training benchmark? How to confirm that the all process are killed?
I find pytorch/benchmarks/distributed/ddp/compare/compare_ddp.py in the repo.
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
# record_shapes=True, # Causes seg fault in export_chrome_trace
# with_stack=True, # Causes seg fault with EFA
# with_flops=True, # Causes seg fault in export_chrome_trace
record_shapes=False,
with_stack=False,
with_flops=False,
on_trace_ready=my_tensorboard_trace_handler(f"tb/{now.strftime('%Y_%m_%d_%H_%M_%S')}", rank, use_gzip=True)
) if args.profile else contextlib.nullcontext() as prof:
for i in range(n_iters):
before_forward_event.record()
out = model(inputs)
after_forward_event.record()
I wonder whether it is inaccurate.
profile
may bring extra cost to the model, and event.record
may take more time compared with removing with profile
You are right; I think the profiler does introduce some slight overhead. However, we find that that slight overhead is typically negligible.
When you want to benchmark DDP, what are the metrics you are interested in?
Thank you for your reply. I am interested in the forward and backward time in each GPU and the average of each GPU.
Thank you for your reply. I am interested in the forward and backward time in each GPU and the average of each GPU.
Hey @JuyiLin here are some examples of how to get those numbers:
getting fwd/bwd time: event_demo.py · GitHub
getting DDP comm time: ddp_comm_time.py · GitHub