How did DDP benchmark?

Are there some demo codes for the DDP training benchmark? How to confirm that the all process are killed?

Maybe the DDP part of this repo may be helpful: GitHub - mrshenli/ptd_benchmark

I find pytorch/benchmarks/distributed/ddp/compare/compare_ddp.py in the repo.

    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        # record_shapes=True, # Causes seg fault in export_chrome_trace
        # with_stack=True, # Causes seg fault with EFA
        # with_flops=True, # Causes seg fault in export_chrome_trace
        record_shapes=False,
        with_stack=False,
        with_flops=False,
        on_trace_ready=my_tensorboard_trace_handler(f"tb/{now.strftime('%Y_%m_%d_%H_%M_%S')}", rank, use_gzip=True)
    ) if args.profile else contextlib.nullcontext() as prof:
        for i in range(n_iters):
            before_forward_event.record()           
            out = model(inputs)
            after_forward_event.record()

I wonder whether it is inaccurate.
profile may bring extra cost to the model, and event.record may take more time compared with removing with profile

You are right; I think the profiler does introduce some slight overhead. However, we find that that slight overhead is typically negligible.

When you want to benchmark DDP, what are the metrics you are interested in?

1 Like

Thank you for your reply. I am interested in the forward and backward time in each GPU and the average of each GPU.

Thank you for your reply. I am interested in the forward and backward time in each GPU and the average of each GPU.

Hey @JuyiLin here are some examples of how to get those numbers:

getting fwd/bwd time: event_demo.py · GitHub
getting DDP comm time: ddp_comm_time.py · GitHub

2 Likes