How did DDP benchmark?

JuyiLin · December 8, 2022, 10:03am

Are there some demo codes for the DDP training benchmark? How to confirm that the all process are killed?

agu · December 13, 2022, 10:02pm

Maybe the DDP part of this repo may be helpful: GitHub - mrshenli/ptd_benchmark

JuyiLin · December 14, 2022, 3:06pm

I find pytorch/benchmarks/distributed/ddp/compare/compare_ddp.py in the repo.

JuyiLin · December 14, 2022, 3:12pm

    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        # record_shapes=True, # Causes seg fault in export_chrome_trace
        # with_stack=True, # Causes seg fault with EFA
        # with_flops=True, # Causes seg fault in export_chrome_trace
        record_shapes=False,
        with_stack=False,
        with_flops=False,
        on_trace_ready=my_tensorboard_trace_handler(f"tb/{now.strftime('%Y_%m_%d_%H_%M_%S')}", rank, use_gzip=True)
    ) if args.profile else contextlib.nullcontext() as prof:
        for i in range(n_iters):
            before_forward_event.record()           
            out = model(inputs)
            after_forward_event.record()

I wonder whether it is inaccurate.
profile may bring extra cost to the model, and event.record may take more time compared with removing with profile

agu · December 20, 2022, 12:24pm

You are right; I think the profiler does introduce some slight overhead. However, we find that that slight overhead is typically negligible.

When you want to benchmark DDP, what are the metrics you are interested in?

JuyiLin · December 27, 2022, 4:44pm

Thank you for your reply. I am interested in the forward and backward time in each GPU and the average of each GPU.

mrshenli · December 28, 2022, 4:39pm

Thank you for your reply. I am interested in the forward and backward time in each GPU and the average of each GPU.

Hey @JuyiLin here are some examples of how to get those numbers:

getting fwd/bwd time: event_demo.py · GitHub
getting DDP comm time: ddp_comm_time.py · GitHub